h2h2h2h2h2h2h2h2h2h2h2

15 minute read · September 22, 2023

Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

In the age of data-centric applications, storing, accessing, and managing data can significantly influence an organization's ability to derive value from a data lakehouse. At the heart of this conversation are data lakehouse table formats, which are metadata layers that allow tools to interact with data lake storage like a traditional database. But why do these formats matter? The answer lies in performance, efficiency, and ease of data operations. This blog post will help make the architecture of Apache Iceberg, Delta Lake, and Apache Hudi more accessible to better understand the high-level differences in their respective approaches to providing the lakehouse metadata layer.

The metadata layer these formats provide contains the details that query engines can use to plan efficient data operations. You can think of this metadata as serving a similar role to the Dewey Decimal System in a library. The Dewey Decimal System serves as an abstraction to help readers more quickly find the books they want to read without having to walk up and down the entire library. In the same manner, a table format’s metadata makes it possible for query engines to not have to scan every data file in a dataset.

If you haven’t already read our content that compares these formats, here are some links to help get you acquainted:

Apache Iceberg

Originating out of Netflix and then becoming a community run Apache project, Apache Iceberg provides a data lakehouse metadata layer that many vendors support. It works off a three-tier metadata layer.

Let’s examine the three tiers.

Metadata Files

At the heart of Apache Iceberg’s metadata is the metadata.json, which defines all the table-level information such as the tables schema, partitioning scheme, current snapshot along with a historical list of schemas, partition schemes, and snapshots.

{
    "format-version": 2,
    "table-uuid": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "location": "/data/animals/",
    "last-sequence-number": 1002,
    "last-updated-ms": 1677619248000,
    "last-column-id": 3,
    "schemas": [
        {
            "schema-id": 1,
            "fields": [
                {
                    "id": 1,
                    "name": "id",
                    "type": "int"
                },
                {
                    "id": 2,
                    "name": "category",
                    "type": "string"
                },
                {
                    "id": 3,
                    "name": "name",
                    "type": "string"
                }
            ]
        }
    ],
    "current-schema-id": 1,
    "partition-specs": [
        {
            "spec-id": 1,
            "fields": [
                {
                    "source-id": 2,
                    "transform": "identity",
                    "field-id": 1001
                }
            ]
        }
    ],
    "default-spec-id": 1,
    "last-partition-id": 1001,
    "sort-orders": [
        {
            "order-id": 1,
            "fields": [
                {
                    "source-id": 1,
                    "direction": "asc",
                    "null-order": "nulls-first"
                }
            ]
        }
    ],
    "default-sort-order-id": 1
}

Manifest Lists

Each snapshot listed in the metadata.json points to a “Manifest List” which lists all file manifests that make up that particular snapshot along with manifest-level stats that can be used to filter manifests based on things like partition values.

{
  "manifests": [
    {
      "manifest_path": "s3://bucket/path/to/manifest1.avro",
      "manifest_length": 1024,
      "partition_spec_id": 1,
      "content": 0,
      "sequence_number": 1000,
      "min_sequence_number": 999,
      "added_snapshot_id": 12345,
      "added_files_count": 10,
      "existing_files_count": 8,
      "deleted_files_count": 2,
      "added_rows_count": 100,
      "existing_rows_count": 80,
      "deleted_rows_count": 20,
      "partitions": [
        {
          "contains_null": false,
          "lower_bound": "encoded_value",
          "upper_bound": "encoded_value"
        },
        {
          "contains_null": true,
          "contains_nan": false,
          "lower_bound": "another_encoded_value",
          "upper_bound": "another_encoded_value"
        }
      ]
    },
    {
      "manifest_path": "s3://bucket/path/to/manifest2.avro",
      "manifest_length": 2048,
      "partition_spec_id": 2,
      "content": 1,
      "sequence_number": 1001,
      "min_sequence_number": 1000,
      "added_snapshot_id": 12346,
      "added_files_count": 5,
      "existing_files_count": 7,
      "deleted_files_count": 3,
      "added_rows_count": 50,
      "existing_rows_count": 70,
      "deleted_rows_count": 30,
      "partitions": [
        {
          "contains_null": false,
          "lower_bound": "yet_another_encoded_value",
          "upper_bound": "yet_another_encoded_value"
        }
      ]
    }
  ]
}

Manifests

Each file manifest that is listed on the snapshot “Manifest List” has a listing of the individual files that contain data along with stats on each file to help skip data files based on each file's column stats.

{
  "metadata": {
    "schema": "{\"type\":\"record\",\"name\":\"table_schema\",\"fields\":[{\"name\":\"id\",\"type\":\"long\"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"age\",\"type\":\"int\"}]}",
    "schema-id": "1",
    "partition-spec": "{\"fields\":[{\"name\":\"age\",\"transform\":\"identity\",\"fieldId\":1,\"sourceId\":1}]}",
    "partition-spec-id": "1",
    "format-version": "2",
    "content": "data"
  },
  "entries": [
    {
      "status": 1,
      "snapshot_id": 101,
      "data_file": {
        "content": 0,
        "file_path": "s3://mybucket/data/datafile1.parquet",
        "file_format": "parquet",
        "partition": {
          "age": 30
        },
        "record_count": 5000,
        "file_size_in_bytes": 1000000,
        "column_sizes": {
          "1": 200000,
          "2": 300000,
          "3": 500000
        },
        "value_counts": {
          "1": 5000,
          "2": 4500,
          "3": 4000
        },
        "null_value_counts": {
          "1": 0,
          "2": 500,
          "3": 1000
        }
      }
    },
    {
      "status": 1,
      "snapshot_id": 102,
      "data_file": {
        "content": 0,
        "file_path": "s3://mybucket/data/datafile2.parquet",
        "file_format": "parquet",
        "partition": {
          "age": 40
        },
        "record_count": 6000,
        "file_size_in_bytes": 1200000,
        "column_sizes": {
          "1": 240000,
          "2": 360000,
          "3": 600000
        },
        "value_counts": {
          "1": 6000,
          "2": 5400,
          "3": 4800
        },
        "null_value_counts": {
          "1": 0,
          "2": 600,
          "3": 1200
        }
      }
    }
  ]
}

Delta Lake

Developed by Databricks, Delta Lake provides a data lakehouse metadata layer that benefits from many features primarily available on the Databricks platform, along with those built into its own specification. Two types of files handle most of the work with Delta Lake: log files and checkpoint files.

Delta Logs

Delta Logs are very similar to Git commits mechanically. In Git, each commit captures the lines of code that are added and removed since the last commit, whereas Delta Logs capture files added and removed from the table since the last commit.

For instance, a diary page (00000000000000000001.json) might read:

{
  "protocol": {
    "minReaderVersion": 1,
    "minWriterVersion": 2
  },
  "commitInfo": {
    "timestamp": 1629292910020,
    "operation": "WRITE",
    "operationParameters": {
      "mode": "Overwrite",
      "partitionBy": "['date']"
    },
    "isBlindAppend": false
  },
  "add": {
    "path": "data/partition=date/parquetfile.parquet",
    "partitionValues": {
      "date": "2023-08-18"
    },
    "modificationTime": 1629292910020,
    "size": 84123
  }
}

Checkpoint Files

An engine can re-create the state of a table by going through each Delta Log file and constructing the list of files in the table. However, after many commits, this process can begin to introduce some latency. To deal with this, there are checkpoint files that summarize a group of log files so each individual log file doesn’t have to be read to construct the list of files in the dataset.

| path                                        | partitionValues | modificationTime   | size    |
|---------------------------------------------|-----------------|--------------------|---------|
| data/partition=date1/parquetfile1.parquet   | {date: date1}   | 1629291000010      | 84123   |
| data/partition=date2/parquetfile2.parquet   | {date: date2}   | 1629291500020      | 76234   |

Apache Hudi

Apache Hudi is another table format that originated at Uber. Hudi’s approach revolves around capturing the timestamp and type of different operations and creating a timeline.

Directory Structure

Each Hudi table has several directories it uses to organize the metadata it uses to track the table.

  • /voter_data/: This is the folder of the table that will house partition folders with data files and the.hoodie folder that houses all the metadata.
    • /.hoodie/: This is the folder that holds all the table metadata tracking table properties and file metadata.
      • /hoodie.properties: List of properties on how the table is structured.
      • /metadata/: This is where the metadata is saved, including transactions, bloom filters, and more.

Hudi Metadata

The metadata folder in Hudi contains the metadata table which holds several indexes for improving transaction performance via data skipping and other optimizations.

  • Files Index: Stores file details like name, size, and status.
  • Column Stats Index: Contains statistics of specific columns, aiding in data skipping and speeding up queries.
  • Bloom Filter Index: Houses bloom filters of data files for more efficient data lookups.
  • Record Index: Maps record keys to locations for rapid retrieval, introduced in Hudi 0.14.0.

Hudi uses HFile, a format related to HBase, to maintain records of this metadata.

Base and Log Files: The Core Content

In Hudi’s world, there are two main types of data files:

  • Base Files: These are the original data files written in Parquet or ORC.
  • Log Files: These are files that track changes to the data in the base file to be reconciled on read.

Naming Conventions

The way Hudi names these files is:

  • Base Files: It’s like [ID]_[Writing Version]_[Creation Time].[Type]
  • Log Notes: More detailed as [Base File ID]_[Base Creation Time].[Note Type].[Note Version]_[Writing Version]

The Timeline Mechanics

Hudi loves order. Every action or change made to a table is recorded in a timeline, allowing you to see the entire history. This timeline also ensures that multiple changes don’t clash.

Actions on this timeline go through stages like:

  • Planning (requested)
  • Doing (inflight)
  • Done (commit)

These steps are captured in the hoodie folder through files with the naming convention of [timestamp].[transaction state(requested/inflight/commit)], and this is how Hudi is able to identify and reconcile concurrent transactions. If two transactions come in at the same timestamp, one of the two transactions will see the pending transaction and adjust accordingly.

Hudi Table Fields

Each table comes with some additional fields that assists in Hudi’s data lookup:

  • User Labels: These are the fields of the table the user specified when they created the table or updated the schema.

Hudi’s Own Labels: Fields created by Hudi to optimize operations which include _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, _hoodie_file_name.

Conclusion

Apache IcebergDelta LakeApache Hudi
Metadata Approach3-Tier Metadata TreeLogs & CheckpointsTimeline
Data SkippingBased on Manifest and File StatsFile Stats from Log FilesBased on Column Stats Indexes in Metadata Table
File TypesMetadata.json, Manifest Lists, ManifestsLog Files and CheckpointsLog Files, Metadata Table, Requested - Inflight - Commit Files

Each format takes a very different approach to maintain metadata for enabling ACID transactions, time travel, and schema evolution in the data lakehouse. Hopefully, this helps you better understand the internal structures of these data lakehouse table formats.

Get Started with a Free Data Lakehouse Powered by Apache Iceberg

Access all of your data where it lies and start querying in minutes. No movement required.