h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2

22 minute read · June 12, 2024

How Apache Iceberg is Built for Open Optimized Performance

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Apache Iceberg is a table format designed for data lakehouses. While many people focus on how table formats enable database-like ACID transactions on data lakes—allowing them to function like data warehouses, or "data lakehouses"—there is another equally powerful aspect: the metadata provided by these formats. This metadata can be used to execute transactions with optimal performance. Apache Iceberg includes several mechanisms that enable query engines, such as Dremio, to query data with enhanced performance. In this article, I will cover several of these mechanisms to explore the open and robust performance capabilities of Apache Iceberg.

Watch the Subsurface Keynote to Learn about the Kind of Performance Dremio has with Apache Iceberg

Table Statistics

Before the advent of Apache Iceberg, managing statistics in Hive tables presented significant challenges. One of the primary issues was the need to manually run the ANALYZE command periodically to collect and update statistics. This cumbersome process often led to stale statistics, which could significantly degrade query performance. Users had to constantly ensure up-to-date statistics, which could become quite onerous, especially in large and dynamically changing data environments.

Apache Iceberg's Approach to Statistics

Apache Iceberg revolutionizes this aspect by generating and storing statistics as part of the metadata during write operations to a table. This means that statistics are always current, eliminating the need for manual interventions to update them. The statistics collected in Iceberg's metadata include essential information like record count, file size, value counts, null value counts, lower and upper bounds for each column, and more.

Types of Statistics Collected

Iceberg collects a variety of statistics that are instrumental in optimizing query performance. These include:

  • Record Count: The number of records in each file.
  • File Size: The total size of each file in bytes.
  • Value Counts: The number of values present in each column, including null and NaN values.
  • Null Value Counts: The number of null values in each column.
  • NaN Value Counts: The number of NaN values in each column.
  • Lower and Upper Bounds: The minimum and maximum values in each column.

Query Optimization with Iceberg

Query engines leverage these statistics in several ways to enhance query performance:

  • File Pruning: Based on the collected statistics, query engines can prune files that do not match the query predicates. For instance, if a query is looking for records within a certain range, files whose bounds fall entirely outside this range can be skipped.
  • Cost-Based Optimization: Statistics like record count and file size help query engines in making more informed decisions about query planning and execution strategies. For example, knowing the size and distribution of data can help in choosing the most efficient join strategies or in optimizing resource allocation for query execution.
  • Dynamic Partition Pruning: Statistics enable dynamic partition pruning where only the necessary partitions are read based on the query filters, leading to more efficient data access and reduced I/O operations.

Partitioning Features

Apache Iceberg's hidden partitioning and partition evolution capabilities are game-changers in optimizing data management and performance in data lakehouses. These features significantly reduce the overhead and complexity traditionally associated with changing partition strategies, enhancing performance and efficiency.

Hidden Partitioning: The ability to track partitioning strategy as the transformed value of a column eliminating the need to persist “partition columns,” which complete ingestion and querying of the data.

Partition Evolution: Because the metadata handles most of the partition tracking, changing your partition strategy for future writes without having to rewrite all previous data becomes possible.

Advantages of Hidden Partitioning and Partition Evolution

1. Cost-Effective Partitioning Strategy Changes

Changing the partitioning strategy in a traditional data lake setup often involves extensive and expensive operations, including full table scans and re-writing large volumes of data. With Apache Iceberg, partitioning strategies can be modified with minimal cost. This is because Iceberg uses hidden partitioning, where partition transforms are applied dynamically during query planning rather than being physically persisted in the data files. This flexibility allows partition evolution, where quick adjustments to partitioning strategies can be made without the need to reprocess and rewrite existing data.

2. Reduced Full Table Scans

Before Apache Iceberg, tables regularly needed special “partition columns” to exist within the data, and analysts would have to explicitly filter on these columns for the query engine to make use of the table's partitioning. With Apache Iceberg, the metadata structure makes this unnecessary. It eliminates accidental full table scans and the delays and costs they’d introduce when analysts forget to query on an additional partition column (for example: filtering on a timestamp but forgetting to filter on a “month” or “day” field created for partitioning purposes).

3. Smaller Data Files

By not requiring partition transforms to be physically persisted in the data files, Apache Iceberg allows for smaller and more efficient data files. The partition information is stored in the metadata, making the data files leaner and reducing storage costs. This also speeds up data access since smaller files can be transferred over the wire faster when having multi-node clusters processing the data.

Partition Transforms and Their Uses

Apache Iceberg supports a variety of partition transforms, each suited to different data scenarios. Here are some common transforms and their ideal use cases:

Bucket Transform

Description: Hashes the column value and then applies a modulus operation to distribute the values into a specified number of buckets.

Use Case: Useful for evenly distributing high-cardinality columns across a fixed number of partitions, such as user IDs or session IDs.

Truncate Transform

Description: Truncates the column value to a specified width.

Use Case: Effective for string columns or columns with long numeric values where only a prefix or a subset of the value is needed, such as product codes or URLs.

Year/Month/Day Transforms

Description: Extracts the year, month, or day from a timestamp column.

Use Case: Best for time-series data where queries are often filtered by specific time periods, such as logs, sensor data, or transaction records.

Hour Transform

Description: Extracts the hour from a timestamp column.

Use Case: Useful for data that needs to be analyzed on an hourly basis, such as clickstream data or event logs.

Partitioning Stats Files

The partition statistics file in Apache Iceberg is for tracking detailed statistics about partitions in a table. This file aids query engines in optimizing queries by providing comprehensive insights into the distribution and characteristics of data across partitions. By leveraging this information, query engines can make informed decisions that enhance query performance and efficiency.

Structure and Information Tracked

The partition statistics file is structured to store detailed metrics for each partition. The key fields tracked in this file include:

  • Partition Data Tuple: The specific partition values based on the partition specification.
  • Partition Spec ID: The unique identifier for the partition specification used.
  • Data Record Count: The total number of records in the data files within the partition.
  • Data File Count: The number of data files in the partition.
  • Total Data File Size: The cumulative size of all data files in the partition, measured in bytes.
  • Position Delete Record Count: The number of records marked for deletion by position delete files.
  • Position Delete File Count: The number of position delete files in the partition.
  • Equality Delete Record Count: The number of records marked for deletion by equality delete files.
  • Equality Delete File Count: The number of equality delete files in the partition.
  • Total Record Count: The accurate count of records in the partition after applying delete files.
  • Last Updated At: The timestamp of the last update to the partition, in milliseconds from the Unix epoch.
  • Last Updated Snapshot ID: The ID of the snapshot that last updated the partition.

Using Partition Statistics to Optimize Queries

Query engines can utilize the information stored in partition statistics files to optimize query performance in several significant ways:

Cost-Based Optimization:

The metadata in partition statistics files aids in cost-based query optimization. Information such as the number of records, data file count, and file sizes helps the query planner choose the most efficient execution strategies, such as selecting the best join algorithms or deciding on parallel query execution.

Dynamic Partition Pruning:

With accurate and up-to-date partition statistics, query engines can dynamically prune partitions based on the query predicates. This reduces I/O operations and enhances query performance by focusing only on the relevant partitions.

Improved Query Planning:

The partition statistics file provides a comprehensive view of the data distribution, allowing query engines to plan queries more effectively. For example, knowing the exact size and record count of partitions helps in allocating resources appropriately and optimizing scan operations.

By leveraging the detailed information stored in partition statistics files, Apache Iceberg enables query engines to perform more efficient and optimized queries. This not only enhances overall performance but also reduces resource usage, making data processing more effective and scalable.

Puffin Files

Puffin files are a specialized file format in Apache Iceberg designed to store auxiliary information such as indexes and statistics about data managed in an Iceberg table. This information, which cannot be directly stored within Iceberg manifests, helps enhance the efficiency and performance of query execution. By leveraging Puffin files, query engines can access detailed metadata, allowing them to optimize query planning and execution.

Puffin File Structure

A Puffin file is composed of several key components:

  • Magic: Four bytes (0x50, 0x46, 0x41, 0x31) indicating the Puffin file format version.
  • Blobs: Arbitrary pieces of information stored sequentially within the file.
  • Footer: Contains metadata necessary to interpret the blobs and consists of:
  • Magic: Same as the beginning of the file.
  • FooterPayload: UTF-8 encoded JSON payload, optionally compressed, describing the blobs.
  • FooterPayloadSize: The length of the FooterPayload in bytes.
  • Flags: Boolean flags indicating if the FooterPayload is compressed.

The structure ensures that the information stored in the blobs can be efficiently accessed and interpreted by query engines.

The footer payload, either uncompressed or LZ4-compressed, contains a JSON object representing the file's metadata. This FileMetadata object includes:

  • Blobs: A list of BlobMetadata objects.
  • Properties: Optional storage for arbitrary meta-information, like writer identification/version.

Each BlobMetadata object provides detailed information about individual blobs, including:

  • Type: The type of blob (e.g., "apache-datasketches-theta-v1").
  • Fields: A list of field IDs the blob was computed for.
  • Snapshot ID: The snapshot ID of the Iceberg table when the blob was computed.
  • Sequence Number: The sequence number of the snapshot.
  • Offset and Length: Location and size of the blob in the file.
  • Compression Codec: If the blob is compressed, the codec used.

Blob Types and Compression Codecs

The blobs stored in Puffin files can be of various types, such as:

apache-datasketches-theta-v1: A serialized Theta sketch produced by the Apache DataSketches library, providing an estimate of the number of distinct values.

The supported compression codecs include:

  • lz4: Single LZ4 compression frame with content size.
  • zstd: Single Zstandard compression frame with content size.

How Query Engines Use Puffin Files to Optimize Queries

Query engines can utilize the detailed metadata in Puffin files to significantly enhance query performance in several ways:

Improved Predicate Pushdown:

Puffin files store detailed statistics and indexes, enabling query engines to push down predicates more effectively. By accessing precise statistics, the engine can filter out irrelevant data early in the query process, reducing the amount of data scanned.

Efficient File Pruning:

The metadata in Puffin files allows query engines to prune unnecessary files from the query plan. For instance, using the apache-datasketches-theta-v1 blobs, engines can quickly determine which files do not contain the queried data, thus avoiding scanning these files.

Cost-Based Optimization:

Puffin files provide additional metadata that aids in cost-based query optimization. Information such as the number of distinct values (NDV) and data distribution helps the query planner make more informed decisions, optimizing join strategies, and resource allocation.

Dynamic Partition Pruning:

With accurate and up-to-date statistics, query engines can dynamically prune partitions, reading only the necessary partitions based on the query predicates. This reduces I/O operations and enhances query performance.

Enhanced Indexing:

Puffin files can store various types of indexes, such as bloom filters or Theta sketches, which help in fast data lookups and reduce the need for full table scans.

Example Use Cases for Puffin Files

Distinct Count Queries:

Using apache-datasketches-theta-v1 blobs, query engines can quickly estimate the number of distinct values in a column without scanning the entire dataset, making distinct count queries much faster.

Range Queries:

Puffin files with detailed column statistics can help efficiently execute range queries by filtering out data files that do not fall within the specified range.

Join Operations:

Cost-based optimization using Puffin file metadata can lead to more efficient join operations by choosing the best join strategy based on data distribution and statistics.

By leveraging the rich metadata stored in Puffin files, Apache Iceberg enables query engines to perform more efficient and optimized queries, enhancing overall performance and reducing resource usage.

Conclusion

Apache Iceberg's value lies in its robust support for ACID transactions and seamless data lakehouse capabilities and its comprehensive and open specification that facilitates optimized performance across any query engine. The detailed metrics and metadata collected by Iceberg, such as statistics, partition information, and Puffin files, are all part of this open standard. This ensures that users can reap the benefits of enhanced query performance without being tied to a specific query engine.

Query engines can leverage Iceberg's metadata to implement advanced optimization techniques, such as predicate pushdown, dynamic partition pruning, and cost-based optimization. The ability to collect and store detailed statistics during write operations, automatically maintain up-to-date metadata, and use advanced partitioning strategies significantly improves query efficiency and reduces operational overhead.

Moreover, many query engines add their own layers of query optimization to further exploit Iceberg's capabilities. For example, Dremio's reflections feature creates materializations based on Apache Iceberg that inherit all the benefits of Iceberg's metadata-driven optimizations. This synergy between Iceberg and query engines like Dremio showcases how open standards can foster innovation and deliver unparalleled performance improvements for data lakehouses.

Apache Iceberg's open and extensible design empowers users to achieve optimized query performance while maintaining flexibility and compatibility with a wide range of tools and platforms. Iceberg is indispensable in modern data architectures, driving efficiency, scalability, and cost-effectiveness for data-driven organizations.

Get Started Building your Apache Iceberg Data Lakehouse Today!

Here are Some Exercises for you to See Dremio’s Features at Work on Your Laptop

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.