14 minute read · February 14, 2022
Apache Iceberg Version 0.13.0 Is Released
· Senior Tech Evangelist, Dremio
On Wednesday, February 9th, the Apache Iceberg community released 0.13.0 with lots of new features and improvements. Also released was a new website and docs design.
2022 has already been so far a great year for Apache Iceberg seeing great coverage such as new levels of support from platforms like Dremio, AWS and Snowflake, and being chosen as the “Open Source Project of the Week” by Software Development Times for the first week of February 2022.
Let’s take a deeper look at many of the new features that come with Apache Iceberg 0.13.0. (Release Notes)
Catalog caching now supports cache expiration
Catalog caching is a technique that can help speed up table reads by allowing engines to not have to read the table's metadata on every read.
Sometimes when multiple readers and writers are using Apache Iceberg tables from different engines the result would be that the cache would not refresh so a manual refresh was required to see the new data. With the new cache expiration feature a time interval can be set for the cache to expire, forcing an automatic refresh to resolve this issue. This can be configured using the setting cache.expiration-interval-ms
which will be ignored if cache-enabled
is set to false. Read more on this feature here.
Hadoop catalog can be used with S3 and other file systems safely by using a lock manager
When using Iceberg on object stores like S3 and committing to tables from multiple engines concurrently, you previously couldn’t use the Hadoop catalog safely. This is because the check Iceberg relied on to ensure writing via two concurrent separate jobs or engines isn’t atomic, so two concurrent commits could result in data loss.
Iceberg 0.13.0 fixes this issue by leveraging a lock table in services like DynamoDB so all catalogs can use locks for safe concurrency. Read more on this feature here.
Catalog now supports registration of Iceberg table from a given metadata file location
From a HiveCatalog you could drop a table but there was no way to add an existing table to the catalog. With this update you can now add existing tables to your catalog by passing the location of the newest metadata file of the external table. This is especially helpful for interacting with Hive external tables in Spark. Read more on this feature here.
Deletes now supported for ORC Files
In Iceberg’s v2 format, delete files are used to track records that have been deleted. Previously, this wasn’t supported when the table’s underlying file format was ORC. Now, both position and equality deletes are supported for tables backed by ORC files. Read more here.
Vendor Integrations
Along with the core updates detailed previously several vendor integrations were added in version 0.13.0 including.
- Native GCS FileIO Support [#3711]
- Support for Aliyun Object Storage Service [#3553]
- Remove restrictions on S3 endpoint to enable support for any S3 compatible storage [#3656] [#3658]
- AWS S3FileIO now supports server-side checksum validation [#3813]
- AWS GlueCatalog now displays more table information including table location, description [#3467], and columns [#3888]
- Using multiple FileIOs based on file path scheme is supported by configuring a ResolvingFileIO [#3593]
- Dremio now supports Iceberg tables that use an AWS Glue catalog [20.0]
Tooling Support
Several updates to the support available for many of the most popular data processing tools.
Apache Spark
-
- Spark 3.2 support. [#3970]
-
- Merge on read delete support in spark-3.2. [#3970]
-
- Compaction in Spark with
RewriteDataFiles
now supports table-based optimization, merge-on-read delete. [#2829]
- Compaction in Spark with
-
- Time travel queries in Spark use the schema for the snapshot used in the query instead of the latest schema in the metadata. [#3722]
-
- Spark vectorized reads now support row-level deletes. [#3557]
-
- add_files procedure now won’t write duplicate metadata files when calling it multiple times on the same table. [#2779]
-
- Stored procedure support for RewriteDataFiles. [#3375]
Apache Flink
-
- Flink 1.13 and 1.14 support. [#3116]
-
- Easier creation of Iceberg tables from Flink. [#3666]
-
- Streaming upsert support. [#2863]
Apache Hive
-
- The table listing API call in the Hive catalog can now return non-Iceberg tables. [#3908]
Conclusion
Apache Iceberg in 2022 is adding the features data engineers want and need which is clear given the attention and momentum it has recently received in just a little over a month into the year.
This momentum is just getting going with multiple announcements already in 2022. Iceberg will be playing a large role in the Subsurface 2022 conference being held Live online March 2-3 featuring several talks on Apache Iceberg. Register for free conference here so you don’t miss any of the Iceberg sessions.
Subsurface 2022 Iceberg Sessions
-
- Tuning Row-Level Operations in Apache Iceberg (Anton Okolnychyi - Apple)
-
- Streaming from an Iceberg Data Lake (Steven Wu - Apple)
-
- An Open Data Architecture in Action with Apache Iceberg (Brock Griffey - Dremio)
-
- The Write-Audit-Publish Pattern via Apache Iceberg (Sam Redai - Tabular)
-
- Iceberg Roadmap (Ryan Murray - Dremio)
-
- What can Iceberg do for You? (John Milstein - Capitalize Analytics)
-
- Lessons Learned Making Open Table Formats Enterprise-Ready (James Malone - Snowflake)
Recordings of Iceberg Sessions from Subsurface 2020-2021
-
- Lessons Learned from Running Apache Iceberg at Petabyte Scale (Anton Okolnychyi - Apple)
-
- Iceberg Case Studies (Ryan Blue - Co-creator of Apache Iceberg, Tabular)
-
- Iceberg at Adobe: Challenges, Lessons & Achievements (Gautam Kowshik - Adobe)
-
- Hiveberg: Integrating Apache Iceberg with the Hive Metastore (Adrian Woodhead and Christine Mathiesen - Expedia Group)
-
- Deep Dive into Iceberg SQL Extensions (Anton Okolnychyi - Apple)
If not currently using Apache Iceberg for your data lake, give it a test run by creating some Iceberg tables using AWS Glue.