11 minute read · November 7, 2023
The Why and How of Using Apache Iceberg on Databricks
· Senior Tech Evangelist, Dremio
The Databricks platform is widely used for extract, transform, and load (ETL), machine learning, and data science. When using Databricks, it's essential to save your data in a format compatible with the Databricks File System (DBFS). This ensures that either the Databricks Spark or Databricks Photon engines can access it. Delta Lake is designed for this purpose. It provides ACID transactions, time-travel, and schema evolution—standard features in many table formats. Moreover, when using Delta Lake with Databricks, it offers additional features not available in open source Delta Lake such as generated columns.
However, Delta Lake is not the sole table format available. We've discussed and compared several others in past articles, delving into their open source community contributions and partitioning capabilities. One such format is Apache Iceberg, which we've demonstrated how to use alongside Databricks using Spark.
Why Use Apache Iceberg with Databricks
One primary reason to consider Apache Iceberg over Delta Lake when working with Databricks is to avoid vendor lock-in. While Delta Lake is controlled by Databricks, Apache Iceberg provides a more neutral playing field. Aligning with a neutral, widely adopted standard means that your data solutions remain flexible and can easily transition between different platforms or vendors without significant overhaul. This ensures that your data infrastructure remains adaptable to changing business needs and technology landscapes.
Furthermore, the robust Apache Iceberg ecosystem offers a rich array of tools like Dremio, BigQuery, Apache Drill, and Snowflake, which have deeper integrations with Apache Iceberg than with Delta Lake. This ecosystem advantage means businesses can seamlessly leverage a broader range of technologies.
Additionally, Apache Iceberg boasts unique features that can be pivotal for many data operations. Its partition evolution capability allows for changes to partitioning strategies post facto, giving teams the flexibility to adapt to evolving data patterns without having to rewrite data. The hidden partitioning feature abstracts away the complexities of partition management, ensuring efficient data access while maintaining simplicity. These features, combined with its wide ecosystem, make Apache Iceberg an attractive choice for many organizations using Databricks.
There are two approaches to using Apache Iceberg on Databricks, either using Apache Iceberg natively as your table format, or using Delta Lake 3.0’s “UniForm” feature to expose Apache Iceberg metadata on your Delta Lake tables.
Method #1 – Use Apache Iceberg Natively
Using Apache Iceberg natively on Databricks offers several advantages and considerations. By adding the Iceberg jar and tweaking the appropriate Spark configurations, you can use Databricks Spark with Iceberg natively. This integration ensures that every transaction is captured as an Iceberg snapshot, enabling time-travel capabilities with tools that support Apache Iceberg. One of the flexible features of this setup is the freedom to choose any catalog based on the Spark configurations. This flexibility paves the way for catalog versioning with tools like Nessie, which can facilitate multi-table transactions and create zero-copy environments both within and outside Databricks.
However, there are certain limitations. Since Databricks employs a customized version of Spark, the 'MERGE INTO' transactions are not permissible on Iceberg tables natively from Databricks Spark. But this restriction doesn't apply when using the open source version of Apache Spark or other tools that support Iceberg.
Method #2 – Using the Delta Lake UniForm Feature
Delta Lake's Universal Format (UniForm) bridges compatibility between Delta tables and Iceberg reader clients. In essence, UniForm leverages the shared foundation of Delta Lake and Iceberg: they utilize Parquet data files accompanied by a metadata layer. Rather than rewriting data, UniForm produces Iceberg metadata asynchronously, allowing Iceberg clients to interpret Delta tables as if they were native Iceberg tables. This functionality means a single set of data files can cater to both formats. This is achieved by exposing a “REST Catalog” Iceberg catalog interface so that Unity Catalog acts as the Iceberg catalog.
The primary benefit of UniForm is the interoperability it introduces. Given the vast ecosystem around data processing, the ability to read Delta tables with Iceberg reader clients broadens the scope of operations and analytics that can be performed. This can be valuable for organizations using a mixed environment of tools, such as Databricks, BigQuery, or Apache Spark.
However, there are some limitations:
- Deletion vector support: UniForm doesn't support tables with deletion vectors enabled, limiting its compatibility with certain table configurations.
- Unsupported data types: Certain data types like LIST, MAP, and VOID are unsupported for Delta tables when UniForm is enabled, potentially restricting the types of data that can be managed.
- Write operations: While Iceberg clients can read data from UniForm-enabled tables, write operations are not supported, which can impact the ability to modify data in such tables.
- Client-specific limitations: Specific limitations may be tied to individual Iceberg reader clients, regardless of the UniForm feature, potentially affecting the behavior of certain client applications.
- Delta Lake features: Although some advanced Delta Lake features like Change Data Feed and Delta Sharing are operational for Delta clients with UniForm, they may require additional support when working with Iceberg.
- Concurrency constraints: Only one asynchronous Iceberg metadata write process can run at a time, potentially leading to delays in reflecting all commits to the Delta Table in the Apache Iceberg metadata.
Summary
Feature/Consideration | Apache Iceberg Natively | Delta Lake UniForm Feature |
Integration | Directly with Apache Iceberg jar | Through Universal Format (UniForm) |
Snapshot capture | Every transaction captured | Async metadata generation |
Time-travel capabilities | With Iceberg-supporting tools | Iceberg Time-Travel Limited to select snapshots and possible inconsistency |
Catalog versioning | Flexibility with tools like Project Nessie | Not possible; must use Unity Catalog |
Transactions ('MERGE INTO') | Not permissible with Databricks' custom Spark | Supported |
Driver resource consumption change | No change | Might increase |
Deletion vectors | Merge-on-Read for Iceberg | Not supported |
Unsupported data types | All Iceberg Types supported | LIST, MAP, VOID |
Write operations to the Iceberg table from other engines/tools | Supported | Not supported with UniForm |
Advanced Delta features (CDC, Delta Sharing) | N/A | Limited support in Iceberg |
Metadata consistency to latest data | Instant | Asynchronous, might be lagging |
Navigating the intricacies of table formats in data analytics can be challenging. The Databricks platform provides a formidable setting for machine learning and data science applications, with Delta Lake being its flagship table format. However, as data landscapes continuously evolve, organizations must remain flexible and forward-thinking. In this light, Apache Iceberg emerges as a significant contender.
Its neutral stance, broad ecosystem compatibility, and unique features offer compelling advantages over Delta Lake, especially for those keen on avoiding vendor lock-in. But as with every technology decision, it balances pros and cons. While Apache Iceberg's native integration offers a seamless experience, certain limitations with Databricks' customized Spark version might be a deal-breaker for some. On the flip side, while Delta Lake's UniForm feature provides broad compatibility, it comes with its set of constraints, particularly around data types and metadata consistency.
Our deep dive into both methods reveals that there's no one-size-fits-all answer. The decision hinges on your organization's specific needs, the existing tech stack, and long-term data strategy. Whether you lean toward the native integration of Apache Iceberg or opt for the UniForm feature of Delta Lake, ensure that the choice aligns with your overarching business goals. As data becomes increasingly pivotal in decision-making, ensuring you have the right infrastructure to manage, analyze, and derive insights from it remains paramount.
Tutorials for Trying Out Dremio (all can be done from locally on your laptop):
- Lakehouse on Your Laptop with Apache Iceberg, Nessie and Dremio
- Experience Dremio: dbt, git for data and more
- From Postgres to Apache Iceberg to BI Dashboard
- From MongoDB to Apache Iceberg to BI Dashboard
- From SQLServer to Apache Iceberg to BI Dashboard
- From MySQL to Apache Iceberg to BI Dashboard
- From Elasticsearch to Apache Iceberg to BI Dashboard
- From Apache Druid to Apache Iceberg to BI Dashboard
- From JSON/CSV/Parquet to Apache to BI Dashboard
- From Kafka to Apache Iceberg to Dremio
Tutorials of Dremio with Cloud Services (AWS, Snowflake, etc.)