Gnarly Data Waves

GDW Special Edition 1

July 11, 2024

Apache Iceberg Lakehouse Crash Course – What is a Data Lakehouse and What is a Table Format?

"An Apache Iceberg Lakehouse Crash Course," a comprehensive webinar series designed to deepen your understanding of Apache Iceberg and its role in modern data lakehouse architectures. Over ten sessions, we'll cover everything from the basics of data lakehouses and table formats to advanced topics like partitioning, optimization, and real-time streaming with Apache Iceberg. Don't miss this opportunity to enhance your data platform skills and learn from industry experts.

Watch this introductory session where we unravel the concepts of a data lakehouse and table formats.

Learn about:

The evolution of data lakehouse architecture
Key differences between data lakes and data warehouses
How table formats like Apache Iceberg enhance data management
Benefits of using a lakehouse model for your organization

An Apache Iceberg Lakehouse Crash Course is a 10 part web series designed to help you master Apache Iceberg, a powerful tool in the data lakehouse architecture.

Our expert-led sessions will cover a wide range of topics to build your Apache Iceberg expertise. Each session will offer detailed insights into the architecture and capabilities of Apache Iceberg, along with practical demonstrations.

Whether you’re a data engineer, architect, or analyst, this series will equip you with the knowledge and skills to leverage Apache Iceberg for building scalable, efficient, and high-performance data platforms.

Register to view episode

Speakers

Alex Merced

Alex Merced is Head of DevRel for Dremio, a developer, and a seasoned instructor with a rich professional background. Having worked with companies like GenEd Systems, Crossfield Digital, CampusGuard, and General Assembly.

Alex is a co-author of the O’Reilly Book “Apache Iceberg: The Definitive Guide.” With a deep understanding of the subject matter, Alex has shared his insights as a speaker at events including Data Day Texas, OSA Con, P99Conf and Data Council.

Driven by a profound passion for technology, Alex has been instrumental in disseminating his knowledge through various platforms. His tech content can be found in blogs, videos, and his podcasts, Datanation and Web Dev 101.

Moreover, Alex Merced has made contributions to the JavaScript and Python communities by developing a range of libraries. Notable examples include SencilloDB, CoquitoJS, and dremio-simple-query, among others.

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Opening/Curriculum

Alex Merced:

Hey, everybody, this is Alex Merced, and welcome to the first session in this ten-part series on Apache Iceberg, our Apache Iceberg crash course. In this session, we'll be talking about what is a data lakehouse and what is a table format. But before we get started with the actual content, I have a few quick stops we need to make. First off, let's just take a look at the schedule we have for the ten sessions in this particular course, which will all then be made available on demand afterward. Today, we'll be covering what is a data lakehouse and what is table format—the foundations for [it], what are we talking about here when we talk about Apache Iceberg. Next week, we'll actually start getting into the architecture of the different table formats, and we'll just keep going from there. Here are the dates; I hope to see you at each one. They're going to be good times.

Also, I am one of the co-authors of Apache Iceberg: The Definitive Guide, and if you would like a free copy of this O'Reilly book, scan [the] QR code on your screen right now to get a free copy of Apache Iceberg: The Definitive Guide. Again, this is a special presentation as part of the Gnarly Data Waves program. We do regular broadcasts as part of Gnarly Data Waves, with all sorts of different speakers and presenters all around the Data Lakehouse. We have plenty of episodes with a lot of great content that have already aired, and many more to come, so do subscribe by either subscribing to the Dremio YouTube channel, subscribing to Gnarly Data Waves on Spotify or iTunes, or just heading over to dremio.com/gnarly-data-waves. Do continue watching our Gnarly Data Waves broadcasts. Final thing, if you have any questions throughout the presentation, or after the presentation for any of these episodes, or just want to learn more about Apache Iceberg in general and have questions, head over to the Dremio community, there is an Apache Iceberg category, where you can ask questions specifically about the Apache Iceberg format conceptually, and I'll monitor that to answer any questions and point people [in] the right direction, which you can again find at community.dremio.com.

Traditional Data Systems

Alex Merced:

With that, let's begin our presentation. What we're trying to do here is lay the foundation, lay the contextual space at which we need to understand the topic at hand, which is Apache Iceberg. This all revolves around the world of that is a data lake. To understand why we would need a data lake, we need to understand the systems that came beforehand. When it [comes] to the data world, we usually work with very enclosed data systems, such as databases and data warehouses. We love these pieces of software because they do a lot of things for us. They package and abstract away things that we need to do, such as how to store the data. If I create a table in Postgres, I have no idea how that data is physically being stored. I don't need to, in order to take advantage of the Postgres platform; this goes for any database or any data warehouse. They also abstract away, how do I know which bits of this stored data are individual tables? There's a table format; it's a way of understanding that that's baked in. How do I know that there are 10 tables, 20 tables, or 30 different tables within the system? Well, that's a catalog. That's baked in. Then I'm going to ask questions [about] that data, so there needs to be some sort of processing system that parses my SQL query, creates an execution plan from that SQL, and then [executes] that plan. Generally, all of these are baked into this system. Typically in a database, we're talking about transactional workloads, often with data structured into rows. If we're talking about data warehouses, we're often talking about analytical workloads, so most often put into columnar data. But the problem with this whole setup is that each system does it in its own way, and everything is tightly coupled. The data inside that system can only work within that system. If I need to use another system for some feature that I really like or want to use for some other use case that the system I'm using doesn’t quite work for, I would have to then move a copy of that data into that other system, which then increases the amount of work and maintenance I have to do because I have to move the data in there and then maintain consistency of that data.

What is a Data Lake?

Alex Merced:

This begs the question: wouldn’t it be nice to have a copy of the data that anyone can access, where multiple tools can work with the same data? That’s when we started moving towards the concept of the data lake. The idea is: how about this? How about we just store the data as files, and then we can use different tools to work with those files? This began with something called Hadoop, which is basically a system to take a bunch of computers and treat them as one large storage layer with a whole ecosystem of tools, such as Hive, Pig, Impala, and whatnot, that work on this storage structure. Essentially, what it is, is you store data. I could store a bunch of Parquet files with structured analytical datasets, I could store images, basically unstructured data that doesn’t fit into that table-like structure, text files that might have sensor data or emails, and whatnot, I could just store everything there. That’s the beauty of the data lake—it can just take all the data.

The problem is that maybe these four Parquet files make up one particular table, and there’s no way for me to automatically know that. The Hadoop system or an object storage system like S3 or ADLS doesn’t magically know that these files are a table and these files are another table. It just knows these are the files that are stored. This limited the ability of tools to interact with this storage layer because they don’t necessarily have that metadata, understanding, or context of the data on this storage layer to access it efficiently and intelligently. So, the data lake paradigm was often used for ad hoc querying and was limited in [many] ways. But still, the dream of having one copy of the data that can work with multiple tools was still there. This is where we started moving towards what’s called the Data Lakehouse.

What is a Data Lakehouse?

Alex Merced:

When we look at the data lake, what do we have? We have a storage layer and tools that can work with that storage layer—tools like Dremio. One of the main things that we do is make it much easier and faster to query a data lake. We also have tools like AWS Athena and many more out there. But it’s not quite a data warehouse yet. There are a couple of pieces missing. If we take that storage layer and all those tools that we could use to query that storage layer, and then add a few layers in between to make it more like a database or a data warehouse system, what was missing? A table format and a catalog. So a way to identify things as tables and somehow be able to collect and understand which tables exist. That’s what’s missing. When we add that in there, you now have all the functionality of a data warehouse but in this sort of decoupled, modular structure. That’s what a data lakehouse [is]. The result is you have data sources; you can now just ingest the data once into your data lake storage, but because you have a table format providing metadata over those data files and a catalog tracking which tables exist, tools like Dremio, Snowflake, Apache Flink, Apache Spark, Upsolver can all begin working with that system—the data lakehouse system. That’s the beauty of it: one copy of the data that allows you to work with many tools to satisfy many use cases instead of having to worry about additional data movements and all the consistency challenges that come with that.

Parquet Data Files

Alex Merced:

So, let’s dive a little bit deeper into what is a table format. To appreciate this, we need to understand the fundamental unit when we talk about a data lake: a file. When it comes to analytics, the standard way to store structured data is a Parquet file, an Apache Parquet file. An Apache Parquet file improves upon data formats like CSV and JSON. When you use something like a CSV file or a JSON file, you have a couple of different problems. One, they’re text files, which means you're using text encoding, which takes up more space for the same content. Two, especially CSV files, they’re inherently row-based. So, basically, if I wanted to do something like give me the total of all sales, I would have to read every row when all I really want is, let's say, that total column, so I can add up the total of each sale. This doesn’t lend itself [well] to efficient analytics. A Parquet file fixes these problems. One, it’s in a binary format, so it can store the same information in a much smaller amount of space that’s specifically engineered for the structure and purposes of analytics. Two, it structures the data as columns. Instead of having groups of rows, you have groups of columns. So, that column of total sales would essentially be a separate data structure that I could pull individually. I don’t need to pull the rest of the rows’ data to get that column, but you may not necessarily need that particular column for every record. So, to make it more efficient to scan, you have row groups. What they do is they take those columns, those groups of columns, and break them up into chunks of rows. Each row group is made up of the columns for those rows. The beauty of this is that each row group has a footer with metadata about that row group. Imagine saying, "Hey, this row group contains all users with an age between 20 and 30. This row group has users between 30 and 40." This allows query engines, something like Dremio, to do what's called predicate pushdown. Instead of scanning the whole file, it can say, "Okay, based on the data in the footer, do I even need to scan this row group at all?" I can skip row groups based on that metadata and more quickly scan the file. Because it's columnar, I can only grab the columns I need. Overall, it allows for much faster data loading and data processing because I can load the data faster because I only load the data that I need from the whole file instead of loading the whole file, then I can process that data very quickly because it is columnar and can be quickly converted in memory to formats like Apache Arrow, which allow for very fast processing in memory.

Defining Multi-File Datasets

Alex Merced:

The problem is, that some datasets are bigger than a single Parquet file. While a Parquet file is really good for doing analytics, what happens as your datasets get bigger and bigger? You might have a situation where your user's table might be, let’s say, in this case, three Parquet files. Each Parquet file has different row groups. In that case, I have to describe to whatever tool I’m using that these three files are a single table. If I do that manually, it leads to all sorts of potential problems. What happens when one of my analysts accidentally says that only user one and user two Parquet is a table, but they forget to include user three? You could have inconsistently defined datasets when people do analytics because there isn’t an absolute way of defining this in the table. So we have all these Parquet files that makeup one dataset, but it’s up to the user to define the dataset because these files don’t understand each other. They’re not aware of each other. Each one is a unique file containing a piece of the data. We need some sort of glue to treat them all as a bundle. This is where a table format comes in.

The Hive Approach to a Table

Alex Merced:

The original table format was something called Hive. Before there were things like Apache Iceberg, Delta Lake, and Apache Hudi, which I’ll talk more in detail about in the next session, there was Hive. A little history on Hive: it was created because you had all this data in Hadoop, like in the Hadoop storage system. You had to use a tool called MapReduce to run analytics on it, and MapReduce required you to use a framework in Java to write an analytics job. You’d have to write this Java code that would define the dataset, say, "Hey, these files are the dataset," and then these are the operations I want to do to it. Most people doing analytics probably don’t know Java these days. Even if you did know Java, it isn’t the easiest way to write basic analytics queries. People wanted to do things the way they’re used to, which is SQL, the Structured Query Language. Hive was created to make this process easier. It would say, hey, you give us the SQL, and we’ll translate it into a MapReduce job. Other frameworks came along later, such as Tez, which could supplant that, but the idea was that Hive would translate to something—initially, it was translating to MapReduce. But to do SQL, you need to know what a table is. There has to be a way to understand, just based on the name of a table, what is the table. The paradigm Hive used was a folder. If there’s a folder called "user," the files in that folder are the user table. This became the way we would recognize what’s a table. And that worked, that allowed us to write SQL and made asking basic questions much easier.

The problem is the work that the query engine has to do because the way we define the table is so simple. We don’t know how many files are in that folder. We know that the folder is the table, but now the engine has to do file-listing operations. It has to go bring up a list of every file in the table, and then individually read each one. Every time you open and close a file, that’s work the computer has to do, and the more work, the longer it will take to execute the query. I might not even need all those files, so I might end up having to read 1,000 files when I really only need 30 of them to accomplish the particular query I’m doing.

So, not ideal. It was also really hard to do granular updates because generally, you would track the tables through something called a Hive metastore. The Hive metastore would track, "Hey, this folder is the user's table," but it would also track what’s called partitions, so like subtables within that table. But that’s it, so it would say, “Hey, here’s the user’s folder and these folders are the partition folders within that table,” but never track the individual files. If I wanted to update one record, I’d have to update the whole partition. That could be a much more expensive and [time-consuming] operation. You had a very limited table evolution. If I wanted to add a column, I would potentially have to rewrite every data file and rewrite the whole table. If I wanted to change anything about how the table works, how it’s partitioned, the schema, and so on, it would require [rebuilding] it. All these issues made it difficult to do the same type of work you would do in a data warehouse in a data lake.

Modern Approach to a Table

Alex Merced:

So while this made it much more possible to do analytics on the data lake, it didn’t make it easy enough to make the data lake where you do all your analytics. A new generation of table formats began to arise that work more like this. Essentially, you have a group of files, but instead of saying, "Hey, the table is the files in a folder," we say, "Hey, we’re going to have this metadata off to the side." This metadata will tell you which files are in the table. It doesn’t matter where they’re physically located. The metadata tells you, "Users-one.parquet, user-two.parquet." Even if there are another 300 files in the folder, they’re not part of the table. Only the files defined in the metadata are part of the table. What is the schema of the table? What the metadata says. Even if the files might be missing a column or have an additional column, we apply the schema defined in the metadata so that when the query engine creates the output after reading the files, it's always going to conform to that schema, which makes it possible to change the schema. [The] same thing [applies to] partitioning. We can say, "Hey, this file and that file are partitioned this way," and we're aware of it because of the metadata, allowing us all sorts of flexibility with how that data is managed. By having all this metadata, I can do a lot of the figuring out that I’d have to do by opening each file individually, from the metadata. I can open a handful of files and get a lot of work done.

An analogy I like to use when it comes to metadata is to imagine a kitchen. If I go to my kitchen, there are a bunch of drawers and cabinets. If I were to find one thing, [like] a fork, and I don’t know where forks are, I might have to open every cupboard and pull out every drawer to find it. That’s going to take a while. But if I had a clipboard with a list of every item in the kitchen and their location, I could save myself a lot of time, grab the clipboard, say, “Hey where are [the] forks,” and go directly to where forks are. That’s essentially what metadata is—an index. It’s a separate structure that tells you about the thing you want to look through, making it easier and faster to look through it.

Benefits

Alex Merced:

By using this metadata-centered approach that defines much more than just what folder the data is in, but what individual files the data is in, you're able to fix all [the] problems associated with the legacy Hive format, making life a lot easier. This enables much faster scanning of the data, and the ability to have multiple historical versions of the data available, because, again, it’s whatever is in the metadata, not just the files in the folder. So I might have files in the folder from older versions of the table, and that’s fine because the metadata is not going to tell them to look at those files unless they are query[ing] an older version of the table. So you can enable time travel, ACID transactions, the ability to do [many] more changes, like updates and deletes, and do them in a safe way where you have atomicity, which means changes are all or nothing, meaning they either happen or they don’t. You're going to have consistency, so that way, anyone who accesses the table, they're generally accessing the same version of the table. Isolation can help structure in a way that helps isolate different transactions that are happening at the same time, and durability, making sure that, “Hey, this data exists and will continue to persist after the transaction.” You have efficient updates because you have much more granular information of that metadata and the ability to evolve the table.

So these are things that become possible with modern table formats. The table formats that exist now achieve this and structure the solution in different ways. And there are three of them: Apache Iceberg. Apache Hudi, and Delta Lake. In the next session of this crash course, I'll be getting into the differences in the architecture of the three formats to give you a better idea of how they work and the trade-offs and choices they made. And a lot of that has to do with their origin stories and understanding where they come from. But we'll talk about that in the next section.

Laptop Exercise with Iceberg + Dremio Lakehouse

Alex Merced:

It's never too early to start getting hands-on with Apache Iceberg. So if you want to get hands-on with Apache Iceberg, scan that QR code that's on the left for a really fun hands-on exercise that you can do from your laptop without having to spin up any cloud infrastructure. If you like that exercise, and you want to do one that's a little bit deeper, we have these three exercises here that actually will take you through the whole experience end-to-end, from ingesting data, all the way to delivering a BI dashboard. You can find all of these and much more over there on the Dremio blog. And again, if you have any questions, make sure to go submit them over there [in] the Apache Iceberg category in the Dremio community. I'll be monitoring it for any questions, so that way I can give you guys your answers, and I'll see you all at the next section. Thank you so much. Have a great day, and enjoy.

Q&A

Alex Merced:

Hey, everybody, it’s me. Hopefully, you all enjoyed the first of the ten sessions. We’re going to cover each one [in more depth]. We really dive deep into what is a lakehouse and what is a table format to really help build that foundational understanding. This will help you not only know what Apache Iceberg is but [also] appreciate why it is the way it is, culminating with the final two episodes, which will include hands-on exercises with Iceberg, using Spark and Dremio. When possible, I will do a live Q&A at the end of the sessions. If you have any questions, [please] put [them] in the Q&A box on the screen, just click on Q&A and ask your questions, and I’ll go through them.

So, the first question we have from Robbie is about the format of the footer. I'm assuming that's referencing, what is the format of the footer in a Parquet file? Now, all that's built into the Parquet specification. So essentially, it's its own sort of custom format of data within the footer, and the Parquet file is all binary encoded, not text encoded. So, I would refer you to the Apache Parquet website to see the details of the specific specification on the specific items within the footer and how it's structured.

Loading real-time messages in Iceberg: Is Iceberg optimized for real-time analytics? Because in this case, we are inserting the data row by row. Each insertion process makes a new snapshot, and we'll end up with several snapshots. I think this may make things slower. Basically, yes, but there are different tools you can use for streaming into Apache Iceberg. We'll be doing a separate session specifically about streaming. For example, let's say you're using the Kafka connector to ingest into Apache Iceberg. What it'll do is, as it collects data from the Kafka topic, it will write data files as it goes and then try to commit periodically. That way, instead of creating a commit for every file that comes in or every record that comes in, based on your settings, it will sort of make little mini-batch commits that might happen every few seconds or a few minutes, depending on your settings and the amount of data coming in. So it won't necessarily be that bad. But even then, even when you're streaming, and this goes for pretty much any table format if you're streaming data in, you are going to have a lot of snapshots and a lot of small files being written. This is generally why you're going to want to have regular maintenance going on. So, in this case, compaction. We'll have a separate episode on optimizing Apache Iceberg tables as well. For example, if I'm streaming and I have data coming in consistently, I might want to have different tiers of optimization jobs. If I have data coming in, maybe I have a job that optimizes the data for the last hour. So that way, everything that came in the last hour gets rewritten into fewer data files. Essentially, it should just keep the name, depending on what tool you're using. And then maybe I might want to have another job, that runs at the end of the day, that rewrites the data for the day so that way the day is written more efficiently, and then maybe another periodic one, like at the end of the week or the month. Depending on where your SLAs are, you’d take a look at what the performance is and tweak the necessary maintenance from there. If a weekly maintenance job does the job and queries are fast enough, then that's all you need, but then you keep fine-tuning those maintenance jobs until you get the performance you need. But yes, with streaming, there's always that challenge of writing more files as they come in. But Iceberg has a suite of tools and services that help with that. For example, there are automated table management services from companies like Dremio, we have automated management services, Upsolver has automated management services, and there are a variety of companies that can provide that for you. That way, as the data comes in, it kind of takes care of that for you. So let me just mark these questions as answered.

Now, when I create a CSV format for the table I create, it changes the name of the table text. How can I keep the same name that I had given when creating the table? Well, I would need more context on that question because technically, it also depends on what tool you're specifically creating the Iceberg table from. So, I'm assuming what you're talking about is you have a batch of CSV files and you're trying to rewrite them as Iceberg tables, which would mean they would eventually get written into Parquet files. Now, essentially, it should just keep the name depending on what tool you're using. So, if I use a tool like Dremio to do a COPY INTO, usually what would happen is I would create the table first, and then, using the COPY INTO command, I can say, "Hey, take the data from these CSV files and copy them into this Iceberg table," and it would automatically superimpose the right schema. It would just land that data fine, and the table name [would stay the same]. Or I could use a CREATE TABLE AS statement. The last two episodes will be about ingesting data, so we'll get hands-on with ingesting data into Iceberg using Spark and Dremio. But generally, the naming challenge that you're mentioning in your post, if you give me more details, post that in the Iceberg topic in the Dremio community at community.dremio.com. Just give me details about what tool you're using [and] the workflow so that I can better understand any nuances that [might help] address that.

Will the subsequent sessions focus more on the Iceberg format? Yes, this [episode] and the second episode are going to be much more foundational. In the next episode, I'm going to go through each of the three formats—Iceberg, Hudi, Delta Lake—and talk about the structures, and how they work; you’ll get to see the mechanics of each one, and then going forward, we're going to go deeper and deeper into Iceberg. We did post a table of contents at the very beginning of the episode, and I'll do that in each episode to show you where we are in the schedule. These first two episodes lay the foundation and context. After that, it’s [about] getting specifically into Apache Iceberg’s features and structure, going through some example Apache Iceberg transactions, talking about issues with optimizing Iceberg tables, and streaming into Iceberg tables, getting into these more nuanced conversations. We’ll also point you to resources and hands-on exercises to try these different things out. So, that answers that question. Let me mark these off.

How [does] Iceberg handle partitions and how needed is [that]? [That’ll be] in episode number 5, [which] is a whole episode just on partitioning with Iceberg. It's really cool and probably my favorite thing about Iceberg, how Iceberg handles partitioning. In the metadata file––there’s a top-level metadata file with three tiers of metadata. In that top tier, there's an array of partitioning schemes that track the partitioning column and whether it is based on the transformed value of another column. You can say, "I want my partition to be based on the value of this timestamp column when transformed into a month value, a day value, or an hour value." We’ll get fine-grained about how that works with Iceberg. Episode number 5 [is all about] partitioning, hidden partitioning, partition evolution, and how Iceberg’s unique design enables that. It’s one of my favorites because it’s one of the places where Iceberg is the most different from everything else and brings unique value as a table format. Let me go back to the Q&A.

Adaptive schema possible with Parquet: I'm going to assume that we're talking about schema evolution, what happens if I change the schema later on? The top-level metadata file tracks the schema of the table at a high level. If an engine like Dremio is reading the table, there might be different Parquet files written at different points in time with different schemas because the table's schema [changes] over time. Dremio is going to know, “Hey this is the current schema, this is the source of truth schema.” As it reads the data files, it will ensure that when it returns the result, it [will] coerce the data it reads from every data file into the current schema. And in the Apache Iceberg documentation, does have documented what are the allowed table evolutions. For example, if you want to change the schema, which data type conversions are supported or not? Certain things may not work like you won't be able to change a text field into a float or something like that. But there are a lot of compatible data type conversions. Renaming is pretty easy to do because, within the metadata, every column is given a unique ID. So if you rename it, it still knows what the column is because it references it by the ID. If you change column zero from 'sales' to 'sales by timestamp,' it will still know it's that column because it still has the underlying ID of zero. That's actually documented in the Parquet files when it's written for Iceberg and in the metadata. So, it matches that up and doesn't have to worry about it, making renaming pretty easy to execute. You can adapt and change the schema. Generally, the metadata is there to help the engine figure out how to apply that to the data. So usually, from a user experience, it just feels like you're working with any data warehouse or database. You just alter the table, and change the schema, and it just works.

What are your thoughts on the reasons why Apache Iceberg is becoming the standard format over Hudi and Delta Lake? My opinion on this––this is sort of my opinion. I do spend a lot of time thinking about this question; there are a couple of different factors. I think one of the things that people [want ]is a rich ecosystem around a standard because we're not talking about just a tool. Like, Iceberg is not just a tool. For example, there are query engines and ingestion tools, and I can switch those as I need them. But how I write my data, the actual format that I write my data, whether it's the file format or, in this case, the table format that's on top of those files, you want that to be something that's going to work with everything on a fairly equal footing. Because you don't want to be rewriting that data over and over again for different tools. So it's a very foundational piece. This is why the whole open thing matters because the more open something is, the more neutral the playing ground that particular thing exists on. Apache Iceberg, by being an Apache project, signals that [it] exists within the playground. You can say the same thing about Hudi as well. They're both Apache projects, so they both operate in a way that many companies contribute to them. There is no particular commercial entity that has control over the project. They evolve based on the community's needs.

While Delta Lake has contributed to the Linux Foundation, the Linux Foundation doesn't necessarily have the same specific standard. At the end of the day, Databricks still has a fair bit of control over the evolution of the Delta Lake format. That's not necessarily [a bad thing]; there are pros and cons to that. The more control an entity has, the quicker it can update things. But it also means that if your company is in the ecosystem looking to support a thing, so let’s say you’re a company, and you want to build your business around the data lake, if you were to go all in on Delta Lake you run the risk that if Databricks changes the format completely overnight, there's not much you can do about that. You then get forced into code changes, making it harder to get the same ecosystem traction.

So that answers that part, but how about the part with Iceberg versus Hudi? I do think, especially in the early stages, the documentation for all three formats wasn't the greatest in the beginning. But I find Iceberg to be one of the easier ones to understand. As you see, when we get to the structure, the structure is fairly elegant and not hard to grasp, and that matters a lot. People don't want to adopt something they don't understand, that they don’t get They want to be able to say, "Okay, yeah, I get how this mechanically works, so I feel comfortable using it because I know it's not magic." I think being one of the easier ones to understand in those early stages helped it get more mind share. On top of it, the Iceberg project basically does a very specific thing. It stays within a specific lane, saying, "Okay, this is how you write the metadata," and that's pretty much it. It creates the standard for how you write metadata. How you read that metadata, the Apache Iceberg project is neutral. How you read that metadata, that’s neutral. It creates libraries so that tools like Dremio, Snowflake, and others can build that support, but it doesn't try to build out its own query engine or optimization service. The benefit of that is it creates [a] need for an ecosystem. [This] means you end up having many companies saying,
“Hey, look, I want to offer this thing to the Iceberg community,” that gives them an incentive to go out there and promote and talk about Iceberg and whatnot.

And that's what you've seen, you've seen companies like Snowflake, like Dremio, Upsolver, and many others who spend a lot of time talking about Iceberg because they're able to offer something to that, and that creates this momentum that you hear about. So one, they can build a business around it because they know that there’s not some other company out there that can just flip the script on them, and two, they know the scope of the project, so they know, “hey, here’s where I can add value,” and the project is not going to creep into that. While Hudi is very all-inclusive, Hudi has this thing called the Delta Stream, and again, I'm going to get into a lot of this in the next episode. But it has this thing called the Delta Stream that does a lot of the things that other companies would offer for you. So, basically, it becomes a little bit––and there's a reason for that. You'll see when I talk about Hudi, that it was built for low-latency streaming, and it was built as a framework specifically for that. So, it's not just the metadata standard; it's the whole thing, the whole apparatus.

So, it was trying to solve a different problem. But the problem it was solving was streaming, not necessarily general use case tables, which is what Iceberg was trying to solve, saying, 'Hey, we just want to be able to have tables on the data lake. How do we create something that solves most table use cases?' Specifically, Netflix, they were trying to move away from Hive. So, I think a lot of those factors kind of led to it more. One is, that Iceberg was designed for more general use cases, so it works pretty well in most places, making it a pretty good place to default on. But also, the fact that it's open and has such an ecosystem, all these things kind of really create that momentum. Cool.

Okay, so let me go on to the next question. And again, these are just my thoughts. These are, in my opinion, don’t treat them as the de facto standard, because all these formats will still have a place. There are reasons to use all three formats. None of these are going to disappear, but there are reasons why Iceberg is becoming the default, the standard that people gravitate towards, using the others for different types of use cases but [defaulting to Iceberg.] The partitioning features, to me, are also, feature-wise a really big one. But I’ll save that for the partitioning episode.

What are the factors to consider when choosing a table format for a project? I will talk more about that in the next episode. I’ll get into the individual structure of each one. But the shorthand, I would say, you can handle streaming pretty well with Apache Iceberg. But there are––Hudi was designed specifically for streaming. So where you have the highest lowest latency needs, like streaming the data, using Hudi on the ingestion side can sometimes make sense. And then, if you're using Databricks, people oftentimes like to use Delta Lake because Databricks is built around the Delta Lake format. But again, when it comes to general use cases, in most situations, you're going to find that Iceberg just makes a lot of sense. It has the richest ecosystem, has the richest documentation support information out there. So oftentimes, it's a great place to default to. Like, if you basically say, 'Hey, I have my data gravity here.' And if I have certain things where I need to kind of have a splash of other things, you do that. But the idea is if you're building your data lake house and saying, 'Hey, this is how my data sits,' 90% of the time, I would opine that Iceberg makes the most sense. You're going to have the most flexibility to move between different tools and different architectures from there. And especially now with projects like Apache XTable, having those small exposures to the other formats when needed isn't as big a deal. So that's that question.

Can I read access Delta tables using Iceberg tables? Well, okay, so each form, basically what it is, is each table is just metadata, and then the Parquet files. It doesn't matter what format it is; Parquet is Parquet. So, a query engine like Dremio could read the Parquet files. The question is, can it read the metadata? So that's where Apache XTable comes in. It's a newer project coming out of the company Warehouse. Essentially, what it does is convert metadata. So, let's say I have this table. The Parquet files are fine as they are, but I want to use a tool that can only read Iceberg. So, I can take that Delta Lake table, take that metadata for the Delta Lake, and convert it into Iceberg metadata. And then, a tool that can only read Iceberg can now read that table. That's essentially how it works. And that's actually being used, I think, on the Hudi side of the Delta Lake uniform feature. But each format is its own thing. It comes down to the engine and what you can do. If I'm reading Iceberg metadata, I'm reading Iceberg metadata. If I'm reading Delta metadata, I'm reading Delta metadata. If I’m reading Hudi metadata…But you can theoretically have all three sets of metadata on top of the same set of Parquet files where the data is stored. And that's essentially what the Apache X Table project is working with that premise.

Does Data Lake support both––yeah, yeah, yeah. You'll see as we get into the third and fourth episodes, where they get into Iceberg's transactions, you'll see that you have all the ACID guarantees when you're working with Iceberg. The catalog is going to play a large role in that. You have this really nice ability to know, hey, you're not going to get stuck reading a partial transaction because you don't get that commit to that catalog. A read should never really affect somebody else's read. Basically, everything is kept in a nice way where everything's really isolated from each other, and you get those guarantees very nicely. So really, when you're using––if you do that tutorial that I linked to during the presentation, you'll see that when you actually start working with Iceberg and you're just writing tables, if you're using Dremio, you won't even notice you're using Iceberg. You're just creating tables and updating tables like you would with any database. The beauty of it is that those tables are on your Data Lake, and if you want to, then tomorrow you can work with them with Spark. You could switch the tool, or you could switch the way you're interacting with those tables, based on the particular use case. And that's sort of the beauty of it. It's not like it's part of this one database or data warehouse system, and it's stuck in that system. It's this table that is just visible by a bunch of different tools, and ideally, it just works. And again, you'll see, if you do the tutorial, you get that feel when you're using the Dremio platform. It just works.

How can you generate Parquet files? You don't have to, so essentially, you use, to write Parquet files, you use a library. That's really not something you do yourself. You could be using tools like Polaris or Pandas, which have functions to write your data frames to a Parquet file. But generally, when you're working with Apache Iceberg, you don't even have to do that. The Iceberg project has its own Parquet writers that write Parquet files specifically the way Iceberg wants them, to make sure they have those matching IDs with the schema. When you're using something like, for example, Dremio, but this goes for any tool that works with Iceberg, if I write a CREATE TABLE statement or an INSERT TABLE statement, we're actually gonna have an episode where I'll walk through those different transactions, Dremio will be responsible for writing the Parquet files and then writing the metadata on top of those Parquet files. So from the user experience, when you work with tables, it's really just you writing SQL, and then the engine, the tool that you use to interact with those tables, is responsible for writing the Parquet, writing the metadata, and understanding all that. The reason why it's important to be aware of all that structure is just to have a deep understanding of why things work, which can be useful when you're troubleshooting between different engines. But honestly, the eventual reality is that it just won't matter. Tables will just be tables, and you'll just feel like, "Hey, my tables just work everywhere." They just exist, and the eventual goal is for that to not be something you have to think about. But the world's not quite there yet.

To build an LLM foundation model, in what way is the data concept essential? I would say the data lake and the Data Lakehouse are essential when it comes to building AI models for a couple of [reasons]. One, you're often dealing with really large datasets that might not be feasible to store in an enclosed system like a database or data warehouse. Sometimes you can, depending on the system. Two, you want to get that data as fast as possible for the training of that model. You don't want to go through several different movements of the data where you first move the data to this system, [then to that system.] In that case, having the data located on your data lake means you just move the data from source systems to the data lake and then train off of that. [Another point] is just the compute costs for training. Using standard cloud compute versus compute with a markup in a lot of other systems is going to be cheaper. And when you think about the kind of compute you need for training, that can be a nice savings. But really, it's more about just being able to make the data available faster and having not just structured data in your data lake tables, but also unstructured data like audio files, and image files, all in one place that your training and models can access for training. It's pretty nice. There's a reason why Databricks is so popular it was designed around accessing data on the data lake. Apache Iceberg just opens that up into a much bigger world.

Do we get the recording of the session? The recordings for each session will be posted on the Dremio YouTube Channel between 24 to 48 hours after the session airs. So, yeah, they will be on there.

How do we update the Iceberg table without hitting the S3 request rate issue? That's one of the best things about Apache Iceberg. Apache Iceberg doesn't care about [that.] So, basically, if you have a really large partition, that’s where you would hit this S3 request rate issue because you might have a really big partition with thousands of Parquet files within this one prefix in your S3. When there are too many requests to the same prefix in S3, you get this rate-limiting issue. But with Iceberg, you can actually spread out that partition. They don't have to be in the same folder, so you can stretch them across many folders, and many prefixes, but you’re not going to hit those rate limits. For example, Dremio does this by default. You don't even have to configure this. If you're using object storage when writing Iceberg tables with Dremio, it'll automatically write each partition in a folder that's hashed for that particular write. So the data files for that partition are spread out across many prefixes, and you avoid that whole issue altogether. Iceberg has a very unique story there, which is really nice, based on the way it tracks the data. It doesn't rely on that physical partition structure.

The maintenance is managed by the query engines, does Iceberg specify any protocols about compaction? The Apache Iceberg project provides libraries to work with many of the open-source engines like Apache Spark. In the Apache Spark libraries provided by the Apache Iceberg project, there are procedures for things like compaction, rewriting data files, and expiring snapshots. Commercial query engines like Dremio also have simple SQL commands for these tasks. We’re going to have a whole episode where I go through the syntax for both. With Dremio, you’d say something like "OPTIMIZE TABLE" and then the table name, and it just runs compaction. You don't need to think about it. It just makes the table faster. The engine still needs to do it, because, at the end of the day, there still needs to be something with the computing algorithms to do that efficiently when you're talking about really large datasets. But the tools are there, both on the open-source side and the commercial side, to make that very feasible and doable.

What about the Iceberg catalog, JDBC, Nessie, etcetera? Oh, I have a lot to say about that. That's a whole other episode. I think that's going to be episode 6 or 7 where I talk about catalogs. I have a lot to say about the evolution of catalogs, the REST catalog. I'm going to save it for when we get there because that's a very loaded topic. But if you keep sticking through the series, we will get to catalogs and do a deep dive into that.

Under what circumstances would Iceberg be a more appropriate choice in comparison to Delta Lake? I would say if you're not––I mean, you could still using Iceberg if you are using Databricks, but I’m just saying that if Databricks isn’t your main tool––Delta Lake makes a lot of sense when you are using Databricks. It doesn't make as much sense when you're not because many of the best Delta Lake features are built into Databricks and aren't necessarily available outside of [it]. So, if you're using anything else, use Iceberg. If you don't want to be committed to using Databricks forever, that's another reason to use Apache Iceberg, as it [gives you] the flexibility to switch tools later on without being stuck in one ecosystem.

Again the partitioning features––partition evolution, you only get with with Apache Iceberg, which is a cool feature. We'll cover this in the partitioning episode.

Can I combine change data capture to get old OLTP tables from a database? Yes, Iceberg does have CDC features. They are [likely] to become more advanced over time. There's ongoing work on CDC within Apache Iceberg tables. You can also use tools like Debezium and other tools to facilitate that. Essentially, there's a Spark procedure that generates a change log, which you can use to make these updates in other systems.

How does Apache Iceberg compare to Databricks Delta Lake? I think I've answered that question a few times now, so I'll move on from that.

Can you explain the difference between hidden partitioning in Iceberg and partitioning when running a Reflection? Hidden partitioning and partitioning when running a Reflection [are] two different contexts. A Reflection is a feature specifically within the Dremio platform. Think of it as…materialized views or cubes on steroids, or as an automated pipeline where you define objects you want as views, turn on Reflections, and all the pipeline logic is done behind the scenes, almost like a one-click pipeline. There are a lot of different ways to think about Reflections, I have a couple of great articles on Reflections. But Reflections are essentially just Iceberg tables. They're essentially just Iceberg tables that Dremio will use to substitute for certain queries because you might have a faster version of the same table for that particular query, as it's partitioned or sorted differently. So it's essentially using the same partitioning features of Iceberg. You're just creating a separate version of the table that gets substituted, which might be more optimized for the particular query. The example I usually like to use is voter data because I think voter data is pretty easy to understand. I might partition voter data based on––I might have some queries that query voters based on party, so I might create a Reflection that's partitioned by party. But then I also might have other people who query the data based on state. So then, I have another Reflection that's partitioned by state. It's still just an Apache Iceberg table, but they're two different versions partitioned differently. When a query comes into Dremio and I say, "Oh, give me all voters in Nebraska," then it'll choose the right version of the table to more appropriately answer that query. So it'll use the one that's partitioned by state. I would say they're two different things. Apache Iceberg has partitioning features, and Reflections is a feature of Dremio that's built on top of Apache Iceberg. It takes advantage of those partitioning features to get you more bang for your materialization dollar. But I'll leave it at that, just because I won't get too deep into Reflections for this particular series. This is more about Apache Iceberg than Reflections. But I do have a great article, "The Who, What, and Why of Reflections," that you can find on the Dremio blog that goes really deep into it.

Could you give a comparison of the three databases? I'm thinking of the three table formats with use cases to understand: Cassandra, Druid, and Iceberg. When I think of Cassandra––I'm a little bit less familiar with Druid, even though I have worked with it a little bit. But I would just say it's a different use case. When I think of Cassandra, I think of a database. I know Cassandra can do some analytic work, but I usually think of Cassandra more on the operational side, doing transactional work. So I would have Cassandra powering my application, but then I would still want to eventually migrate that data to Iceberg tables for analytics. So just different use cases. [As for] Druid, I haven't worked with it as much to speak too intelligibly there. I did write a tutorial where you get hands-on with Druid, but I wouldn't say I'm an expert in Druid.

Currently, we use Hive tables and want to migrate to Iceberg tables. How do…can we point Iceberg tables to the same online HDFS Parquet? Yeah, you can use the same Hive metastore. So you can use your existing Hive metastore as your catalog. That's generally going to be the easiest next step, and then you can basically choose a couple of tables to migrate over. You can either rewrite the tables or use things like the 'Add Files' procedure. I generally would recommend rewriting the data files, like actually rewriting the table. The reason is that when you use––there’s a procedure in Spark that [migrates] the table––the underlying Parquet files are not rewritten, and those Parquet files don't have field IDs that match up with the metadata. It'll still work, but I think there are some advantages for engine compatibility to have those files written by an Iceberg writer from the get-go. So, that would be my advice. But again, I would do it one table at a time. [Since] you can use the same metastore, you don't have to move everything right away. You can move one table over to Iceberg, make sure all your systems are set up, and do it in phases. That way, you can do a couple of small mini-migrations to figure out all the logistical steps you may have to worry about with your particular architecture. And then, once you've proven it out and said, "Okay, now we know how to get from point A to point B, and have it working nice and smooth," then you do a larger migration. Depending on your architecture, you can also work with our solution architects. We can help figure out that plan. Dremio is often really good for these kinds of migrations because you can connect multiple systems across different formats and provide a unified interface. So, your end users don't have to worry about what format the data is in.

How does Iceberg handle CDC? Is it built-in or [does it] require custom code? It depends on how you're doing it. There is a built-in CDC procedure. I'll talk more about this when we get to that particular episode about streaming.

How is deletion handled by metadata? May the partition be deleted? We have to rewrite the table metadata again. Every time you change a table, new metadata gets written. You're not rewriting the entire metadata. Iceberg is very intentional about trying to break everything up so that, when you change a table, you rewrite as little as possible. But we'll get into that when we discuss the structure of the metadata in future episodes.

There are procedures that are not supported by some of the catalog types. For example, migrate is not supported with the JDBC catalog. That should be less of a problem. When we get to the catalogs, I'll talk about something called the REST catalog, which is going to become the standard going forward. Once everyone standardizes the REST catalog specification and all catalogs work through that layer, you won't have all these issues with mixing and matching the compatibility of different procedures, or at least not as much. At the end of the day, different catalogs do have different underlying mechanics, but the REST catalog will clean up a lot of that going forward.

Is there any certification available for Apache Iceberg, or is a managed variant offered by vendors? Stay tuned. I'll say that—stay tuned. So, none yet, I guess, would be the answer then.

Closing

Alex Merced:

But with that, that answers all the questions. Again, thank you so much for coming to this first session. The next session will be on Tuesday. Questions will be strictly on the Dremio community. So again, make sure you sign up for the Dremio community at dremio.com, and you can post your questions in the Apache Iceberg category. I'll keep an eye on it. Generally, I'll do these live Q&As at the end of these sessions as often as I can, which should be most of them. But on Tuesday, leave your questions on community.dremio.com. On Tuesday, we'll be going through the structure of the three formats, and in the next session after that, we'll start getting deeper into how Iceberg transactions work. Thank you all so much. It's great to see so much engagement and so many people interested in the topic of Apache Iceberg. It's a pretty exciting time to be in the Data Lakehouse space. Thank you all so much for being with me today, and I'll see you in the next session.

Gnarly Data Waves

Apache Iceberg Lakehouse Crash Course – What is a Data Lakehouse and What is a Table Format?

Register to view episode

Speakers

Transcript

Opening/Curriculum

Traditional Data Systems

What is a Data Lake?

What is a Data Lakehouse?

Parquet Data Files

Defining Multi-File Datasets

The Hive Approach to a Table

Modern Approach to a Table

Benefits

Laptop Exercise with Iceberg + Dremio Lakehouse

Q&A

Closing

Ready to Get Started? Here Are Some Resources to Help

Whitepaper

Delivering AI ready data with an Intelligent Iceberg Lakehouse

Case Study

Major Investment Management Firm Achieves Data Democratization with Dremio Lakehouse Platform

Webinars

Unlock AI-Ready Data with the Intelligent Iceberg Lakehouse

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?