May 2, 2024

Apache XTable (incubating): Interoperability among lakehouse table formats

Apache Hudi, Iceberg, and Delta Lake have emerged as leading open-source projects, providing decoupled storage with powerful primitives for transaction and metadata layers, commonly referred to as table formats, in cloud storage. When data is written to a distributed file system, these three formats are not significantly different. They all offer a table abstraction over a set of files, including a schema, commit history, partitions, and column stats. When engineers and organizations must choose a table format, they face a challenging decision. Each project has its rich set of features that may cater to different use cases. So, the question arises: do we really need to choose?

Enter Apache XTable—an open-source project that provides omni-directional interoperability between lakehouse table formats. XTable doesn’t introduce a new or separate format but offers abstractions and tools for translating table format metadata. This allows you to write data in any format of your choice and convert the source table format to one or more targets that can be consumed by the compute engine of your choosing. This presentation will showcase how XTable addresses the challenging problem of selecting a specific table format and the growing need for interoperability in today’s lakehouse workloads.

Topics Covered

Iceberg and Table Formats

Sign up to watch all Subsurface 2024 sessions

Speaker

Dipankar Mazumdar

Staff Developer Advocate, Onehouse

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Dipankar Mazumdar:

So thanks everyone for joining the session. My name is Dipankar and today I’m going to talk about Apache Xtable and how you can interoperate between different legal stable formats like Hudi, Iceberg, and Delta. A little bit about me. I’m Dipankar. I’m currently a staff developer advocate at onehouse.ai. I’ve been like, currently I focus on open source projects like Apache Hudi, Apache Xtable. Previously, I have also contributed to projects like Apache Iceberg, Apache Arrow. And my focus has been pretty much in the data architecture space.

Evolution of Data Architecture

So in terms of the agenda, we will start with background on the legal stable formats. People are aware of it. It will be kind of a refresher in a way. And then we’ll go to the second part of it, which is like which format to choose, and how Xtable comes here and brings about the interoperability perspectives here. And finally, we’ll talk about the technical overview of it, go into more of a detailed presentation on the technical nits and bits. All right. So over the years, organizations kind of have been heavily investing on building centralized, reliable, and scalable data platform, right? So the whole idea was if they could democratize data, it would allow more data consumers like data analysts, data scientists, right, to use it. And they can further enable like workloads like BI and machine learning. So data warehouse were kind of the first one that enabled to deal with this. But then it lacked kind of abilities to deal with unstructured data for more advanced use cases such as machine learning, right? And then came the era of the data lakes, right, with the Hadoop ecosystem, which was on-premise. And data lakes were kind of great, right? Organization could basically store all of their data, whether it’s structured, unstructured, didn’t really matter, into a file system like Hadoop file system. And then eventually, with the rise of cloud, and it’s kind of cheap price, we have the cloud data lake storage is now like S3, Azure, Google, all the three major vendors, right? So you can basically scale out the storage as you would like, and you have no limitation. I mean, in a way, it’s kind of like lesser and cheaper cost. But of course, there were problems. And of course, why we have the Lakehouse architecture and why we’re talking about all of this.

So one is you can’t just dump all of your data into a data lake, right? And have your analyst and scientists like to deal with the physically stored data. So what I mean is if you wanted to do BI, you still needed to organize the data and get it out in a more structured form, right? So you typically use a data warehouse again, kind of like again, two-tier kind of architecture, we have data lake and data warehouse. So it was kind of costly as well, and high maintenance. And the most important problem was that data lake, they lack those atomicity and consistency guarantees, right, with transaction. So things like updates, deletes, you cannot do acid-based transactions on the data lake. So for example, if two concurrent writers are writing to the same data set at the same time, there are no guarantees that things might succeed or fail, right? And also, it was really getting hard to make sense of all of the data. Because in a data lake, there is no real governance, no real quality checks, and people usually dump the data as you collect from operational system. So it’s really hard to make a decision and like run analytical workloads on that kind of thing when you don’t have those quality and governance policies.

So these kind of gave rise to a new architecture platform, which is like data lake house. And you know, as you might have guessed from the name, it’s a combination of the data lake and a warehouse in some form. But of course, this is not just a mere combination of the two in a way that kind of satisfies the needs. But the idea is to basically take the best of the both worlds, right? For instance, data warehouses, they have great data management capabilities, right? And optimization capabilities, things like compaction, clustering, ability to do things like time travel, schema evolution, right? So these are the things that were previously available to users and customers using data warehouses. And similarly, with data lakes, they provide the scalability part, right? I talked about even the cost-wise, the object stores are very cheap comparatively, minus the risk fees and stuff, which is really great.

So a lake house architecture basically builds on this idea. And the most important kind of things that ultimately lays the foundation is kind of the table format in a lake house. So let’s kind of talk about the various components at a very high level quickly. So I’ll start from the bottom up in this particular diagram. So if you see in the bottom part, you know, you have the storage, which is the cloud storage. This is where your data lands after ingestion from like, let’s say, transactional system, operational system, CRMs, right? So Azure Blob, Amazon S3, Google, all these vendors provide any kind of object store, and it provides the required performance and security on that as well. And then we have the file formats on top of that, which is where the actual data is stored, right? Usually in columnar format, like providing any advantages like in reading or sharing the data, and also compression-wise, performance-wise is really good. So file formats like Parquet are commonly used, ORC, even things like JSON, CSV. It can be pretty much anything. Again, the most important part on that layer is basically the transaction database layer that you see on top of the file format and the data lake storage. That’s where the table format comes into the picture, right? And that’s what basically table format does. So at a high level, what table format does, and we’re going to talk a bit more about it in the next slide, but it kind of abstracts the physical data structures complexity, right? And allows different users to like have a schema, kind of a table schema, without having to deal with the kind of physical bits and bytes, okay? So Apache Hudi, Iceberg, and Delta Lake, they are like the three popular table formats and widely gaining enterprise adoption. But obviously in the past, we have had the Hive table format from the Hadoop world, which kind of led us to like even today’s like format. And on top of the transaction layer, you see the compute engines and the BI tools, the ML frameworks, whatever is responsible for like interacting with the data and like consuming the data.

So one question that always came to my mind when I was understanding a Lakehouse architecture is how does it relate to like existing architecture, like a data warehouse, right? So the main distinction between a data warehouse, if I have to compare, and as something like a Lakehouse, is that each of those components that you see here in this particular slide, are not bundled up into one single system in a data warehouse, like in a data warehouse, right? What I mean is that all these components that I talked about, they’re also present in a data warehouse, but the file format and the table format part are kind of proprietary, which is like very much designed to work with their own compute engine. And that’s kind of what locks you in a way, right? Because you want to, like you have to load the data into their storage, and then you only can access the data from their compute engine. Again, I’m kind of seeing the pattern slightly changing with kind of customer’s need. So now even we have warehouse vendors supporting different table formats. But if you’re interested like deep dive more into table formats and the Lakehouse’s architecture, I really suggest reading the Lakehouse paper by Databricks. I also presented in a published paper during my time back in Dremio, where we kind of compared it to data warehouse and looked into like how it kind of compares with that kind of engine.

What is a Table Format?

So what is a table format? So usually with databases, you know, we have seen the concept of table, right? Yeah, it’s not a new concept, right? Companies like Oracle, Microsoft, they have been doing it with relational databases for some time now, right? So like I said, it’s basically a way to organize the data sets file to present them as a single table. So users don’t really have to know what the physical bytes and files and they can just have that abstraction on top of that. So in the big data world with Hadoop and HDFS, we ended ultimately with the hive format, right? You know, it was kind of important because we wanted more people to use more data on the data lake. Otherwise you will have like, you know, data engineering teams, writing complex map, reduce algorithms, like to process the data, which can be really cumbersome. And it’s really limited, even skills wise, right? To a particular set of people. So with Hive, we have a table in the database, you know, and it’s composed of one or more directories, right? And you have one directory here, but of course, you know, there are a lot of directories in production and you know, the content of the directories files are the files itself.

So the Hive table format was one of the table format that kind of gave us those acid based like transaction on top of Hadoop data lakes. But today we have come a bit off from the Hadoop ecosystem and in a good way. I mean, it’s all inspiration, right? So, and now we have three major table formats, like, you know, Hudi was started from Uber camp, like Iceberg started from Netflix camp and, you know, Delta came out of Databricks. Overall, the goals of this format are kind of same. And some of you may already be aware of these formats, but folks who are new to the space, you know, just a bird’s eye view of what these formats look like and how they differ, right? So, like I said, these are basically the metadata layer on top of the actual data file, the parquet files, for example, right? And how this metadata files are organized and designed are a bit different. And that’s the only difference you see. For example, Delta, you know, does this with this JSON base, you know, transaction log. Iceberg uses a bunch of metadata files and Hudi kind of takes a timeline based approach and indexing. So, so once you have the kind of metadata abstraction on the files on storage, you can now explore fundamental characteristics like, you know, SQL semantics, you know, you can write a SQL queries on top of that and, you know, do things like schema evolution, for example, time travel, right? All those data warehouse capabilities are now capable here. And, and also this metadata files allow us to track like statistical information about the data files, which are super important for like things like, you know, performance optimization, performance tuning, you know, those are really the benefits of this.

So, like I said, although these three table formats came out of three different camps, the goals are obviously mostly the same. You know, the goal is that customers should be able to like have a open data architecture, meaning that data should not be like locked into a particular system where they can only bring in one computer. So that’s the whole idea behind it. Now, coming to the main point, the thing is, fundamentally, these three table formats are not much different, like I said, right? When data is written to a kind of a distributed file system, these formats have similar structure. And it’s kind of like an extraction on top of the parquet files, right? For example, if you see the, you know, the Hudi table format structure, right? You see the .hudi metadata file and you see the bunch of parquet files. Similarly for the Delta, you see the transaction log and you also see the parquet file, right? And for the Iceberg, you see the manifest file, the metadata file, the manifest list along with the parquet file. So really the intersection part, the common part here is the file are pretty much the same. The metadata is the only thing that is changing.

Which Format Should I Choose?

So that brings us to the million dollar question, right? What format should you choose, right? This is probably one of the most common questions I have received in my two, three years of working with like customers in lake houses and, you know, productionizing lake houses. And to be honest, this is a tough question, right? Everyone’s workload, use cases, ecosystem, you know, they are different, right? And so there are different factors to consider in this case. You’ll probably see a bunch of like, you know, comparison articles around the same, you know, that kind of targets things like, you know, performance, features, community. But it’s a really difficult thing to compare and choose and really stick to one format maybe. But to give you a feel, here are some guiding principles, right? For example, like Hudi can be good for like, you know, low latency streaming pipelines, for example, right? Delta could be good if you’re already in a database environment and it’s kind of easy to get started as well. Whereas Iceberg, you know, it works with like, you know, Dremio, you know, it’s pretty much like, you know, the competitively wise people are building it. Even things like, you know, partitioned evolution is very unique to like Iceberg, hidden partitioning. So these are some of the things that I really like where Iceberg shines. So overall, it’s kind of tricky to nail down and like, you know, stick to a particular format, like get married to a particular format, right? So up until now, we have learned that the decision to choose a format can be challenging and, you know, involve many costs, right? So the challenge basically lies in selecting the best format, right? And which can be really arduous task. And the ideal choice may also shift over time as the workloads evolve in your environment, right? And also there, it’s also important to acknowledge the fact that now we are seeing more and more open table formats being developed. Like for example, Apache is one example, which is kind of like with link environment that came up recently, right? And we are also seeing like more specialized table formats, like Havasu, for example, is based on Iceberg, but more for special kind of geospatial workloads, right?

So ultimately, these are all good developments, right? These are like helping customers like, you know, own the data, you know, have open data architectures and not limited to a particular workload, right? But in general, in the past couple of years, in my experience as well, you know, what it comes down to is it’s feature level comparison. Okay. What people do is, okay, maybe Iceberg has this feature, but Delta doesn’t have this feature, you know, so that kind of comparison, it depends on your workload as well. The second part is the complexity of implementation itself, right? How is it easy to migrate or, you know, to an existing format or from scratch, right? Are there any additional system dependencies that you need to worry about? And the third part is what type of like, you know, community support is that it’s really open? How open is it? How are the developments happening, right? Because these are all open source projects. And on top of that, there are also some new challenges as we are kind of learning from like in the productionizing lake houses, right? For example, you know, newer workloads in the lake house architecture, they involve multiple teams using multiple different table formats, right? So organizations, they want to go beyond the stable formats and be able to like, you know, kind of interoperate and like talk with different teams and use the data from different teams. So I think that the challenge is here that you need to be able to interoperate and make your like data. When we talk about data silos, this is kind of in a lake house environment. This is what we are talking about in a way.

Introducing Apache XTable

Okay, so the question is, do we really have to choose or is there any another way, right? And then that’s where kind of Apache Xtable comes. We have seen this kind of problem and the solution is basically the Xtable that we kind of presenting here. So Apache Xtable is not a new table format. It’s just a bunch, it’s just a translation layer on top of like the existing table format, right? So the idea is that, you know, your data should be universal. You know, the flexibility should be there for teams to choose and cater to different use cases. You know, you should be able to write in one format and use query engine, whatever works best for you, whether it’s Dremio for Iceberg, whether it’s a Spark for, you know, Delta or whatever it is. And the overall idea, I guess, is to write once and query everywhere.

So Xtable is an open source project that kind of allows omnidirectional interoperability. And the reason I talk about omnidirectional is because you are not limited to kind of one-off conversion from like one particular format to another. This is more about like interoperating between formats, right? And so that the important part is that it’s not a new format, like I said, it’s just an abstraction and set of tools that allows you to translate the metadata, only the metadata part. So the most important part to note here is that we are not touching any data files as part of this translation, right? There is no costly data rewrites. There is nothing happening in the background. The only thing we are doing here is, for example, let’s say if we have like Iceberg metadata, we’re going to translate using that metadata and like, let’s say, to a different table format. That’s the only thing we are doing. So the overall goal with Xtable is that you start with any table format of your choice. And if there is a specific requirement, right, in the organization, you can switch to another, depending on these factors that I’m going to discuss, right? And you know, all you have to do in this case, like I said, in this particular screenshot you can see, you can just, you know, choose your source table format, choose your target table formats, run Xtable, and that’s all. You have the metadata there. And now you are ready to like plug in, for example, you can see spark.read.format hoodie, you can read hoodie tables, you can read Delta table, Iceberg table, whatever it is. So again, I’m going to like, you know, present a more of like a, kind of like a high level, like a demo, if you may, like to just show what we did with Dremio, like in this case.

I’m going to more of a technical level now, like, and understand, like, try to like, you know, explain like what basically the Xtables architecture looks like. So at a high level, this is what like Xtables architecture look like. It is comprised of three key components, the source reader, the target writer, and the central core logic itself. So let’s kind of take a look at these three individual components. So the first is like the source reader. And these are like table format specific module. They’re responsible for reading metadata from source table, and they operate using a pluggable file system, right? So for example, like when I say pluggable file system, today, if you have any new formats coming in, you can just take that interface and implement that interface. So you’re not like, you know, writing a bunch of new code itself from a more technical level. It’s really easy to like plug into Xtable as a new table format as well. And basically what this particular piece of the architecture takes care is basically it extracts the information like, you know, schema transactions, partitioning info, and it translates into Xtable’s unified kind of internal representation.

The next part is the target writer. And this kind of mirrors the source reader. You know, their role is to basically take that internal representation that we did of the metadata and kind of accurately map it to the target from its metadata structure. And this includes like regaining schema, transaction log, you know, partition details, all this thing in the new format. You know, this is what we’re taking care of. And the most important part of this Xtable architecture is the central processing unit is basically the core logic. And this particular core logic, it kind of orchestrates the entire translation process, including initialization of all the components, you know, managing the source and targets, you know, handling tasks like caching for efficiency, state management, for example, incrementally processing telemetry, all those things are being taken care of by the core logical part of this Xtable.

XTable Sync Process

Now that we have a decent understanding of Xtable, let’s understand a bit more technically what happens during the sync and then try to understand about the two translation modes or sync modes available. So I use the sync and translation very, you know, interoperably because it’s kind of the same thing it means. So Xtable provides two synchronization modes. One is the incremental and another is the full. The reason why we have this kind of two mode is because incremental mode is pretty much lightweight and has better performance, especially on large table.

So what Xtable does, Xtable detects which particular source table formats commit have not yet been translated yet to the target. Right. And it kind of focuses solely on converting those commits. Right. So this particular approach, you know, by incrementally kind of like reading commits, you are reducing the time and the computational resources. Right. And also this allows for kind of you can like frequently invoke like, you know, Xtable on a commit by commit basis, like, you know, efficiently kind of reducing the stainless on your data. Right. And if there is anything that prevents incremental mode from working properly, you know, the tool will kind of fall back to the full sync mode and you go back to like the similar way. So what exactly happens during the same process? Right. So the syncing of data files is very important. Right. Along with their like column level statistics, right. Partition metadata. So these are the things that we kind of sync in this entire translation format between the table source and the target. Then the schema updates. Right. The schema updates need to be reflected from the source to the target table formats. That’s something super important as well.

And the third part is the metadata maintenance for the target formats. Like for a hoodie, like, you know, unreferenced file will be marked as cleaned. For Iceberg, the snapshot will be expired after a configured amount of time. And for Delta, you know, the transaction log will be retained for a particular configured amount of time. And, you know, it kind of clears that up later.

XTable Attributes

So just to sum it up, like the high level kind of attributes, what Extable provides, you know, Extable is omnidirectional. Meaning you can start from any format and you can convert to any other format and you can round Robin or round trip. Like, you know, you can make any combination that you want with absolutely little performance overhead because no data ever is copied or rewritten. Like I said, right. It’s just only a small amount of metadata that we’re kind of like, you know, converting. Right. The second part is we have to, you know, synchronization mode. We talked about it. The incremental mode is lightweight, better performance. And then the full swing mode is like for the entire table sync. And of course, Extable is designed to easily adapt and expand it. Right. In response to like emerging table formats like I talked about. And this just only means you have to like just pretty much, you know, implement like kind of the plugging system, the pluggable file system. And it kind of like opens up avenues for, you know, all this translation.

Case for Interoperability

So Extable kind of provides a new direction right here to like kind of deal with the tough decision of selecting and, you know, sticking with the table formats. Now, while we started with the need for interoperability in this particular talk, it also brings back some critical questions, like such as this one from the community, if you read this particular tweet. So the tweet says, you know, taking a step back to view table formats, how many teams out there are running or needing data to be in all formats like Iceberg, Hooli, or Delta. Maybe I’m not seeing the full picture here. I would think one of the format would suffice in organization end to end. And that’s a valid question, right? You know, it’s kind of a new way of doing like architecture development. So it’s kind of like tough to understand, like why we need kind of these different things. So to provide some context, you know, some of the aspects that we want to consider here is that number one, we have compute vendor table, you know, format, right? Compute vendor, most of the compute engine vendors today, they are inclined towards a particular format and all for the better reason, right? You know, you want to be able to do better optimization, better feature extensibility with particular format and doing it for one format is it’s always challenging here because we are dealing with so many new things, but doing it for all three doesn’t really make sense in a way, right? But this kind of contradicts the initial idea of not locking in data to a specific format, right? And in fact, this is one of the biggest motivations, you know, behind an open link architecture that you should be able to like, you know, interoperate between different kind of like data.

The second part is the coexistence of table format, right? Most organizations and I’m kind of working on like with those customers as well as I’m learning more, they have more than one table format in this architecture, specifically like larger organization, right? Because they want to use these different compute engine, which are like developed specifically for that particular format. Okay. And the third part is the rollback or migration itself. So decision to select a table format is correlated with another question, which is what if there is a need to revert or migrate, right? So migration is usually kind of a standard answer, but in this case it is very like costly and it can involve like tremendous engineering effort and planning. It’s really like a lot of input.

Okay. So I just wanted to like, I’m not going to do a demo in the brevity of time, but you know, this is what we kind of recently did and I have a blog and I can share it later as well. Basically the idea is that Dremio doesn’t support Hody tables today. It’s mostly like Iceberg and like, you know, a Delta in a way. But we wanted to like show, okay, what if like Dremio has a, like a Hody table and how we can use it. Right. So the use case was like, let’s say a team A in a particular organization, they are using Hody with Spark to write like some kind of streaming these workloads and they’re writing data into Hody. Right. And then there’s another team, team B in the organization who uses Dremio with Iceberg, but they also want to use the data set from like the Hody one. Right. And they want to use it and merge it in Dremio and like do their analytics in Tableau or whatever BI tools. So what do they do in this case? So in this case, what we did is we use X table to translate the metadata. It’s a simple translation. So the Hody table will be exposed as like an Iceberg table to Dremio and that way Dremio just reads it as an Iceberg table and you can do your analytics on top of that.

XTable Community Reception

Okay. So that’s pretty much it in terms of like the, what I had into the technical contents. It’s an open source project, X table. We just launched with one house, my Microsoft and Google in last year and we are under Apache license two in GitHub. So if you want to get involved in any form, feel free to like, you know, start a PR or go to discussions. It’s also exciting to see some early momentum in the project. You know, these are again, very vanity metrics, but it is very important to understand the overall growth of the projects. We had like in a more than close to 700 stars, a thousand GitHub clones. We were also trending in the GitHub repo worldwide. So that’s really interesting to see the kind of trend where we are going. Our goal is obviously like the seamless interoperability, like I talked about and eliminate more data silos. And of course we want to be vendor neutral. So we kind of like donated this project to Apache software foundation and now we are incubated as the Apache project. So that means our next goal is to take it forward as a Apache top level project so we can graduate it in a way. And I also wanted to like pull up and give a shout out to some of the initial commuters involved in this project from various angles and organizations. Like, you know, we have folks from my organization, one house, like Microsoft, Google, Dremio, Cloudera, Walmart. So, you know, really good momentum there, good collaboration, diversify as well. And I’m going to skip this, how to get started. You can just check out our X table, you know, the documentation site, and it should be pretty easy to like get started. We even have some hands-on examples. You can just like use the Docker repo and you can like open up a container and run your conversion. And you know, that’s Pretty much it.