May 2, 2024
Beyond Tables: What’s Next for Apache Iceberg in Data Architecture
Iceberg’s core purpose is to enable multiple engines to use the same table simultaneously, with ACID guarantees and full SQL semantics. While building Iceberg-based data architecture, the community has added new specifications for use cases like catalog interaction, views, remote scan planning, views, and encryption. This talk will cover the other standards and projects that the community is working on and how those projects unlock better patterns in data architecture.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Speaker
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Ryan Blue:
All right, well awesome, thank you for attending this session everyone. This is always a fun time. I think we’ve been doing this like three years now, so awesome to be back speaking at Subsurface. This year I wanted to do something a little different. Normally I kind of talk about big efforts and things going on in the Java implementation. I kind of wanted to get into, this time, a more forward-looking, not just what we’ve done in the last year, but really talk about some of the things that are upcoming in the Iceberg community, and basically why we’re working on those things. That’s why the clickbait sort of title here. Really quickly, hi, I’m Ryan Blue. You may have seen me around in the Iceberg community. I’m the co-creator of Iceberg from when I was at Netflix, and I’m currently the CEO of Tabular.
Apache Iceberg: A Universal Analytic Table Format
Most people, when they think about Iceberg, think about this. This is one of my typical slides explaining exactly what Apache Iceberg is. I think this fits the definition that most people have. Where we started out was we wanted to take our data lake at Netflix and upgrade it to essentially the same capabilities as a data warehouse. We say, “Iceberg upgrades data lakes to perform and act like data warehouses.” And crucially, we didn’t actually realize this initially, but we were going for sort of a universal analytic table format, one that works in basically any engine. Because at the time, we had Pig, and Hive, and Spark, and Trino, and all sorts of things trying to work with our data. There were three, essentially, pillars. We wanted warehouse behavior. We needed to perform and act like a warehouse. We built SQL transactions, schema evolution. We used declarative approaches and things like that. It also had to work with everything that we had. We sort of built the universal storage capability there. Then we also found ways to add new capabilities. The model that we use for snapshots and different versions enable time travel and rollback, branching and tagging, and all sorts of new things. But all of this that we talk about and usually think about as Apache Iceberg is really just focused on a table, a table format and a single table. Where we’re at today is that we actually were moving beyond tables, obviously, from my title. I want to go into exactly why and what is happening.
I think one place to start is that the universal aspect of this that I think we think of as a happy accident. Essentially, what we wanted to do was have ACID transactions that worked even if we were using different compute engines, like Spark, Trino, or Dremio. Those compute engines didn’t need to know anything about one another. We didn’t need heavyweight locking in, say, the Hive catalog that slowed transactions down or anything like that. What we ended up building, because we wanted safe transactions, we were focused very much on this Spark can modify a table safely and Trino and Flink can also be acting on that table at the same time and no one sees uncommitted data or we don’t have bad data leaking anywhere. What we ended up solving was a larger problem, and that is universal analytic storage. The assumption that we had one central data repository and multiple engines ended up being this larger contribution that we didn’t frankly realize at the time. I think of Iceberg today as solving a larger problem in the table space, shared database storage. That is a pretty big step forward. We’ve never actually done that before. We’ve never had databases sharing storage safely. Now, we’ve had some databases sharing storage with Hive and Hive-like formats, but they’ve never been able to really do it safely. This has become a surprising game-changer, especially as we see adoption of these open formats in even closed-source data warehouse engines like BigQuery and Snowflake and Redshift. It’s a pretty big step forward for those engines to start adopting this from the Hadoop ecosystem. I think of this as really a quiet revolution. This hasn’t happened before in data. We’ve never had the ability for, say, MySQL and Postgres to share underlying tables, yet that’s exactly the kind of thing that’s going on in the analytic database world today. That’s weird.
Architecture Realignment: Centralized vs. Specialized
What we’re seeing is this is causing an architectural realignment. I think some people call this lake house. This is how I actually think of the lake house trend today, which is that we’re seeing essentially now that databases and engines can share storage, we’re seeing them come apart and realign around two broad categories. One is centralized architecture. What needs to be everywhere? And then the other is specialized architecture. What do you only need in certain places? For example, tables and your data, you only want one copy of your data. That’s a centralized piece. You don’t want to need to copy your data between Dremio and Spark and Redshift. You want to have one centralized copy that you maintain. You’re not copying data around. You’re not trying to keep it in sync and worrying about problems when your data gets out of sync. We ran this query over here, and it didn’t give us the same results as the query over here. What’s going on? Those sorts of challenges have plagued the data landscape for a very long time, and we can finally move past that thanks to this centralized and shared storage.
The other aspect of this is specialized. You can imagine there are a lot of specialized things that are much better off for focusing on the specialized elements. Puppy Graph is here. They’re a sponsor. Go check out their talk. That is a great example of a specialized component, a graph database, running graph queries on data stored in Iceberg tables. It’s just a different way of working with the data, and it’s a different specialized engine. Streaming and batch and ad hoc, those are all sort of different choices that an engine might make, and you use different compute resources and frameworks for different purposes. That’s the specialized element. This is an interesting realignment because the question is, what goes in the specialized bucket and what goes in the centralized bucket? As these things change, I think that this is the biggest transformation that we’re going through over the next few years, figuring out what is centralized and what is specialized.
This is also, it corresponds with a pretty big shift in business models and just the way things have been built for a very long time. We’ve never had databases that didn’t have very tight control and tight integration with storage. This is a completely unprecedented space, and this is why I think of it as a quiet revolution. It’s never happened before, and I really think that databases, or at least analytic databases, the design of them is fundamentally changing. It’s getting more like the Hadoop landscape, right, with all of the engines that we’ve had over there, but it is not going to be a short or quick process. There are a lot of people that have always sat on data, right, because if you put your data in my database, the data naturally pulls workloads to that database, and so this is a very revolutionary change where you can put your data somewhere and where that data sits doesn’t determine what compute engine you use on it. This is a very new and interesting space.
Modular Architecture
For that reason, I think that one of the big tenets moving forward is that neutrality for that centralized storage component is going to be crucial. You want to make sure that your storage component, your table space, works with everything out there equally well. That’s my little tangent on that, but what I want to bring us back to is what is the Iceberg community doing in response to this sort of shift in the data landscape? I think that to pull this together, what we’ve been talking about at least in the community is sort of a move towards modular architecture. This is, I think, how you build that centralized and specialized realignment. People need to control their data, have one shared centralized copy, but they should be able to plug in specialized compute resources, just like Legos. Everything should just work. Other parts of this mean that if we want to be able to take any specialized compute and plug it with any storage layer, we need to move some things. We need to secure the data itself and not the access to the data, because it doesn’t help if you have three different database engines sitting on top of your data, each with very different ideas of what permissions look like. We also have a really, really huge dependency on open standards. The Iceberg table format, I think, is driving this revolution, but we have more things like views, authorization and authentication, even catalog protocols and things like that. We also need declarative patterns and services, but this open standards piece is where I want to go next, because this is where the Iceberg community is heading. This is essentially the big place where I think the lens that we’ve set up for looking at the analytic database space explains where the Iceberg community is building today.
Let’s take a look at what else is needed in the centralized or modular space and essentially explain why is Iceberg working on all this other stuff that’s not tables. First of all, if you want a centralized data layer, you end up needing a catalog standard. Before the start of the Iceberg project, at the start, we had different catalogs everywhere. Some people had Hive, some people had Glue. Netflix had our own catalog called MetaCat. We knew a lot of other organizations that had their own catalogs, and we were also providing all of the compute services. We made a simple API that you could plug in your own catalog. As the space matured and now we have closed source databases, we need everything to essentially talk the same language in order to connect to catalogs. This is that pluggability or the idea that a new database engine should be able to act on your data and understand what tables are out there. The community a couple of years ago came together and started on the REST spec or the REST protocol standard just to ensure that connectivity. This is going really, really well. There are a lot of organizations adopting it. I’m hoping that this week we see some announcements about this, but the REST standard I think is quickly becoming the standard for catalog interaction. We also use this as an opportunity to fix several long-standing issues with Iceberg and the metadata model. That is super important so that we have that pluggability. Adoption across engines is increasing, and that’s really step two is to get everything talking that same language.
Then there are extensions to that REST standard. We’re already working on a set of those to make sure that it’s easier to work with data and to do things that you need to at the catalog level. We’re going to talk a little bit more later about identity passing and access control and things like that, but really we’re making good progress here towards server-side scan planning so that any language can go interact with a table and get back a list of files that it needs to read, making the Iceberg clients far easier to write and maintain, and also do things like append files via post. AWS is working on both of these features, and it’s a really, really great area to see improvements in. So huge thanks to AWS for driving the way there.
Views
The next thing that we’re also working on, we’ve had a lot of discussion in the community about views. So just like you have a catalog of tables that you want to be universally accessible, it doesn’t really make sense to leave views out of that picture because views and very simple SQL transformations are such a core part of working with data effectively. You don’t technically need views. You can copy the view definition everywhere, but it’s so much more convenient to have views and to be able to work with them in that way. So in a world where essentially your catalog and everything below it is shared, we also need to share views in this world. So the first step is common views, where we simply have one SQL definition, and hopefully that works across engines. That’s not a great plan. It does cover probably 95% of cases, but once you start getting into certain types of joins or lateral view explode and some of those things, common views really break down quickly. So another thing that we’ve done is we’ve allowed views in the iceberg spec to have multiple SQL dialects, and an engine can come in and say, “I’m looking for the Spark dialect,” and pull up that dialect that is specific to Spark. And so being able to store multiple blobs of SQL that all do the same thing allow us to have this space of compatibility for views. And there are projects like Coral. I think Walla is giving a talk, either at this conference or the next one, about Coral and this automatic translation between dialects. Long term, I think that we’ll probably want to use some sort of intermediate representation. So an engine will take its SQL and bake that down into an intermediate representation that, like the iceberg format, is a cross-engine representation. There are some candidates out there like Substrate for this, and I think that that’s going to be important. That’s still well into the future, but we are working towards that goal of having views that just simply work across engines. That is going to be really, really cool and a very useful tool in this space where we have centralized data but specialized execution. We’re also working currently on materialized views, which is not necessarily any system producing views, but how do we keep track of the metadata for where a materialized copy of a view lives, and how do we keep track of how up-to-date it is, and what are the key features of knowing whether you run the view or use the materialized table or things like that. Views are very, very important, so we’re spending a lot of time here. Views also unlock a lot of better patterns, and the one I’d call out is better CDC patterns where you don’t need to keep your mirror table up-to-date every minute, but you can have longer periods of time and then just merge in at run time or at read time all of the latest changes. That is a much more efficient pattern, and views really unlock that, especially views that can be across engines. You don’t necessarily have to materialize everything. We can do some things more lazily.
Access Controls, Identity, and Policy
Next up, I think there’s a lot to be done in terms of access controls, identity, and governance policy. This is one area where I think it’s very clear you don’t want three or four or five different sets of policy. You need the same identities across all of your engines, and you want to work with your data as though it is secured, not the traditional way, which is to secure every endpoint separately. Centralized policy and universal enforcement is really key. This also comes into play when we’re talking about modularity. How do I delegate? If someone comes in with a new engine that they want to hook up, how do we delegate that and make it very, very easy? Luckily, we have open source standards for these things. Exchanging identities, so users and groups and things like that, that’s just SCIM 2. There’s an existing standard for this that’s used very broadly. OAuth, the same thing that you use to allow Facebook to access your Google Workspace contacts. Hopefully no one’s connecting Facebook and Google Workspace. That same thing that you would use for delegation can be used to enable delegation within a storage system. All of these things together, along with the REST catalog, allow a storage platform to build very good quality and fine-grained access controls using identities that come from Okta or sign-ins that come from Active Directory and have one set of policy that’s centralized and has universal enforcement, no matter what is accessing the data. This I think is an area where Iceberg is not building this itself. What Iceberg is doing is facilitating this and recommending standards with which we should be building. This is a central area that the REST protocol is looking at.
Table Encryption
The last one is actually table encryption. This isn’t something that is really centralization, but I wanted to highlight this specification and this effort in the community because it does show that the community is maturing beyond just simple table concerns. A lot of people have use cases for which they need to encrypt the data. We’re building standard encryption and metadata encryption into a spec that goes into Iceberg. This uses native Parquet and ORC encryption, which both support the row group and column skipping that those formats are known for, but also has streams for encrypting metadata files and has key management all the way up the Iceberg table structure and that tree so that it’s very easy to secure your data and doesn’t have a ton of hits to your key management system. Table encryption I think is just a maturation of the community, and huge thanks to Apple for leading the way on all of this table encryption work.
And with that, that’s basically it for me. Thank you for listening to my thoughts on how the data landscape is changing and what we’re doing about that in the Iceberg community. Quick shout out to Tabular, that’s where I work. You can follow the links here if you’re interested.