May 2, 2024
Best Practices for Building an Iceberg Data Lakehouse with Dremio
A data lakehouse combines the flexibility and scalability of the data lake with the data management, governance, and analytics of the data warehouse. Open table formats like Apache Iceberg make it possible to efficiently manage and leverage data while maintaining complete control over your organization’s most critical asset. In this session, we’ll share best practices for building an Iceberg data lakehouse, including: ingesting data into Iceberg tables, automating table optimization for performance, and building and sharing virtual data products. We’ll also share how Dremio’s Data as Code capabilities enable data teams to apply a DataOps framework to their lakehouse management strategy, including version control, CI/CD, governance, and observability.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Speaker
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Alex Merced:
Hey everybody, I guess many of you have just met me from the book signing, so for those who stopped by and grabbed the book, thank you so much, I really appreciate, you know, wanting to like read words that I wrote and said and stuff. Yes, yes, thank you. And then, you know, and yeah, overall what we’re going to do is we’re going to talk about best practices for building an iceberg-based lake house. So again, there’s going to be a lot of things, or best practices could cover a whole lot of things more than the time that we have, so we’re going to focus on sort of three particular areas, and just talking about sort of, hey, if I’m using specifically Apache Iceberg, what are some things that I should think about? So that’s going to be the goal, and again, my name is Alex Merced, I’m now Senior Technical Evangelist here at Dremio, one of the co-authors of Apache Definitive Guide, I’ve worked as a developer for different companies in the past, and I, many of you I’ve met at other conferences and spoken at different places, whether they’re online meetups and whatnot, and also I host a bunch of podcasts, so I talk a lot is the point.
Parquet File Row Grouping Size
But we’re going to be focusing on at least three areas for best practices for managing your data lake house. So the first area is talking about your data, because oftentimes when we use like databases and data warehouses, we don’t think as much about this, because the database or the data warehouse is doing a lot of this like optimization of the data under the hood. So we’re just basically adding records to the table, and how that data gets stored, well, we don’t think about. But on your data lake, and specifically in your data lake house when you have Apache Iceberg tables, you know, initially you may be doing that maintenance yourself.
Now I’m going to punchline ahead of time, you don’t necessarily have to, and I’ll talk about that afterwards, but if you don’t have tools that kind of do this for you automatically, you have to think about things like, you know, your Parquet files. Like first, my Iceberg table is made up of Iceberg metadata and Parquet files. Well, I want to think about how those Parquet files are written, because Parquet files, they’re going to let me group my data into different row groups, okay? And the benefit of a row group is that like if I’m Dremio and I’m reading a Parquet file, I can basically take a look at metadata for each row group within the Parquet file and determine, hey, do I even need to read this row group, okay? So if I have more row groups, theoretically I could skip more row groups to speed up my query, referred to as predicate pushdown, but also if I have more row groups, I’m storing more metadata about each row group, so there’s like a tradeoff, okay? So it’s always a good thing about like tradeoffs.
So if I were to have smaller row groups in the way I write my Parquet files, again, I can theoretically, if I have a lot of data with a lot of different values, that might improve my ability to prune row groups for a query engine, and it also could be better for parallel processing, because a query engine like Dremio can then pass those row groups to different nodes on the engine to do parallel processing. And again, better for memory usage, because you’re not filling up the memory with all one big giant row group at any given time. That’s good, but again, more metadata, because each row group gets their own set of metadata, so that could be a larger file for less records, okay? And maybe not as great for, you know, compressing the data, because basically it’s each row group will be treated differently. And when it comes to bigger row groups, well, that’s less metadata. So again, basically if you’re talking about really big data sets, where one data file is probably going to have very similar records, you might prefer the smaller row groups, or I mean, fewer, bigger, fewer bigger row groups, because the data’s going to be very similar. You’re not going to be pruning the records in that row group as much, so in that case you might prefer less metadata to prove the compression improvements you’re going to get out of those bigger row groups. But again, bigger row groups could also mean more memory that’s being used when you read those row groups.
You know, you’re not going to be able to distribute, so there’s only two row groups, I can only parallelize that by two, okay? And less predicate pushdown. So again, just think about how your Parquet files get written. Nowadays, oftentimes the tools that you use will kind of make a lot of these decisions for you. So if you’re using Dremio and Dremio is writing iceberg tables, it’ll make a lot of these adjustments and setups for you. But at the end of the day, it’s still important that these are decisions that do need to get made at some level when it comes to thinking about how does the data land in my data lake from my data lake house. And also be aware of it, so that way if for some reason you ever hit some bottleneck or somewhere where you need to tweak, you know what are the different levers I can pull.
Compaction
Compaction, okay? This is going to be another area where Dremio is going to be able to help you. But the basic idea here is, like, and this also has to deal with the way you partition tables and things like that. But bottom line is, you might end up with lots of files that might be big or small, but what you would prefer to have, especially for really large data sets, is fewer, bigger files. So every time a query engine like Dremio has to open a file, read it, then close it, then open the next file, that’s going to slow down your query. So the less files you have for the same data, the faster that query can get run. So usually, again, you’re using a data warehouse or a database, these kind of things are just kind of taken care of for you, but on the data lake, there’s no process automatically compacting those files.
Now, if you’re using Apache Iceberg, Apache Iceberg does have built-in Spark procedures that can help you do that for you. Dremio has an optimize command that can do this for you. But even better, if you’re using something like the Dremio Lakehouse Catalog that we have on Dremio Cloud, we do have an auto-optimize feature. So that way, this kind of thing will just be done automatically for you. So that way, it feels more like using a, well, a database or a data warehouse where you just don’t have to think about these things, okay? But again, absent using some sort of optimization service, there’s a few of them now available for Iceberg. What’s one of the other big pros of using Apache Iceberg is that generally, unlike other formats where usually you only have like one option for optimizing or managing your tables, Iceberg has–now you’re starting to see a huge flourishing ecosystem of vendors who are offering optimization services. So while, you know, I’m a big fan, of course, of Dremio services, but the idea that you have choice, and again, not vendor locking, that’s one of the big selling points of Apache Iceberg. And the great thing is, again, Dremio, we’re an open ecosystem. So whatever service you use, that catalog can still plug in to Dremio and you can still use Dremio to run those queries.
But compaction is also, again, an important thing to think about with how your data is stored, okay? Especially if you’re doing something like streaming, where basically you’re maybe only ingesting a few records at a time, so you’re writing really small files, okay? And it’s something you also want to be careful when you’re doing partitioning because you may–you’ll be like, “Yeah, hey, let me partition my table by like the first letter of everybody’s name.” That’s great. But what if every time you ingest like 1,000 people, only one person has a name that starts with M? And now you’re creating a bunch of small files in that particular partition that can affect your performance, and you can clean that up by doing compaction, okay? So again, regular compaction, always a good idea. But separate from sort of the idea of partitioning, there’s also the idea of clustering or sorting your data. So yes, maybe not all my–so I might be partitioning the data by like, say, first name, but let’s pretend like it’s like voter data. I might separately want to sort the data by political party, okay? And that’s also going to help me reduce the number of files that have–especially for large data sets, Dremio might query, okay? Because basically, right here, if I’m looking for all the records that start with A, I have to open up file one, file two, and file three. So Dremio would have to read three files. But if I had sorted that data before I wrote the files, well, then I’d have three files and I’d only have to open up file one. So now Dremio has cut down the file operations by a third and resulting in a much faster query.
So while Dremio is really fast and regardless of what query engine you’re using, good data management at the file level and table level will give you faster performance. And again, these are all things that auto-optimization services from Dremio, and again, a growing list of providers in the Iceberg Lakehouse space can provide you, okay? Again, another great reason why to be in the Iceberg Lakehouse space.
Z-Order Sorting and Partitioning
Okay, then there’s Z-order sorting, it’s just a particular way of sorting, okay? And this is gonna be basically for–basically when you have queries that frequently access like multiple columns. So for example, what Z-order sorting will do is it’ll sort or clustering, if we’re talking about clustering, will write all the files that match these ranges sort of among these data files and then any values within these ranges will be in these data files and so forth and so forth. So what happens if I’m looking for someone who has an X of 25 and a Y of 75, guess what? I know I’m looking in these data files and I can skip all these data files, okay? Which means from a query engine perspective, I’m reducing a lot of surface space of which files I need to scan, okay?
And then last, partitioning, being able to say, “Hey, if there’s a particular column that I know like, for example, if I’m–again, using voter data as an example, if I always know that I’m using–like let’s say I have national voter data and I’m giving each state party access to that data, well, really like the Kansas party only cares about the Kansas data. So having a partition by state makes a lot of sense because they’re only really hitting up that particular group of voters. So in that case, partitioning by a regularly queried column can really be beneficial because then you have data files that just cover that particular value, which is different than the sorting because if I go back to the sorting for a second, you might have a file that has two unique values depending on how the sort goes, okay? So moral of the story here–moral of the story here is that you want to manage those data files when you have a data lake house. And these are all techniques to help you manage that data lake house. But again, a lot of this stuff can be taken care of for you when using an optimization service for this first category.
Choosing an Apache Iceberg Catalog
Now, another issue when it comes to an Apache iceberg, specifically a lake house, one of the big decisions you have to make up front is what catalog I’m going to use. Now, one of the announcements this morning over here in the keynote is that Germany is going to start supporting the REST catalog option, okay, or the REST catalog interface. And just let me just talk about that because that’s going to really transform this particular question. Okay. Before, I would have to kind of choose between all these different catalogs, okay? And you still do, but the implications are going to be a little bit better going forward. So just first, let’s talk about the catalogs that have existed. And basically what happens is that every time there would be a different iceberg catalog, that catalog tracks your iceberg tables. But even though five query engines can all know how to read iceberg, they need to still consult your catalog first.
So they need to know how to talk to your catalog. And that’s where the problems would kind of come in because certain engines would work with certain catalogs. Other engines would work with other catalogs and so forth and so forth and so forth. And Dremio works with lots of catalogs because, again, we try to be an open platform. So we try to be able to talk to everything. So that way, you’re not forced to kind of go in any particular direction. You go in the way that makes more sense for your use case and will support you and kind of be your best friend along the way. But of course, you’ve heard us talk about, hey, we have this project, open source project that we started a while back called Project Nessie, which is part of our now part of Dremio’s integrated Lakehouse catalog. And the benefits of that one, the benefits of this particular catalog, and again, each catalog has different pros and cons, is that Nessie brings you Git-like functionality. So at 4.30, we’re going to be having our evening keynote, and there I’ll actually be doing a demo of that kind of functionality. So if you want to see that in action, how that looks like, I’ll be doing that there.
The beauty of it is since you’re doing this at the catalog level, this opens up a whole new world of multi-table semantics, meaning like multi-table rollbacks. I don’t have to roll back each table one by one. I can roll back my entire catalog. Multi-table tagging. If I want to tag, hey, say, hey, this is my end of month view of all my tables, I can do that instead of doing each table individually. Okay. So it gives you sort of these multi-table semantics when it comes to versioning and time travel, et cetera. And again, there’s a–Dremio, we have a cloud-managed version of it, which is–which makes that a very convenient option because you don’t have to stand up another service if you don’t want to. But because it is open source, you could deploy your own Nessie server and we do–especially for Dremio customers, we do support you in doing so, okay? And then again, it works with lots of different engines, Spark, Dremio, Flink, Presto, all support Nessie, so you can run queries there. If you have other use cases or like if you’re streaming data with Flink, you can do that into a Nessie catalog. If you are doing like batch with like Spark, you can do that with a Nessie catalog, okay? But you would deploy a Nessie server.
Hive Metastore. Many of you might be already on Hadoop and you might already have a Hive catalog. So in that case, that makes it a really easy starting place for Iceberg, okay? I wouldn’t necessarily go to Hive if you don’t have Hive already, but if you already have Hive, then that becomes, hey, this is a good default to start with, okay? The downside would be if you don’t have Hive, then you have to deploy Hive. It doesn’t really give you any extra features other than tracking your Iceberg tables, but again, if it’s there, that becomes an easy starting place. You’re not having to like find new pieces to get started with your Iceberg journey, okay?
Then we have AWS Glue, which if you are an AWS shop, that’s a great way to go because it connects to all the other services like Athena, Redshift, whatnot, so it makes your tables portable within the ecosystem, but if you’re using Azure or GCP, that may not be as appetizing, okay, because it won’t connect to those services as well. So again, each catalog has its own pros and cons. Hadoop, which is the file storage catalog, so literally like any storage layer could be treated as a catalog. Now a couple caveats about this is that the way that particular catalog works, because you don’t have this database or other service that’s tracking your tables, what happens is that each table will have a little file called versionhint.txt that’s in the folder and then that’s how a query engine like Dremio can track, “Hey, which version of the table are we at?” The problem is, is a lot of object storages can’t give you the same update guarantees when you’re updating a particular file, so this is why generally this is not recommended for production, especially if you have concurrent writes because there might be a chance where two writes happen, they both try to update that versionhint.txt at the same time and now you start having consistency issues. So generally you do want to have some sort of service-based catalog as your production catalog for Iceberg.
But yeah, but again, all of this was a very, very big decision up until now because again it would preclude you or include you to certain tools and other tools, okay? But now with the REST catalog, basically what this does, it creates a standard REST API. Any catalog can implement that REST API and any engine can read that REST API. So basically as different tools and different catalogs adopt that REST catalog standard, basically what will happen is that a tool like Dremio can connect to absolutely any catalog and doesn’t have to actually create an individual support every time a new catalog comes out because one connector can connect to them all and that one catalog can connect to all tools that support that standard. So basically it’ll eliminate that whole sort of compatibility question with catalogs which will make the Iceberg story a lot simpler. So the point is like when it comes to best practices, one of the first things you have to decide is again which catalog and generally like I’m definitely going to lean towards sort of the Nessie story because again that Git-like functionality can open up a lot of practices regardless of what query engine you’re using that eliminate the need to duplicate data which saves you money on storage and will also give you really easy recovery ability to be able to just roll back when you make a mistake or something goes wrong to kind of, well, fix things instead of having to back the data all weekend.
So that’s the story there. And then the cool thing is if you choose one catalog nowadays, you’re not stuck there because out of the Nessie project they did create a CLI tool for literally moving one catalog to another. So you can actually sit there, use that CLI tool and say, “Hey, I have this catalog and I have all these tables in this catalog and I’d like to move them over to this one,” making it really easy to migrate. So you can start with, let’s say, again if you have like Hadoop, you start with Hive, then as you migrate and then later on you want to switch over to like let’s say Nessie if you weren’t using Nessie initially, you can easily just move those references over and it’s, again, basically the whole catalog has already become less of a friction point than it used to be, okay? Mainly why? Much easier to migrate between catalogs and two, now with the REST catalog spec, you have sort of a universal standard that kind of eliminates a lot of those issues.
Building Your Semantic Layer
And three, building your semantic layer. So again, basic idea is you generally take away is, again, one, trying to use an optimization services like there’s ways to optimize your data, but if you can avoid that and have some sort of service that does that automatically for you, that’s going to be great and make it a lot easier to have a data lake house, okay? Choosing the right catalog is going to make a difference in sort of what lake house, what your lake house capabilities are and whatnot. And three, building your semantic layer. So again, idea here is just basically predefining all sort of the versions of your data that you need for consumption, okay? So because oftentimes what would happen is that you would have your tables and then people would define like little subsets of the data they want in their favorite BI tool. So they might create like a Tableau extract or a Power BI extract of those specific sub data sets, but they would only live in that BI tool. So then if you have two teams using two different BI tools, you’re redefining that data set twice, okay? And that’s costly, not necessarily efficient. So great thing about using a tool like Dremio that has a built-in semantic layer is that you can define that sub data set once as a virtual view and now Power BI and Tableau can both take advantage of it. You can turn on something like Reflections and they can both take advantage of it without having to do it twice. You’re not paving the same road twice.
So using something like, again, building that semantic layer, you can do it with Dremio. It’s not the only option in the world, but it’s a pretty, well, you know, we like it. But the bottom line is you want to kind of organize that semantic layer in a way that makes sense. So here we’re kind of showing you more of a, what’s like a more data mesh type structure, okay? So you see here I have like basically what you can think of as either a virtual data mart or a data product for each of my different business units. And then you can organize within that your favorite three-layer terminology, whether you like bronze, gold, silver, you know, raw curated production, raw business application. At the end of the day, the idea is you have your raw data, you do some work on it, and then you make your sort of production consumable bits at the end, okay?
But the cool thing about, again, doing it virtually is, again, you’re not, you’re, basically by focusing on doing it virtually through virtual data marts, what you’re doing is you’re eliminating the copies you would traditionally make, which is, again, going to be a better practice. And then, again, basically when you do need it, you can just turn on reflections to speed up the bits that need it, okay? So that basically, again, taking that sort of virtual data mart approach, using that semantic layer is where you sort of, you really define and do a lot of that last mile ETL work, is going to be a time saver, increase self-service, and, well, save you money because you’re not storing as much, you’re not going to be running as much compute, especially if you’re using, taking advantage of things like reflections, that’ll just basically swap out those optimized bits, and you’re overall going to get a better story, okay? And then, again, now we have the reflection recommender, because that was always the trouble before, right? For those who’ve been using Dremio, it’s like, when should I use reflections? What is the right reflection? Or maybe you should make five reflections, and then you have too many reflections, and then that becomes the bottleneck, okay? And then, again, we at Dremio, we are always listening to you, so we know that, so that’s why we built the reflection recommender, so that way we can help you identify where those good opportunities are to save money.
And then we have now incremental reflections, so that way, if your data is an iceberg, again, another reason why the iceberg lake house is so good is that now these reflections can be updated incrementally on top of that, so that way, instead of doing a full rebuild of reflections periodically, you’re just getting the alpha, so those reflections update faster, and yeah, that works pretty well. And again, there’s so much more to come, as you saw in Cedric’s keynote, and then even more that I can’t say yet, because that’s too far down the road, but it’s coming, I’m telling you, it’s coming, and it’s cool, so there’s a lot to be optimistic about, a lot to be excited about.
But again, here’s just another example of just sort of how this would look like, so again, purple physical data sets, so these are actual physical tables stored on my data lake, okay, so I might join them, so like over here, this might be sort of like my curated view, or silver data, okay, and then I start generating these views, sort of in that gold layer that are gonna be what people actually build BI dashboards on, but since I’m using a virtual data mart, and I’m doing this modeling virtually, none of these are copies, so again, we’re not duplicating any data, but because of that, what’s happening, you know, assuming I’m not using reflections, well, when I run this SQL, it’s rerunning this SQL, which is rerunning this SQL, and after so many layers of views, that’s gonna add up, but again, I can just, in the past, what we would do is I would go create like a BI extract, or a materialized view, you know, and then basically this would create other abstractions, and if I created a materialized view, what would I have to do, I’d have to go educate my analysts, like don’t query business view E, query the materialized business unit view E, because if they accidentally queried this name, they’d still end up with the same problem, okay, they’re not saving themselves any money, so you could use these old ways of doing things, but it still required work on the consumption layer, which still meant mistakes, which still meant slow queries, and we wanted to avoid that, so again, kind of getting to this whole best practice, if we can use reflections, we don’t have to worry about that, I just flip a switch, turn on reflections there, and now when the user does the query, Jeremy behind the scenes will just say, hey, you know, we’re gonna query reflection J instead, and then basically that user just ends up with a faster query, and they don’t have to think about it.
And that’s always the ideal world, right, because at the end of the day, the more you have to ask people to change the way they’ve been doing things, the harder it is for them to do it, so the idea is that if you want to avoid the mistakes, you want a system that’s gonna kind of automatically do it for them, and then that way, they just keep doing what they’re doing, and they’re just like, thank you, thank you, it’s faster now, okay, and that’s why reflections, I would say, is a best practice way of accelerating your data lake house, and that’s just there, again, there’s so many different aspects of managing a data lake house, but again, how the structure of your data is gonna matter. Dremio will help you write those ideal Parquet files, okay, and help you manage that data, okay, choosing the right catalog is gonna matter, and again, what is the right catalog for you is gonna depend on your use case, but Dremio’s probably gonna be able to connect to that catalog, so that way you have those choices, okay, and online, you don’t want to duplicate that data, and you want to make sure that your users don’t have to deal with too much change management every time you try to accelerate what they’re using, and this is a way, like, using something like Dremio with reflections, it’s gonna really help you have a lake house that’s easy to use for your consumers, but also easy for you to manage, especially with all those observability features to help you observe those reflections and manage those reflections.