Gnarly Data Waves
Episode 32
|
September 12, 2023
Introduction to Dremio Arctic: Catalog Versioning and Iceberg Table Optimization
Join this webinar for an introduction to Dremio Arctic, a data lakehouse management service that features easy catalog versioning with data as code and automatic optimization for your Apache Iceberg tables. Learn how Arctic helps data teams deliver a consistent, accurate, and high quality view of their data to all of their data consumers with a no-copy architecture.
The data lakehouse is an architectural strategy that combines the flexibility and scalability of data lake storage with the data management, data governance, and data analytics capabilities of the data warehouse. As more organizations adopt this architecture, data teams need a way to deliver a consistent, accurate, and performant view of their data for all of their data consumers. In this session, we will share how Dremio Arctic, a data lakehouse management service:
- Enables easy catalog versioning using data as code, so everyone has access to consistent, accurate, and high quality data.
- Automatically optimizes Apache Iceberg tables, reducing management overhead and storage costs while ensuring high performance on large tables.
- Eliminates the need to manage and maintain multiple copies of the data for development, testing, and production.
Watch or listen on your favorite platform
Register to view episode
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Introduction
Alex Merced:
Hey, everybody! This is Alex Merced, a developer out of here at Dremio, and welcome to another episode of Gnarly Data Waves. And [in] this episode, we'll be talking about an introduction to Dremio Arctic catalog versioning and Iceberg table optimization, an exciting topic that I'm sure you're going to enjoy.
But before we get in there, I want to remind you that you can Test Drive Dremio by heading over to dremio.com and clicking that Test Drive button, where you can test the Dremio Lakehouse platform with no costs and no obligations, [and] see what the big deal about the Data Lakehouse is, and what can be done directly from your data lake when
Also, make sure [to] pick up an early copy of Apache Iceberg: The Definitive Guide. So right now the early release copy has about 180 pages worth of Iceberg content. The final manuscript will be 300-plus pages released early next year. So go check it out for free. Get an early copy, scan that QR Code.
And Dremio will be at many different events, such as Big Data and AI Paris, Big Data London, Coalesce by DBT. And me and Dipankar, the developer advocates here at Dremio will be at Data Day Texas in January, doing an Apache Iceberg Ask Me Anything, so make sure to be there. It's gonna be a good time. I was at [Data Day] Texas last year, and it was a delightful time, so make sure not to miss it.
Okay, but with no further ado. Let's get on to our feature presentation––introduction to Dremio Arctic: catalog versioning and Iceberg key optimization with Jeremiah Morrow, product marketing director here at Dremio, with Iceberg and Dremio Arctic. Okay, with no further ado, Jeremiah, the stage is yours.
Introduction to Dremio Arctic: Catalog Versioning and Iceberg Table Optimization
Jeremiah Morrow:
Thanks, Alex. Hi, everyone. My name is Jeremiah, and I'm responsible for product marketing here at Dremio. These days, I'm mainly focused on Apache Iceberg as well as our lakehouse management service, Dremio Arctic. And today's Gnarly Data Waves episode is all about Dremio Arctic. Here's a quick agenda for the next 30 minutes or so. First, to review, I'm gonna talk about the rise of data lakes, and how we sort of got to the architectural inefficiencies that we've gotten to today. Then I'll talk about what's needed to bring us from data lakes to a data lakehouse. Spoiler––table formats are a key part of that story, and we love Iceberg here. So I'm going to talk a little bit about Apache Iceberg.
Then I'll talk about how you can enhance and expand that Iceberg experience with Dremio Arctic, including automatic optimization for tables and catalog versioning with data-as-code, and then I will briefly show you some of those capabilities in a quick Arctic demo. Afterward, we should have plenty of time for Q&A. So get your questions ready, and whenever you like you can ask your questions, and we will address them at the end of the presentation.
A Brief Review of Data Lakes
Jeremiah Morrow:
So data lakes and data warehouses––how did we get here? It started decades ago when we were all using a data warehouse, an EDW. And it was really good at what it was designed for. It could store, organize, and analyze primarily structured data from business systems sitting alongside it, housed in a data center, and it could provide reporting on that data within a reasonable amount of time. But then we started collecting more data and more data types, including semi-structured and unstructured data. Data was coming in from a lot of different sources: mobile devices, social media, Internet of Things. A lot of those sources were outside of the data center, and the data warehouse was built for a time and place when that scale of data was not there. And it couldn't keep up with data growth. And it also couldn't meet much more aggressive SLAs for performance things near real-time, for example.
So we built data lakes first with Hadoop on-premises and then, as cloud vendors grew, we used object storage. And data lakes were also really good at what they were designed to do. They were really good, as cheap and efficient storage for a large volume and variety of data types and data. Scientists loved it for small-scale data. Science projects with small teams of data scientists. But the data lake never really replaced the data warehouse for enterprise BI and reporting it struggled especially with concurrency, and it was missing support for acid transactions. So the data warehouse lived on.
And so this is what we see, and probably 99% of the organizations we talk to today [are] enterprise organizations. This is what we call what I call a cooperative data architecture. So you'll have one or many data warehouses sitting alongside one or many data lakes, both of which are doing pretty much the same thing. They've always done data warehouses for BI and reporting and data lakes for data science. And so if a data consumer needs an important source of data from the data lake for a dashboard or report the data team needs to move that data over via some ELT or ETL process to make that data available for data consumers. That's fine on a small scale. But over time, this architecture takes a lot of work and effort to maintain especially as more of the data that we need for BI and reporting lands. First, in the Data League, those ETL and ELT processes begin to proliferate, and every new pipeline becomes another asset that the data team needs to manage and maintain in order to deliver data access. So most data teams have a goal of presenting a unified view of all of the data for end users.
Cooperative Data Architecture: Data Lakes + Warehouses
Jeremiah Morrow:
And there's a really fun philosophical conflict right now in the market, about which of these architectures is ultimately the one we should be consolidating on. Is it the data warehouse? Or is it this new concept called the data lakehouse that combines the flexibility and scalability of data lake storage with the analytics, particularly the BI and reporting capabilities as well as data management and data governance that you get in a data warehouse? And I think based on the fact that data lake storage is increasingly becoming the de facto landing spot for more and more of our data, if both architectures were completely equal in terms of capabilities, most of us would probably choose the data lakehouse. So the first attempt at giving data consumers direct access to the data lake was to put a query engine on top of data lake storage. And that idea has been around for a really long time at this point, relatively speaking, and it does serve a purpose and it works in some cases. We have obviously a lot of Dremio customers who are happy analyzing data in place, using our Aerobase query engine and the semantic layer to join in query data wherever it lives, including data lake storage, on-premises, and in the cloud.
But to build a lake house, we need to add something new. We need a layer in between the data lake and data storage, and the engines, the execution engines. We need to enable all of the things we can do in the warehouse, including governance, write, support, [and] storage optimization. We need to accelerate query performance to give customers the BI performance at scale that they need. And we need to make all of that very easy, so we've added a few layers to the stack, and these are efforts to deliver some of the management and governance capabilities that you find in a data warehouse. So the first attempt at organizing the lakehouse was file formats, and file formats were proven to improve performance and also compress data for storage optimization, and a lot of companies have seen the benefits of standardizing on a file format like Apache Parquet, for example.
So table formats are sort of the natural evolution of file formats, and they build on a lot of those optimizations to make data even easier to manage like a company would manage a data warehouse. Once you've adopted a table format, the final piece of that puzzle is a lakehouse catalog that makes data management really easy and efficient. That provides the security and governance capabilities that enterprise organizations need for their data. And that makes it very easy for all of your end users to access a consistent and high-quality view of your data.––and for us, that’s Dremio Arctic. So these are the steps that it takes to build a lakehouse––to go from data lake to data lakehouse, adding all of these in the red.
Most companies are at this point using file formats. They've seen the advantages of file formats––Parquet does a lot to compress the data to deliver higher performance than CSV, for example, and some of those customers are using some sort of file format in their data lake. So let's talk about table formats. And specifically, let's talk about Apache Iceberg, because it has a lot of cool features for data management that we love here at Dremio.
First, it's an open table format, which is important for a couple of reasons. First, proprietary technology is part of the reason we are where we are in terms of that graph that I showed you earlier with the architectural inefficiencies. I've talked to a lot of customers over the last 5 years or so who say that they need to get off of their legacy data warehouse––it's slow, it's expensive to maintain––in order to achieve the performance they're looking for. They have to duplicate data in the form of BI cubes and extracts, and all of those are an additional layer of management. It was never designed for the scale of data that we have today. The problem is, migrating off of those platforms takes a lot of work and a lot of time, and so keeping your data in open formats like Apache Iceberg ensures that you don't have that difficulty the next time there's an innovative data tool that you want to leverage. By keeping data in the data lake in an open format, you will always be able to access it, and it will be available to you, no matter what tools come down the road
And second, Dremio certainly believes––and I definitely believe––in a multi-engine future. One execution engine doesn't do every data job well, and so you and your organization should have access to the most efficient and the best tool for every analytic workload.
Let’s Talk About Iceberg
Jeremiah Morrow:
So Iceberg is an open table format. It's got the most contributors of any table format [in] the market today. It's got the support of a lot of technology companies, meaning that by choosing Iceberg, you can be assured that you're going to find support within the ecosystem, that your tools are going to work with Iceberg now, and in the future. Iceberg was built to solve a lot of the challenges with table formats like Hive, especially for large tables. So iceberg enables easy table optimization with features like garbage collection and compaction which solves some common challenges for enterprise data lakes. It enables consistency with acid transactions, and it ensures high performance queries even at massive scale. So if you're undecided on a table format for your data lake, take a serious look at Apache Iceberg, and definitely talk to our developer advocate team. Alex and Dipankar are regulars on Gnarly Data Waves. They put some great content out on social media Linkedin. [They’re a] very good source of information. They've re-written and produced a ton of really valuable stuff around Iceberg.
But for Dremio, Iceberg is just the beginning. We wanted to make Iceberg even easier for companies to use by automating a lot of those optimizations that Iceberg provides, so data teams don't need to think about them, they just work. And we wanted to provide the tools that enterprise data teams absolutely need to adopt a data lake house, and for a lot of our customers that's security and governance. And finally, we wanted to make it easier and more cost-efficient than ever to provide a consistent and accurate view of your data for every data consumer, continuing the Dremio tradition of not needing to manage and maintain multiple copies of data, and so Dremio Arctic uses data-as-code for that.
Dremio Arctic
Jeremiah Morrow:
So that's what we built with Dremio Arctic––it is a lakehouse management system that makes it really easy to manage your Iceberg data lake and it features data optimization, data-as-code, and enterprise-grade security and governance. And now we can go through a few of those features.
First, Arctic is a modern Lakehouse catalog. It's Iceberg native. So a lot of features are based on Iceberg functionality. And we're adding to that with some automation––Arctic is based on the product Project Nessie, which is built into the Iceberg project. Dremio is accessible by multiple execution engines. You can use Dremio for interactive BI and dashboards, which it is excellent at, and you can also use other execution engines for different analytic workloads. One of the key use cases for Dremio is enabling a data mesh, and Arctic definitely enables that architecture with the ability to manage multiple catalogs in complete isolation, so [it] give[s] your domain owners federated ownership of their data and enables them to share data across the organization very easily.
And of course, we feature access control and integration with user groups and directories with additional RBAC features coming soon. So from an optimization standpoint, we took some of the features in Iceberg and added the ability to essentially set it and forget it. Today, a lot of this is schedule-based. So you set it on a schedule, it does it in the background without you having to manage it. And full automation is coming down the road.
Automatic Optimization
Jeremiah Morrow:
The first is table optimization, which is based on Iceberg's compaction feature. So it solves a really common problem we hear within enterprises for their data leaks, where because of streaming or micro-batching, you're ingesting a bunch of small files that need to be rewritten into larger files for performance. For a lot of companies, this is a very manual, incredibly labor-intensive and tedious task that the data team just has to do in order to deliver analytics on streaming and micro-batch data. And with Arctic, you can just schedule compaction jobs to take place at regular intervals, and then just forget about it. It runs in the background.
The second is table clean-up, and so that's Iceberg’s vacuum function and so vacuum removes unused files to optimize storage. Again, just like with compaction, this is a manual task that Arctic can do at irregular intervals, so the data team just doesn't have to worry about it. And together, these capabilities ensure high performance on your Iceberg tables and also a lower storage cost as well.
What is Data-as-code?
Jeremiah Morrow:
The next feature that I wanna talk about is my favorite and it is data-as-code so at its core. Data-as-code is the practice of applying software development principles to data management and governance and to set the stage, essentially, here's what our customers tell us they want. They want every data consumer to have access to a consistent, accurate, high-quality view of their data. They want to make it very easy to make changes to the data in isolation without impacting other users, and all that, using traditional tools and traditional data management techniques is incredibly difficult. Often it means having to stand up, manage, and maintain multiple environments for dev test and production. And, on the other hand, from a data consumer perspective, data scientists often want access to production quality data, but in a safe way that enables them to do what they want without impacting other users as well. And so they require their own environment. All of these multiple environments are brand-new copies of the data. They're new data pipelines. And all of them need to be managed and maintained by the data team at the So data-as-code uses essentially git-like features––branches, tags, and commits––to enable very easy version control and delivery of multiple consistent accurate views of production data without having to build multiple environments. It's a zero-copy architecture. It uses metadata pointers, a feature called snapshots, in order to show the state of the data lake at a certain point in time.
And so what you can see here is multiple branches of the main branch. Again, zero-copy clones. So in this example, I have my main branch of the data. And that's my production branch where my end users are using that data for things like dashboards and reports. If I want to make changes to that data, I can create a separate branch. So in this example, I have an ETL workflow. So I'm adding new data in, and then I can check it for quality before I merge it into the main branch.
Every step of this process is visible through a commit history. I can use tags to call out specific commits. That's those green bubbles there. And I can, and all of the work is done again. No copies and my users don't see any changes to the data until I merge the ETL branch with the main branch on the lower branch––the data science branch. If I want my data scientists to have access to production data, I can create a data science branch. In just a few seconds, they have access to production data that they can do whatever they want in complete isolation from my production users, and tags and commits give data scientists some great tools for collaboration and also for model reproducibility. So overall, data-as-code gives data teams the ability to work on data in isolation while delivering a high-quality version of that data to their end users.
It delivers a very easy version control, and another level of governance, where every single change to the data is tracked, and recovering from mistakes takes just a click of a button. And you can roll back to a previous state of the branch.
Use Case: Data Product Development
Jeremiah Morrow:
I wanted to share a used case and I mentioned data mesh, so I wanna share an example workflow within that context. So data-as-code makes a lot of sense for managing data products because now we can treat them like software applications. And if you talk to a software development team today, developers would never ship a product without thorough testing, without quality assurance, and some form of CICD. And so why, as a domain owner, would I ship a data product without the same level of version control and ease of delivery and consistency?
So in this graphic, we have a sample domain-level catalog, and we have a marketing data product that shows web traffic. We, the data team, need to make regular updates to the data product as web traffic obviously changes over time and multiple users within marketing and other departments might be using the web traffic within their dashboards. So we need to make these updates in isolation, without impacting their dashboards. With data-as-code, our data product that we are presenting to our end users is the main or production branch, and we use an ETL branch to bring new data in. We can do all of our testing. We can even run a dashboard against the branch to make sure everything is working, once we've added the new data and the new data is represented in the results.
After those quality checks, we can merge to main, and our production customers will see the new view of the data after the merge. Super cool, right? So that's Dremio Arctic in a nutshell easily. Manage your iceberg tables with automatic optimization and manage your data like software code to deliver a consistent and accurate view of the data to all of your data consumers in a no-copy architecture. Now, let's look at Dremio Arctic in a quick demo to see some of these capabilities in action.
Demo: Dremio Cloud UI, Integration of Dremio Arctic
Alex Merced:
Hey, everybody! This is Alex Merced, developer advocate here at Dremio, with a quick demonstration of the new Dremio Cloud UI, and the new integration of Dremio Arctic as the default catalog in Sonar.
Okay? So basically, again, the Dremio Cloud platform has 2 different types of projects. There are Sonar projects, which are the query engine, which is what I'm on the screen that I'm on right now. And then there's Arctic catalogs, okay? So now, when you create a Sonar project, every Sonar project is going to have a default catalog. So in this case, my default catalog is located over here, okay, and that's going to be sort of my main place for organizing in this Sonar project. But I can create other Arctic catalogs, which then show up like other sources connecting to object storage or databases. The cool thing about this is that it fills the same features that the spaces feature did in the past, where I can now break down this space or this catalog into different sections. In this case, you know, if I were doing a data mesh, I could break it up into different folders, like accounting marketing sales, which can then be governed and controlled.
But you get the added benefit that any views I create and any Apache Iceberg tables that I create within the catalog are now versioned. Every time you create a folder that's a commit. Anytime you create or alter a table in an Apache Iceberg table in the catalog. That is a commit every time you create or save a view, that is a commit in the catalog. So let's just show you some transactions. That kind of illustrates some of that.
Okay, so I'm gonna head over to the SQL editor in my Sonar project. And what I'm gonna want to do is I'm going to want to create, let's say, a new table. Now, I could do this, I could say, create a table. And what I'm going to do is I'm going to do this in my Demos folder. Actually, before we do this again, I can create a new folder pretty easily. And if so, if I go to Demos.
Okay, what I'm gonna do is I'm gonna make a folder for today. So I'm gonna click here, ‘add’. And I want to add a new folder. And now today is September eighth. So we'll do this for Sept 8.
Cool. And now I've just created that folder. And again, the creation of that folder is actually a commit. So if I head over to here to Arctic catalogs, I can go browse my different Arctic catalogs. This was in my demos catalog. And then I can see here in commits––I can see my commit history, and I can see, hey, we created that folder. And the cool thing is when you have multiple users in Dremio, I can see which user made the transaction. I get the transaction ID, which I can use to do rollbacks for auditing purposes, or to time travel. I can then also again see exactly what was done there. So I know what was done and who did it, and I can see when they did it––in this case, 15 seconds ago.
Okay, so I get this nice auditable view for me to understand what's kind of going on in any particular Arctic catalog. And then again, I can also audit my branches while I'm here, any tags within that catalog, and then browse the data in that catalog. But let me go back to––that's on our project. Now that we created that folder, we'll go back to my Sonar project. I'm going to go back to my SQL editor and I'm going to create a table. This would be in my demos catalog, and it's going to be in that September eighth folder, which is Sep 8…And we're gonna call this table something really simple. And it's just gonna have one field name, which is going to be a varchar. We'll do a couple transactions right off the bat––and I'm going to just insert a few records, insert into Demos.september8.names VALUES. And we'll insert my name.
And cool thing I can do [is] run multiple transactions in one Dremio session, in one run of the query editor, query runner. So I'm going to hit run. And that's going to begin running these queries.
So I can then see the progress of the all these queries that I've lined up right here. Okay, so it created the table. And now it's going to insert the records. And then, once these record fees are done, we'll be able to inspect the jobs. So now I can click here on query one to see that the table was created successfully. I can click here on query two––the C1 record was inserted. Okay, now, what if I wanted to insert more records? But it might get a little tedious to keep typing in Demos.september8.names. Now, a cool thing you can do––let's insert another record.
What if I just want to use the table name? So I want to say, insert into names VALUES? Okay. “Jeremiah Morrow.” Now, how do I type this? How does it know that I specifically want to go inside Demos inside that September eighth folder? What we can do is that there's this context section right here. In this case, what I can do is I can set the context specifically into my Demos catalog, into that September eighth folder, and that sets the context of the session. So going forward, when I run these queries, running that from that context, it's always going to assume that part of the namespace going forward. And again, I can even do that to change branches. So as we'll see here later on, I'll be able to create multiple branches, and I can even switch that branch context here if I wanted to, for that run of the SQL editor.
Okay, but I can just run that query, insert another name into the table. And we'll get that done. So that's inserted. And now what we'll do is query the table, select star from names. Because again, we're still in the same context, I don't have to type out the whole namespace. Okay cool. There's my 2 records, so we can see...And then again, that's querying from the main branch. Now let's say we wanted to add some records––what I can do is I can create a branch. And the cool thing is that Arctic is backed by an open source project called Nessy. So everything that I'm doing here you can do in any tool. So you could be doing this in Spark, you could be doing this in Flink, you could be doing this in any tool that supports an SE catalog in a sense of being able to utilize different branches. So what I'm going to do is I'm going to create a branch and then insert more records from the branch. So that way, what we're doing is we're gonna isolate that those records. Then they're not necessarily visible by our mainline production. So then, what I can do here is, first, we'll create a branch. It will say, “create branch.” We'll just call it the September eighth.
And then I have to say which catalog it is––so this will be in the Demos catalog. Okay, create branches. Number 8 in Demos. Then we're going to want to switch to that branch. So that way, every query going forward is in the context of that branch. So use branch “September eighth” in Demos…And then I can go do my insert. So insert into…And just to be 100% sure, I'm gonna type out the full namespace. So we'll say Demos at September eighth, names. And then what I want to insert here is “Dipankar Mazumdar”, and “Jason Hughes”.
Okay. So there's going to be our query. We’ll run that. So what's going to do is first, it's going to create the branch. Now that [the] catalog has a new branch, it's collecting all the commits and transactions that occur in that catalog from that branch.
Okay, now, I've switched over that branch and I've done the insert. So now I've inserted these records. Now, just to prove to you that I've made these changes solely on the branch, here's what we're going to do. So first off we're going to query from ‘names’, select-star from names. Okay? And again, since our context is switched set to main again, this is always going to be where the the the SQL runner starts. I can change the context using those use commands, but this is always a starting place. So it's starting from main. So then what I'll do is I'll switch over to use branch, september eighth, and Demos, and then we'll try to do that again––select Star from names––and again, just to be a hundred percent, I'm just gonna do the whole full namespace. So, Demos, that's September8.names. And I'm going to just do that over here.
So again, this query should be coming from the main branch. Then we're switching branches, and then we're going to query it from that branch that we created. So it's going to run those queries. And now it's querying the branch. Cool. Now, if I take a look at the first query, see, it's only Jeremiah and Alex were there, because those were the transactions we did when we were still on the main branch. But then, after switching branches, which again, this basically says, now we're switched over to the September eighth branch. Now, Jason and Dipankar are added. Okay. so you notice that data is now isolated there. So this has several different implications. Again, that can be used for isolating ingestion.
Okay, because of that, I can do multiple transactions in that branch before ever having to merge and publish those transactions, enabling multi-table transactions, or on occasion, I might just create a branch just to create an environment for somebody else to play with, so they can go do whatever they want in that branch, experiment from that branch, add, subtract, insert delete records, knowing that any changes they make from that branch will not affect what's visible to the mainline production branch, which all your front-line analysts are running analytics on.
So you enable all these possibilities, and all of it is without doing duplications of the data or having to create an isolated duplicate of the data to do your changes on and swap it out. None of that. It's just all zero-copy.
So that's all very well and good, but not only do you get that––that's just sort of showing you off the power of Nessie. And again, I could go create a branch, and then, you know, have Spark ingest data into the branch. I'd be able to see the data, and work with the data from that branch here and then publish it when I feel like it's ready. You know, also different possibilities.
But other things you can do with––okay, yeah, let's leave. If I go to my Arctic project, I go to Demos, what I can do is I can go to like any particular table. So here's September eighth, and okay, here's the names table that we just created. I can click over here. And I can set up automatic table optimization. Okay, so basically, what you would do is you would go to your project settings. So first thing I'd have to do is just set up my engine for this particular catalog. So you can see here, project right here where it says, configuration, I click here, okay, and this will allow us to configure an engine for optimization purposes using your AWS credentials. So I would just select the Cloud that I already have set up. And then I can say, okay, hey, I want to, you know, let's say, small engine, here's where any data gets written and provide the AWS access key and all that stuff. And basically, what it'll do is it'll use that engine anytime. We want to optimize our table.
And again, it knows when to optimize tables. Because again, I can go back here, choose any particular table, and then from here, I can set a schedule and say, okay, optimize this table. So things like compaction, you don't have to think about. It will run that compaction periodically on the schedule that you set. So you don't have to worry about the small files problem becoming too much of a problem. Because you just haven't thought about maintenance––it's happening behind the scenes you're getting that isolation. And it's now all really well integrated into the Dremio UI. So it makes it really easy. But again, it's not exclusive to Dremio being able to use that Arctic catalog––that Arctic catalog can be connected to from Spark, from Flink pretty easily.
Okay, you want to see some examples of that. There's several tutorials I've done on Nessie showing how to connect Nessie catalogs to things like Flink and Spark over there on the Dremio blog. Which essentially weould be the same if you were using Arctic, you would just basically get your token from Dremio for the authentication purposes, but otherwise it would be sort of the same process, because, again, under the hood, it's still a Nessie catalog that's that's providing the catalog functionality. It's enabling that branching and that merging and so forth.
So hopefully you guys enjoyed this again. My name is Alex Merced, developer advocate here at Dremio. I'll see you all around again. Make sure to head over to dremio.com/blog to learn more, and I'll see you all around.
Presentation Closing
Jeremiah Morrow:
So that was a quick demonstration of the capabilities of Dremio Arctic and specifically data-as-code in action. Another thing that we have done, if you're ready to give Dremio Arctic a try, is that Dremio Arctic is now the default catalog in Dremio Cloud. So every single new project in Dremio Cloud will get a Dremio Arctic catalog.
So you get our semantic layer capabilities, which are already very cool for joining, and sharing of data views. And you can expand on the semantic layer with automatic optimization for your Iceberg tables as well as catalog versioning with data-as-code, and we have multiple tutorials, if you want to start to get up and running and try out data-as-code for yourself. So check out Dremio Cloud for free at www.dremio.com/get-started/, and if you have any questions about Arctic, or how to use any of these capabilities, feel free to reach out to myself or the developer advocate team. We're all very happy to to help you get up and running. And that, concludes the the formal presentation for today. Happy to start answering any questions that you may have from the audience.
Q&A
Alex Merced:
Hey, everybody! Oh, welcome back! So if you have any questions, do put them in the QA box. So that way, we can get to those questions, and you know, help you answer any questions you have about Dremio Arctic, Nessie, and catalog versioning. Our first question we have: Can you explain the concept behind data-as-code? Jeremiah, go for it.
Jeremiah Morrow:
Yeah. So the concept behind data-as-code is the idea that software development was really transformed by Github, and the way that we deliver software code now with easy versioning, easy governance, collaboration––all of the things that Github gave software developers is now best practice. And so why not introduce those concepts to managing and delivering data? And so that is the basic idea behind data-as-code––treat it exactly like software developers have treated software development and software code, and the other idea behind it is this idea of zero-copy cloning. So your data literally exists as code in a branch. You're not managing and maintaining multiple environments of the exact same copy of data, you're giving everybody a clean version of that production data to do whatever they need to do with. Yeah, you might be on mute, Alex.
Alex Merced:
Okay, so our next question is, can we write functions or procedures or packages? So let me take that one. So when you're saying like, you want to write SQL functions, that's would be the actual specific SQL, you use would be specific to the engine you're using. So in this case, Dremio does support scalar and and tabular UDFs, so you'd be able to use those UDFs and actually save those views based on those UDFs or save tables that were affected by those UDFs into your catalog, because the catalog itself isn't tracking, necessarily, the SQL, or [the] Python you're using. So if you're talking about custom, Python functions, [it’s the] same deal. Basically, what Nessy is collecting is essentially––the way you want to think about it is that everything that Nessy tracks is referred to as a key, so essentially like a name, and then attached to each name, there's like a little blob of metadata. And for Iceberg table, specifically, that little blob of metadata is just saying, okay, here is where the actual Iceberg metadata is located. So somewhere on your S3, or your Azure blob storage or your Hadoop storage, it's just pointing to where that metadata is, and then that metadata is just pointing to a bunch of Parquet files. So basically how you run those SQL operations will be engine-specific, not necessarily Nessie or Iceberg-specific. So in that case, yes, if you're using Dremio, you can write your own SQL functions. If you're using Python libraries like, let's say you're using PI Spark, and you wanna write custom libraries for working with PI Spark. You can do so and use that with your Arctic catalog because the Arctic catalog is just merely the interface which which that engine will be able to discover your Iceberg tables.
Next question, does iceberg solve the issues associated with Hive, especially asset transactions? Is any other shortcomings of Iceberg we need to be aware of, or any specific use case? Mind if I take that one, too?
Jeremiah Morrow:
Go for it.
Alex Merced:
Okay, okay, I'll say about this one––when it comes when it comes to the shortcomings of Iceberg, one is asset transactions. So Apache Iceberg––and pretty much all the tape formats solve this problem––what they all do is they practice what's called snapshot isolation. So in that case, when with Hive, you didn't have individual snapshots, you just basically had the Hive metastore telling you, hey, these are the folders where the table is, and whatever files are in those folders at your table. So you couldn't really see the past history of the table, you can only see what are in those folders at that time.
Well, modern table format––what they do is they capture the files that are part of the table at any particular time. The mechanism is different. I'd like the reusability and modularity of Iceberg approach where it creates. These manifest these lists of files that can then be included in multiple snapshots. But essentially, each snapshot's tracked, and then which lists the files are included in that snapshot are all tracked in the metadata. So I can go back and see those previous snapshots. So you have that that problem solved. Now, when it comes to asset transactions, because you have this linear lineage of snapshots, what happens at the beginning of the transaction [is] every snapshot has an ID. So what's gonna happen is that the next snapshot, so let's say, we have a linear numbering like 1, 2, 3, 4. Currently, the snapshot is 4, and I'm about to write a new transaction.
The transaction at the beginning would predict, well, when I'm done, I should be writing snapshot 5 but then what happens is that if somebody else decides to write to that table at the same time, they're gonna project that they should be snapshot 5 as well. So you now have 2 rights predicting snapshot 5. One of them is gonna finish first. So when the when the first one finishes it commits, it's snapshot 5. It's happy.
But the second transaction, when it gets done, before it actually commits, it's gonna check to say, hey, am I still gonna be snapshot 5, and it's gonna [say]: wait, no, there is already a snapshot 5. So that's how it knows, oh, I can't commit this transaction. It will then re-attempt it. It won't necessarily rewrite all the files, but it'll go back, take a look at the current snapshotm and make sure that it can do its transaction with the new history, and then project 6, and then commit 6. So you don't have this optimistic concurrency control. So everyone writes as if they're going to commit. But then you double check at the beginning and the end to make sure that there always is consistency in those rights. So in that case you have asked the transaction solved, and a slew of other problems with Hive are solved.
Now as far as challenges that there are with Iceberg today. They come in 2 sort of flavors, and for the most part Nessie solves these problems.[And] Dremio Arctic solves these problems, at least one of them. The the first problem is multi-table transactions. Right now, just in pure Iceberg, we cannot do multi-table transactions. So you can only do one transaction at a time, on one table at a time, because the the specification is based on a table.
Now, the Nessie catalog, on the other hand, because you're capturing commits at the catalog level, you can now isolate transactions at the catalog level, which means you can do transactions across multiple catalogs in an isolated manner and publish them simultaneously. So Nessie does address that specific Iceberg shortcoming. They are trying to come up with an Iceberg-specific solution to it, but it still doesn't do everything that Nessie does for you. And yeah.
And then the other thing that's currently being fixed as well is that the way Iceberg tracks the individual files on your tables using absolute paths. So this does create a challenge when you're trying to like, let's say, move the location of table A from folder A to folder B, because it's expecting the fully absolute path to whatever S3 folder or Hadoop folder that the those files are going to be in. There is a a proposal on how to fix that. So that’s temporary thing that very soon [will be fixed], and for most situations it really shouldn't be an issue, because you shouldn't necessarily be just moving your physical files over on a regular basis, like that probably would not be ideal.
Jeremiah Morrow:
Real quick, [a] couple of call-outs, because you're a little too humble to call us out yourself. Alex is actually writing a book on Iceberg, and there is an entire chapter about the origins of Iceberg and how it developed out of some of the shortcomings of Hive, so check out the early access to that, because that is just raw, unfiltered thoughts from our team.
And the other thing is October tenth, we will be talking about exactly what Alex just mentioned in terms of multi-table transactions with Dremio Arctic. So October tenth. Stay tuned and now back to your regular scheduled program.
Alex Merced:
Well, yes, that. Yes. And so if you want to get the early copy of that book, just head over to dremio.com. The link's right there on the main page to where you can get their early access copy. We're getting pretty close to to wrapping up the first draft of the manuscript. So I'm pretty excited. How far back can you go with time travel snapshots? Is it similar to Snowflake’s 9 days? There is no set time period. So basically you get to manage this, like there is settings you can change in the table. But at the end of the day it's gonna come down to when you decide to run expiration-type procedures. So this would be an expired snapshots procedure in Spark or a vacuum procedure in Dremio. But when you run these procedures you'll actually say, okay, I want to expire all snapshots prior to this date. Okay? And then that will be as far back as you can time travel now with Dremio's table optimization. There'll you'll also be able to set an automatic expiration where it'll say, okay, always make sure the catalog everything it beyond 90 days, 100 days, 5 days, is expired. So you'll be able to set those settings yourself. You're not married to a number that Dremio decided, or any provider decided, because the data is yours. The data is in your storage. We just facilitate doing those transactions on that data for you. So you get to decide what your policies are, and you have that flexibility.
I think that's all [the] questions. But again, we have a lot of great topics coming up in the coming weeks. So next week we're gonna be doing actually more about table formats. So if you wanna hear more about Iceberg Delta Lake––goody! Next week we're doing the who, what, and why of data lakehouse table formats. So we'll talk about how each one's architected, and how each one works, and the pros and cons, and all that stuff. Then after that, I'm gonna be talking about materialized views versus Dremio data reflections, because a lot of times people think of data reflections, which is a Dremio feature, very similarly to materialized views. But it does so much more and offers [so much]. It's really cool. So if you wanna really capture the nuances of what the distinctions are, make sure you're there for that one. We're gonna be doing the Arctic one they mentioned, like the multi-table transactions and and zero-copy clones. So we have so much great content: if you can't be here, live every week, make sure to subscribe on iTunes, or Spotify to Gnarly Data Waves, subscribe to youtube.com/dremio, where we’ll also post the recordings of these as well. But either way, I'll see you all next week. Have a great day again. Thank you, Jeremiah, for being here this week, and I'll see you all soon.