May 2, 2024
Eliminating Data Downtime While Accelerating Data Science with Data Version Control
It’s 2024. Data teams still spend days trying to identify and fix broken data pipelines and revert data to a healthy state, and data scientists still spend days waiting for infrastructure to build models and “what-if” experiments.
In this session, learn how you can alleviate both problems – recover from bad data pipelines in less than a second and give data scientists immediate access to sandbox environments for data simulations – using Git-style version control and branching.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Ben Hudson:
I wanted to welcome you to the session “Eliminating Data Downtime While Accelerating Data Science with Data Branching.” Alternatively, this session could be also named “How Git for Data Could Make Data Engineering Less Painful.” So, in this session, we’ll walk through a few things. We’ll start to set the stage by talking about data engineering responsibilities and why some of these responsibilities might be a source of pain today. We’ll talk about a possible solution to that pain, looking at other disciplines and how they solve their own problems in their own worlds, and then demonstrate our solution in action. By the way, if you guys have questions, if anyone has any inquiries, questions about the talk at any time, please feel free to paste them in the chat and we will get to them at the end of the session, we’ll have a quick Q&A, so I’ll leave time for that.
Data Engineering Responsibilities
So let’s talk about data engineering responsibilities. I did a quick Google search and here’s what the top few results say. In summary, data engineering is all about making data available, accessible for downstream analysis, for downstream users. Data engineers accomplish these things by doing a few tasks. First, one of the core tasks that data engineers do is loading new data into their downstream systems. They need to get data from a bunch of sources. They need to transform it and validate it so it’s useful for anyone downstream looking at that data. They need to clean it up, they need to normalize it, and they need to make sure it looks nice so there’s no values, no duplicates, no messy data in there. Today, this could require multiple physical environments, so for example, a development environment, so they make the changes and validate them in a test or quality assurance environment and then once that’s all good, they’ll deploy it into production. Bottom line is that it could require multiple physical environments, could get expensive depending on the volume of data that you have, and there’s a whole complex, risky change management system and process to go along with that as well.
In addition to loading new data, you might have to make changes or updates to data as well. So for example, you might have to update someone’s email address or phone number in your database, you might have to delete someone for compliance purposes, or you might even have to do something as simple as fix a typo. You might also want to evolve or update the schema of the table itself, so for example, if you wanted to add a new column to your table, so this is an example of schema evolution. Again, same process as above, I just copied and pasted because it’s the same thing, and it could require multiple physical environments, again, same change management process that goes along with everything today.
However, for both of the above, things could go wrong. When things do go south, companies rely on data engineers to save the day and recover from mistakes. Unfortunately, this could take hours or even days, and every second of downtime means another second where data analysts and data scientists can’t make decisions and run analyses on data because data engineers have to first spend hours or days pinpointing the error, so where things went wrong, and also roll back from mistakes as well. So after you’ve identified whether it’s just one table, one row, or the entire system is messed up, they’ll have to assess and triage the situation, and then roll back those mistakes and revert to a better state. Something as drastic as a database restore.
On top of all these responsibilities, data engineers have to make data available to end users who want their own copies of data or versions of data for their own use cases, so for example, you might have a data scientist or a team of data scientists who want to make changes or experiment with data, so they can run what-if experiments or simulations and predict analyses. Today, what a lot of companies do, where we talk to, is they have to spin up a new environment every time someone wants to experiment with data. Could be slow, could be expensive, but at the end of the day, it’s the end users who are frustrated because they have to wait for their systems to be provisioned. They want their data fast. All in all, data engineering could be tricky.
Software Engineering Responsibilities
In parallel, our friends in software engineering could have it a little bit safer because the systems they work with, or they usually work with, support version control and branching. Had to add an XKCD comic just because. But let’s look at the tasks that software engineers accomplish. First, whether they’re writing or testing new code, whether it’s for a new feature or a bug fix, instead of requiring multiple physical environments, software engineers simply create a branch for their feature or their fix where they can make changes and validate and test them through their CI/CD pipelines. And then once they’re done, they can simply open a pull request, have someone review it, or some automated system review it, and then merge their changes into production. And the benefit of this is that, number one, they get their own safe sandbox environment to make their own changes. But number two, a merging of a branch means that their changes could be revealed into production really quickly instead of going through taking a long time. When something does go wrong, as well, rewinding to the repository to a prior state is fairly simple. Quote, unquote, fairly. You can revert or reset commits, right? Basically rewind your repository to somewhere back in time.
Data Engineering Responsibilities (Revised)
So this looks a lot cleaner and a lot less painless than the world of data engineering. Why can’t we do this in the world of data engineering? And the answer is, yeah, this is possible, right? What if we bring those disciplines from software engineering, the principles of version control and branching, to the world of data engineering? What could the world look like in a really blue sky situation? So let’s look at the first two scenarios of loading new data or making changes or updates to existing data. Instead of requiring multiple physical environments, you can use branches instead. So you can make changes in a dev branch, validate it in a test or QA branch, and then deploy into your main branch or production branch. It’s a little bit less painless because it’s all done in one physical environment. Everything is managed using branches instead and snapshots. And again, once data is merged into the main branch or once changes have been merged, then end users get that data immediately. So they get the freshest copy of data for free.
When something does go wrong, again, just like Git, because everything is commit based, you could be able to rewind your environment to a prior state. So instead of spending days of downtime and risking a lot of money from not being able to analyze data, make decisions, you can recover from mistakes in less than a second. So eliminate data downtime. And then last but not least, making data available to end users for their own use cases is as simple as creating an experimentation branch or a quick sandbox for data scientists, which you can give to your data scientists or your data analysts for downstream usage. And then once they’re done with it, you can easily clean that up and delete it with a click of a button. So it’s instant experimentation for anyone who wants it without any data copies as well. So it’s all for free, and I’m making physical copies of data.
How Do We Make This Possible? Iceberg Edition
How do we make this possible in the world of lake houses? Let’s look at a quick architecture diagram and see where these components or these capabilities could live. So this is a classic stack diagram of an iceberg based lake house where in our lake house, you’ve got your file formats, which store the raw data, you’ve got your table format, which is our layer of metadata, which makes ACID transactions possible on top of the data lake, put simply. And of course, catalogs to support that. So within the world of iceberg, you have the choice of using many different catalogs. So for example, the REST catalog is a very popular option. You could use Hive Metastore or Amazon Glue. And Dremio’s got its own catalogs as well. So Dremio comes with its own catalog, and we’ve got an open source version called Nessie. And once your data is in iceberg, iceberg tables, any engine can work with those tables and they can read, write, interact with that data in a really multi-engine interoperable manner.
So how do we make this possible? Given our stack diagram, the answer is in catalogs. So as I mentioned earlier, when you use or when you want to use a table format like iceberg, you need a catalog, right, because the catalog maintains the state of tables. So data engines or query engines can interact with that data in a safe multi-engine interoperable manner. The answer for us is that there are catalogs out there today in the world of iceberg that make it possible for users to manage data using Git-like semantics. And we’ll give you a quick demo of what that looks like.
Git for Data Version Controls
So with these capabilities, with these catalogs, what can we do now that we have version control? Let’s walk through a few scenarios and what’s possible. Let’s talk about the art of the possible. So number one, with Git for Data, we’ll call this, you can easily make changes, ensure data quality, and share your new data immediately using branches. So for example, if you wanted to make changes or upload new data into your database or your data warehouse or your iceberg tables, rather, in this situation, instead of going into different dev, test QA environments, simply create a branch. And this is all going to be done through SQL, which we’ll demonstrate in a second. Create a branch, do a Git checkout into that branch, run a copy into to load data, and then do all your validation. Once all the validation is done, then you can merge it into prod on your production.
In addition, if something goes wrong, recovering from a mistake is easy as one simple command, just an alter branch, where what this command will do is it will essentially just rewind the clock to a prior state of commit or prior commit. So in the world of Arctic or Dremio’s catalog or Nessie, which is the open source version, every single atomic change that happens to the catalog, whether it’s an insert, update, or delete to the data is tracked. So what the catalog does is basically track the version of files or the files that make up the state of the lake house at that point in time, every single atomic change. So we basically can represent these as a tree of commits. So on the right-hand side, and it’s really easy to rewind the clock to a prior state in time.
If someone needs their own copy of data or their own version of data for downstream experimentation, if you’re a data scientist, then, again, it’s as simple as creating a branch and giving that branch to them once they’re done with it, once your data scientist is happy with their experimentation, they don’t need that stuff anymore, and they can drop the branch.
Another benefit of this or version control is that you can really easily reproduce models on analyses. So for example, if you wanted to reproduce a model from last quarter, for example, how did we get to these conclusions? How did the numbers look at this point in time? Because everything is versioned and you can add tags. So for example, at the end of your fiscal year, you can add a fiscal year tag to your data and that is your source of truth and always refer to that state of the lake house in a consistent manner. Version control makes it very easy to reproduce any analysis that you want from data in your lake house.
And then last but not least, again, it might be a little obvious, but version control makes it possible for you to audit changes, understand who did what and when on your catalog. Every change is logged.
Live Demo
All right. We are breezing through this presentation right now, breezing through the session. I can’t see the questions that are popping up, if there are any, so please feel free to paste anything that you might have. What we’ve done so far is we’ve set the stage by talking about problems with data engineering, reasons why data engineering might be painful today, and we’ve talked about how it might be possible to bring concepts from version control, Git, software engineering, into the world of data to make those things less painful in the world of data engineering. I’ve shown a few slides, but no session at subsurface is complete without a live demo. So what I’ve got here is I’m going to switch to a live demo, and we’re going to put those slides into action. Hopefully my screen is still working, and I’m going to switch into a demo environment.
Hopefully everyone can see the screen. But what we’re going to do now is we’re going to run through a demo. So I’ve got this is the demo UI, and the demo or the Dremio UI has a data catalog built in our data explorer. We’ve got a SQL runner, and within the UI, you can administer all parts of your lakehouse, so you can see all the jobs that have happened. You can change settings on your lakehouse, et cetera. But for this purpose, or for the purposes of this presentation, I’m going to stay within the SQL runner, which allows you to write SQL, run queries against your data. You can also save queries if you want, or save entire scripts as well. But what we’re going to do is we’re going to do a quick demonstration of how you can use branching to make changes and reveal changes domain really quickly, easily, and painlessly.
So I’ve got some commands here. What we’re going to do is actually explore this data set. So I’ve got a few commands. Number one, what’s in this data set? We’re going to show you this quick data set here. I’ve run the commands already, just for this query, just to make things a little simpler so we can get a head start. But the data set I have here is a data set that contains information about a bike sharing service in New York City. This is a data set that combines different types of data. It’s got some date/time data, some timestamp. It’s got some geospatial data here on the right-hand side, if you wanted to plot that on a graph, some latitudes and longitudes, et cetera, et cetera. This is a medium-sized data set, so it’s on the bottom right-hand side. We have 30 million rows of data. And just for fun, just for the purposes of this demonstration, I’ve added a quick analysis demo here, or a quick analysis query. So bottom line, show me the most popular types of bike rides. So you can have a regular bicycle or an electric bicycle in the state of New York City. So we can see that we’ve got electric bikes aren’t as popular as regular bikes just yet. That’s the summary that we have so far. And this is going to be our running example for this demonstration.
So we’ve got our data. We’ve got 30 million rows of data. As a data engineer, remember, one of our jobs is to load new data. So suppose we want to load some new data into the main data set. As a data engineer, I want to make sure my ETL job works and the data looks all right before exposing my updates to end users. In my Dremio catalog, so everything here that I’m doing is going to be within — it’s going to be on iceberg tables in Parquet files. These are all living in Amazon S3. Let’s first create a branch. Let’s show you how easy it is to make changes. So run a SQL command that creates a test branch named gnarly test. Great. Now we can see that the branch gnarly test has been created off of my main branch. I’ve got some queries here that’s going to create a new table. It’s going to load a bunch of new — a bunch of rows into my data set. And after that, I’m going to do a quick sanity check just to make sure my data is there. So I’m going to click run. And we can see that my new table should be completed, created. Should be inserting some new data. And we’ll do a quick sanity check. Queries have run fairly quickly. So we’ve created a table. We have inserted eight rows of data into the table. And you can see what this data looks like. So if I do a select from sleigh rides, so it’s our Santa’s workshop-themed data. You can see that I’ve got a new data set that has some — a new writable type or a new type of vehicle for my bike service. Fantastic.
So let’s then insert this data into main data set. I’m going to run this again. And what this is going to do is I’m going to merge my — merge my data from the sleigh rides data set into — into my data set. Oops. Sorry. Excuse me. I’m going to go back in time, actually. I’m going to go back. I’m going to go back here just because. Oops. One second. I need to do this one command very quickly. There we go. But let’s — what I forgot to do is I actually went to my gnarly test branch. That’s what I should have done. So now I’m on my gnarly test branch, and I’m going to now, for real, create a table, run all my inserts, and then check my data, okay? So I’m going to run everything again because this is what happens in a live demo. So let’s wait a few seconds for all these commands to run. So create table. All good. Insert data. Good. My data is there. So I’ve done my sanity check. And now I’ve done, actually, for real, for real, I’ve inserted eight rows, right? And now in the context of my test branch, so this is my ETL environment, or my dev environment, let’s test my query, my running query on this updated data.
So now if I run this query on my test branch where I’ve inserted my data, you can see that my — I have a new result here. City slay is part of my analysis query. The benefit of this, or the thing that’s really cool about this is that I am currently doing this in the context of my dev branch, in my safe sandbox environment. Anyone who’s querying the main branch, or if, say, you’re, like, a downstream analyst looking or running a dashboard on the main copy of data has no idea what I’m doing and no idea that me as a data engineer, I’m making changes to data. So let’s simulate this right now. I’m going to run this query, and I’m going to run this quick thing, or this quick query that says do the same query, but analyze the data at the main branch. So you can easily add this at branch syntax, which says give me the data from the main branch. And you can specify any branch that you want here. And as you can see, anyone who queries the main branch only sees the regular version of data that we had before. They don’t know what’s going on in my gnarly test branch. But let’s say we’re happy with the changes now. I want to merge my changes back into the main branch, so our data analyst can see all this great new data that we have.
All it takes, instead of a lot of change management, coordination, of course, we’ll assume that the data looks good for the purposes of this demonstration. Usually that has to happen first. Let’s do a quick merge branch. This is all that you need to do. So I’m going to merge branch gnarly test into main, so merging my changes into production. And this took less than a second. It was really quick. Again, if we run, so now, as a data analyst, I get to immediately benefit from this new data. So if I run that same query where I’m hitting up the main branch, as a data analyst, you can see that I immediately have access to this fresh new copy of data. Great. So, what we’ve done as a data engineer is I’ve made changes, I’ve looked at the changes, seen that they’re all there, and share these new changes into a main branch for downstream data analysts. And as data analyst, number one, I have no idea what the data engineer is doing, because they’re working in their own isolated environment. But number two, once they’re happy with the changes, me as a data analyst, I get to benefit from this new data immediately. But suppose something went wrong. We need to roll back our changes, for example.
Rewinding changes, it takes less than a second. With our catalog, with these get for data capabilities, it’s as easy as rewinding the state of your data lake house to a prior commit. So I’ve got this command here, which says, move the commit pointer of the main branch back in time. And you can do this using a tag or a commit to refer as your reference point, basically, or a timestamp, even. But for this purpose, or for this demonstration, I have picked out or cherry picked a specific commit point that I want in the past. So rewind the clock to this specific snapshot in time, or this specific commit in time for the lake house. So I’m going to run this. And within a second, I’ve rerun the clock. And if I run my analysis query again, my running example for the day, we can see, once this query finishes, that it’s as almost as if nothing ever happened, right? Me as a data analyst, I see my old version of data immediately. No downtime, didn’t have to do anything. Very little hassle. All I would have been doing so far today is just run a bunch of SQL commands, right? So in short, we made it really, really easy to make, validate, share, and undo changes using a single copy of data. This is all in Iceberg and Parquet. Really quickly and painlessly, right?
So hopefully that gives you a quick idea of how things could work in this world of — or new world of Git for data. But to recap what we’ve done, and I will put into slideshow mode, because that usually helps with seeing the slides. But what we’ve done is we’ve talked about pains. We’ve talked about the problems to those pains. We’ve talked about possible solutions to problems to those pains. And we’ve demonstrated the possible solution to the problems with those pains as well. So — or problems with those responsibilities. If you’d like to give this a try yourself, you can go to Dremio.com. You can sign up for the free edition of Dremio Cloud, which is our fully managed service for Dremio, if you’d like. You could also download the open source catalog if you’re in Iceberg, or if you’re experienced with Iceberg already. If you have questions, please feel free to ask anything in the chat, or reach out to me via email or on LinkedIn as well. you