May 2, 2024
DataOps at Dremio Lakehouse Management
DataOps is a framework that brings users closer to the data, streamlines data management and operations, accelerates time to insight, and reduces total cost of ownership. We will show how common devops tools can be integrated with Dremio Lakehouse Management to accelerate data product development. Starting from a simple feature enhancement request for an existing business report at Dremio, we explain how it improves the research, testing, and deployment phases of the data development lifecycle for data engineers.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Video Synopsis
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Hey folks, welcome to our session here. We’re gonna talk about DataOps at Dremio with two very smart individuals. Brock is a Dremio master principle solution architect and AJ is a Dremio data engineer. So I’ll hand it over to them. If you’re online, please feel free to comment and I will give your feedback or questions to the presenters.
AJ Jensen:
All right, hello everyone, I’m AJ. And today, we’re gonna talk a little bit about what DataOps at Dremio looks like. Just for context, I’m the only data engineer at Dremio and the only full-time member of our data and analytics team. So I don’t know about you guys, but from time to time, I will get a message like this and not really know what happened or how something broke because while the growth looks really great for the top chart, it’s off by 60,000% all of a sudden. So obviously, something went wrong. And so today, we’re gonna talk about the solution we’ve implemented internally to make it so we have solid data governance and change management solution for our own internal data lake while still balancing the great ease of access that Dremio provides for our internal analysts.
Dremio Semantic Layer
Brock Griffey:
Thank you, AJ. So I’m gonna start off talking a little bit about the Dremio semantic layer and the great value this provides you. So when you think about, hey, we have all these data sources coming in, we’re loading that data into the data lake. And that’s what we do here at Dremio. We’re loading data directly into the data lake. We have sources like, we have tools like Fivetran loading data directly to iceberg tables and we’re pulling data from things like Salesforce and Jira and Intercom and a whole bunch of other data sources. Now, once we have that in there, Dremio is actually providing us, you know, we’re gonna use our own product for everything. So we’re actually use Dremio to do all of the analytics on top of that. But we’re gonna build a nice semantic layer through kind of like that bronze, silver, gold layers. We’re gonna create a common transformation at the very bottom layer. And then we’re gonna go and for the business users, we’re gonna expose out this business view and we’re gonna allow them to go and get access to that and start drilling into the data and have that self-service capability on top of it. And at the end of the day, they wanna go and use that in their own applications. So you have an application layer inside there that we’re gonna say, “Hey, this is what our BI tools “are gonna hit against.”
While this is great, you end up with some problems in this kind of architecture. So just kind of talk about this is I have a view. This view came from a Salesforce data. And in here, obviously for my example, it’s a little bit more simplified on there so it could fit on the screen. But I have a customer name. That’s a custom column coming in and we are changing that column eventually. But right now, we’re gonna keep it as is. So I bring that up to the business layer and now I have another view. It’s just selecting from the layer below. So we’re building a layer on top of layer, a view on top of a view. And at the end of the day, someone’s gonna take that and they’re gonna put that in their own dashboard, their own application, and they’re exposing it to their end users. But what happens if I wanna change that now? I don’t want just customer name. I want to have it first name and last name. Well, that’s great. Now I have a more curated data set. That’s awesome. I have that self-service around the data. But now I just broke this and everything else downstream just broke as well. So how do we prevent that from happening and how do we make it so it’s easier for our users that work with this data without making it more complex?
Normal Development Process
So to talk about a normal process, how people would traditionally solve this is through like a development life cycle. They would have a test environment, a dev environment, a production environment. They would have to maintain copies of the data or at least a small amount of sample data in each environment. And then as they’re writing code, they write against the dev environment and hope and pray that that works against prod as well because the data in dev is probably never the same as in prod as much as we try and keep that in sync. But in order to migrate that code up to the next level, we would put it into some kind of repository like GitHub and we’d make it so it’s available for the next layer to actually run that code into their environment. And you could use tools like dbt and other things to help do this, but at the same time, you still have to maintain and manage many different environments to do this and makes it just more complex. As you are ready to actually move this data to a production environment, you now have to schedule downtime or have a window of when we’re gonna do the upgrades and move the data into production. Maybe I’m doing DML operations, maybe I’m doing just changing views or altering views. Whatever it might be, you have to schedule some downtime to make sure that no one is going to get affected by this.
And at the end of the day, once the data finally gets migrated to the production environment and you validated everything, you hope that everything’s working perfect. But if it’s not, and you need to roll back that, that can be very costly and time consuming to do. You might need to do a full database restore if something incorrect happened in production. So we thought about this.
The Solution
How do we solve this with our own tools and how are we doing this at Dremio? Well, the solution that I have, part of the solution at least, is our catalog. So when you think about this, Dremio’s catalog brings several things, but really the part we’re gonna focus about is the get as code or data as code section. So this get as code allows us to create branches of the data without creating copies of the data. So now I can have a branch on my catalog that allows me to operate on the production data, but I’m not gonna affect production. I have my own isolated area where I can do my own version control. There’s governance built in, so I can’t do things I’m not supposed to do. I can see what I’m allowed to see and I can work in that environment. So instead of having those complicated five, six steps that we had before, we have simplified this. So most users in the past may have actually worked directly inside your dbeaver or your other tools to come up with what code you want and copy that code then to your repository and then checked it in. We’ve actually come up with a process here that is going to automatically synchronize your changes when you create a branch in Dremio. It’s a simple statement in Dremio, just create branch from your main catalog. Now you have a new branch you can work in. When you do that, we have a process that’s automatically synchronizing all of those changes every time you make a new change in there to your repository. So now you have a single place you can go look at in your repository, validate that these changes are happening, what the change is, and AJ will show you a demo here in a little bit of what that looks like. And this just gives you an easier way to maintain this lifecycle. You no longer have to manually do all these code changes. You can work directly in Dremio. You can issue the commands directly in Dremio and it synchronizes your repository.
At the end, you can have automated testing that runs. Now this solves the other problem we were talking about earlier. How do we fix things downstream? How do we know something’s gonna affect something downstream? Automated tests that run, regression tests that run. When you are building out an environment like this, you’re gonna build out tests for each view you have to validate that I’ve made a change here. Is it gonna break something upstream? And so we can use our repository to automate running tests. And so that’s what AJ here has done is automated this, make this very simple for end users to actually, I want new data, I’m gonna create a branch, work that branch, check it in, and then the tests automatically run and say, yes, your changes are not gonna break someone else’s stuff. And then when we’re ready, we just merge it. Very simple commands, and now it’s three steps instead of the five or six steps. So I’m gonna hand it back over here to AJ and he’s gonna talk about how he did this.
Demo Scenario
AJ Jensen:
All right, so for our demo scenario here, we have at the top, our analyst, who is working on a reporting layer view to see all sales across different regions or categories and rolling things up to the date level. And then we have our data engineer, who is tasked with updating the business layer column names or the logic for one of those columns. And that’s where I come in. And I’m primarily working with the Dremio server as well as the AWS CodeCommit service, which nobody really knows a lot about, but just replace AWS CodeCommit with GitHub in your head and everything else will make a lot more sense. CodeCommit’s free and I just kind of took a liking to it while I was developing this, so I built the whole thing using it. But we’ll talk a little bit more about implications for something like GitHub or GitHub Actions later on.
And then lastly, we have what I’ve termed the Lakehouse Manager, which is an automated system that looks at the Dremio catalog and detects changes for a branch that you tell it to look at. And when it sees those changes, it goes and makes a corresponding update in the repo in CodeCommit. And then you can tell it to run tests, you can get reports back on those tests, and if everything looks good, you can merge in the repo and you’ll actually see a merge happen in the Dremio catalog. So I have a quick little demo for you here. (air whooshing) So this is a CodeCommit pull request. And if you look here, we can see two files. The first one has just been added to this test branch, and the second one has been changed. So we’ve done a little split on the I_Product_Name table. And you can see from the names here, this business/item.sql, that we are mirroring the structure of the folder layout in the Dremio catalog. But it’s just a .sql file instead of the actual view. It’ll become a little bit more clear here in a second when we click over to Dremio.
So right here, I’ve commented on this file and said !test. You don’t have to comment on a specific file to test that file. Just any file, you make a comment, it’ll run the entire test suite. So I’m gonna click over to the Activity tab in CodeCommit and hit it with a refresh. And here we can see that the Lakehouse manager, who is a bot, has ran the test suite and has determined that all tests have passed and we can merge if we’d like. But before we do that, let’s go take a look at what happened in Dremio. So here we can see the actual view definition. So you can see dremiodata.business and the view name is item. That’s the exact same SQL text as what’s in the repo. And we’re on our test branch. You can see the little icon right there for a branch. And let’s take a look at our Jobs tab in Dremio real quick. So this was our test that we ran. And you can see it did a quick check on what was the total sales for this one product name. So where did that test come from? Well, I remember I said that the repo in CodeCommit mirrors what’s happening in Dremio. There’s only one exception to that and that is the data tests folder, which is gonna hold our tests that we’ve defined.
So in this case, really it’s just a unit test for your analytics. You can see on line six here, I have this query templatized. So where it says @branch and then curly brace branch, curly brace. I’m gonna actually read that text into Python and replace that little signifier with the branch that I wanna run the test on. And then when I get the results back, I’m gonna make sure that it matches the expected value. And if it doesn’t, I’ll report back that that test failed. In this case, it passed though. So what we can do is, and this is really just me showing off at this point. We can go take a quick look and see, oh, we’ve done an alter on our view in, here, I’ll back that up real quick. Whoops. So you can see the same alter operations made it into CodeCommit as well.
So our history from the catalog commits have been translated right onto the CodeCommit repos branch. And so, I mean, everything looks like it’s tracking well to me. So I’m gonna go back over here to the pull request and I am going to hit merge. And if any of you guys want my email address, it will be here in just a second. Feel free to come complain if you didn’t like my presentation or something, but. So the merge is gonna run here in CodeCommit and in just a couple seconds, we will be able to go check in Dremio and see what happens.
So we’re back on our jobs tab now. It’s gonna take probably another 10 seconds, I think. But we’re gonna see here that a merge statement has been ran that has taken the test branch and merged it into the main branch in our catalog. Whoops. Actually don’t know how to get out of the video, sorry. There we go. So what’s coming next right now is I do actually wanna move this over to GitHub Actions as much as I might like CodeCommit. It is definitely not what most people use for sure. And if you are interested in the code that is behind the Lakehouse Manager, you can follow that URL. I uploaded a bunch of stuff that I’ve been working on for this project earlier today. And it really is for the most part, just a bunch of vanilla Python, urllib and boto3, nothing too fancy. That’s actually all we have, I think.