May 2, 2024

Diving Deeper: To Data Governance Excellence with AWS and Dremio

Companies need control over their data, who has access to it, and what can be done with it. Join Leon Stigter, Sr Product Manager at AWS, on a journey into Data Governance on AWS. In this session you’ll learn why Data Governance matters, how AWS sees Data Governance, and how Dremio integrates with AWS Glue Data Catalog and AWS Lake Formation to help you get the right data to the right person at the right time.

Topics Covered

Governance and Management

Sign up to watch all Subsurface 2024 sessions

Speaker

Leon Stigter

Sr. Technical Product Manager, AWS Lake Formation, AWS

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Hello everyone. Thank you for joining us. My name is Shriram Kamath. I’m a product manager at Dremio. Some ground rules for this presentation. You’ll be able to ask questions, but they’ll be at the end of the session. So for today’s presentation, we have Leon Stichter, and he’s going to talk about data governance and excellence with Dremio and AWS. Over to you, Leon.

Leon Stitger:

Awesome. Thank you very much. Appreciate the intro. Good afternoon, folks. So I hope, first off, I hope everyone is having an amazing conference so far. I certainly enjoyed the keynotes this morning. I learned a great deal of some of the awesome innovations that Dremio is doing, not just in like Iceberg and all of the other things that Dremio has, but a lot of the customer stories that we’ve heard. Now, as we walk through these slides, you’ll definitely recognize some of the themes that the gentleman from TD Bank touched on as well. I typically love to have these sessions a bit more interactive, which I absolutely understand. It’s going to be a bit harder considering there are people remote and people here in the audience. But as a quick show of hands, who here is like a data lake administrator for their company? Good. I see no hands. Oh, one. Sorry. Kind of. Okay. Awesome. Who here would describe themselves as a data engineer, data scientist, that kind of role? Okay. I see a few more hands going up. Great. Now, who here is expecting a sales pitch? Okay. I see no hands going up. That’s good. That means I don’t have to change anything about my slides. So I do want to say that as I’m not going to be talking or not trying to sell you on any technology, I am going to try to sell you on why data governance is so important. And as we do that, as the title of the presentation says, I am going to touch on a few AWS services as well as Dremio. So please bear with me as I talk about technology. But again, I’m not trying to sell it to you.

Why Does Data Governance Matter?

So as we go over the next, let’s say, 25 minutes, I’m going to be talking about three main topics. Why does data governance matter? How does data governance work on AWS? And obviously as we’re at a Dremio subservice live, how does Dremio actually fit in there? So let’s get started with that first one. Why does data governance actually matter? And if you’re a parent like me, then why is one of those quintessential questions that your toddler is going to ask you at least 5 million times per day? I see a few heads nodding. That’s great. It means I’m not alone. That makes me feel good. Now, obviously, he’s not asking me why does data governance matter? Why is it harder to do in larger companies? And why does caring about data actually matter in the first place? But those are very important questions to ask ourselves as we’re dealing with a lot of data.

To get to the bottom of that question, I want to pose a question to you all. And I’m going to guess it’s going to be rhetorical. But is your company interested in growth? I’m going to guess the answer for most of you is going to be yes. Our companies are interested in growth. Is there anyone whose company is not interested in growth? Okay. Good. And being data driven actually helps stack the odds in your favor. As you can see on the slide, according to a Forrester survey, if you are data driven, if you believe data is your strategic asset, then you have an 8.5 times — you’re 8.5 times more likely to have at least a 20% growth. Now, is your company data driven? Can I see a few hands? Okay. I see a few hands, but not all. That’s good. And why am I saying that’s good? It’s because of this number. So according to Harvard Business Review, 74% of companies that try to be data driven aren’t really treating data as a strategic asset. So that means that only 26% of companies is. Now, usually, if you’re not really thinking about data as a strategic asset, you’re also not spending a lot of time thinking about the consequences of accessing data. The consequences of who has access to the data. What’s the quality of the data? Where does the data come from? Those are all very important questions as you think about data as a strategy. So there’s this wide gap. I mean, I could probably illustrate it by moving along this stage. Between having data. Between having quality data. And having the ability to make strategic data driven decisions about what is best for your company. And it turns out that that is especially true if you’re dealing with multiple departments. So in a larger company. And just think about that for a second. We’re here very close to the financial center of New York. And we have a lot of banks or financial institutions, I should say, surrounding us. And those all sell different financial products. They can sell mortgages. They can sell insurance. They obviously sell banking services. Just think about the fact that for all of them, there exists a customer. But for all of them, the customer is a bit different. It depends if you are an insurance customer, what the attributes are that describe you. It depends on if you are a mortgage customer, what kind of attributes describe you. So thinking about where the data comes from, who has access to it, what the data actually is, is incredibly important as you think about data.

Now, imagine this. And I saw one data lake admin in the audience. So that person has to imagine at least a little less. But I want to walk you through a scenario, right? So imagine this. You are the data lake administrator of your company. You are — it’s a nice sunny morning, kind of like today was. You are enjoying your first cup of coffee, cup of tea, any beverage that you might like in the morning. You open up a newspaper or open up your favorite news site on a tablet or a phone. I don’t judge. And as you scroll through, as you swipe the pages, this is the headline that you see. And before I read it, I just want to make sure everyone knows this is obviously a fake generated article. But any company suffers a massive data breach. Millions of customer records are exposed. Now, that’s not great. Obviously, making the news, making a headline with this kind of news article never is a good idea. And maybe it happens and it doesn’t make the news. Maybe just, in air quotes, someone within your company has the wrong access to the wrong data at the wrong time. And it’s, you know, really contained within your company. None of that is great. Let me also say that I hope that no company on earth ever has to have such a headline. But we all know that it does happen.

So, making sure that you have the right access to the right data in the right context with the right permissions and hopefully with the right quality of data is actually really important to us. What we hear from our customers on why data governance is so hard are just some of the things that are on the screen right now. Now, I’m not going to read all of them. If you want, you can obviously take a screenshot. But I do think there are a few root causes, if you will, to really look at. First off, data governance is typically not really aligned or at least not adequately aligned to the business needs of a company. And I’m definitely going to touch on that a few more times over the next slide because the relationship between IT and business, especially for data, is just so incredibly important.

Now, data governance, at least for most companies that we’ve talked to, doesn’t really think about all the different capabilities and actions that you would have to think about for data governance. And typically, it’s just having a data catalog where you catalog all your data or just having security, making sure you have the right permissions, or just knowing where your data comes from with data lineage. But in fact, you need to have a whole lot more. And third, but certainly not least, what we see is that the responsibilities for data governance are either completely centralized or completely decentralized. Well, in fact, you kind of want to have a balance between that. You want a central data team, and we’ll touch on that in a second, to help facilitate that. But you also want to make sure that the data producer and consumer teams have some say in how their data governance works as well.

And these things are getting more attention by the day because as we use more and more data, as we produce more and more data, we obviously want to experiment with things like large language models and the data that we have within our company. We want to feed that in. Now, we want to make sure that as that data feeds in, that that is of sufficient quality. Because if it isn’t, the answers that our large language models are going to give just aren’t going to be as useful. We’re not going to get that return on investment that we’re looking for.

How Does Data Governance Work on AWS?

So hopefully by now, we can at least somewhat agree that data governance is really useful for us. Is anyone disagreeing with me on that one? Okay, good. Now, let’s talk about how data governance works on AWS. Before we really dive in, does anyone want to raise their hand and give a definition of what data governance means? Okay, in that case–in that case, I’ll try. Obviously I had help from Gartner on this one. So this is the Gartner definition of what data governance is. I’m not going to read it out loud because you can obviously read yourselves. Now, I think that it’s very useful to understand like the formal definition. I also think it’s really useful to think about a more practical definition, if you will. And in that case, I think that might be something like data governance ensures that the data is in a condition necessary to really support the business initiatives and operations. And I promise, we were going to come back a few times to how supporting business initiatives is actually really important for data governance.

We obviously know as I talked about a bit before that there is this partnership between IT and operations, and that is to really get the right data in the right condition because it takes people, process, and technology. So if that’s the purpose of data governance, it’s really obviously to enable our end users to find data, to access data, to share data. We want to make sure that all kinds of users, whether they are data scientists, whether they are end users, whether they are data lake admins, to get the right access at the right time to the right data. Once they find the data, we obviously want to make sure that it’s in a condition to be useful for the business purpose. Now, I do want to make one caveat there. We obviously want to make sure, we as like the broader we in our industry, we want to make sure that there is room for experimentation. There should be room for data sets that really haven’t been curated as much to be used and see whether or not it fulfills that business purpose. As we get access to data, we want to keep it safe. We want to keep it secure. We don’t want to end up on the front page of whatever newspaper that you’re reading. And especially now, increasingly more important, we want to comply with the relevant regulations and policies while still making it possible to easily access data.

So within AWS, we’ve come up with this data governance framework that has a whole bunch of capabilities, I’ll call them, starting with ingest and store, making sure that you understand where the data is coming from with things like data integration, master data management, to really understanding what the data is through data profiling, data lineage, data catalog, and obviously protecting your data, making sure that it’s compliant with the regulations that you have, making sure that you understand the lifecycle of the data. Now, at AWS, we obviously have a ton of services that solve these kinds of capabilities, but as AWS grew on the idea of making sure that our customers should be able to choose the right tool for the job, what we see a lot is that our customers are increasingly choosing an AWS partner like Dremio to fulfill parts of what you see on the screen right now. And to us, that is absolutely okay. We, as in like the IT, we need to make sure that we provide our business friends with the right tools to do their job, whichever it might be.

Components to Modern Day Strategy

So as we think about how to do that, how we can simplify, how we can help our business teams do that, I think this would be a definition of a modern data strategy, an agile plan of aligned action, spending people, process, and technology that really accelerates creating value in direct support of strategic business initiatives. Like I said, I was going to come back to the business initiatives or business objectives, sorry, on this slide quite a few times. If you are more mathematically inclined, I guess it would be mindset plus people plus process times technology.

Now, I want to touch on the orange words here for a second, agile, because we’re not making a plan that’s going to last for the next three to five years. Our environment, our data is changing too rapidly for that to be the case. We really want to align it between a mindset people and process, because if we just think about technology, then that’s not going to work. We all know how shadow IT is something that we kind of want to prevent. We want to partner with the business. Accelerates, because it’s not good enough to just think about how you’re creating value now, it is also about how can you increasingly help your company get more value. And direct support, and this is obviously touching on some of the root causes of issues with data governance that we talked about earlier, we want to make sure that it’s tied to an actual business strategy.

So, as we think about who is doing all that work, I think generally speaking, there are three types of teams, I would say. Producers, consumers, and data teams. You have the producers on one end that are really the domain experts of what they do. They own the data. They know what the quality is. They know where the data comes from. And on the complete other side, you have the consumers, the teams that want to take that data, turn it into something that helps them propel forward. Now, I do want to say that every team within a company can be both a producer and consumer together. They can be the same team. Just think about a forecasting team, for example, which is going to take sales data produced by some other team, but is going to take that, turn it into a forecasting model and share that with some other organization. So, they are both producer and consumer. Now, while I’ve drawn it in the middle, it doesn’t necessarily mean that it’s going to be a centralized team. It’s a data team. It’s really the team that sort of facilitates the exchange of data like a marketplace. They help producers publish their data, share their data to some form of catalog so that you can easily consume it from the other side, and they help the consumers really discover the data. Finding data is, even today, just incredibly hard.

AWS Lake Formation

Now, as I talked about at AWS, we have a bunch of products. Some of them you see on this slide right now. However, that little section over there where it says, “Build your own solution with third parties,” I think that is such an important part of any type of company’s data strategy, right? You want to make sure, as I said before, that you choose the right tool for the job. What we increasingly see is that our customers put their data into Glue Data Catalog with Lake Formation permissions on top of that and use that with integrated engines like Amazon Athena, Amazon EMR, and obviously AWS partners like Dremio. I do want to touch on Lake Formation a bit because I’m going to use that in the next few slides as we walk through an actual example of how that might look like. For AWS, Lake Formation is a central place where people can essentially manage their access control, manage who has access to what data at what time. It, as you could see in the previous slides, really is about enforcing security across multiple different services, whether they are AWS or whether they are from our partners.

How Does Dremio Fit In?

So now the obvious question, as I talked about what data governance is, how it works on AWS, is where does Dremio actually fit in? Now, I know that, especially if you’ve seen the keynote this morning, Dremio does an incredible amount of work. They do so much. However, for the next few slides, I really want to focus on the process and consume side, so the foundational query engine of Dremio, if you will. So let’s run a query. Who is not excited about that one? Thank you. I do appreciate that. By the way, this is from the TPC-DS dataset, so if you want to play along, you absolutely can. We’re going to select everything from the tables store sales and customers, where the customer key equals the customer ID. So relatively straightforward, John, right? Good.

And this is where I think Dremio has a very interesting integration with Lake Formation. By the way, total shout out to the Dremio team, who on their website, in their documentation, did an excellent job of explaining how to set up that integration, not just in words, but actual artifacts that you can just put into your environment, whether that’s a Dremio engine or your AWS account. Really nicely done. So as we walk through that query, the Dremio engine is going to do a few things. It’s going to do a few more than these four, but these are the four most important ones. It’s going to check whether the tables that we mentioned have Lake Formation permissions, and it’s going to check if they are registered in the Glue Data Catalog to begin with. It’s going to check the user’s ARN, like the username, if you will, to find the permissions that user actually has on that data, because we want to make sure we get the right access to the right user.

Then Dremio uses Lake Formation to really figure that one out, and ultimately, it processes the query, assuming that you have access. If it doesn’t, it will stop you right there. The main thing here being that if you do not have access, if you are not allowed to access that specific dataset, you will not get access. You will not get a response from Dremio. So let’s walk through that, right? First step, Dremio is going to check whether the– so what I’m going to do is I’m going to sort of follow those same four steps, and I’m just going to play that I’m the Dremio engine, and I’m going to use a bunch of screenshots from the AWS console. So Dremio checks whether the tables have Lake Formation permissions, and in our query, we had two tables. We had customer and we had store sales. So in this case, it might be a bit hard to see on camera, but in this case, we have those two tables, and they have Lake Formation permissions on them. So the first is a checkmark. First off, the tables exist. That’s rather important to know, and they have permissions on there.

Second step, who’s actually doing this? And this is, I think, one of the most important steps. We need to find out who is running that query and whether or not they actually have permissions. So in my case, I am user00, I think aptly named, and as you can see on the screen, I have permissions to access store sales and customer. You see two records, one for the table, one for the columns. Next step is to figure out what kind of access I actually have, and in my case, I have describe and select. So that means that the Dremio engine knows that I can see the metadata, so essentially all the column names, and that I can actually run a select query, which is really useful to know because if we go back to our query, I wanted to have all the columns from both tables. So based on how I’ve set up the integration between Dremio and AWS, I am allowed to do this. That’s great. I can now actually see stuff. So let’s make that a bit more difficult, right? Let’s do select name, email, street from customers. Incredibly complex query, not because it’s actually a really hard query to run. It really only takes three input fields or three columns from one table, but as you’re about to hit enter on that query, your legal person steps into the room and says, “What about PII? We are a multinational company. We need to make sure that some of those records are only accessible by people that should be able to access them.

So let’s think about that, right? We’re basically going to go through the same steps. I still have access to the tables, which hasn’t changed. I’m still the same user. I haven’t changed, which is good. I still have the same permissions. We haven’t changed that either. And the big difference, though, is that we are now adding a data cell filter. We are adding a row in there that says, “Give me everything where country equals to Netherlands.” And, oh, by the way, because I’m not in sales, but just imagine I was, I’m not allowed to see any email addresses. So that gets us to one of those powerful capabilities that Dremio supports from Lake Formation, which is what we call cell-level security. If you think about a table as like a multidimensional or two-dimensional grid of rows and columns, then cell-level security really allows you to, on the individual cell level, specify whether or not you should have access. So using that cell filter I just had, I could do stuff like I want to block access to email, and I also only want that person to see any data where the country equals to Netherlands.

And that’s great. So as I run this query, I only get to see Alejandro, John, and Martha, who live on Main Street 1, Any Street 2B, and Common Street 6, simply because that’s what my–that’s what it says in my data filter. Now, as I said, we are–you know, Any Company is a multinational company, so we absolutely have sales teams in the UK and in the U.S. So from that perspective, someone in the U.S. might actually be able to see everything. They might be able to see that we have a customer called Pat, who also happens to work at Any Company, and who lives on 1, 2, 3, Any Street. So it really is dependent on the user what kind of stuff that they can or are allowed to see.

So imagine how relaxed you could feel as a Data Lake administrator knowing that the data in your Data Lake is going to the right person at the right time in the right context with the right permissions. Is that not going to make you feel good? That makes sure that as you open your newspaper, as you look at your favorite news website, the only articles about Any Company, you know, his company, that you’re going to see are happy ones.

So just to recap, as we have about three minutes left, we talked about three main things. Why does data governance matter? To sort of recap into one sentence, it gets the right data with the right quality to the right people in the right context with the right permissions. How does it work on AWS? We looked at that data governance framework with a whole bunch of different data governance capabilities. And as I mentioned, it’s not just an AWS thing. It is customers– I would almost call it choose your own adventure. Customers get to choose what they want to do with which software. And how does Dremio fit in? As I walk through an example using part of what Dremio offers, it really is about making sure that the permissions I set are actually what I’m going to get as the query executes. So with that in mind, I certainly enjoyed my time here. I thank you all for being here. I hope you have an amazing rest of your conference.