May 2, 2024
How Dremio’s Data Lakehouse Reduces TCO for Analytics
Reducing TCO while delivering faster time-to-insight is top of mind for every analytics organization. This session will share how Dremio’s Unified Lakehouse Platform helps organizations drive down Total Cost of Ownership (TCO). We’ll explore strategies for optimizing performance and outcomes, while reducing cost. Learn from real customer examples about how you can reduce analytics cost and increase productivity by minimizing data copies, eliminating BI extracts, and saving valuable data engineering time.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Lenoy Jacobs:
All right. Hey everybody. Good morning, good afternoon, good evening, wherever you are. Thank you for joining me today. Welcome to my talk on how Dremio’s Data Lakehouse reduces TCO for analytics, right? Or rather, how to reduce your cloud data warehousing cost by a lot, by using a Data Lakehouse instead, right? What is a Data Lakehouse, you might ask? Well, we’ll get into that in just a moment, right? We’ll start by diving into what are some of the cost drivers that happen in times when we’re using cloud data warehouses. And then we’ll also look into how to eliminate a lot of these cost drivers, right? Now when we move our workloads onto a Data Lakehouse architecture, right? And then finally, we’ll look at some real world examples using cloud data warehouses like Snowflake to compare with, right? But it’s hard to talk to you about a solution if you haven’t talked about the problem yet, so I’ll let us first do that.
What Cloud Data Warehouse Customers Tell Us
Being in the Data Lakehouse space, we in Dremio, talk to a lot of people who are trying to improve their data platforms, right? And from them, we also hear a lot of problems that they’re having with their existing platforms, right? And some of the things that we hear from cloud data warehouse customers are that, one, there is a lot of data lock-in, right? They cannot access their data efficiently except by using the data warehouse platform, right? If you want to export the data out, it can get expensive to do so, right? You do see some movement away from that by certain cloud data vendors, cloud data warehouse vendors. You see them trying to support open table formats, but it’s still very much the fact that you have to get the data in something like Iceberg to get the full advantage of the Iceberg platform, right? It is not necessarily or not ideal for BI and self-service, right? So while there’s a lot of ease of use and features in data warehouses, at the end of the day, you still have these, you still end up doing these old patterns of having things like BI extracts and cubes to improve BI performance, right? And oftentimes when people take a look at, hey, what is something like Snowflake costing, right? The cost of these additional workloads to create these BI cubes and BI extracts aren’t generally factored in there, right? So typically the costs actually end up being way more expensive than you realize when you actually factor in all the work that you need to do around the data warehouse to optimize that data warehouse, right? Third, it’s expensive to maintain, right? You guys know this, data teams spend a lot of time and resources maintaining expensive queries, right? They optimize materialized views on top of that, right? Even things like cost of storage, right? You don’t really factor that in, right? Because you don’t realize that you’re tracking and storing historical versions of your data, right? Generally for something like 90 days or something like that and you’re paying for that, right? And all that can add up, right? Things like egress and ingress costs from ETL, right? Which brings me to the last point here, expensive ETL, right? So when you’re ingesting data from the data lake to the data warehouse, right? You’re moving data outside of your data lake, right? So you’re hitting those egress costs, right? And these costs add up, right? Just the movement of data from the data lake to the data warehouse, right? So you have all these costs that you generate operating a data warehouse, but you also have the cost of getting the data into the data warehouse and the cost of optimizing the data warehouse itself, not just your own workloads, but also the workloads of all other users of the platform. And in most cases in some of these initial cost calculations, right? You’re just mostly focused on what your workload costs and you’re not really focused on what all these additional factors or additional cost drivers that drive those data warehousing bills to be so large, right?
Current Approaches to Data Management
So why is that, right? All that comes down to this traditional approach of data management, right? And trying to see, hey, is there a better way to do this, right? Whether it’s Snowflake or some other cloud data warehouse, the pattern is still the same, right? And you run into the same problems, right? So what are the problems, right? Now, one, you start off with your sources, right? On the left hand side over here, right? Which can be application databases, can be different files that are generated throughout your businesses, SaaS applications, or data like Salesforce or some third party app, right? You’re collecting all sorts of different stuff and that typically ends up in your data lake, right? Why? Because the data lake is a place where you can store both structured and unstructured data. It is an inexpensive place to store data, right? And on the data lake, people typically do what’s called a medallion architecture, right? They will land all the data in the bronze zone, they will clean it, they will standardize it, move it into a silver zone, right? So it’d be a silver version of the data and then they’ll transform it one more time to make it ready for consumption. That’s a gold layer, right? So right here you have created like three copies already, right? Because you took the raw data, you transform into silver, you transform that to gold, right? And then each time it’s a physical copy in this traditional pattern, right? And so you’re generating costs there, right? And then what happens is you take the bits of your gold data that you want in the data warehouse and you end up ETLing it into your cloud data warehouse, right?
And so, you know, once you do that, you create the curated zone, right? Which is essentially a copy of the gold data, right? And then you start building your summary tables, right? You start building accelerations of those summary tables to help your BI dashboards, right? You’re gonna create processes around those materialized views, right? And you’re going to be worried about how to maintain those materialized views in the summary tables and keep those in sync and updated and fresh and all of that, right? And then what happens is you generally tend to generate departmental data mods, right? It’s gonna be some combination of some logical views, right? But, you know, a lot of times you’re creating additional copies, right? So that people within these departments can then work and make their changes, work with the data in their own way, right? So then there are more physical copies, right? And each of these data movements also means there is programming code, right? ETL code that have to be written, right? This code has to be tested. This code has to be deployed. This code has to be managed, right? And it just gets more complex, which means it takes a longer, increasing your time to insights because you’re taking longer to get that insight, right? And all those increase your costs, right? You’re spending compute to do this. You’re spending more compute to do this. You’re increasing the amount of your storage footprint, right? And you can see how this gets expensive really quickly. And I’m sure you experienced this with your cloud builds, with your data warehousing builds, right? And that’s why you’re in this session. You’re generating all these extracts and cubes as well, right, like, you know, these are external to the data warehouse, right? Because you, you know, things like Tableau extracts or Power BI imports, right? You create a separate collections of these pre-aggregated data and this gets expensive and complicated, right? And the thing is, the more complicated all of this gets, right? The less self-serve it gets, right? So we get away from the desire of being, desire of being able to allow our data analysts and data analysts and data scientists to be able to have more direct access to the data that they can provide for them so they can provide for themselves. We start having these data copies everywhere, right? And because there is so many versions of data, right? All of this data have to be governed if you want to comply with, you know, governance regulations and stuff like that, right? This gets really hard to go on all of this data across all of these different systems, right? And then, you know, the data in your data warehouse, as I mentioned before, it gets locked in, right? Generally, the data warehouse has some proprietary internal format. And if you decide, you know what, let me move on to another data warehouse or some other tool, right? Guess what? You’re going to have to move into that platform, right? Into some other proprietary format. And again, more detail, more costs, right? You get the picture, right? Ideally, you would want to have that time to insight to be as minimum as possible, right?
If you would get as close to an instantaneous time to insight, wouldn’t you do that, right? So, how do we do this? The idea is to take these data consumers you see over here, and basically what we call, what we call as, you know, shifting them left, right? So we shift these left, right? Which is now currently pointed to the cloud data warehouse space. We are going to shift them left to be more on the data lake space, right? And that means we get to treat our data lake as a data warehouse, right? And that is what we call as a data lake house, right?
Ideal Enterprise-Grade Lakehouse
So let’s look at what makes up a data lake house, right? It’s got a lot of different pieces. Think about it this way, right? You start on the bottom right here, right? You’ve got your object storage. That can be your S3, your Azure storage, or a Google Cloud storage, right? And what you want to do is you want the files, your Parquet files, your ORC files, your Avro files, right? That have landed in the data lake to be treated as tables, right? And that’s where the open table format comes into picture, right? The open table formats, things like Iceberg, right? They allow you to identify different groups of files as tables, right? And more often than not, you don’t have just one big table. You have lots of tables, right? And you need a way to track all these tables, and that’s where a catalog comes in, right? The catalog helps you discover these tables across multiple tools. And what’s the point of being able to track your tables if you’re not going to do any analytical and analytics with it, right? You need a query engine for that, right? And that’s going to be able to, a query engine is going to be able to run queries and do transformations and do all kinds of processing work you want to do on that data.
And finally, at the end of the day for your end users, you want to deliver the data in a way that is easy for them to understand, easy for them to find and discover, and that’s also well-governed, right? And that’s why you need a semantic layer, right? Traditionally, semantic layer always has been part of FBI tools. Here, it’ll make a lot more sense if you offload that semantic layer logic into your data lake house, right? This is a place where your users can go and get a unified view of their data, right? To be able to discover the data that they can, to discover data, to bring it into their different use cases, whether it’s data science workloads or dashboards or building data applications, right? And being able to use standard interfaces to grab the data, things like ODBC, JDBC and REST APIs and stuff like that, right? But also this picture, there’s one more thing, right? Most of your data in your data lake is treated as tables through these table formats, right? But not all of your data is ever going to be in the data lake, right? You may have some data that might be just, you know, from a marketplace, right? You might get it from AWS’s data sharing marketplace, right? You might have data just sitting out in Snowflake because you’re using Snowflake’s data marketplace, right? You may have other data that is sitting out in a database that just isn’t worth moving to the data lake, right? You would rather just operate it directly from the database itself, right? So in this data lake house platform, ideally for the enterprise grade data lake house, you would want to have some sort of virtualization, right? To be able to virtualize that long tail of additional data, right? Think of it as an 80/20 rule where you want 80% of your data on your data lake, but you’re still going to have 20% of your data coming from all these other sources. So you need a tool that can give access to that additional 20%, right? Along with giving you a platform to unify all of these pieces that make data lake house and make it usable. And that is essentially what something like Dremio provides, right?
Dremio is a data lake house platform. It is a unified data lake house platform for self-service analytics, right? You can connect your object stores, right? Whether that’s Azure Storage or S3 or Google Cloud Storage or even on-prem data sources, right? Things like Minaya or FastData or NetApp or any of the data lake storage vendors, right? You can have a hybrid lake house app, right? Some data on your cloud storage and some data on your on-prem storage, right? You can connect, as mentioned, that long tail of additional sources. Things like Dremio can connect to Snowflake and NoSQL databases like MongoDB and Elasticsearch and Postgres and SQL Server and also some of these other unique sources, right? It provides you with the things that you need to tie things together, right? Dremio provides semantic layer that we talked about, right? So you have that nice view where your end users can easily discover data and that data can be documented, can also be governed, right? Everything in the semantic layer is done virtually, so there are no copies of your data, right? It also provides a SQL query engine, right? Dremio is going to provide the best-in-class price per performance, right? With all sorts of features of accelerations, right? It’s going to give you that data lake house management features, right? It comes with a catalog that’s built in, right? It’s going to provide automatic table format optimization and garbage collection, all of that stuff. So that way, the feel of using the data lake house, it feels like using a cloud data warehouse, right? It has that one unified platform view, right? And you still have that open nature. You’re still shifting left, right? You’re moving all these workloads to your data lake, right? But you have the platform that feels like a cloud data warehouse, right? It has that ease of use factor, right? And then you can pass the data to your data science dashboards and your applications, right?
So let’s dig in a little bit deeper. I won’t spend too much time on this slide. Again, Dremio is enabling a lot of this through its unified access layer, right? You can see that we give you an easy view, right? It’s nice and organized in one place, right? Minimal copies of your data because everything that you do in Dremio’s semantic layer is all virtual, right? Dremio provides a SQL query engine, right? It allows you to easily query that engine, query the data, but also quickly, right? Speed also means less cost, right? So if you don’t have to ETL the data into the data warehouse, you save money on that, right? You’re using less storage, so you save money on that. If your queries run faster, you save on compute over there, and then also using cheaper compute, you save money on that as well, right? So you can see where all of these cost savings are going to come from, right? And then you have the lake house management, right? Ideally, those iceberg tables, those lake house tables, you would want them to keep them optimized, right? So ideally, you get things like iceberg compaction, right? It cleans up all of the unused files, right? So that way, you’re not spending more than you need to on storage, right? And that way, your queries are never slower than they need to be because your data sets are optimized, right? Which again, as I mentioned, faster queries means saved money, right? More efficient queries means less long and shorter running compute, right?
Warehouse and Lakehouse: Better Together
So how would you bring all of that together, right? We are not saying go and get rid of your Snowflake or Redshift or Teradata. We’re not telling you to pull out of them, right? There are a lot of reasons for you to still use these cloud data warehouses, right? For example, as mentioned, Snowflake has a data marketplace, right? It can be useful to acquire data sets, right? But instead, the idea is to start shifting left, right? And bring all of these workloads directly to your data lake, right? To your data lake house platform, right? And that’s when you start seeing a lot of these cost reductions, right? You see a reduction in the amount of storage costs, right? Reduction amount of ingress and egress costs, right? You’re not moving data into your data warehousing platform as much anymore, right? You see a reduction in the cost of ETL, right? Especially ETL that you do to create that additional movement to generate all those data maps, right? Instead, you can virtually model all of that data in a lake house. They’re not making a billion copies of it, right? And then you’re able to go on all that data, right? Including that long deal of additional data on those non-data lake sources, all from one place, right?
So if you look at this diagram, right? You now have the vast majority of data in batch sources, which then you do ETL and you land them into your data lake. As before, this is not changing, but instead you’re landing them into an Iceberg table, right? And this would be tracked by your Iceberg catalog, right? And then, you know, Dremio would be that interface to manage and work on your data lake, right? Providing you with that UI, right? But it can also connect to your data warehouses, right? Dremio can connect to Snowflake and Redshift, right? And so you’ll be able to see those shared data sets that you’re using, right? You can still use those data sets to enrich your lake house, right? And then deliver the data to your data science notebooks and applications and BI tools and all of that, right? The Dremio engine itself has a lot of other things, you know, that it does under the hood to further decrease your costs. It has a lot of caching, right? So that it does not, you don’t need to access S3 as much. So it does not hit S3 as often. So you save on S3 access costs. You, Dremio has things like reflections to speed up queries even faster, right? Without having the typical headaches of materialized views. And, you know, Dremio kind of does all that management and allows to be much more usable, right? So you don’t have to create as much or as many reflections or as many materialized views as needed, right? Or your analysts or data scientists don’t even need to know that they exist, right? Because Dremio will intelligently use those reflections to speed up the queries on the right data sets, right? So, you know, the idea is, you know, it, your users just go, explore, get the data that they’re looking for and just query, right? Without worrying about where my data is or without worrying about which is the best optimized version of the data that I need to use, right? And, you know, in a way, in a quick and efficient way, reducing that time to insight, right?
Real World Examples
All right, so let’s look at some real world examples, right? You know, this is some examples comparing with Snowflake, right? Here we have a company that’s looking at a three year total cost of ownership on the analytical platform. So here we’re looking at the end-to-end DCO of analytics, right? So basically all the costs that we mentioned earlier, that you would incur using something like Snowflake, right? So things like the different ETL costs, the egress costs, and, you know, the compute costs and all of that, right? Here we’re comparing a Dremio large cluster, which is eight nodes, right? Versus a Snowflake large warehouse, right? And you can see here with Dremio, you see a cost over three years, right? Of 1.4 million versus Snowflake’s 2.9 million, right? When you break it all down, right? And that’s a 50% workload, right? And this is just one example with Snowflake. Some customers have seen more than 50% in different situations. It’s a transformative thing, right? Because think about the stuff that you can do with the money that you save, right? What data projects currently that aren’t being funded that you can now fund, right? It not only gets insights faster, but increase the types of insight that you’re getting, right? And that savings can go a long way, right? There’s a lot of value beyond just dollars and cents, right?
Here’s another example of a customer who had 75% TCO savings, right? 3 million savings in just one department of the company, right? So imagine if they adopted this pattern in all of the departments, right? So basically the original story is that they were using Snowflake, right? You can see here, they have data on S3. They move it into Snowflake. They create, they generate extracts. They create, generate all the data mods, right? They extract 700 million records of data into Tableau, right? And even with the extract, right? It would take about four minutes per usage of the dashboard, right? So let’s say someone was turning a knob on the dashboard and to see the data a little bit different, they would have to wait three to four minutes, you know, per click, right? Again, that’s not necessarily the time to insight that you’re looking for, right? With Dremio, it got a lot more simpler, right? They just landed their data in the data lake, which they were doing already, right? And there’s just one copy of the data, right? And Dremio is sitting on top, deliver the data directly on top of, directly to Tableau. There was no need to do Tableau extracts, right? They were all live queries and they were able to use it from three to four minutes to five to 15 seconds per usage of the dashboard or per click, right? And then at the end of the day, this is what matters, right? It allowed the business decisions to be made more, much more faster, right? You’re able to get that nimbleness, right? You’re able to make those decisions quick, right? And you need a platform that can allow you to do all of that, right? Here’s another example, right? You know, this is where we see 91% lower TCO compared with Dremio. The Dremio compared to Snowflake, right? This company was a global leader in the manufacturing of commercial vehicles, right? Now, again, what happened here is they were using Snowflake. They compared a workload, you know, the same workload on Dremio versus the same workload on Snowflake. It cost them $47 to run that same workload on Snowflake using Azure and AWS, right? And comparing that with Dremio on the data lake, right? It was much more cheaper, right? You know, there is quite a difference in savings to do the same work, right? Again, with Dremio, there is no data movement, right? No need to copy the data because, again, we are capturing the full data here, right? There’s a full picture here, right?
All right, here’s another example, right? This is going to be my last example. You know, we see here is Dremio working with a large pharmaceutical company, right? And basically, the workloads were primarily on Azure, right? So basically, as a comparison, we ran those same workloads on Dremio using Parquet files and Azure Storage, right? Using a 10-node, 120-gig, 16-CP machine, right? And Snowflake Wood was using the extra-large cluster. So this is the apples-to-apples comparison. So what happened here is they ran those same workloads and it took 30 seconds for that workload to complete, but those same workloads in Dremio took about seven seconds, right? So again, it’s quite a bit faster, over 3M is faster. But the thing is, you know, think about it, right? If you can get data insights faster, you can get the data faster to make those nimble decisions that we talked about, right? And you’re able to transform a business faster. And that’s huge, right?
All right, so let’s sum it up, right? Dremio is a lake house solution versus a typical data warehouse, right? From the data engineering standpoint, right? Data ingest, we talked about this, right? It’s pretty straightforward. Data’s already on the data lake. You do not need to ingest data. But with data warehouses, you do need to do that, and that can incur costs, right? Granted, again, as mentioned, you know, there are external table features that some data warehouses have, but again, to get the full performance of the data warehouses, you know, they would recommend you to load the data into data warehouses, right? That means you’re gonna have the ingestion costs, right? Or you’re gonna have those egress costs and those additional storage costs, right? Data transformation, with Dremio, it’s virtual, right? You do not make copies of your data. You do last-mile ETL to add columns or to remodel some data, right? And a lot of that can be done virtually with performance, right, and that’s where Dremio’s secret sauce is, right? It’s able to create that virtual data, the virtual data mods to make it practical, right? And with data warehouses, you should create these items, but then they’re all physical copies, right? You are making physical copies of your data. Again, more copies means more work, more code, more governance work, right? And plus all that storage costs, right?
Long-haul transformations, well, this is not a Dremio feature yet, right? Like, for example, what I mean by a major transformation is like you’re converting a file format from Avro to Parquet or Avro to Iceberg or something like that, right? This is not a core capability of Dremio yet, so ideally, you would want to pair Dremio with something like Spark, right? But of course, from a data warehousing standpoint, this is the bread and butter, right? You can easily do this, but it typically tends to be expensive when you’re doing this directly in the data warehouse as compared to using something like Spark, right? Now, as far as the user experience goes, right, Dremio semantic layer are built in, right? It’s part of what makes Dremio unique, right? It is the user interface where your users can easily discover and organize and curate data without moving data, without creating copies, right? The data warehouses is not available. You kind of need, there isn’t a semantic layer. You kind of pair with third-party services like Tableau or Power BI or AtScale or something like that. It’s more easy to configure things and more, it’s more easy to configure things with Dremio, but with third-party applications, it’s more stuff to integrate it using a data warehouse with these third-party tools, right? Acceleration,
Dremio has things like caching, right? It’s worth caching things that get queried very often, right, so you don’t have to keep fetching data from S3. It’s got things like reflection that we talked about. Data warehouses do have materialized views and materialized aggregates, but then there is a lot of maintenance work, right? You kind of, your users have to be trained on which materialized view to use, right? They cannot use the original table. You know, some data warehouses have this table redirect functionality, but then again, it’s just maybe only one table redirects at max, right? At Dremio, you can use any number of tables, right, but then to be redirected, right? Data curation and federation, it’s easy to do with Dremio, right? Everything is virtual in Dremio, right? You can create things like external reflections, right? With data warehouses, it’s kind of like, you have details, right? There’s no virtual way of doing stuff, right? You can create views and stuff, but then more often than not, you end up creating copies, right, and materializing these views, right? And as far as query data is goes, right, this is the cool thing about Dremio. Dremio can automatically rewrite your queries, right? This is where the reflection features comes in, right? It creates reflections to speed up raw queries, right? Things like reflections, you know, and when a query comes in for the original data sets, Dremio will automatically use the reflection. Your users need not even worry about which reflections to use and all of that, right? Dremio just does that automatically for you, right? So it’s not just about cost, it’s also about ease of use, right? And Dremio’s whole purpose is to make things easier, efficient, right? Okay, so that was just it about my presentation, right? There is a QR code here. It gives you a lot more details on what I’ve talked about today, including a white paper. You can scan this QR code to access it.