Advanced Guides
Dremio Architecture Guide
Wondering what Dremio is and how it works? Download the Dremio Architecture Guide to understand Dremio in depth.
read moreEpisode 31
|
September 6, 2023
Unlock the potential of data engineering in our "ELT, ETL & the Dremio Data Lakehouse" webinar! Discover how Dremio's no-copy architecture revolutionizes ETL & ELT patterns, optimizing data processing and cutting costs.
Join us for an insightful webinar titled “ELT, ETL, and the Dremio Data Lakehouse,” where we explore the cutting-edge capabilities of Dremio in revolutionizing data engineering and analytics workflows. This webinar delves into the strategic use of Dremio’s innovative technologies to optimize Extract, Load, Transform (ETL) and Extract, Load, Transform (ELT) patterns for enhanced efficiency and cost-effectiveness.
The session will commence with an in-depth exploration of traditional ETL and ELT methodologies, highlighting the challenges faced by organizations in managing large-scale data transformations. We will analyze the critical role of ELT patterns in the modern data landscape and the growing significance of data lakes for storage and processing.
Subsequently, we will introduce Dremio, a powerful and flexible data lakehouse platform, as a game-changer for executing ETL and ELT operations. Dremio’s unique architecture empowers users to directly query data residing in the data lake, eliminating the need for unnecessary data copies and reducing data movement overhead significantly.
During the webinar, attendees will gain valuable insights into how Dremio’s no-copy architecture minimizes data redundancy, accelerates data processing, and drastically reduces the associated costs. By harnessing the full potential of data lake storage, organizations can simplify their data engineering workflows, enhance data availability, and achieve unparalleled performance for analytical workloads.
Key webinar takeaways:
Don’t miss this opportunity to explore the next-generation data engineering possibilities with Dremio’s data lakehouse platform. Join us for a thought-provoking discussion on how to elevate your organization’s data capabilities, streamline ETL and ELT operations, and make data-driven decisions.
Watch or listen on your favorite platform
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Alex Merced:
Hey, everybody. This is Alex Merced, developer/advocate here at Dremio, and your host here every week on Gnarly Data Waves. This week, I will be presenting about ETL, ELT, and the Dremio Lakehouse. So we'll go over what ETL ELT is [to] make sure we're all on the same page. Then we're just talking about how Dremio can facilitate both patterns and address a lot of the challenges in different patterns. So we'll go through that.
But before we do all that, of course, I'd like to remind you guys to go try out the Dremio Lakehouse. Go head over to Dremio.com, and click on the Test Drive button. No cost, no obligation. Get to try out working with Dremio, so that way you can see what Dremio brings to the table, and what it means to use the Dremio Lakehouse platform.
We try to be everywhere and anywhere, at all times. So we're going to be at all these events, coming up: Big Data and AI in Paris in September; Big Data London this month, Coalesce DBT in October; and Dipankar and I, the Developer Advocates here at Dremio, are going to be doing an Apache Iceberg ‘Ask Me Anything’ over at Data Day Texas in January. You'll be able to come down in person and ask us anything you want. We'll also be doing office hours while we're on-site over there. And yeah, it's gonna be a good time, celebrating being near the release of the Apache Expert Definitive Guide at that point early next year.
And on that note, again, my name is Alex Merced. I'm a Developer Advocate here at Dremio. We're talking about ETL and the Dremio Lakehouse, but a little bit about who I am. If you're not as familiar with me, I've worked in data, and I've worked in Web for companies like Crossfield, Digital Campus Guard, and General Assembly. I'm one of the co-authors of Apache Iceberg: The Definitive Guide, which I mentioned earlier. I've spoken at many events, like P99Conf, Data Council, Data Day Texas. I run several podcasts, which I'll mention again in a minute, written different libraries for Python and Javascript. I like code. I like tech. I like talking about tech. That's essentially the long and short of it.
And, on the note of the book––if you want to get an early copy of the book––right now, there are about 160, or even 180 pages that are available in the early release. So lots of information is already available to you about Apache Iceberg when you get that early release copy which just got updated last week. If you've done it like a month ago, you got the early release copy. Go pick it up again because there are new chapters there. Okay? And you just have to scan that QR code right there, and you get a free copy!
I also mentioned that I host some podcasts. So you guys are here already on Gnarly Data Waves which we do every week. But I also do SELECT* From Data Lake with Dipankar Mazumdar, with the developer advocates here at Dremio. I [also] do a solo podcast, called Data Nation. All of these can be accessible on Spotify and iTunes, so subscribe! And you know, we always got stuff to say; we've got all sorts of interesting things going on in the data world. Now we will begin with our featured presentation, talking about ETL, ELT, and the Dremio Lakehouse.
And again, throughout the presentation, if you have any questions, just put them in the chat. So first off, we're just going to go over what these things are, just to make sure that we're all on the same page.
And essentially, first, we'll go over ETL. ETL is a pattern that has been around forever when it comes to taking data from point A and moving to point B. Essentially, what we do is you'll have a source of the data which could be like a transactional database or could be an IoT device. But you have somewhere where data is coming from: we're gonna take the data out of that place. What you need to do is you need to transform the data in the process of moving it, so that way you structure it to however you need it for your analytical needs in the destination. So let's say you're landing it as per K files in Amazon S3. So essentially, the data would come from, let's say, a Postgres database. You might change the schema, you might denormalize some tables, whatever other transformations you decide that you want to do, do some data quality checks, clean up bad records, backfill things, make all these transformations so that way, the already curated data is landing in those Parquet files in S3. So the idea is, that the transformation is happening before the data arrives at its location, not after. That's going to be the key distinction between the 2 patterns.
So what are the pros and cons of this? So when it comes to ETL, one of the benefits of taking the ETL approach is quality control. Because in this case, I'm doing all of that work before the data lands. So that means the data can land in exactly the way I want, and I could have pre-checked it for things like no records, and referential integrity checks, all these things could have been done in that transformation process before the data landed. So you have more data quality assurances of the data that landed at your destination, whether that's the data lake, the data warehouse, or wherever.
Performance––since the data transformation occurs before loading, there are no transformations that have to occur after, [in an] ideal world. So in that case, when you're querying the data, you're just basically querying the data as it is. So you're not also spending time doing additional sort of schema transformations or other transformations in those final queries. So it can, especially for larger data sets, improve performance.
Structured data––it works really well for structured data, because at the end of the day, what you're doing is you're trying to refine the structure in that transformation process.
But there are some cons to doing an ETL approach––latency. Because you're doing the transformation in between the landing, that means the actual moving of the data is 2 things–– you have to load the data or export the data, and then you have to also wait for the transformation. So there's a lot more computing that has to be done in that process, it can take longer, creating latency for people who want to query that data where it lands, because it's not arriving there as fast as it would have if it wasn't transformed in that step. So there's a latency consideration.
There's complexity. So basically that one step is doing all the work. You saw all the logic––how do I get the data out of the location? And what do I do with the data to transform it before I land in new locations, all being done in one job, or in one process? Okay? So in that case, there's a lot more complexity in that process. And basically, as you scale it up, you know, basically where you're starting to deal with more and more data and more and more complex transformations it gets, it can get harder to scale. So again, just since it's doing more, there's just more work to do when you're scaling that leg of the journey.
Okay, so that's sort of the ETL story. There are some best practices you can follow, such as data profiling. It's much easier to design these jobs well if you spend some time learning the data before you build a job. So profiling the source of data, make sure you understand what's sort of the current the existing structure like with where it's starting and where it's supposed to end, and having a sort of kind of mapping out sort of the beginning and process. Essentially like pseudocode when you're writing code, just kind of thinking through, what is it that I'm trying to do before I go and execute? So creating those data profiles so that we can go through that sort of logical layer, that logical process before you go through the implementation.
Okay, incremental loading. So this is like things like the CDC using ways to capture the changes in the data. That way, as data changes, you're not having to reload the full data set but just the incremental changes to speed up the process of doing those jobs.
And then monitoring and logging––that way you have logs that you can use to pick up on any errors or unexpected behavior, and monitoring [to] have some sort of way to actually seal this information to get an idea of the health of your pipeline.
Okay, so that's sort of the ETL side. So again, ETL––we're taking the data out, transforming it, then landing it. And then data lineage––that's going to be a best practice, regardless of which way we do this, that you keep track of how the data got from point A to point B, and there's gonna be different mechanisms for doing that.
But then there's also ELT. In the ELT store, basically, what you're doing is you're saying, you know what? Let's not transform the data right away. I'm just gonna export the data. I'm just gonna move the data from point A to point B, so I'll just take those raw tables out of that Postgres database. I won't denormalize them. I won't make any changes to their schema, at least not yet. And then I'm gonna land them in the Parquet files in S3, and then I'll let the processes go forward to transform them, which could be, you have logical views that basically do those transformations on a per-query basis, or you create other physical copies of the data post that. So you'll basically you'll take the Parquet files and then transform them into another Parquet files with your transformations applied. But the idea is that it's after sort of that initial load. So you're initially loading, essentially, the raw data and then transforming from there. And then again, this can have some pros and cons too.
I mean, sort of right off the top of my head, it is possible that you might end up storing more data because just [using] the raw data. You might have data you might have not loaded through that would have been lost through data cleanup or data validation, stuff like that is gonna be in that sort of that raw landing zone. And on top of that, you know, depending on where you're doing the transformations, afterwards, you may be doing them in a more expensive place.
So let's imagine a scenario where we're applying ELT to a data warehouse. So the idea is that you may be taking the data from your data lake, the raw data from your data lake, or your Postgres database, and directly loading it into a data warehouse, in a bunch of denormalized tables. So now, all that normalization that you might do, all this additional transformation work, is gonna be done with the warehouse compute which may be more expensive. So essentially, you're doing those transformations with more expensive compute which can then increase your costs.
But there are again pros and cons of this approach, and then you'll see later on, that Dremio does a lot to kind of help enable this to be, even especially the ELT part, to be more practical and not hit those sort of costs and problems.
So pros of ELT––flexibility. So in that case, I don't have to commit to how I'm transforming my data upfront. I just get my data, let's say, into the data lake, and land them in Parquet files and Iceberg tables, whatever form you want, and then I can transform them from there.
I don't have to necessarily make that commitment upfront. So it's a little flexible. I can make changes. It's scalable because basically, each leg of the journey is kind of doing less work. So it's easier to kind of think through the scaling issues. It's easier to think through 2 simpler jobs than trying to think through one complex job, how to scale that, how to make that work, and meet SLAs.
A real-time analysis––essentially, since the data is going to land there faster people who are using sort of like that raw data, who can use that raw data to do their analysis can, well, get there faster because they don't have to wait as long for those transformations to happen.
Okay, cons of ELT––data quality. Again, you're loading the raw data, so you haven't done any of those data quality checks. So again, that could, for people who are using that raw data and analytics, cause problems.
Performance––now, this is more performance on the query side versus on the ingestion side. So basically here, we're having a faster ingestion because we're not doing those transformations in the ingestion. But we might be doing them on our queries because we might be doing those transformations through logical views, and now those logical views are doing those processes and doing that compute at that time. And again, that gets more expensive if you're doing it on a data warehouse compute, which oftentimes is marked up from just pure cloud compute. So then, you're paying whatever the cost of your AWS or your Azure instances, plus whatever the markup of the data warehouse is for that compute.
So best practices for ELT––schema-on-read. Essentially, what's going to happen is that regardless of what the data currently looks like in the Parquet file––and this is what happens when you're using something like Apache Iceberg, where the schema is technically tracked in the metadata. Let's say I've been writing Parquet files to that Iceberg table for a long period of time, the schema may have changed, so the schema, technically of the table, is different than, let's say, the original handful of Parquet files that were written because the schema has changed. But what happens? It implements the scheme of the table on the data, regardless of what is on that read because the engine will say, Okay, hey, based on the Iceberg metadata, this is the schema, so, regardless of what the Parquet files, regardless of the structure of the Parquet file, I'm going to make sure that it returns the data matching the schema as it was in this snapshot. So things like Apache Iceberg help make that schema-on-read type approach doable.
Okay. Metadata management––maintain metadata to track data transformations and ensure data lineage. Again, this is another place where Apache Iceberg is really going to be really helpful. And then Dremio actually has data lineage built into its semantic layer. So in that case, if you were to do ELT with Dremio, basically, once you land, you land your raw data in the data lake. All the transformations going forward could occur on the Dremio platform through different views or by creating new tables, however you would like, all of that would be the lineage of how you get from point A to Point B. [It] will all be visible on the Dremio platform. So when you're doing ELT, it'll actually allow you to kind of really capture that full lineage from the moment you've landed the raw data to whatever views you provide to your final end user.
Data governance––establish data governance practices to monitor and improve data quality. But the bottom line is, you want data quality checks, you want to make sure that the right people have the right access to the data. That way, as you land all this raw data, you're not accidentally exposing any data you don't want, since you're not cleaning up any PII ahead of time or doing any of that stuff, you're just landing the raw data.
Okay. But either approach, basically, the bottom line is, especially if you're using cloud infrastructure, you have 3 areas where your costs are kind of coming from. Storage––so the more data I store, the more that I spend. Compute––the longer those instances are running to compute those queries, running whatever software they're running, whether they're running Spark, whether they're running Dremio, whether they're running some other miscellaneous proprietary compute software for running queries, you're paying for those instances. So basically, the faster the queries can be done, the less time those computing instances need to be active, saving you money. if I can store less data, if I can compute less, if I can access less––because you're also gonna get charged for all those API calls to S3. So not only are you paying for the physical storage, but the access to that data. And then there are also Egress costs if you're moving it out of the network, so there are all these different places where you may get hit with costs depending on if you don't intelligently plan sort of how you set everything up and take advantage of a lot of these innovations that may that can optimize these costs.
And just to reiterate it again, when you're doing something like, let's say, using X data warehouse, you're paying a markup to that compute. And then if you're landing for, particularly if you're doing ELT into a data warehouse, you're landing the raw data, so you might be landing more data in the data warehouse. So you're paying for storage that's marked up. And then you're doing the transformations in the data warehouse. So then you're paying for compute that's marked up. And generally, the data warehouse is going to be marked up more. So you're paying for that premium price to do things that you could have done for less had you done let’s say you'd done ETL––we would have transformed that before you're landing it in there, where you may be just be paying for the pure compute cost, depending on what platform and what tools you're using for that data integration. Okay? And so bottom line––possibly more expensive compute. So when you're using ELT––I will show you that ELT can be done in a very affordable way when we use Dremio.
Okay, now, if the destination is the Data Lakehouse––again if you're using Dremio as your Data Lakehouse platform––the raw data and transform data exists on a lower-cost data lake. So bottom line, you're just paying for that good, pure S3 storage costs, which is going to be cheaper. If you're using Dremio as your engine, if you're using Dremio community edition whatnot, you're not paying any additional costs other than the instances that you're deploying. Okay? And then, generally, if you're using the Enterprise edition or Dremio Cloud, the markups over the computer are going to be less so than the data warehouse. So you're spending, you're using cheaper compute, and you're going to use cheaper storage. So if you are doing either pattern, the cost is going to be overall, if that becomes the center of your analytics, basically the center of your analytics universe.
The thing is, you don't have to transform things right away. So for example, you land raw data, and you don't even necessarily need to move data, because with Dremio you can federate that data. So like, for example, if I have tables in Postgres, 200, I can instead of having to export all those Postgres tables and then land them over there as Iceberg tables in S3, which may be something I want to do, depending on different factors. But depending on the details I may not need to. It may be good enough for me to just connect Postgres to Dremio, and then I can access those tables. And then, if latency is an issue when accessing the Postgres table directly for your analytic needs, Dremio has a feature called data reflections. Essentially, when you turn on data reflections, it essentially does create an Iceberg table under the hood that is physically represented. But you don't have to worry about kind of you don't have to come up with that pipeline that keeps that in sync, that every time that Postgres table changes, the reflection updates. And then also your end users don't need to think about how there's a separate copy of the table, they would just query the Postgres table, and then Dremio would then use the data reflection to actually process the query performantly. Then, Dremio will handle the logic of making sure that it always matches and regularly updates it to match that Postgres table.
So then you take a lot of that, and you make it more user-friendly for the end user, and also easier from the pipeline perspective. The reflection becomes your pipeline, in that kind of sense, and again, lower cost storage, lower cost compute relative to landing raw data in a warehouse and then doing your transformations from there.
And again, because Dremio confederates all sorts of different sources, like even data warehouses––you can connect your data from Snowflake and do this with Dremio Data Lake tables on your S3 or Azure, Postgres, tables, MySQL tables, Iceberg tables, and then with Iceberg, you have read and write so you can actually handle some of your ETL work right there from within Dremio and do something like, you know, incrementally update an Iceberg table using data from a Postgres table by doing like a select insert or an insert select statement.
You know, if data sitting on your AWS S3, your Azure blob storage, Google Cloud Hadoop––all of these sources can then connect the Dremio. At that point, you've moved no data. So you have no cost at that point, as far as compute to move that data or cost of storage MIT CTL, and for duplicating the data. And then again, you can curate logical views––Dremio is going to be a very performant engine, so a lot of the time just the performance of Dremio will be enough. But in the situations that there isn't, maybe because the particular, the particular Postgres server that you're connecting to with a particular MySQL server, pushing down the queries there may cause some latency that you want to eliminate, you turn on data reflections, and then data reflections again will periodically make sure that it sinks with that underlying source. And it’s using Iceberg, and so you're getting the benefits of Iceberg and its Iceberg metadata to speed up that process. The other cool thing about reflections is that when you create these reflections, you can actually choose partitioning and sorting rules. So it may be pulling the data from the Postgres tabl, but it will sort it, it will partition it for you. So that way, you get a very crisp representation that really gives you crisp, crisp performance.
Now. What happened? Your SLA is getting tighter and tighter. So that's when you start thinking, okay, hey, maybe instead of using data reflections, I want us to actually start landing my data directly into Iceberg. And then that's where you might want to use Dremio to take that data from the Postgres database, and then again, through creating a table as statements or selecting insert statements, you can then move that data over to an Iceberg table. In which case, basically the refresh situation, you're always going to have access to the freshest data, because the way Iceberg works is that it always refers to the catalog. So it's always grabbing the latest snapshot. But then you would have to kind of like, set a schedule of, okay, we're going to do these inserts so often. And when you can do that in Dremio, you can do that in Spark, you can do that in Flint, how you land it in Iceberg is fine, as long as the catalog connects to Dremio. And Dremio has a built-in Nessy-powered catalog called Arctic that gives you other benefits, like automatic cable optimization. So basically you can move that data into there, and then you get the automatic table optimization, so the whole thing process gets easier. But again, basically, ELT, in particular, becomes a lot more practical because you can do all the transformations after the data becomes accessible to Dremio. You can do that all logically, so most of the time you have to do no movement of data, no copying of data, so you're not increasing your storage costs. And only when you need to, you turn on data reflections. And then, once you've exhausted data reflections, then you know, you explore like, okay, landing the data directly to Iceberg to get just that extra crispiness.
But again, you're able to do that, and then Dremio can capture the lineage of all those transformations through logical views in its UI, so it gives you all the tools to really make ELT powerful. But also, if you're doing ETL, Dremio is still great, because you can still curate all your data, do all those transformations, when you land your data in your data lake, and Dremio is still going to provide you top-notch performance when it comes to querying the data lake and then give you those tools like data reflection, so squeeze that even better performance and the ability to federate that data with other data you may have elsewhere, that you may not want to land in your data lake. So whether you're doing ETL, you're doing ELT, Dremio is going to offer you ways to save money.
And then one last thing I want to point out is the C3 cloud cache that Dremio has. So again, one of those cost drivers that we talked about was access costs. So Dremio clusters have this thing called the Columnar Cloud Cache, where, as you run queries, it's going to pick up on query patterns and start caching frequently accessed things on the nodes, the result being that you're not making as many calls to S3, or to Azure. So not only are you reducing your compute costs because the queries are going to be faster from the caching, but you're reducing your S3 access costs because you're not making as many API calls to S3. So you can sometimes see dramatic decreases in your cost, because you're attacking the compute, the storage, and the access when you're using all of Dremio’s features and optimizations that it has under the hood. Again, really making whichever way you're doing, you're moving your data, ETL or ELT very, very manageable. So a lot of times you'll read these articles that really criticize ELT, you know because they're matching that scenario where you where you land all your raw data in your data warehouse, and you're paying for that extra expensive storage, and using that extra extra expensive compute to do all your transformations, but ELT can be a very valuable pattern when you're using tools like Dremio, that aren't necessarily going to surge your costs, when you take advantage of Dremio’s particular architecture with that particular pattern.
Now, just a summary of what we're talking about––basically, with Dremio, you have a 0 copy, logical-view-first architecture. At the end of the day, you don't have to make any copies.
When warranted, because again, the performance should be pretty good up front, but when you need a little bit more, you turn on data reflections. And Dremio will manage that. So essentially, the data reflections are almost like turning on a data pipeline without you having to design it, it'll just automatically handle the movement and syncing of data.
If you're moving data from an Iceberg table to another Iceberg table, you now have incremental processing that's built into all of that. So the reflections will update incrementally. So then, even better performance, faster, crisper freshness. Less data movement––lower compute costs, lower storage costs, especially when you're doing ELT, in this case, we're even taking out the export part because we're just connecting the data directly to the Dremio. The Columnar cloud cache, that's going to lower those cloud access costs, data reflections that creates those reusable physical representations built in Iceberg that can be partitioned and sorted to your specification. And you can actually build multiple reflections for the same data set. So if I know there are 2 sorts of categories of queries that are on the same table, I can create 2 reflections, one that's optimized for these queries and another one that's optimized for those queries. And then Dremio's cost optimizer will know which reflection to use based on the query. So that way, again, you're reducing those compute costs. And again, it wouldn't necessarily recreate the entire table, Dremio will create pieces of reflections so that we can mix and match them, so that’s when you get the best bang for your buck. [It] makes them very modular and reusable, and they're reusable across multiple views. So any views derived from a particular table, they get the benefit of those reflections as well. And I'll talk a lot deeper about reflections in a few, I think next month, when I'll do a presentation on reflections in particular.
You can denormalize among disparate sources. So again, if you want to do denormalization, you can do that through logical views on Dremio, and you can do that across multiple sources. And again, you can always just turn on reflections on that logical view, to create that physical representation. And essentially, again, you just literally, in the sense, turned on a data pipeline at the flip of a switch, because Dremio is handling that periodically, “let me go back to run these queries and recreate that physical representation”.
The semantic layer in Dremio is gonna allow you to document and govern in the modern day. So earlier, we talked about monitoring as a best practice, particularly like ELT. Well, Dremio is going to provide you with lineage built into the platform. It has all types of governance––role-based access governance, column and row masking rules, so you have robust control over who has access to what data. That way, you can have the raw data there and not have to worry about people having access to the wrong data. You just make Dremio the central access point for your data, regardless of where it lives.
And then, you have Dremio Arctic, which is a catalog. So essentially, if you're all your data sitting in Iceberg using the Dremio Arctic catalog, and you wanted to give somebody access to a copy of the environment, normally, what you would do is you would basically create a copy of the data and say, “Okay, here's your copy of the data that you can play with, that won't affect my copy of the data.” But now you're duplicating your storage. But with Dremio Arctic, you can literally just create a branch, because it has those git-like functionalities, and that branch does not create a copy of the data, but any changes to that branch do not affect the main production branch that everyone else is querying. So you can create these experimental environments for other people or other teams without necessarily having to duplicate your data. So Dremio really does a lot to help reduce the need to copy, which again blows your storage costs. But then also through caching, through data reflections, and then just an overall super fast engine, you get the reduction of compute cost and better performance, and also just the semantically, that provides a very easy interface for your end users. So it just makes getting the data into your data lakehouse easy and makes using it in your data lake house pretty easy.
So hopefully, you guys enjoyed this presentation. Again, my name is Alex Merced. You can follow me on Twitter at @amdatalakehouse, and again, subscribe to any of these podcasts, ideally all of these podcasts, on Spotify and iTunes.
And now we’re getting into Q&A, so I'm gonna go take a look at the chat and the Q&A box to see what questions people may have…And I see there's a question to share the QR code again…So let me go back to that slide…so that way, you guys can see that QR code for the book…and there you go. So I'll leave that up, for anyone who wants to take a picture of that. Okay?
Okay, so here we go. So basically, why is Dremio's Lakehouse storage cheaper than a data warehouse? Online, generally if you are using a data warehouse, Oftentimes the storage in the data warehouse is marked up. So you may be using S3, but when you're using Snowflake’s S3, you're paying the Snowflake price per se. I'm just using that as an example. There are other warehouses. Choose whatever warehouse, and the same thing on the compute side. Okay, now, when you're using a data lakehouse pattern, and again, that's any Lake House pattern, just basic ideas, you're now landing your data and keeping it specifically, and doing all your analytics from the lakehouse directly from your own, let's say, S3 storage, or from your own Azure blob storage, instead of using the data warehouse's storage. Then you're paying those raw storage costs, you're not paying a mark up on that. So that's one level where the data lakehouse makes it cheaper. Okay, where Dremio makes things cheaper is that it reduces the amount of copying that you need to do on one level, through the data reflections, which there is another question about data reflections, so I'll explain that more in a moment, through all these different levels that Dremio has, it makes the need to copy data and create, but still have multiple versions of the data much easier, and so saves you the cost of storage. And Dremio's engine is very performant, mainly because it's based on Apache arrow, [and so it] has a lot of other really cool under the hood tools. There's actually a really good video from our most recent Subsurface conference, called, “How to run thousands of concurrent queries on Dremio”, where one of our SAs goes through a scenario where they had thousands of concurrent queries running performantly on Dremio, and sort of the architecture behind it, and what enabled it, it really goes to a lot of the performance optimizers in Dremio. I'll make sure that's a link in the follow-up email.
And also the C3 columnar clash––that's gonna reduce the access cost. Because you're basically caching Parquet files or micro-blocks of different things so that we're not having to repeatedly ask, access them from S3. And it's also just faster to access them right directly on the node. And then you have data reflections which are kind of like materialized views, I'm gonna do a whole presentation outlining the differences between materialized views and data reflections. But data reflections, essentially it is, If I turn on data reflections on table A, let's say, essentially, what's going to do, depending on the detail, it will create a physical Iceberg version of that table in my storage, my data lake storage, so in my S3., And it will be partitioned the way I specified, but then that will be used for queries on that table to speed it up. And if I create views from that table, that reflection will also be used to speed up any queries on the views. So it's not like materialized views which are matched by namespace. So basically, if I have 2 views on the same table, each of them would need their own separate materialized view. Here you have a materialization that can be used in multiple different levels, so you reduce the amount of materializations you need, again, lowering the amount of duplication of data you need, but also it's automatically synced for you. So generally you can set a schedule. There's settings you can change, but I think the default is every 5 minutes. Every 5 minutes, we'll go in there and then re-sync the data. So essentially, it will go check to see there's any updates and then update the reflection. So that way, it's using the reflection to speed up those queries.
Okay. How can I compare the main branch of my Iceberg data and the Dev testing branch in order to detect data changes? In your queries, you can specify what branch you're doing. So one thing you could do, is you could do two select-all statements, one querying the table from one branch and one carrying table from another branch, and then just do like a union-all, and get the difference between the two. And then basically, you'll see the differences between the 2 tables as far as the actual records. Okay, if you wanna see, sort of like the difference in what Parquet files on your storage, what you can do is you can use it on your Iceberg metadata, and you can actually query the the list of files that are part of this table currently, and then you can do the same thing where you basically, you do one select-all query for all the files in table A, and then select-all for the same thing but table A in the branch, and then do a union to discover their differences. And you can do that on a single query, right there from the Dremio UI! So it's pretty convenient, because again, in the query, you can specify what branch you're querying, and you can run multiple queries.
Another note is this––you'll know if there's data changes, because basically, you do have an audit log for your Dermio Arctic tables. Okay, you can go to the Dremio Arctic UI, and you can actually see all the commits. And you can see who made the commits, and generally there'll be a commit message. That kind of gives you an idea of what was changed. So basically, if you see no new commits, you'll know there's been no data changes, and then if you do see new commits, then you can kind of investigate further from there. So you don't necessarily have to go create the query right away. You can take a quick peek at the UI, take a look at the commit history, and then decide whether you want to do further investigation.
Okay. Hi, there! I have missed most of the session, but want to catch up with the rest of all the beneficial information you shared. Can I get the recording of the session? Yes, the recording of these episodes are always posted on Dremio's Youtube channel. That's youtube.com/dremio, within 24 to 48 hours from the broadcast. It will also be posted as an episode of Gnarly Data Waves on the podcast which you can subscribe to on Spotify or iTunes So you can always find all the previous episodes there. And so this episode will be posted there. And again, also on the Youtube channel. Youtube.com/dremio. And I've also just posted several really cool videos on the Dremio Youtube channel over the last week or so, including one video where I show you how to build a little lakehouse on your laptop. So that way you have this little laptop environment where you can practice and try things out. So check all that out at youtube.com/dremio, but with that, I think. Let's see here, do I see any other questions?
And yeah. So I think that's all the questions again. Thank you all for coming this week again. Come again next week. So for those of you interested in when I was talking about Dremio Arctic and the branching and the merging, next week we'll be talking about Dremio Arctic, so be there to catch that. And then again, we have so many more episodes coming down, so you can always just go to dremio.com/gnarly-data-waves to see what episodes are upcoming, and register for them.
And again, all the episodes will always be on the podcast and on the Dremio Youtube channel. So you can always just subscribe there to make sure you're always aware of what's coming out. And thank you guys very much. Have a great day, and I'll see you all soon.
Wondering what Dremio is and how it works? Download the Dremio Architecture Guide to understand Dremio in depth.
read more