May 3, 2024

Delivering the Fastest Time to Analytic Insight with Dremio Reflections

One of the hallmarks of the Dremio Unified Analytics Platform is Dremio’s innovative SQL query acceleration technology, Reflections. A Reflection is an optimized relational cache of source data that can be used to speed-up data processing by 10 to 100x. Learn how you can leverage and optimize Dremio Reflections for query acceleration at scale.

Gain a better understanding of how Reflections work, how they expedite query execution and power dashboards and reports. Discover how to manage Reflections efficiently at scale, and how to optimize performance across diverse datasets.

Topics Covered

Lakehouse Analytics

Performance and Cost Optimization

Sign up to watch all Subsurface 2024 sessions

Speaker

Benny Chow

Principal Software Engineer

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Benny Chow:

Today I’m going to be talking to you about Dremio’s query acceleration technology, and I think hopefully by now you know it’s basically called Reflections. All right, so as part of the agenda today, it’s going to be a technical talk, and I’m going to be showing you actually a lot of demos in the UI as we go through this, but we’re going to start with basically what are Reflections and how they’re kind of different from materialized views. Then we’re going to go into basically query planning, which is about matching Reflections into user queries, and then finally Reflection Management, which is basically about keeping Reflections up to date. All right, so let’s just get into it.

Dremio Reflections

So what are Reflections? Now I like to think of Reflections in terms of the use cases that they can solve, right? So when you have a complex semantic layer and you’re running queries where you’re passing through tens to hundreds of views or tens of hundreds of joins, querying from billions of records of data, there’s no way that’s going to perform well for DASPR queries. So what you have to do is basically pre-compute this data, do it up front once so that your DASPR queries can be potentially a sub-second or in the seconds. So a key use case for Reflections is for BI. Now another use case is when you’re querying, say, millions of CSV JSON files, which is really not ideal for analytical querying. And so with Reflections, you can basically optimize the physical storage by basically creating a materialization that’s in, say, parquet data file format. And furthermore, you can partition and cluster the data so that you basically lay out the physical file so it’s ideal for query access, so that you achieve the maximum, say, partition and file pruning, right?

And finally, another use case is when you’re querying, say, Oracle data, join to like data out in S3. And that could be an expensive, like federated join. And so you want to materialize that so you don’t do that every single time when the DASPR runs. So these are all common use cases for Reflections. Now when talking about Reflections, oftentimes people will say, oh, Reflections are materialized views. But Reflections are really much more than just a materialized view. I like to think of Reflections as almost like a relational cache, right? And the reason why I describe it that way is because Reflections, when we match Reflections into user queries, we think of these as basically query trees, right? And so you can imagine that like when a user runs a SQL statement, right, we turn that into a query tree, right? And basically with Reflections, we’re basically matching subtrees of the user query with subtrees of the Reflection. And so we’re basically doing all these like query rewrites.

And so the key point here is that from a user’s point of view, you don’t have to reference the Reflection. They don’t have to know anything about these Reflections. The user could query against the view that the Reflection is materialized against. Or the user could just directly run the SQL inside the view, and our planner will match that query tree against the query tree of the Reflection.

Reflection Example

So let me show you an example of that in the UI. All right, so let’s just quickly create a Reflection. So here I have a view that’s based on the TPC-DS query number three. And so this is a very straightforward SQL statement. I mean like this is a fact table joined to two dimension tables on this joint key and that joint key where we’re grouping by these dimensions, and then summing by this particular measure. So this is a really basic query. And then let’s go ahead and create a Reflection. So I’ll turn that on here. All right, so it’s materialized. Let’s go ahead and then let’s run a query against this Reflection. All right, and then if we click on the query profile, we’ll see that it was indeed accelerated. All right, so this is a very straightforward case where I directly queried the view that the Reflection was based on. So of course, the planner chose to use it, right?

But I could also directly run the SQL that the Reflection is based on. So let’s take that whole SQL statement and directly run it. And here I’m not referencing the view at all, right? And so we run it and look at the query profile. We’ll see that the Reflection was still matched into the user query, right? And so this is where I talked about where we take the query tree of the user query and are able to match that with the query tree of the Reflection. And so, you know, if you look at the profile details and look at this tab right here, our planning tab, I’m not going to go into this in super detail, but basically this was the query tree of the Reflection, I mean, of the user query. And you can see how we ended up rewriting that to use just a simple scan on the Reflection, right?

What to Materialize?

All right, let’s go back to slides. Cool. All right, so what is it that you want to materialize in a Reflection, right? So you basically want to materialize expensive operations, such as scans on, you know, slow file formats, filters, projects, joins, and aggregations. But basically the idea is that you want to take all this stuff, all these operations out of query time and basically compute it once and reuse it.

Now as far as what Dremio Reflections support, there is very, very few limitations as far as what can be inside the SQL constructs, right? So we support any join type, any aggregation type, you know, including nested aggregates such as group bys on top of group bys, window functions, set operations, time travel queries, UDFs, user defined functions, and more recently, row column access control policies. And what that means is, you know, you can have like security policies defined on a table, right? So that like you can restrict access to different roles or different users. And what we can do is as part of material, when we materialize a Reflection that contains this table, we’ll actually dynamically pull that filter out of the materialization. And then when the user queries something that could use this Reflection, we dynamically reapply this filter at query time. And so now what you’ve achieved is that like, you know, the same materialization could be used by different users, different roles while still enforcing your access control policies.

Now as far as like limitations, you know, stuff that we wouldn’t support is if you have something that’s like non-deterministic in your SQL. So for example, if you have random function in there, right, that’s something that we can’t materialize, right, and it doesn’t make sense. So behind the scenes, Reflections, the materialization storage table is an Apache iceberg table, right? And we also use a nested catalog, you know, for maintaining the location of the tables.

Query Planning with Reflections

All right, so let’s talk about the second agenda item, which is basically query planning with Reflections. And really, this is about how we match a Reflection into a user query. So you can imagine that planning queries is a complicated process, right? And our planner goes through multiple phases. And probably the most complicated phase is called logical planning. And logical planning is where we do the Reflection matching. So when I talk about Reflection matching, I like to break it up in basically three steps, right? So the first step is you need to figure out which Reflections are the ones that you can consider for this query, right? And in general, these are going to be Reflections that have overlapping tables or views with the user query, right? Once we have that initial list of candidates, right, then we have to go through the much more expensive process of trying to match the Reflection into the user query, right? And one of the first things we’ll do is we’ll, for example, normalize the query trees. And so, for example, we will do something like we’ll destroy join orders, right? So for example, if you’re joining like 10 tables, right, we’ll destroy the join orders that we can match a subset of those joins against the Reflection, right? Another thing that we’ll do to match queries is we’ll rewrite the original user query into alternative forms so that we can say match Reflections into this new form, right? And you know, one example is called the AgJoin user query alternative. And actually, I’m going to demo and talk about this one on the next slide. And this is how we match the aggregate Reflections, right?

So once you do this, then we try to actually match and unify the Reflections into the user query. And what that means is like now we’re looking at the query trees of the Reflection bottom up and trying to match that into the user query. And one of the important steps here is called unification. So the idea here is that like if your Reflection contains more data than what your user query has, right, what your user query needs, right, then you basically have to add an additional filter on top of the materialization storage table before you can put that into the user query, right? And so that’s an example of a simple unification, you know, rule that we’ll do. So once you have all that, then you have a bunch of replacement plans, right, where like you’ve got Reflections that could be used in different parts of the original query, user query tree with different alternative forms. So all that is then given to our cost-based optimizer. And here we’re basically searching through all these different plans to try to find the best one, right? And basically, we’ll have a cost model and then we’ll have statistics, right, on like for example, the row counts of each table of the Reflection versus the base table. And then we’ll also use things like partition stats, right, to basically estimate, you know, what is the row count of this table after filtering, of course, right, so filter scan. And then we’ll basically, in the end, produce the best plan, right? And remember, the best plan can mix and match Reflections, right, in different parts of the query trees to find one that has the lowest cost at the top of the plan.

Matching an Aggregate Reflection

All right, so let’s talk about an example of matching an aggregate Reflection into a user’s query. And I will also demo this after this. So let’s imagine we have a Reflection on top of a fact table sales, right? So here, I’ve got fact table sales and I’m basically summing a measure and then grouping by the product ID. Now here, I come in with a user query that joins, that basically takes the sales table and joins with the product table, obviously on the product ID key, right, and then you sum by the measure and then group by the product name. So if you look at this on face value, this user query doesn’t really look like this Reflection, right? But what you notice is that this aggregate operator can actually be split and decomposed below this join, right? And so when you do that, then you basically now need to aggregate by the join key, in this case, which is product ID. And so once you do that and rewrite this query, now this side of the query of the join now looks a lot like the Reflection right here, right? And so then our planner is able to put this Reflection into this left side of the join, right? So basically, this turns into a scan on the materialization storage table. Optionally, you can also do this on the right side of the join too, which is shown here.

All right, so let’s show how this looks like in the UI. All right, so let’s go back to my view again. And what I’ll do here is I’m going to go turn off that first Reflection I created because I want this query to match using what I just had described, right? So if we look at this view, right, look at the SQL statement, basically our fact table here is store sales, right? And we join to these two dimensions on these two keys, okay, these two join keys. And then what we’re doing is then summing discount amount. So let’s go ahead and create that aggregate Reflection on top of store sales.

Okay, so here’s store sales, go in there, go to Reflections, let’s create an aggregate Reflection. Let me go clear all this stuff out. So we’ve got date key, item key, and then there was a discount amount, all right, okay, now it’s materialized. Let’s go back to the user query, I’ll just directly run the view, okay. And then let’s check the query plan, and now we see that the aggregate Reflection, right, was matched into the user query, right? So this is, you know, you can imagine it was a totally different, you know, query, like we pushed the aggregate below the join, right, and then put the material, then we put the Reflection, you know, underneath the join. And you know, all this, all this that I just described, all this rewrite logic, the user query alternatives, the user replacement plans, they can all be seen in the acceleration profile, right? So this, this is all this, you know, once you turn on verbose profile, like this information is basically logged here, right? We talk about considered, matched, chosen, and then, for example, you know, the replacement plan that we generated was something called a, you know, ag join, you know, alternative.

Reflection Management

Okay, all right, so now I’ve talked to you a bit about matching, let’s go on to the final topic, which is Reflection Management. So Reflection Management is basically about, like, keeping your Reflections up to date. But like, I like to think about it as, like, trying to do it so that you do it such that you minimize your compute and storage costs, right? Because if you’re always rebuilding your Reflections every time the data changes, that can get expensive, right? So what are, like, the important cost considerations, right? The first and foremost, probably the most important thing is really is when you’re refreshing your Reflection, right, basically, like, updating it with the latest data, is it a full or incremental refresh, right? Obviously, full is going to be much more expensive than an incremental refresh. So, you know, what you want to achieve is try to get incremental refreshes.

Now, our system will pick the best method for you, but then there’s certain conditions under which, okay, we can do an incremental refresh. And I’m going to talk about that on the next slide. The other thing to think about is, well, how long did a refresh take for a given workload management configuration, right? So when you’re refreshing your Reflections, you can set up workload management rules to target a specific engine, you know, which then has a number of nodes and cores per node, and also, like, a concurrency level set on the engine. And those things will really determine, like, how long it takes to refresh a Reflection or how many can you do in parallel, right? Other things are like, you know, do you have engine replica scaling set up so you can scale out engines, right? So that you can, you know, if you’re refreshing more concurrent Reflections, it will automatically scale up the engines for you.

And then another important topic is, well, how much storage are you using for each materialization, right? So, it’s not uncommon for your materialization tables to be like, you know, 100 gigabytes of disk space, right? But the thing to keep in mind, though, is that, like, it’s not like, you know, a Reflection is just 100 gigabytes, right? I mean, oftentimes, you’ll have 100 gigabyte Reflections sitting on S3, right? But, you know, in the background, there’s another, you know, Reflection being built or materialized, you know, with the latest data, right? So, you often have, like, copies of the materialization storage table, right? And, you know, the planner will have to drain queries off of the previous materialization before it moves on to the new one, right? So, you have copies of this, and there’s ways to control, like, how many copies do you have to have. And then, really, another important aspect, obviously, is if the Reflection is actually being used, right? So, you do all this work to materialize it. You need to know, right, what is the impact of this Reflection on user queries, right? And the impact is not just that it’s being used because, you know, it’s being, you know, it’s, you know, it’s a seller of queries, but there could be also a negative impact when, for example, there’s too many Reflections in your system, and, like, you spend too much time considering matching and not using your Reflection, right? So, you need to know about that as well.

Reflection Management Example

All right. So, how do the users see all this stuff, right? Actually, it’s quite easy. So, let me show you how it looks in the UI. So, we have a nice Reflection management UI where – okay, so, here is the last Reflection that I just built. It’s the aggregate Reflection that we just looked at together. So, here we can see, for example, what is the current Refresh status, right? So, is it scheduled, or is it currently refreshing in the background, or is it maybe a manual refresh, something that, like, the system doesn’t run automatically for you? Here we see when was the data last queried from the source table, or basically the last refreshing table, right? So, this tells you kind of, like, what is the – you know, if your data is stale or not. You know, then you have, like, metrics about, like, for example, record counts, the current footprint, total footprint. And remember, total footprint could be more than a current footprint when you have multiple copies of the Reflection of the materialization stories table present.

There’s your last refresh. How long did it take, right? I mean, big Reflections can take, you know, 30 minutes. And then, whether the refresh method that we automatically determined for you was full or incremental. And then, Reflections also have an expiration, right? So, you know, by default, I think it’s, like, three hours. So, you know, you can say that, okay, you know, end user, your data will never be more than three hours stale, right? And that’s what this guarantees. And then, finally, you have all the different counts, right? So, this tells you, like, how many times was the Reflection actually used in a query versus how many times was it matched, right? So, match produced replacement plans, which goes into logical planning. And so, this gives you kind of insight into the effect on planning time and also usage.

And then, finally, this link right here, you can click into that. And then, this is the actual materialization job, right? So, this is the job that was used to, you know, to create the Parquet files and then, you know, generate the iceberg table or in case it was an incremental refresh, then, you know, what was done to incrementally refresh the Reflection. One of the cool things is, in the profile, we also show the refresh method. So, here, there’s a section called Refresh Decision, and it tells you exactly whether we did a full or incremental. If we did, if we could have done incremental, what method did we use? Was it append-only versus snapshot-based? And so, also, if it was a Reflection where we couldn’t do incremental because it contained, you know, for example, like an outer joint, right, then, you know, we would tell you that here.

Incremental Refresh

All right, okay, let’s talk a bit about incremental refresh. So, you know, ideally, you’re doing, you’re refreshing your Reflections incrementally, right? But whether we can do an incremental refresh versus a full depends on a couple of factors, right? The first one is, what is your source data, right? Meaning, like, what is the data that’s being put into the Reflection, right? So, if your source data is append-only, as in, like, you know, you’re ingesting, like, say, event log data, right, then we support any source, meaning that it doesn’t matter if it’s JDBC, MongoDB, if it’s the Iceberg Table, right? As long as it’s a, basically, you’re impending data, then we can support incremental refresh, right?

Now, if there’s updates and deletes, right, and the source tables are Iceberg, right, then we support something called a partition-based refresh, right? And what that means is that we’re able to look at the source data, get the snapshot IDs, right, and then figure out what partitions in Reflections have changed and, basically, rebuild those partitions, right? And this is how we can handle the updates and deletes. As far as, like, the SQL constructs that are allowed inside a Reflection, we support, you know, these aggregate functions, right? We support an unlimited number of inner joins, right, and then a single group by. And so, as I said before, we’ll select the best method for you when the Reflection is refreshed, right? As in, like, you know, can it be incremental versus full? And if it’s incremental, should we just do – can we do append-only or update and delete, right? Because these two have different, you know, materialization plans, right, associated to them.

Full Refresh

All right. Now, as far as a full refresh, full refresh can be expensive, right? And so, you know, the key thing about full refresh is you should only be triggering them when you need to, right, such as when there’s data changes. And so, we have different refresh policies that we support. So, for example, like, we have an interval-based refresh policy. So, for example, like, every hour or every 30 minutes, right, and this is good for, like, if you’re streaming data into a source table, right, and you’re streaming data into a source table and then you just want to update this table on a periodic basis, right? We also support schedule-based. So, if now you’re doing batch processing, right, so, you know, you’re doing ETL loads nightly, right, then every weekday at 5 a.m., you know, kick off the reflection refresh, right? And then, of course, you can do it manually using, you know, the REST API and also the SQL API. Now, we also have optimization to make full refresh less expensive, right? So, we can skip refreshes when we detect that there’s no data changes, right? So, for example, if your reflection is based on all iceberg tables, we will track all the snapshot IDs, you know, that were, you know, that were used to materialize reflection. And then on the next refresh, we’ll compare the latest with the ones that we tracked and basically ditch it. If there’s no actual data change, right, then we won’t do a refresh.

Now, another advanced optimization that we have is that reflections can actually wait on other reflections to refresh. So, actually, you might not know that, like, when you materialize reflection, our planner can see that the job that is used to materialize reflection can actually be accelerated by another reflection, right? And so, when you have this sort of chain of reflections, when you want to reflect the top of a reflection, it would be better to wait for the bottom ones to refresh first before refreshing the top ones, right? So, wait for your upstream dependencies first, right? So, there’s this concept of waiting or pending statuses on reflections.

Reflection Custom Tips

Okay. So, that’s what I have to talk about for reflection management. I just want to end with, you know, some tips on how to create reflections, right? So, you know, if you’re following Dremio’s semantics layer best practices, then you’ll typically have your layered, you know, views, right? So, you might have staging, business, and application. And so, generally, you want to create reflections on the top-level shared views, right? Because these are the views that are accessed, you know, by, for example, the end users, the applications, the dashboards. But also, more importantly, these are the views that contain, like, hundreds of joins and, you know, hundreds of scans, right? So, you want to basically, again, you know, pre-compute all that.

Another tip is, you know, don’t just enable raw reflections directly on source tables, right? Because what you’re really just doing there is just copying the data, which might not be providing much value. Instead, you really need to think about, you know, whether you can partition or sort the data so that you optimize the file pruning, right, for your queries, right? That’s what you should be thinking about there. Now, you know, we also have a reflection recommender, right, which you can basically, you know, feed your slow queries to, and then it’ll give you recommendations, right? So, it’ll look at the query tree and basically say, oh, okay, I can recommend you to create a raw reflection on this view, or maybe it’ll recommend, it’ll see that, okay, what I can do is actually create a view off to the side, right, and then create a reflection on that view and then have the planner, you know, match that view’s reflection into the same user query, right? And finally, you know, and I showed you this already, is that if you need to troubleshoot what’s going on with reflection matching, then you go to acceleration profile, and you need to have both profile enabled.