52 minute read · May 1, 2018

Dremio 2.0 – Technical Deep Dive

Webinar Transcript

Kelly:
I’m going to start presenting here. This is a session on the Dremio 2.0 release from last week. If you’re here for a different webinar, sorry about that. Hopefully you’re in the right place. If for some reason you drop or something like that happens, you’ll get a recording of the session forwarded to you after. Hopefully you’ll stay on and ask lots of great questions along the way.Okay, great. So let’s get started. We have a ton of material to cover. We’ve broken this down into a few different sections. I’m joined today by our products team led by Ron, VP of Engineering, and Can who’s the key product manager here at Dremio–responsible for the 2.0 release.We’re going to start by looking at the REST APIs and some updates to the Jobs user interface. One of the things we’ve heard from lots of customers, and lots of folks in the community is: we’d love to be able to interact with Dremio through REST, a REST-ful interface of some kind. We’ve now got a pretty rich set of capabilities in the product to let you do everything from administering Dremio to issuing sequel queries as well.We still have other things that we plan to add in the future, but there’s a lot of great things here now with 2.0 that should make it so you can provision a Dremio cluster, and interact with it from an application over REST. That’s in addition to the options for other reasons. There’s a primitive documentation, there’s also a new tutorial on using the REST API that I encourage you to take a look at. If you google Dremio REST you’ll find the tutorial and a link to the documentation, but here they are for us to check out. So that’s what’s new in the REST APIs. If you have any questions about that feature, post it to the q & a. We’ll make sure and get to that at the end of the webinar.In the UI screen, one of the key things we have is the ability for you to see all the jobs that have been run in the cluster. No matter how those queries were issued. Whether they’re issued through the Dremio interface or through an external client over ODBC, JDBC, or REST. Can, do you want to talk a little about some of the updates here? What’s changed?Can:
Yes, one of the biggest things we’ve done is as we wanted to make sure that the jobs page covers more details about these other of individual jobs. So now, instead of all that … I don’t mean that we show status of what we are running, computers. We now also show specific styles statuses for each job, when they’re starting the job, when the planning is going on, whether they’re queued,, as well as all of the other spheres, that we already have. This should help both users and administrators much better get a clearer picture of what’s going on in the process, in terms of work books that are active.Kelly:
What’s the second point here about how queries are being truncated?Can:
Yes. This is more of a guard against some runaway situations where, we’ve seen some of our users running very large runs in the UI. Meaning, in Dremio it’s almost, in the UI that you can work with a query you can preview it. At which point we have special sampling push downs for each of the sources you work with and you work on the sample side where you can run the query.But when you run the queries, there’s more than what a human could consume in the UI. There’s not really a point, because that result does not give you … What we are doing in this one, is we automatically truncating the results to a reasonable size so that you do not use all this space on your cluster. This only affects your inquiries on a run, and that should actually affect performance and improve it for very high queries.Kelly:
If I set my queries over SQL. this does not apply.Can:
Of course. Yeah.Kelly:
This is just for people working in the Dremio 2.0.Can:
You got it.Kelly:
Great.Alright. Another thing we’ve added is we’ve improved the experience for people writing SQL to do automatic error highlighting. Here’s an example I threw together while we were waiting for everyone to join our call. I’ve used the incorrect column name. I added an underscore wrong to one of the column names in the data set, just a display. You can see that it’s underlining the error. The problematic part of the query is highlighted in red, and you can see in the error message below the SQL window tells you exactly what the issue is. It’ll be easier for you to write queries, and as you make mistakes, for Dremio to help guide you to the right place.This is just the beginning, we have a lot of exciting things coming in this area to keep making Dremio really a world-class people editor as you’re interacting with the system.Let’s talk about so changes to data sources, specifically Data Lakes. In this release, we added support for Azure ADLS, and that means you can use ADLS as both a source for data and also as your reflection store. You can purchase reflections in ADLS, which effectively brings it on par with S3, which we’ve had since version 1.0. In this release, what are some of the things we did to make S3 work better? Can, do you want to talk about that?Can:
Yeah, of course. So we did a couple of things. We’ve updated our logic for handling bucket metadata, so now adding in new S3 stores should be much snappier than it was before, even if you have thousands of buckets. The second big update we did was working with reflections on S3. Now your reflection creation on S3 is much more memory efficient, and we also passed it into some changes in the way we do Salesforce on S3. You should expect a much better experience working with both a large set of buckets and on the other side working with reflections of S3.Kelly:
I’ve personally experience these improvements firsthand. It’s nice to see. I think it’s much more snappy.Okay, here’s the big exciting change in this release. Starflake Reflections as well as External Reflections. Let’s talk through … As a reminder, what the heck is a reflection anyway?Reflections are how Dremio accelerates queries. It’s optimized representations of data. I think of them like indexes in a database. Where you can add a reflection to a data set in Dremio and users don’t need to change their queries. They don’t connect to some new physical object. They keep their queries, but suddenly they start to go faster. You can have many reflections, many different types. We have always had two core types of reflections, we call aggregation reflections and raw reflections. Aggregation reflections are really useful for speeding up UI queries, lots of curation and aggregations. Raw reflections are about speeding up, needle in the haystack type queries.Let’s talk first about external reflections. What’s the purpose of external reflections and what can you do with them?Can:
Great question, Kelly. External reflections are very useful for customers who already have and are maintaining APL jobs or data warehouse stores. They’re already aggregating tables, they cross some dimensions, there’s already some logic for that. And they want to reuse in Dremio without actually having to rebuild their logic and instill it in Dremio. In certain cases, these jobs are already maintained and they’re already happening. The idea is to be able to reflectively integrate these additional digests or summary tables or partition data sets that you already have into Dremio’s optimization process.When an inquiry comes at this point, we not only leverage our, what we call internal reflections, basically raw and aggregation reflections you have inside of Dremio, but also the reflections that might live in external systems, giving you an additional flexibility and performance.Kelly:
I could do something like use Hive to create an aggregated representation of a data set, or maybe I could do that at Spark, and I’ve already got that process running, with this I could register that to Dremio, and now Dremio can consider that as one of the options to execute query and maybe that’s the most cost-effective way?Can:
You got it. You got it.Kelly:
That’s great. The reflections for the app, there are two components: building them and then seamlessly substituting them. So if you’re already invested in various digests and want to be able to use them on seamlessly substitute them so your users don’t have to know of all the names of the different digests, you just submit the queries out of tableaux, you get to have that benefit.Can:
Yeah.Ron:
One of the things that you can see here in this metadata table with the describe on SYS doc reflections, there’s a column at the very end here for external reflections. If you want to understand one of the complex types of reflections that you have available to the query planner, the catalog of those reflections is contained in the reflection table. As you can see, I did select some of the columns out of that, just to give you a sense for what that looks like.Mike Ferguson asked a good question here: “can Dremio gather statistics on external reflections?”Today it does not, and that’s a really important question. That’s one of the difference between Dremio using reflections to substitute versus being the manager of the reflection. That is one of the advantages of having Dremio manage the reflections is that statistic. The only thing that Dremio does when you create an external reflection is make sure that the schema matches what you express when creating that reflection. There’s a slide on that, I’m getting ahead of myself, but I see several questions on it, so thought I’d address that. That’s Mikes question.Kelly:
Okay.Ron:
Go ahead. Sorry for the interruption.Kelly:
Great.Alright, so now let’s talk, how do you actually use this thing? If there’s not a piece of the user interface that allows you to set up an external reflection. It’s something you need to buy via SQL today. So here’s the syntax. There’ll be a quiz at the end if you remember the syntax. No, I’m just kidding. There won’t be a quiz. But it’s in the documentation so you can [crosstalk 00:11:22] register that particular data set as an external reflection.Can:
Yes.Kelly:
Any other-Ron:
There’s a quick question there that Alexander is asking about how to … Dremio figure out if a reflection is suitable for a certain query?Real quick, we take any query that comes in from a VI tool or that you enter in yourself, we parse that into a query plan. We then look at the query plans of a variety of different reflections and identify whether the query planned for a reflection can be substituted for a subset of the original query’s query plan, and that subset might be a perfect match for part of the query, and that’s how that substitution happens. We call that process substitution, and if a substitution happens, your query is accelerated. You’ll set that in the form of a flame in the job details.Can:
One interesting note there, actually. Is during the substitution process, it does not have to be a one on one match, so if Dremio can understand that it can derive certain metric or dimension they’ll allow you from an interesting reflection. It will rewrite the user query to leverage the reflections. It’s not only one on one matching, but it’s also rewriting and trying multiple versions of the same query to see if it can match.Ron:
I see Gary’s asked a question about what are the limitations of the size of a reflection?There really isn’t a limit in terms in number of rows or physical size. Reflections are stored in a file system that could be ACFS it could be S3 or ADLS or it could be ARNAZ, but that’s one of the really nice features here, is that you can scale out with your cluster size. The reflections are distributed across the file system so that you can scale out your workload as much as you like.Kelly:
One question. External reflections–what sources are suitable for external reflections? What can I use to sort an external reflection?Ron:
Pretty much anything actually.Kelly:
Yeah.Ron:
Anything we support. It could be Auricle. It could be elastic search. It could be HDFS.Can:
Yeah, exactly. Once you’ve … What you just need to tell Dremio is how your summary or digest table relates to the original data. Once Dremio understands that relationship you’re good to go. [crosstalk 00:13:43]Kelly:
Initial … So, a couple of things. There are already production clusters with well over 100 nodes, using reflection. That gives you a sense for the size of reflections leveraging. For the initial use of external reflection that we’re seeing some of our enterprise users has been HDFS, because they’ve already got a Spark job or a Hive job that has created digests. That they had before Dremio came onto their aware … before they became aware of Dremio.Ron:
I suggest maybe we keep going and I know there’s some more good questions, we’re going to get to them in the end. I want to be sensitive to time and all the stuff we have to cover.Kelly:
Yeah. So let’s talk about Starflake Reflections, because I think this is … What are you chuckling about Ron?Ron:
Because you almost said Star Lord reflections.Kelly:
I didn’t not almost say Star Lord.Ron:
Sure.Kelly:
For the sake of Star Lord. One of the Starflake reflections will … It’s common, we’ve seen with customers that they’ve organized their data in a star or snowflake schema. Either in a database, or in HDFS or S3. In cases where you have a central fact table and a number of dimension tables in star schema or in a snowflake schema. If you set up a reflection that includes all of those tables as a single reflection, Dremio can detect that this is a star or snowflake schema and can automatically use that reflection to accelerate queries that involve all of the tables or any subset thereof.What this means is that you can simplify the number of reflections that you need to create and Dremio can simplify the maintenance and creation of these reflections to be as efficient as possible. Any other comments or thoughts on that?Can:
No, you really hit it on the spot. In Dremio … Previous to 2.0 release the reflections we still had the capability of accelerating snowflake and star schemas, but the flexibility of being able to also do the data join or subset of data join that are part of that schema was that now users can even do the agro-queries without even know anything of that model. They will still get the benefit of that acceleration. That gives the powers the admin and the data curators actually a lot by not having to have this configuration. It makes change management very easy. It makes the data consumers more productive, because they don’t have to keep modeling.Kelly:
I think an exciting thing about this to me, is that, typically you’d think of, “I have a bunch of data sets in S3 or HDFS, if I want to get the performance I need, now I have to load it into a data warehouse.” But here you can get the performance of acceleration off of your data in the file system. You get interactive speed without having to use a proprietary data warehouse.In terms of the user experience we have these new icons that I think are fun. The fire and ice icon here in the upper right. Basically the Starflake reflection has been used to accelerate a query, this is what you would see on the job. If you see it in a query … When you see just the snowflake, or the Starflake on it’s own, without the flame-Can:
That’s actually coming up … It’s not just … So we shouldn’t.Kelly:
Sorry about that.Can:
It’s okay. Basically serving … Dremio knows if we can use Starflake reflections when we build the reflection. So we tell the user if we’ve identified that, “hey this looks like an expanding … It looks like a Starflake schema, the joins are not expanding. Meaning we can use and leverage Starflake.” Even before you know the query you’ll know if this can be used for that. Then as you’re running these queries if we end up using that specific way of accelerating the query, we’ll show you the flame identifier.Ron:
Cool.Kelly:
So that’s Starflake and that’s reflection. We’ve covered … I know we have a couple questions we didn’t quite answer in the live q & a we’ll get to those in the end. Let’s move onto to the subject of reflection maintenance.To give you a sense for what we’ll cover here. We’ll talk about refresh policies and how the reflection maintenance work is managed and handled. Then a little bit about the systematic data that you can look at to better understand this stuff.So Ron you want to talk a little bit about reflection refresh and what’s going on in the screen we’re looking at here?Ron:
Yeah. So what we’re hovering over here is the refresh policy tab for a particular physical data set. That’s the refreshing of reflections happens at the level of the physical data sit, which may have one or more reflections. What I get to define here are the two most important parts are: the refresh period and the expiry.What I have configured here is for my reflections, my physical data set to be refreshed every hour. I’m doing the full update. I’m asking it to refresh every hour, and that means every hour, I will create new materialization of that reflection. I’m also stating that the reflection materialization expires and should no longer be used after its three hours full. That marks once the age of a reflection is three hours old, or has deprecated. That means that new queries that arrive, will not use that reflection. Some queries were already perhaps in flight and may still be used. When something becomes deprecated, we do not immediately delete it. We wait for queries to no longer be using that reflection. You should be aware of that because … When planning for disc space that’ll be used by the system. When you delete a reflection or when expire one, we have backgrounds that harvest it later when queries aren’t used anymore.Can did you have something you wanted to add to that?Can:
I think we’re going to cover some this a little more.Ron:
Yep.Can:
So-Ron:
That’s fine.One of the things that was very much in request was the never refresh, and the never refresh was not because people don’t want to ever refresh it. Well, there were two use cases. One was, I have to know that this is something I want to reflect and this is not data that’s changing much or ever at all. Therefore, don’t bother refreshing it.Kelly:
Mm-hmm (affirmative)Ron:
So we added that. Then the other part of it was, I want to be in full control of when the refresh happens, and I want to write my own query job for when that happens. We have refresh now available as a button, and of course, AVI.Kelly:
Cool. I think you covered that.Ron:
Yes.Kelly:
These points here.Can, do you want to talk about the reflections admin, this is specific to the Enterprise Edition. What’s new in the admin screen?Can:
The admin screen is changed to be via reflections, actually. In the previous places you would see this screen be by data set, and then you would have to go into individual data sets to see the reflection configurations. Now we actually present that via reflections. The screen is grouped by data set, and then you see individual reflections for each. This gives you a much easier overall picture of what’s taking the most cost-wise. You see … You can compare the costs by usage of each of these reflections.I think one of the biggest improvements on the screen is improving the reflections data sets, which we actually have more details on the next one. Basically as you see, there’s a green warning and the red error sign. How we’ve updated the screen is to make it much easier to understand what’s going on in the pipeline. Each of these status is telling you need to pay attention or whether you don’t need to pay attention. Green basically means everything is okay. From my current reflection can accelerate. It’s ready. My pipeline is helping, meaning the refresh jobs are also working fine.The warning, the triangle means–at eye level–your query’s currently being … Your reflection can’t currently accelerate, but there might be a potential problem in the pipeline, but it’s not a terminal problem. Meaning, that it will re-attempt building that same reflection and it will let you know if that fails as well. At which point, it will turn to a red light. Meaning, if the reflection is in a terminal state the moment the existing one expires you’re going to be … You won’t have coverage on that anymore.Now as an end-user and an admin, you get much more claim to greater understanding of what’s going on. Both from a current acceleration standpoint–can this accelerate? And also, what the pipeline looks like–am I going to be able to keep accelerating, is it keeping up with updates and everything. All those details.Kelly:
Cool. Okay.Ron:
This is a summary of the different states that a reflection could be in. Can mostly covered these. We also track the number of consecutive failures. Other nice things. Lots more information and control available to you in this release. If you go and look at a particular data set or reflections that apply to that data set, we have a nice new capability here where recommended reflections no longer block. You want to talk a little bit about that?Can:
Yeah, one of the top three asks we’ve been getting were to actually being able to, in certain cases … Let me go back a bit. When you open up the reflection screen, Dremio actually sends a discovery query so that the underlying data sets still understand the characteristics, things like cardinality or different problems and basically get a profile of the data. We then use this information to give you suggestions on the aggregations.With this release, what we’ve done is we’ve made it so that you can actually skip or keep doing other work while you’re waiting for the aggregations. It’s not diffcult, so you can go ahead with the rest of your workload without having to wait for the recommendation. An option is, so you don’t have to cancel that.Other biggest change on the screen that you’re going to see is similar to how we have done the address. Now, each reflection can enable and disabled on its own. This is a big change. Now, what this means, is that you have much finer grand control of your multiple drafts for a data set. You can test out this data without having to delete it or drop it. This gives, again, much easier control, easier management of the overall process.Ron:
Okay. I have a question. If I had a 100 reflections?Can:
Yeah.Ron:
And they’re all on a similar maintenance schedule?Can:
Yeah.Ron:
How do I … Does Dremio do all of those. Update all of those simultaneously? Or, what does it do?Can:
It’s a good question. Every time you add or update or delete a reflection from Dremio what happens in the background is Dremio updates what we call a dependency graph. Let’s assume that you have two reflections in the system. You have a raw reflection that has all the columns of the data set. Then you have an aggregation reflection of the same data set. If you were to enable these at the same time, what Dremio would do is, it would make the aggregation reflection depend on the raw reflection. Meaning, it would first build the raw reflection, and then it would use the raw reflection data to build the second reflection.This gives … First of all, make sure that you don’t hit source system multiple times. If it’s an unnecessary read, lack of performance, you don’t have to incur those costs.The second thing is, it actually provides higher performance, because the idea is, this is like a funnel. You have the data, the reflection that covers the most, and then you cover less, you cover less, you cover less. It’s becoming more efficient to build off of less information, as opposed to having to do the full scan of the original data set, and do it multiple times for each of the reflections that you have.Ron:
I think that’s really important to note, for people to understand. I would also point out that there’s bit of a workload management within the system. Such that, when … If I create a 100 reflections on 100 different data sources, they won’t all be created at the same time, because obviously they’re not going to be dependent on each other. There is a queue that’s schedules, and you can control the size of the queue and how much concurrency there is in the reflection data so you can avoid overwhelming the system while it’s building reflections from also servicing queries.A little bit of a hint of some of the work that will be upcoming that give you a bit more binary control of workload management in the not-too-distant future.Kelly:
Great. Alright.System tables, you have system reflections, system materializations, system refreshes. You can run the describe on any of these to see what the schema is. Have fun. Enjoy those tables.Ron:
I put those in there mainly so that you guys know that this is how a lot of our internal data is serviced. If you wanted to build an alternate UI for Dremio, or you’re curious how the arresting PI is powered, it actually all gets the information from there. We try to standardize how that gets done. I wanted to give you guys a little bit more information for under the covers.Kelly:
Okay, let’s talk a little bit about metadata and keep those questions coming.Actually so since … Let’s answer one quick question on reflections. Mike asks, “Can I create an aggregate reflection on top of an external reflection?” That’s a good question.Remember reflections are reflections of the data set. You can have your data set, and you created an external reflection for that data set that may be stored, for example, in hadoop because you’ve already created a Hive job for that particular raw reflection, perhaps. Then once you go configure an aggregate reflection underneath the covers, what Dremio does is it runs a query that corresponds to the data that needs to be collected in order to materialize that aggregate reflection. That query, just like the substitution I described earlier, will go through a similar substitution process, and may choose the external reflections through the optimization process as the way to run that query. It may therefore use the external reflection. It’s important to not think of an external reflection as a new table in the system, but rather a reflection for an existing data set.Mike, I hope that answered your question.Ron:
I had the same question, Mike.Kelly:
Alright, let’s talk … And by the way, we some tutorials forthcoming on external reflections because I think it’s very helpful add on to, as quick … Apparently, I didn’t.So let’s talk about metadata a little bit. Dremio is gathering metadata from all these sources, and you have now a lot of tools in terms of how that metadata is harvested and how frequently it is refreshed. I think it’s important to understand here, there are different kinds of metadata gathering processes that some are fairly cheap to perform and some are much more expensive to perform. We want to give you lots of control in figuring out how to best optimize the environment you’re working in.Let’s talk a little bit about what kind of information do we gather from sources when we connect to them.Can:
The first thing to note here is there are two levels of probing. There’s levels of discovery and data set details.When you add a source in Dremio, we go through the data set discovery process. Meaning, for things like initial databases we capture the schema name, data name, data base name, table names–the high level objects basically. Or something like, if there’s 3, we capture the bucket names. And so on and so forth. So basically this is information that you need to be able to see the catalog without any of the column information or information that you would actually need to execute a query.The second side of the probe is what we call deep data set detail. And these are details that we need to execute queries against system. And Dremio caches the details because we don’t want to incur the cost of retrieving these inquiries, as much as possible. What you’re seeing on the left side of the screen is the various options to actually go ahead … I’m going go into details of the various options we have.In this one, let’s focus on what the data set details actually are like. So these are things like, filings, format information, schema–column types and column names, estimates or any statistics that we might have. Things like high stakes statistics at the time, table size if it’s relational data bases, proprietary information, file size information and so forth. On top this, we keep … I kind of covered this, but what we call split space. Being able to tie in locations and individual valleys.One final thing is for file systems we also have what we call a read signature, which is a way for us to understand if a deep probe is needed without actually doing deep probe. Basically what we do there is, we come up with a version for the data set, underlying data set, and we compare that version as opposed to doing the full update. Then if the version has changed then we do the full update. That give you a much more efficient method.As a follow up to this, there are multiple modes in which Dremio can cache these data set details. By these modes, Dremio would only update by the data set that you touch, meaning, if you’re previewing something in the UI or if you’re running queries. Once you did that, Dremio will put it in the schedule and try to refresh it when your expressions are full, and it will expire the metadata after the expiration period as if you can cache for something. In that case, if you want to then go query that data set, we would do an inline query, I mean, metadata update and get that information.Ron:
Sounds good. I think some of our customers have hundreds of thousands of partitions in Hive and thousands upon thousands of tables in their Auricle data base. This is useful to have this flexibility.Can:
Exactly.Kelly:
Okay, so, Let’s talk about the freshness guarantee. What is that all about?Can:
Yes, so, basically, it’s kind of going back to the whole expiration point that we were talking about. If for some reason, we cannot keep your metadata up to date, what we will do is … Actually I’m going to talk about something else first.First of all, we’re trying to keep your metadata up to date. So that, when a query comes, we don’t actually have to go back to the source system, as I was saying, because that’s costly. If for some reason we cannot keep up with the update, and still have to execute that query, we will go underlying metadata update, and we can that fall through. In our system.If, for example, you added a table to your system that Dremio does not know about, before that, Dremio will actually go to the underlying data source and try to discover that and still let you query. So you’re not under the thing. The point here is, we will try to do our best effort.People have scheduled updates, and make that effort before they need the update.Kelly:
Cool. Okay.We now have some new capabilities for you to have control over this metadata that allows you to explicitly invoke a refresh and also to forget. When would you ever want to forget a metadata?Can:
If for some reason your metadata gets into a space that does not make sense, that’s when you would use forget metadata. Our first recommendation is always to refresh it and see if that helps. Only because like 95% of problems will be solved by that.In case something goes wrong and there’s some issues, that’s when you would use forget metadata. And the next time you would query, it would appear in again in that catalog.Kelly:
The forget metadata is potentially has big impacts. It’s not something you want to use lightly. It’s sort of a last resort effort.Can:
Exactly, because it has implications around imposing information on reflections and security permission. It’s not like a reflection.Kelly:
Okay. Source down. So naturally distributed systems, networks go down, servers go down and if Dremio’s connected to hundreds of different source systems or maybe just a few systems … If it is ever the occasion that it loses connectivity with those sources, what happens? How does Dremio behave when a source goes offline?Can:
What we do is, we actually have … We actually check the source in the interval, and try to understand this data. Let me understand if a source goes down what we do instead of filling that query right away we actually show a specific message to the end-user saying, “hey, there’s connectivity issues with your source.” And we tell them about potential problems. The big point here is, now, Dremio’s taking certain pieces, there’s a source that was bad could affect other user queries. Now, with this release, we make sure that each source is isolated from a catalog and planning standpoint and we have solidified the source data checks mechanism to make sure we can detect these changes as soon as possible, and make those changes in the system to ensure support for the rest of the work force.Kelly:
So Ron, if I have a source and I have a raw reflection on that source, should I just think of that like a backup? Like a fail over copy of the data that Dremio can go query the raw reflection and not force it down?Ron:
No, you shouldn’t think of it that way. Correct.Kelly:
Okay. So, if a source is down from the perspective of Dremio thinks it’s down, I can’t run a query even if I have the data in a raw reflection? I believe that’s true. Check to see … The metadata checks to see if the underlying source is still available before considering substitutions of reflections. The thing that’s key here is that when we first released the very first version of the product, some systems being down actually prevented us from running all the query. So, adding more maturity there.We see that it’s marked as red. If you’ve never seen that, one way you can easily simulate that, what we do here is, if you have Dremio running on your laptop, just turn off your wifi connection.Ron:
Yep.Kelly:
You’ll see everything go red.I think this is the end of the slides on the metadata section. So schema learning. When Dremio learns about schema during these probes activities and builds a schema in its catalog in the Dremio data catalog, but then we periodically update and maintain that catalog. If you alter the schema of the source, Dremio should automatically pick up those changes the next time it goes to collect metadata, and update the catalog automatically without an administrator needing to intervene.I spent a fair bit of time working with Mongodb and also Elasticsurge, and there the schema can change completely unannounced from record to record. So in addition to the metadata base ability to probe and collect schema, we actually do this as well during queries. If we detect in a query that a schema’s changed, we’ll automatically update the catalog of the query. It’s a really, really nice feature that it is a reflection, pardon the pun, on the reality of these modern sources. The schema is not always declared and it’s not always maintained. But that doesn’t mean that you have to have the team’s administrators watching for these things, Dremio will automatically adapt and learn about the changes itself.That’s it in terms of the material we have prepared for you guys. I know we went over a ton of things, and left a few minutes here at the end to go over some q & a. You guys asked great questions.Ron:
Yeah!Kelly:
Ron do you want to start?Ron:
Sure. Could you comment on the expected capacity planning for storing reflections and factors that affect reflection files storing?I don’t think that there’s a very simple answer for that, unfortunately. It depends, probably, on the size of the data, understanding your workload, and which reflections it can handle and become least common denominators for many queried workload on the one hand. The other side, is that there are some options when you create reflections and in particular, one of them helps you choose between an option that will lead to larger number of files that smaller versus larger files in the system. You can see that in the UI, when you go to an aggregate reflection in the advanced screen, in the upper right hand … I’m doing this from memory so hopefully I’m right.Kelly:
I’ll look at it.Ron:
What you want to pop up is the demo that … yeah. I believe if you go to an aggregate reflection on the advanced screen, in the upper right hand corner for it you will see a little gear and-Kelly:
[crosstalk 00:40:49]Ron:
Yeah, the advanced mode. Just click on the little gear on the upper right hand corner there, and that lets you choose how the distribution will work when you distribute a reflection. What you can choose here is whether that distribution happens globally or locally. If it happens globally, we’re going to split up the streams based on the distribution that you requested and so that’ll help, potentially minimize the number of files produced. If you’re an HD investor and you’re worried about the total number of files, you would chose that. If we do it locally, that means we potentially get much higher parallelization, therefore, we potential minimize refresh time. You get to make those choices depending on whether you’ve got S3 or HDFS or BDFS as a the underlying. Hopefully that answered your question.Other than that a lot of it depends on the refresh rate of data and so on. It’s kind of hard to …Here the event, just as a … One thing to keep in mind is the data persists as Park A, which has a number of different impressions and schemes in place. The question is, well how big is the source data that you’re reflecting? If it’s an aggregation reflection then that data tends to be summarized. I’ve opened up an example which the source data is about 300 gigabytes or 400 gigabytes of CSB so the aggregated representation here. Here are all the dimensions that were picked, and here are all the measures. The aggregated file size is 128 megabytes, so it’s dramatically smaller than the source data. If I did a raw reflection on this data set, it would be significantly larger, because you would have the row level details on the deck.Kelly:
And that’s actually, that’s a really good point for, one of the reasons you may want to choose to have fewer files as opposed to many files is better compression.Ron:
Okay let’s move onto another question. Does metadata capture lineage or reflections? I’m not sure if I understood the question, but we do capture … Oh I see, Kelly was down to 14% and we were all watching!We do capture where the data came from and we do an internal dependency graph between all the reflections. I think, Kelly, do you want to show us something?Kelly:
Yeah.Ron:
Okay. So here’s a virtual data set that is dependent on these different physical data sets, and you can think of the reflection dependency graph being somewhat similar. Where one reflection might be dependent on another reflection. That sense of lineage and dependency is something we track internally.This is a screen to help you understand the lineage of data between virtual data sets and physical data sets. You shouldn’t need to worry about the lineage of the reflections, because those are managed by Dremio. You should be thinking, in my opinion, the lineage of the virtual data sets and physical data sets, which is what is visualized here and if you’ve never seen this screen before, this is a feature that is only available in the Enterprise edition.Kelly:
Can, you want to … I think there was one you wanted to take? Which one did you want to take?Can:
I can pick one. Does Dremio preserve the original indexes of external data sources? Dremio is a read-only system. If, let’s say, we’re working against an Oracle table that Oracle table has an index on a certain thing, and if we end up pushing down using some inquiry that would use that index, we would use it. There is no implicit tag, us going in, because we don’t accept that stuff. The system isn’t as you configured it, if the Dremio push down can leverage interface, we will leverage that interface. It’s not the only one.Kelly:
You want to tackle any of the other ones?Can:
Yeah. Is there any support for capturing data updates, merge data in raw reflections? Today there are two main reflection update capabilities. We have a pull down mechanism where if your data is mutating, we do a full refresh. Then we also have an incremental update module for append-only files, system file-based data sets, or depending on their roadway systems, we can do … If this … we can look at the last files that were added since the last update. Or we can add the last row that were added to that last update and add that to our data set.Ron:
I will say that on the roadmap, one of the things that we’re working towards is to enhance that raw reflection … Sorry the full update is a participation-full update. You know what areas of your data and how they’re mutating, you could plan your strategy for how to create reflections and distribute them in a way that will partition out, and therefore, only refresh specific partitions. And thereby be a lot more efficient. That’s coming in the future.Kelly:
Okay. The other ones here …Ron:
Give me one here.Kelly:
Dremio supports US spatial query.Ron:
Well you can do inequality queries on a geo-spatial coordinate, but you can’t do the type of stuff that you’re used to being able to do in something like Azure or in Postgres with a geo-spatial. Very rudimentary things in Dremio today, something we’re considering for future release.Kelly:
Yeah. Answered that one. I see on the other side a couple of fun questions. I heard that Dremio uses aeroformat, is there a cost of conversion to aeroformat from say, or cables or other?The aeroformat is part in memory representation. The conversion in cost is about reading from some other system, and whether it’s already aero or not and into aero. We’ve not seen … The significance of the cost of pulling data from those other systems and into Dremio will also greatly depend on the latency of the network and the response time of that other system. Mark, I hope that answers your question.Sorry, I’m just reading through the next question. Is there a way to create and manage layers of virtual data set that can be stacked on top of each other to create reflections on the most popular virtual data set? Go ahead and answer Can.Can:
That’s actually one of the most common uses of data in Dremio being used. You basically build multiple layers, assessing different audiences in your organization. For example, you might have a layer of virtual data set, for your data entering organization where you have the lower level details which you build the reflections on top of. Then you might have a more business-oriented space, where you build your data set and curate it for that layer. Those data sets will be accelerated by the underlying reflection layer. You can also track how many jobs each of your data sets gets. Based on that you can make a decision on, “hey maybe I need to xyz at this point or this other point.”Ron:
Yeah, I can see here that in Kelly’s example on the screen there are 129 different jobs that have been run against business, 83 against review, 59 against user. What we do against this one marketing Yelp-join view-Can:
It’s a URBRon:
URB with whatever that was. Use a review business-Can:
Oh! Okay. Sure, of course.Ron:
We may learn from that and the feedback system that it may not make sense to create a reflection on that data set or system. There’s the balance of how big that data set is versus the originals. That’s, I think, what you were referring to is: you can create, and you can have a much more complex lineage of virtual data sets, look at which ones are being used more then choose your reflections based on that. Choose your reflections based on workload.Can:
You’ve got it. Then what we see are customers, actually, as you build up more reflections, you actually start seeing patterns and merging these reflections. To serve multiple-use cases resting on reflection.Kelly:
Couple more questions and a couple more minutes.Ron:
Any ones you want to hear about, Kelly here? We have questions …Kelly:
You guys kind of went through all of those …Ron:
There’s a question here about a specific example of dimensions and measures, and a specific example that we covered at 11:14, so that one’s a tough one, because I’m not the question is asking.Can:
I think the question is, I’ll give my take on it.Ron:
It sounds like, yeah, but-Can:
Building a reflection you might mix and match dimensions and measures from multiple sources. Your first model, or join, or approach in that virtual data set can tell, “hey from Mongo I’m going to take the reviews table,” for example. From Atlantic I’m going to take this other index, I’m going to join them, and on top of that you can build your reflection. And at that point, Dremio doesn’t really care where the individual field is coming from. We will treat it as a virtual data set, instead of reflection. Coming from your user standpoint, that would be irrelevant for that part of the data set.Ron:
Jess asked a question about whether we considered a feature around alerts when a schema change happens? Yes, we have. Mind you that we’re not perpetually scanning the entire underlying data source. We need to think about how to service that information, because if you have billions of documents in Mongodb database and one of them happens to have a new array that everything else is in, it may take us a long time before we notice that change. We need to be careful with the semantics of what those alerts mean, as well as still thinking about how to service that information. It would certainly be in … You can see in our logs that we encountered a new schema today, and certainly, reflecting that … Oh, poor choice of words. In a system table of how many times a day or how many times a week we are seeing this schema change, and what was it when we saw it would be useful to use if so. We’re certainly thinking about that.Can you comment on accelerated query latency, sub-second, etc. Any chance of plugging Dremio APIs into data serving applications? Query latencies are definitely sub-second. We got quite a few that are in the 10s of milliseconds, I don’t necessarily have them in a single millisecond. This is for … We built this system for interactive analytics, the benefit of reflections and accelerated queries is all about getting to sub-second queries and therefore, I’m not sure what you mean by data serving applications? But they can certainly be interactive applications is what I’ll call them, just to make sure that I don’t read too much into it.Kelly:
If you had a data serving application, you could interact with Dremio over JDBC or ODBC or the REST API.Ron:
Yes.Kelly:
Up to you.Ron:
Yep.Kelly:
Anything else here? Any other questions from you folks in the line? Going to wrap up here in just a minute.Can:
[crosstalk 00:52:33] We went through everything.Kelly:
Tons of good questions here.Can:
Yeah.Kelly:
Oh! One was about a web hook?Can:
Oh he answered that one.Ron:
I already answered.Kelly:
Oh, you did?Ron:
We can ask … swimming in text still.Kelly:
How do I start with Dremio? Ah! I think that’s a great question! You go to our website there is a wealth of tutorials, that have been written, some by us and some by people in the community. You can go to the download page, it’s open source, so there’s editions for Mac OS for Linux and for Windows. Production builds for Linux. Dremio is intended to be run in a cluster. You can run it as a UR application in your view cluster if you like. But it’s something we made so you can try it out on your laptop and get a sense for how it works, but clearly this is intended for large production systems with many nodes in the cluster.there are tutorials and there is a community forum for you to ask questions if you have any as you’re getting started. If your integration is with Spark via external reflections we do today, but not yet, forthcoming, Mark.Ron:
Little bit later this year, Mark, I think you’ll have some exciting news from us in this area.Kelly:
Okay, great. So thank you all so much. We’ll circulate a recording of this session to you. If you have questions you can also go to [email protected] and ask them there and we do our best to answer things quickly. If you have any feedback or comments, you can always reach me at [email protected]. Really appreciate your attendance and attention today. Enjoy Dremio, it’s for everybody. Thank you so much, take care.Ron:
Bye!Can:
Bye.

Dremio 2.0 – Technical Deep Dive

Table of Contents

Webinar Transcript

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?

Table of Contents

Webinar Transcript

Additional Resources

Apache Iceberg: The Definitive Guide

Introduction to Data Engineering

What Is Apache Iceberg? Features & Benefits

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?