49 minute read · May 26, 2019

Dremio 3.2 – Technical Deep Dive

Lucio Daza · Director of Technical Marketing, Dremio

Tom Fry · Senior Director Product Management, Dremio

Webinar Transcript

Lucio Daza:
Excellent. So there is a couple of things before we get started that I want to run by the audience. The first one is, this presentation is for you. We know you are going to have questions. We want to answer those questions. So if you want to communicate with us, there are three ways that you can do that. You can communicate with the rest of the panelists, or the audience through the chat window, also you can raise your hand as well as using the Q&A button to ask any questions. Quick thing that I want to point out, we have a Q&A session at the end of the webinar. However, if you have any questions, do not wait until the end to ask those questions. Go ahead and hit that button right there and put the question in the panel. We will go ahead and capture your question. And if we don’t answer it throughout the presentation, we’ll do it at the Q&A session at the end. And if not, we will go ahead and follow up with you through email, and we’ll provide the information that you are requesting.Lucio Daza:
So another thing, we’re going to be playing a fun game throughout the webinar. So I am going to be giving away three or four Dremio T-shirts, highly acclaimed, everyone wants them. They are awesome to take a nap, or to go jogging, or just to brag about them. So I am going to be dropping questions throughout the presentation, and the first person who answers that question, in the chat window, will get some swag from us. All right? So let’s go ahead and make that work. So with that said, let’s see where are we today? So, May 2019, exciting times. We have … It almost feels like it was a couple of weeks ago when we did the latest release at Dremio 3.1. So today, we’re going to be talking about 3.2, and unlike many other products out there where only major releases contain new features, here in Dremio, we provide new features on every release. So we’re going to be talking about a lot of exciting stuff, new features, new improvements that we have in this release. Also, if you have any questions about previous releases, feel free to let us know. We’ll try to address those as well.Lucio Daza:
And also, a quick highlight that I want to mention before we get started, we have a new deploy page. So you probably already discovered this, if you went to our website, or you probably already tried what you’re seeing in the presentation right now. But our new deployment page, you can go here and there is not anymore gifs about, “Where can I find the binary’s to download and install Dremio?” Now you can go and download templates to deploy in AWS, or Azure, and also to deploy using Kubernetes, and Helm. So there are many different ways right now that you can deploy a tool like this. So we wanted to make all these resources and all these options available for you in one location. Okay, so how cool is that? So now go to Dremio.com, now we have here instead of the download button, we replaced that with, Deploy. So, go ahead and take a look and see what your options are not. All right? So, a quick recap of what we did in 3.1.Lucio Daza:
So if you attended our latest Technical Deep Dive, you might remember that we did some cool stuff around Multi-tenant Workload Controls. We also had Enhanced Previews, and we’re going to be talking about that today, as well. We have some enhancement around Gandiva, which made things super fast. Also, we had a new Reflections Matching Algorithm, and of course Advanced Relational Push-Downs. And today, we are going to be talking about all this cool stuff that we just included in our product, and we’re going to be talking about enhancements in Cloud Native Data Lake connection. We, also, are going to be talking about Concurrency and Scale Performance for large-scale deployments. We’re going to also be talking about Functional Enhancements that we have in our product. So, it’s not all this stuff that you’re going to see under the hood, is also user experience was greatly enhanced, and we’re going to be talking about that.Lucio Daza:
One thing to notice, is Dremio 3.2 comes with over 200 improvements. So we are going to do our best to talk about the ones that we considered were the coolest ones. However, if you get a chance, go and check out our release notes and see what’s new out there. If there was anything that we missed here, please go there and check it out. A bunch of cool stuff. So, Tom, let me catch your attention here for a second because I know that a few months ago Microsoft released a new … I don’t know if I want to call it a [inaudible 00:08:00], but the released a new version of ADLS, and this is ADLS Gen 2. So how are we doing in terms of adapting to all that evolution of this new data source. What are we doing in Dremio’s terms to get our audience to be able to work with this?Tom Fry:
Well Lucio, thanks a lot for mentioning that. One of the things that we do here at Dremio, is we strive to make it easy to connect to data wherever it may be stored. You know we see customers moving not just public and hybrid Cloud environments, but also moving a Cloud storage services as well. And in particular, Cloud storage services such as, Azure storage, and AWS has three, are really becoming central to organizations data lake story. So one of the things that Microsoft did recently, was they really broadened the scope of the services they offer in storage. And they introduced multiple new storage services. Tom Fry:
The first of which they’re branding, Azure Data Lake storage Gen 2. And the second is a second version of their Azure Blob storage service. So what we did here at Dremio, is we added support, not just for these new Azure storage services that they recently offered, but we also expanded our support for Azure Blob storage, and included that in the matrix of our support on the Azure platform. So with this we really have full support in the Azure ecosystem, really for all the storage services that Azure offers. Whether or not it’s Azure Blob storage, or Azure Dada Lake storage Gen 1, or Gen 2.Now we have it explained on our website. There’s some confusion between on some of the branding or positioning for some of these different storage services in Azure, and we have within Dremio two different storage connections. The first is the traditional one that is already available, which is for Azure Data Lake storage Gen 1. And then we have the second storage that you see here in this picture, which is for Azure Storage Source, and this covers both Azure Blob storage, and ADLA Gen 2. So the easy way to think about it is, if you’re using Azure Data Lake storage Gen 1, use a existing traditional connector. And for all those other Azure storage properties, our new Azure storage Source connector, is the one to use.Lucio Daza:
That is great. So, I also wanted to mention if you’re not too familiar with ADLS, and want to check out its architecture and want to understand everything that is composed in this awesome technology, we have a great explainer in our library. So, go to dremio.com, find the library. We have a section called, explainers, where we have super cool write up on ADLS Gen 1 and Gen 2, and what are the benefits. Also, pricing models, and so on and so forth. I like to brag about it because I wrote that. No, but it’s very cool. Joking aside, it’s very interesting. So, besides adding these new Cloud storage sources, are there any other Cloud storage improvements that were made in this version, Tom?Tom Fry:
Right, so I’m really glad that you asked that. One of the things that we did here with the latest release is, we made some significant investments in terms of, how do we optimize reading of columnar data? Particularly on Cloud storage services. So particularly for Amazon S3, and particularly for Azure storage. What we wanted to do is back up and say, “Why is this important? Why do we care about optimizing for columnar file formats? And why is that different?” So if you look at historically, databases store data in what we call, row-oriented format. Which means, all the fields for a single record are grouped together, and this is optimized for a transactional systems, traditional OLTP type systems. However, 3 worked at analytics, this organization was inefficient, and modern databases implemented what we’d call, columnar-oriented format, which groups a single field for multiple records together.And this is very optimized for read-oriented analytics, and the workload pattern you’d typically see with that. And initially these were typically proprietary systems, think traditional EVWs, and so their file formats were naturally proprietary as well. And with the emergence of open-source databases and open-source data services, Hadu, HBase, et cetera. The industry moved towards and created a non-proprietary columnar file formats. The most popular of these are, and widely used, are Parquet and ORC. And these formats have really become central to many organizations’ data lake storage, where you’ll see that an organization as they build out their data lake, Parquet and ORC file formats are really central to how they organize and how do they store data today. However, there’s some inefficiencies between the way that traditional file systems read columnar file formats such as Parquet.File systems try to predict future read requests, and read ahead automatically. And so this enables file systems to kind of prefetch data, and have it immediately available for when it’s requested. However, if you look at the workload patterns on columnar file formats, these predictions are often inaccurate, and as a result this results in both wasteful reads, but much more importantly it stalls a processing pipeline. Because whenever you have a miss when data;s not accurately prefetched, the system has to make a new request and wait for the data to arrive. Traditionally in environment, this waiting was not as impactful since data as local and quickly accessible. Think for example, if your running had [inaudible] on top of HDFS. The file’s local to where the compute is going to be. So the wait was kind of short.However, as Cloud storage services, this pause in waiting is much more impactful because Cloud storage is remote, and latencies are one to two orders higher magnitude. So what did Dremio do? What we essentially did was, we combined our understanding of columnar file formats, and analytical sequel workload patterns in order to more intelligently predict likely access patterns within our Parquet file, or within an ORC file, and how do we optimize reading within these files? So by doing this, we’re able to significantly improve our read-ahead hits, and as a result really gain a lot of benefits of data pipelining throughout the system. Meaning, that we much more accurately predict what is the next set of data that we need, and bring that into the system before we need it. And the result is, the system not only achieves higher bandwidth when working with a Cloud storage source, but we also have much higher CPU utilization since the pipeline is full, and we also really improve our CPU and memory resource utilization. And most importantly, we really result in significantly faster query response times that users see at the end.Lucio Daza:
That is great. So let me ask you this. There was a lot of information on this slide, and you probably already walked us through that, but these AKG looking graph, what am I looking here between Dremio 3.1 and Dremio 3.2? The first one looks like what my heart looked like when I was interviewing with Dremio. The second one looks like when I received my offer, I had a complete flatline. So, go ahead and walk us through that and see what exactly are we looking at here on these graphs?Tom Fry:
Sure. So what you’re really seeing with these graphs is a visualization of a little bit what was just touched on. So when you do these read-ahead look at patterns, and the data that you try to guess that you might need was wrong, the system has to pause and stop and make a new request to the Cloud storage service. And particularly with Cloud storage services because the latencies are a little bit higher than what you’d see from a local storage system. You’d see these down times where the systems essentially stalled and waiting for data. So what you’re seeing in that Dremio 3.1 pattern is a typical, pretty much a lot of analytical systems, when reading from columnar-oriented data. So you’ll see, we kind of read a little bit of data, and then we realize we had a miss and so we have to kind of pause while we request new data. So we wait for a period of time. And then that data comes in and we process some data, and then we have another miss, and so we’ll have another pause. And this process kind of goes back and forth, so then you see this choppiness in the first graph.And by more accurately predicting where in a Parquet file or and ORC file we’d want to redata, we essentially don’t see this as pausing and restarting. We’re able to largely fill the pipeline, which means the system’s always reading some type of data, and the system’s always processing some other data that it already prefetched. And these instances of when, “Hey, you know we made a mistake, and we pull in the wrong data.” Happen much less frequently. They do happen occasionally, and sometimes … You know no prediction’s ever 100% accurate. So there still will always be sometimes a little bit of a pause, but it’s drastically reduced. And as a result of this, we’re able to make much more efficient use of our resources, and we’ve seen right out of the box a two to four times query response times. Not at the file system level, this is at the user level. So including all the overheads, query planning, returning result sets, et cetera. Even conserving all those overheads, we still see significantly faster query response times through this technology.Lucio Daza:
This I great, and thank you for clarifying that. So, I believe there is a talk based on these, right? Where the customers can go and take a look at?Tom Fry:
Yeah. So we included a link here, and it’s also posted on our website. Our CTO Jacque, gave a great talk at Strata, talking about a variety of different thing with Gandiva and a couple other technologies. We’re really focused on what we’re doing with Dremio 3.2 in terms of this predictive pipelining. So if you’re interested in the technical specifics behind this, there’s a nice in-depth talk by our CTO that he presented at Strata recently. So I highly encourage going to look at that, it’s a great talk.Lucio Daza:
Awesome. Great. So, I think this is a good time to take a quick pause. We received a question from Mr. Or Mrs. [Anaoymous@MB] which is, how did you decide on your mascot, on your logo? So Gnarly The Narwhal, I honestly don’t know. I think it’s because he looks so cool. But my smarty pants answer would have been, we chose him very carefully. However, I told you guys that we were going to go ahead and play a game for you to receive some free and cool swag, so I want to go ahead and ask you this question. And the rule of engagement here is the first person to hit me on the chat with, of course a name, and the answer, we will go ahead and ship you some cool stickers and also a T-shirt. So the questions is the following, so as you may know a group of bee’s is called a colony. And a group of rhinos, for those who didn’t know, is called a crash. So who can please tell me what a group of narwhals is called?So we’re going to give it a few seconds and see if somebody jumps in with an answer, and let’s see if we have a winner. Let me go ahead and open … Alex, you got it. You got it. Congratulations. So Alex, we’ll get back with you, and you are the proud winner of a narwhal dremio T-shirt. Please, we’ll reach out to you and we will need, of course, an address and a size for your T-shirt. But congrats. Yeah, it is a blessing for those of you who didn’t see the answer. So a group of narwhals is called a blessing. All right, so now we are going to continue on the webinar, and Tom, this columnar where predictive pipelining is great. I think there is a lot of improvement, or customer are going to see a lot of improvement and benefits in terms of overhead with their resources and CPU and memory, and of course the queries are going to come faster. However, what is the actually benefit for this technology? Like what is it that we’re going to be noticing here the most?Tom Fry:
So that’s a great question. So the real impact is what you see as an end user at the query response time level. And what we’re showing here is actually the combination of a couple different things. The improvements we made with predictive pipelining are actually cumulative as well, with some improvements that Azure has made on their storage services as well. So one of the things that Azure’s been promoting is that Azure Data Lake storage Gen 2 is faster than Azure Data Lake storage Gen 1. And what this line is showing you is, what is the combined benefit of putting these two things together when you’re both not only switching from Dremio 3 to Dremio 3.2, but if you also upgrade your storage sources to Azure Data Lake storage Gen 2 as well and when we put these two things together, we’re seeing right out-of-the-box a four to six times performance improvement. This is showing a common benchmark a TCPH scale factor 100 improvement between Dremio version, also between combining those results with the Gen 1 to Gen 2 Azure storage improvements as well. So we think this is really fantastic, and that it’s something that people can make immediate use of.Lucio Daza:
So one of the things that happened in 3.1, is that we had some features that were, quote unquote, preview features. So if customers wanted to enable them, they needed to get in touch with us, and we would tell them how to go ahead and enable them and so on. Is there anything that the audience would have to do in this case to go ahead and enable these features, or what’s the deal here?Tom Fry:
So all these improvements we made on the storage side, both in terms of the new Azure storage services, and the new predictive pipelining technology, are fully GA in the three tier release. They’ll be automatically enabled for users running Dremio. So there’s no configuration option to switch, or some flag, or setting to set. If you use the new Gen 2 storage service, you’re going to see performance benefits out of that. And you will automatically see the benefits of the predictive pipelining technology for all of your AWS and Azure storage services that you use. So no configuration setting, nothing that needs to be done. As soon as you upgrade, you’ll see those benefits.Lucio Daza:
Awesome. Out-of-the-box, you hear him people. Out-of-the-box benefits. All right, let’s go ahead and talk about the following. I mention in the other view, that we’ve had some new ways to deploy and run Dremio, and up to now, you needed multiple servers, right? So what do we have now … Let me see if I can switch here. How can we go ahead and deploy Dremio in this latest version?Tom Fry:
Sure. So many organizations today utilize Kubernetes and containers to really simplify operations. So instead of spending significant IT time administrating servers and standing up clusters and manually spinning things up and down. Kubernetes enables IT to automate what traditionally were more manual task or had to involve some type of manual intervention. So Dremio already supports Helm charts for Kubernetes, in multiple deployment scenarios, including Azure AKS, Amazon EKS, and Standalone Kubernetes Clusters for on prem, hybrided, and Cloud environments. What we did in 3.2 was we made significant improvements really aimed at simplifying a lot of common tasks that people encounter in production environments.So we really wanted to harden our Helm charts in two different categories. One, was to make sure that they have a lot of best practices built into their default configuration for production environments. And the second was, to enhance them to kind of enable cluster life cycle management topics, and make them much easier. So if you think about it, the types of activities that are typically done, starting up a cluster, expanding a cluster, upgrading a cluster, et cetera. We’ve really made these all essentially a one command line type activities. So traditionally what would take IT a bit of time, not is just kind of provided out-of-the-box, and very easy to do.Lucio Daza:
That is great. So I imagine there is a lot of administrative activities that we’re making easier. And I think these are the fundamentals of moving towards the Cloud, right? You want to deal with less hardware and especially technologies like this that allow you to containerize your applications are going to allow you to get rid of all the overhead of having to deal with massive environments. I think we are on time for another question, and then again, no Googling people. Alex, you already got your shirt, so sorry may, you don’t get to participate again.But now the next question is the following, so close attention to this, why is Kubernetes sometimes abbreviated as k8s? So I’m going to give you a couple of seconds. What is Kubernetes abbreviated as k8s? And let’s see who the winner of the new shirt is going to be. Simon, you got it. Wow. That was fast. All right, so Simon got the … The red solution … The right answer. So it’s because we have eight letters between k and s in Kubernetes. That is great. So let’s go ahead and move on. What if I don’t have Kubernetes environment, Tom? Why do I need to … What are my options in this case?Tom Fry:
Sure, so as popular as Kubernetes is becoming, it does require you to have that service available and set up, not all IT organizations do. Some other organizations also have policies that require them to run on bare metal instances either for security, or regulatory, or a variety of other compliance issues. So for that we have template-based provisioning, that really enables one click deployments in the Cloud. This makes it easy for any organization to quickly deploy Dremio with no other services or dependencies required within their own storage account. So if you have your Azure storage account, you can quickly basically use this one click deployment to go and spin up your own servers, and all the hardware and software, and configuration will be done for you. Now initially right now today, we’re offering arm templates for fast deployments in Azure. And AWS support is coming soon. And you’ll be able to find these, find Dremio, in their respective marketplaces as well.Lucio Daza:
So I believe … Let me get something clear here, because I believe I saw only the CD edition available on the website? And if I want to, can I bring my own license if I want to use enterprise edition?Tom Fry:
That’s exactly right. So what you have from the website is the community edition version of Dremio available for free to go and try however you will. If you’re interested in trying the enterprise edition, you can always contact your Dremio support channels for that, and we can enable basically the E edition in any of these environments as well.Lucio Daza:
Awesome. Great. So let’s go ahead and talk about the large scale use cases. So just to kind of do a quick recap, we’re talking about deployment on the Cloud. We’re talking about ADLS connection. So we’re talking a lot about … There’s the potential for very large scale use cases. Are there any improvements that we are included in this version that are being aimed to those use case?Tom Fry:
Yeah, so I’m really glad that you asked that. So, as we’ve made a lot of improvements really focused on Cloud environments, of course that comes with we want to support Cloud scale as well. You know, Cloud makes it very easy to be able to provision resources on demand. Grow your data likes on demand, and as a result you need a storage service that can handle those volumes in what we’d call, Cloud scale datasets. So we’ve made of variety of improvements, and continue to make a variety of improvements focused along that vector. The first that I’d like to talk about are improvements that we’ve made to our Reflection management to help support Reflections on a very large scale or datasets. So let’s back up and review, what are Reflections and what are they used for?Reflections enable Dremio to greatly accelerate operations and also to offload requests from storage systems. Essentially another way you could think of it is a Reflection enables Dremio to preextract data from and external relational database or an external Cloud storage service, and bring that into the Dremio cluster to essentially have that available for immediate query. Some users use this to accelerate their systems. Some users use it to essentially offload their storage systems. And Dremio performs all the management for this. There’s no IT administrative tasks. It essentially is a data curator specifying, “Here’s a virtual dataset. Please accelerate this for me.” And Dremio handles the rest for you. So this is a feature that’s heavily used by users. We have many customers in production with … One was showing me many thousands of Reflections that they had set up. So it’s a great feature.Now the issue is if you’re let’s say, going to build a Reflection on top of a very large Cloud scale dataset, think hundreds of terabytes or petabyte sized data. Refreshing, you don’t want to refresh all that data every single time. Not only is the data volume too big, but it’s also inefficient because most data in a data lake is historical, and is unchanging. So what users would like to do and what we have partially enabled in the past is the ability to essentially implement it in incremental refresh of a Reflection. This means that you basically keep your exasperating data in place, and you only look at the recently changed data and just bring that into the system. So if you think about if you have a massive dataset that spans 10 years, you don’t want to refresh the 10 year history. The 10 year history sits in Dremio, it sits in a Reflection, and each day you just pull in the most recent data.Now previously in Dremio 3.1 and earlier, we had limited options for how users could configure, and specify to Dremio which data has changed that they want to perform the incremental refresh on. And you’re really limited to just BigInt columns. So today with 3.2 we essentially expanded our support for incremental refresh, to cover all of essentially sortable types. And you can see here in the slide which data types we support incremental refresh now. In particular the date and timestamp fields were under a lot of interest from users because thinking about data and a time base is how a lot of people want to essentially think about their data, and refresh their Reflections. So say, essentially, I want to say on a new day I’m only going to capture the most recent day and pull that data [inaudible] into incremental refresh, or to use timestamps as well to say, “Just since the last refresh, look at the timestamp and pull in the most recent data. So these activities are now very easy to specify within Dremio. There’s a very simple UI to configure this, and it’s been a pretty heavily requested feature. So we’re happy to get it out.Lucio Daza:
So I see here … Let me get a little picky here on what I’m seeing on the screen because I see on the left hand side the BigInt, which you mentioned, and also timestamp, and so on and so forth. So is this affecting only one of the data sources that we support, or is this a change that has been applied across all the data sources that we can connect to?Tom Fry:
So yeah, that’s a great question. I’m glad you brought that up. So this is across all of our different data sources. So whether or not this is a physical dataset that’s backed by Parquet files in S3, or if you’re talking to a relational database like Oracle. You can basically use incremental refresh in all these columns across all of our data sources. It’s essentially tied to the concept of a virtual and physical dataset, and not to the sources as well. So as we add future sources as well, these will apply to them as well.Lucio Daza:
Great. All right, so quick question, and this is actually a question for the audience because we have another T-shirt to give. And you mentioned a couple of cool things here in this slide. So, then again Simon, congrats. Alex, congrats. The next person is going to win another T-shirt, and this person is going to tell me very quickly, within the next few seconds, no Googling, what does S3 in Amazon stand for? So let’s go ahead and give it a couple of seconds. What do S3 stand for in Amazon. Simple … Mike you got it. Simple Storage Service. So Mike, you got it. You’re the winner. Thank … Congratulations. All right so Tom, this is great. And I think this is going to … This is one of the … As I mention earlier, we have over 200 improvements, this is one of the coolest ones. And now let me ask you this, I mean we’re talking about large scale data here. And now of course, we can connect to this data. How about performance? What are some of the things that we’re doing in terms of these large scale data, or use cases? What are we doing in terms to making all this efficient, and making sure that things work the way they should?Tom Fry:
I’m glad you asked that. In addition to the incremental refresh enhancements that we just showed, which were heavily requested, we’ve made a lot of investments in terms of scaling this to size and scope of a Dremio cluster, and the size of the workloads we can support. This is everything from the size of datasets to how advanced or how large a query sequel operations can be. And also thinking about expanding the [inaudible 00:35:07], the number of users that a Dremio cluster can support. There’s actually numerous enhancements that we made in this release, and what we’re capturing here are some of the more impactful ones, but the improvements that we’ve made are not limited to just what you’re seeing on this slide. But these are some of the ones that we found to be more interesting to talk to.So the first is what we’re calling, Large Metadata. So if you think about working with large datasets, you’re not just storing all that data in one file. You’re talking about hundreds of terabytes, or petabyte sized data, the data’s spread across many files, in a very large scale, and you might be talking about millions or tens of millions, or even hundreds of millions of files. And the thing is when you’re working with a very large number of files, each file has a bit of information that needs to be stored in memory for that file. That includes the files names of basic attributes about it, some security attributes about it, you know kind of user access, et cetera. And because of each file essentially consumes a little bit of memory consumption. That place has some constraints in terms of the number of files you can support.If you’re going to look at supporting millions of files within Hive, for example, that’s many gigabytes of memory. And as a result that, what Dremio used to do is it used to pull all that metadata into memory at once, and that actually became a limiting factor where the limiting factor wasn’t much the data itself, it was just the file information at the scale it could become a limiting factor. So what we’ve done in Hive is essentially really broken up this process to be much more [inaudible 00:36:41]. We’ll work on partitions and even files within a partition a more [inaudible] batches. There’s no performance degradation at all by doing this, but what it enables us to do is to kind of work on the problem in chunks. So if you have a petabyte sized data with tens of millions of files, we don’t have to do that all at once, we’re going to kind of be working on it in sequential chunks, and we can essentially really get to very large scale through that. So that was a major improvement and we already with one customer have that running in production today.Another major improvement was really thinking about more complex larger sequel operations, to sequel queries. And also on how to handle them on a larger Dremio clusters. So when we start to think about large sequel plans, and we start to think about what many Dremio clusters or nodes in a cluster are [inaudible] we identify multiple areas where we’re actually being redundant with information. Or we might be sending the same part of a plan for different query fragments, or the same executor multiple times. So identifying multiple different areas to create optimizations by essentially really normalizing our query plan, and then sending some information once on the even would span multiple query fragments. And by doing this we saw plan generation times being reduced by over 10x, and this isn’t on our internal kind of contrived examples, these are on real customer examples that we’re working with. So by really reducing our planning times by normalizing query plans, we can handle much more complex sequel at much larger scale.Another major improvement was really thinking about user concurrency. We had some limitations before on our coordinator node, in terms of the number of users that we could support at a single time. And what we did was really reworked our coordinator node in terms of, how do we think about pipelining, planning, and sequencing of operations in order to not only handle higher concurrency, but more gracefully handle high volume bursts, or if you have 1,000 users that all submit a query at the exact same time we will not only be able to handle 1,000 at this time now, but we’ll actually more gracefully handle that as well. We also just made a variety of execution engine improvements. So one of the things that we wanted to lift out and highlight was higher cardinality operations on variable length fields such as strings. Before in parts of our execution engine, you know you could only get a certain cardinality level for group Y columns in these types of fields. And now we’ve essentially made it unlimited based upon whatever memory or resources you’re able to throw at the problem. So this is just a small capture of the type of scaling improvements that we’ve made, but we wanted to highlight that this is definitely an area that we continue to invest heavily in and we’ll continue to do so as well.Lucio Daza:
This is awesome, and equally mind blowing to me that we have to be concerned so much about dealing with the data about the data that we are about to start working with. And in that topic I wanted to ask you something. You mentioned something pretty cool in here about the large metadata quadrant that we have there on the slide. How many files can be supported in Hive now? And I believe you mentioned a production use case, is there something that you can share a little bit about?Tom Fry:
Sure. So we have currently running workloads with tens of millions of files, that’s actually a real production workload in progress right now, and we’ve designed this to easily be able to handle 10 or 100 times larger scale than that. So that easily gets you well through the multi-petabyte sized datasets today out-of-the-box.Lucio Daza:
Nice. That’s [inaudible 00:40:26]. So audience, attention, there is another question coming in. So let’s go ahead and save that one for the next slide. So Tom, we talked about Azure connectivity. We talked about concurrency, and scale performance. We also talked about Reflection enhancements. And also I forgot to mention to the audience, Reflections is a very deep topic and if you need to know, or you want to know about data Reflections, we have these awesome class in Dremio University that you can go ahead and enroll for. It’s free to do so. It’s 100% dedicated to data Reflections. You can understand them. There is a lot of reading, a lot of theoretical information about what Reflections are. There is also a virtual lab that you can deploy and work with Dremio and follow up and follow on in the exercises. And if you are like me, that I learn by practicing, you will be able to follow on in the exercises and take a look at how Dremio works with data Reflections. So we talked about all this cool stuff. Tom, beyond database performance, are there any other performance improvements that we have in store for the customers in this version? In this release?Tom Fry:
Sure. So one of the things that we also tried to improve was our web UI when working with very large datasets as well. One of the things that our web UI does is, as users curate data, so they look at a dataset, and they make transformations, maybe they join different tables together, change the data hyper fields, add in … You could do a variety of transformations to our virtual dataset. We constantly update a preview of what that dataset would look like based upon the latest interative changes that you’ve made. And this works great for our sample dataset that’s in the system, and smaller datasets. But when you get the very large sized datasets, even the preview can take several seconds.So what we’re finding because people wanted to curate data and very large sized datasets, is they make a transformation and then the UI would essentially pause while it was generating that preview for you. So users just have to pause for several seconds, and then make another change, and then the UI would pause. This really only happens with very large scale datasets, but as we saw it becoming an issue, we really wanted to kind of unblock people and enable a more seamless experience. A lot of times users, when they curate data, want to know, “You know I want to make these [inaudible] transformations in place. I want to do operation A, then B, then C, then D. And then I want to go look at the result.”So what we implemented was a new dataset rendering engine, that essentially works in the background so that users can iterate and just make changes to their virtual datasets on the fly. And there’s no blocking operations within the UI. The preview will then be generated to the latest change you made. So if you went through and made five changes to your virtual dataset, and then as you kind of wait the preview will catch up in a few seconds and show you the latest changes. So when we looked at user workloads in terms of the point and click type operations, we think this makes a much more seamless interface for users who are curating very large data.Lucio Daza:
That’s great. And now users they can go ahead and edit their original sequels directly on the listing page, right?Tom Fry:
Yeah. So that was another thing. I think there’s some additional clicks. So we made a variety of other little UI changes as well, for kind of common tasks to reduce the number of clicks and if you want to edit the original sequel directly that’s more accessible now.Lucio Daza:
That’s awesome. So, all right, now question for the audience as we only have a few minutes left. First person on the chat window to let us know what ORC stands for will get another T-shirt in the next couple of minutes as I’m doing my transition here. What is ORC, please let us know. The first person to let me know without Googling the acronym will win a shirt. All right. So let’s go ahead and continue talking about this. We have now functional improvements in … You got it [Ram 00:44:34]. So we have functional implements in 3.2. What can you tell us about that, Tom?Tom Fry:
Sure. So another improvement we have is expanded support for complex data types. So let’s kind of talk about what are complex data types and why are they important. Traditional relational databases only supported a structured data, meaning you needed a schema defined by IT that was largely inflexible and difficult to change. Yet in the real world, users have data that changes its structure over time, and they need flexibility to be able to dynamically add or change fields, maybe without changing the underlying structure. And this is really where complex data types come into play because they allow multiple different data points to be specified within a single column. This could be done with a variety of different organizations such a list organization, or a structure organization, or a map organization.But fundamentally what they all allow is for multiple different fields to be able to define in a single column and for people to be able to essentially add new [inaudible] fields within that kind of data structure. Dremio, prior to 3.2 supported reading complex types from Hive tables and Parquet format. And because the importance of ORC and how prevalent this is in a lot of data lakes, we’ve got span and support for Hive tables backed by ORC files to include complex data types as well. So what you’re seeing here in the visual here is an example of using a variety of listed instruct type fields. Not only in the data that you’re reading, but in terms of specifying the data to filter on and push predicates down on.Lucio Daza:
That is great. So I mentioned to the audience, at the beginning, that we had over 200 … improvements in the features in our latest release, here we are listing the rest of the 199 that we didn’t talk about. So what can you tell us about the stuff that we didn’t get to, Tom? What can we tease our users here with?Tom Fry:
Sure. So there’s much more to the release than we would possible cover within a hour. So our documentation and release notes are very specific about all the things that are being improved and added. Bug fixes, new enhancements, et cetera. So I definitely would encourage people to go to our documentation and look at all the additional things that we’ve done. This is just a scattering of different types of things listed from that list of 200 that you mentioned. It’s everything from LDAP improvements, to improving operations in YARN environments. We’ve made sequel improvements in terms of the types of sequel language that we support. We’ve made improvements to essentially some over the wire encryption, in terms of the support that we have there. A variety of other UI enhancements that we’ve implemented as well. So there kind of a very long tail of improvements. One of the things that we strive to do with each release is to constantly hear from users how to improve the product and what would make it more usable. I think that list of 200 is reflective of that.Lucio Daza:
This is great. I believe we have around 10 minutes to go. So before we go … And Tom I want to thank you so much for your time and all this great information. As Tom mentioned, there is a ton of stuff that you can learn about this release, and not only this release but previous release as well that we didn’t touch on here. So go ahead and check the release notes. Also, if you haven’t done so go to Dremio University and enroll. Dremio University;s a project that we launched earlier this year. It’s free to enroll. It’s free to take the classes that we have in there.We have Dremio fundamentals, if you’re not familiar with Dremio, or if you want to polish your knowledge in Dremio. There is also Dremio data Reflection, which as I mentioned a few minutes ago, is a very deep topic, very technical, yet something very rich and very beneficial to know. So go ahead and take that class as well. And there is Dremio for data consumers, and one of the cool things that you can do Dremio University is that you can launch a virtual lab in … You can go ahead and launch a virtual lab and follow the exercises there without having to install it or anything. All you have to do is just click a button, it will launch the lab for you for, I believe, I have it set for 48 hours.So you can go ahead and give it a try. Also, we haven’t mentioned this at the beginning, or throughout the webinar, but Dremio is opensource, it’s free to try, free to download. Go ahead and go to our Deploy page, and depending on your environment choose, get the template or download the binary, and you will be good to go. So, we have some cool questions. Obviously we will not be able to get to all of them as I mentioned at the beginning, if we don’t get to your question please just hang in there, we will follow up with you on an email. So the first question that we have here is, Juan you are asking us: do incremental refreshes only work on physical datasets or also virtual datasets? Tom, do you want to take that one?Tom Fry:
Yeah. So the answer is yes it will work on both. So depending on the dataset types that you specify, it should be able to work on both types. [crosstalk 00:50:23].Lucio Daza:
Great. Nice. Great [inaudible 00:50:26]. So the answer is, yes. There you go Juan. So we have a couple more questions here. So I believe this one is related to the Cloud storage … Or no. Let me see if I’m getting this to this right. I think this is related to the pipelining. What is the … I guess I’m trying to paraphrase here. Why does predictive pipelining technology matter? Like what is the benefit in terms of magnitude and latency, and so on? Sure, so the reason why it matters is, particularly with Cloud storage services they have much higher latency in terms of when you get that first bit of data from them, compared to a traditional local file system. So when I was talking about, before, as you’d essentially think about reading data.If your pipeline is essentially mispredicting which data you want to pull, and you say, “Hey, wait. I pulled the wrong data. Here’s actually the data I want.” You don’t have kind of stop and wait for that bit of data to come to you. If that’s coming from the local file system, that pause is actually very short. But the reason why we really focused on this, especially for Azure and AWS, is this type of pause is much more impactful in Cloud type environments. So as more and more of our customers are suing Azure and AWS S3, as really central to their organization data lake story, we really essentially wanted to make sure that we’re running as fast as possible on these Cloud storage services. And this technology was really focused on making sure that we are essentially on blocked, and can run at full speed in those environments.Tom Fry:
Great. Great. Thank you. And we have another question, I believe this is related to our deployment section where we were talking about the Helm charts and the templates, as well. So the question is, are the ARM templates suitable for production?Lucio Daza:
So that’s a great questions, and I’m glad somebody asked it. So both are suitable for production. So that they easy answer. I think what we’re showing in the slides, and you can review it after the webinar is posted, is the Helm charts … You know we’ve included a lot of operational hooks in terms of making it easy to do common IT tasks. So if you have Kubernetes environment, we think our updated Helm charts will result in essentially a more simple IT environment, with less administrative tasks, or more easy tasks, easy to specify what you want to do when you want to scale a dataset up or down. For example, with the Helm charts, you can essentially learn scripts that will say, “You know on Monday morning I want to double the size of my Dremio cluster. And on the weekends I want to shut it down.”These are all kinds of things that are very easy to automate with the Kubernetes home charts. But that’s really what the difference is. Both can be used for production environments, and the difference is really that the Helm charts come with a lot more … It’s a more rich environment in terms of the types of enhancements that we can do to kind of simplify some of the IT tasks. With the template base and the iron base template of bases, they’re probably suitable for production environments. You just might have to take a little bit more manual effort for some of those type of activities.Tom Fry:
Great. Great. All right, so I think we have time for one more question and this one is about the concurrency scale. He says, you mention large metadata is supported for Hive, what about other sources?Lucio Daza:
So that’s a great question. And we implemented it first for high base sources because we had some immediate demand on that. But it’s really an approach that we should be taking across the board, and we have engineering plans. It’s in our roadmap. We’re going to be expanding the same interative of support, really across all of our different storage sources. So by doing that, we’re really going to be able to take that large metadata concept, thinking about tens of millions if not hundreds of millions of file and be able to expand that to all of our different storage sources. So that’s in the roadmap and we look forward to delivering that in a bit.Tom Fry:
Wonderful. All right, so I think that is it for us today. Tom, thank you so much. This has been a great session. A lot of good information. I truly appreciate the time, and all the technical detail that you provided to us today. I want to congratulate Alex, Simon, Mike, and Ram. You guys are going to get your shirts. We’ll get in touch with you. Congrats on getting those answers right. And to all of you who are listening, our recording of this technical session will be available shortly in the next couple of days in our website, as well as a transcript.So you can go ahead and if there is something that you need more clarification on, or that you need to … or if you dial in later, you can go ahead and catch up on the stuff that you missed. If you want to see a demo if Dremio, join us every Friday for our live demo session. I host those personally. I will go ahead and walk you through some of the use cases that we can tackle. I can explain a little bit of the UI and how is it that we work in some of these use cases. So I look forward to seeing you there, and other than that I appreciate everyone;s time. Thank you. Thank you, Tom. And I hope everyone have a great day, and those questions I wouldn’t get to, we;ll go ahead and follow up. And stay tuned for more news about Dremio. Have a wonderful day everybody. Thank you.

Dremio 3.2 – Technical Deep Dive

Table of Contents

Webinar Transcript

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?

Table of Contents

Webinar Transcript

Additional Resources

Apache Iceberg: The Definitive Guide

Introduction to Data Engineering

What Is Apache Iceberg? Features & Benefits

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?