44 minute read · May 21, 2019

Interactive Data Science and BI on the Hadoop Data Lake

Kelly Stirman · CMO and VP Strategy, Dremio

Webinar Transcript

Kelly Stirman:Okay, so let’s get started, so we want to talk today about this concept of interactive data science and BI on the Hadoop data lake, and the idea is… It is probably familiar to those of you who are working in the Hadoop stack. There’s a lot of interest in data science, there’s always been an interest in BI, but the prospect of truly interactive speed and access to the data in your Hadoop environment I think has been challenging for a number of years, and that’s something we want to focus on today with Dremio and giving you an understanding about how our open-source product fits into the Hadoop stack and how it helps you deliver on this idea.Okay, so just a little bit about Dremio if you’re not familiar with us, just to give you a little bit of grounding, go through this very quickly, we were founded in 2015, we spent a couple of years building the platform and came out of stealth almost two years ago and launched Dremio 1.0 a couple of Julys ago. When we started Dremio, we helped start another open-source project that you may have heard of called Apache Arrow. We’ll talk more about Arrow a little bit later.This is a very interesting and exciting project that I hope you will learn a little bit about today and follow going forward, because if you care at all about analytics, if you care about data science, you should care about Apache Arrow. It’s become a really critical part of lots of different projects in the space, and it’s been a very exciting thing to be a part of. It’s grown from a few thousand downloads a month to over 3.5 million downloads a month currently, so it’s really taken off in popularity, and a lot of that is driven by the data science communities.We’re a product that we call data-as-a-service platform, and we can talk more about that later, but just as a quick summary, companies love of the idea of everything as-a-service but they don’t normally have that for their data, so how do you make data more service-oriented, something like microservices for data analytics, and make data and analytics more self-service no matter what tool you want to use? We’re based in Santa Clara here in Silicon Valley, and what we talk about today and what you see is open-source, this is something you can go and try out yourself right away. Try it out on your laptop, try it in your Hadoop cluster, try it in the cloud, wherever you like.So, the leadership team here has a deep background in open-source, distributed systems and big data technology, so two co-founders came out of MapR and started Dremio about four years ago. I ran strategy and a number of other functions at MongoDB, engineering leadership from AppDynamics, and we work with the pandas community and specifically Wes McKinney to make sure that consumers of data through Dremio have an exceptional experience. So if you’re working in a data frame in Python or in R, we want to make sure that your experience consuming data through Dremio is really the best possible experience. Investors from some real luminaries here in Silicon Valley, Lightspeed, Redpoint, Norwest and Cisco Ventures.Data Science and BI on the Hadoop Data Lake
So just a little bit about us, and quickly how we think about the market, so what’s out there, what’s going on, and what are some of the patterns that we observe? And I think you’ll find that there’s a lot of… This will seem familiar. So first of all, we talk about BI, users of BI tools and data scientists, we refer to them as data consumers, right, so these are people for whom access to data is a key part of their job, it’s typically a part of their daily workflows, and even though people have different tools that they like, the common denominator is data and access to data.And one thing we know from talking to companies, and I’d be surprised if this isn’t the case at your company, is there’s no one tool. Different teams, different individuals have a favorite tool, and so in your data scientists, some people prefer Python, some people prefer R, or Scala or some other language. And in the side of BI you’ve got, you know, Tableau’s very popular, increasingly Power BI is very popular, but you’ve still got lots of enclaves of MicroStrategy and BusinessObjects, and if you just sort of go looking around the corner into the dark recesses of your company, you’re going to find pockets of different tools.And so all of these different data consumers have their favorite tool and they want access to the data, meanwhile the data that they want is not in one place either. It starts in this… Effectively born in some sort of an operational system that is deployed on a relational database, or it’s deployed on a third-party application, or it’s deployed on a NoSQL database, and so the question is how are you going to get the data from these systems where the data’s born into the hands of the data consumers? So what does everybody do? Well, the first step for many companies now is you… You know, the strategy is, “Well, let’s get the data into the data lake as the initial step, let’s first get it all in one place under a common set of infrastructure and technologies and security controls so we can begin to work with the data.”And the way that you move the data from those operational sources into the data lake is through some kind of an ETL tool, or a data pipeline that’s written in a script of some kind that you’ve developed. You basically make a copy of the data from the operational source and put it into the data lake, and so for those of you deployed on Hadoop, you know, that’s HDFS and your Hadoop stack, whatever distribution of Hadoop you’re using. If you’re in the cloud, that could be something like ADLS or S3, an object store of some kind that makes it easy for you to put the data somewhere, so that’s the first step.The next step of course is, well these tools are not designed to consume data from a file system or an object store. The tools all assume a SQL type of interface, and that’s how they operate at their best, and that’s where they’re most functional. So you begin perhaps to look at something like Hive, or if you’re in the cloud maybe it’s something like Athena, maybe you’re looking at Presto, but there’s some kind of a SQL interface that gives you access to the data that you have in the object store or file system.Now, my experience is that in a lot of cases the functionality or the performance of those SQL interfaces is inadequate for these different tools and their particular needs, and so what you really find much more frequently is that the next step is that you take a section, a sub-section of the data, you know, a partition or an aggregated representation, or some slice of the overall dataset in your data lake and you put it into a data warehouse or a data mart, in a traditional relational system, so something like Redshift on AWS, it could be Teradata.If you have a large Teradata investment, it’s extremely popular and hard to unwind yourself from Teradata because of its performance and security and workload management features. Or maybe it’s something like Vertica, but you’re taking some sample of the data in your data lake and putting it into a relational database to get the kind of SQL interface that you’re looking for, and the way you do that is with ETL or another set of scripts, and you’re making another copy of the data to put it into one of those data warehouses or data marts.Now, almost always what we find is that the performance is still insufficient, and so what companies do next is they build things like cubes, or aggregation tables, or extracts for a particular tool, right? And it tends to be the case that the cube you build or the extract you build is specifically for one of those tools, and the way you do that is proprietary to that tool, so if you have a mix of three or four different tools, you might find yourself doing this three or four different ways to support those different tools and their needs. And so there’s another kind of copy step and another thing to orchestrate in this larger pipeline, and another copy to govern and secure and destroy, and et cetera, et cetera.So you end up with this pretty complicated series of steps, and most companies we talk to, it’s not… You know, it’s not three steps, it’s something like 10 to 15 steps, and many copies of data along the way, and it’s at this point you begin to have the kind of interface and speed that your data consumers expect, but what you’ve really… You know, you’ve ended up in this really tricky situation where, okay, we fulfilled the requirements and we have this data to a point where the data consumer can use it, however we’ve made the data consumer now entirely dependent on IT. If they have a new piece of data that they want to make available, then their only option is to go to IT and begin a data engineering project, and that of course is something that takes weeks and months, and probably quarters to from a new ticket to data that’s available to the data consumer in their favorite tool.Data Science and BI on the Hadoop Data Lake
And so that is really at the heart of the problem that we talk to companies about, is this architecture, this process is very heavy, very expensive, long lead times, and results in a situation where the data consumer is entirely dependent on IT. So, this is a picture I was drawing in the ’90s as a data warehouse architect, that the technologies have changed to a certain extent, but the basic idea and series of steps really hasn’t changed, and so that’s really why we started Dremio, is to say “There’s got to be a better way to do this.”And so when we look at what we set out to do when we started the company and started to build the product is, “Okay, let’s accept that data consumers are going to use a mix of tools and there’s never going to be one single tool, so we need to make sure that this solution that we provide works with all the different BI and data science tools,” and we effectively do that by embracing ANSI standard SQL and traditional protocols like ODBC, JDBC, and REST to make it so that pretty much any tool that can work with SQL over ODBC, JDBC, or REST can work with Dremio, and from the perspective of tools we just look like Oracle or SQL Server, or something like that.The second thing is one of the reasons that you end up with all these copies of the data and all the different underlying technologies is performance, is getting to the interactive experience, which is essentially for the data consumer to do their job effectively. So in Dremio we have approached this in a fundamentally very different way that we call Data Reflections. There is a patented approach that allows us to optimize different types of [inaudible 00:13:15] workloads in a way that allows you to not have to build cubes or extracts or aggregation tables, or all those other kinds of things you do just to get at the performance that you need, and so we talk about this as data acceleration and giving you 10x to 1,000x acceleration on the raw underlying data.Next, we wanted this to be a solution that is designed to be self-service so that the data consumer can do much more of the work themselves instead of being so dependent on IT. We also recognize that there’s an opportunity here for Dremio to add a lot of value in terms of security and governance, and that the… One of the downsides of moving data into things like file systems and object stores is that you don’t have the kind of security controls that a relational database provides, kind of fine-grained access controls and masking of sensitive data and so on and so forth. So those are capabilities that we’ve built into Dremio that give you those controls no matter what the underlying data source is.We’ve built something that’s deeply integrated with the Hadoop ecosystem, so we’ll talk more about that later, but this is an application you can deploy in your Hadoop cluster very, very simply and build on all the investments that you already have there, and Dremio itself of course is open-source. So that’s what Dremio is. If you sort of go back to that picture with all the different data sources in the bottom of the picture and your data consumers with their favorite tools in the top of the picture, we’ve integrated a number of distinct technologies into one self-service massively scalable platform.So we talked… Let me just go over these very quickly. So we talked a little bit about data acceleration, which is something that’s essential to solve a problem like this, we talked a little bit about the security and governance controls, we talked a little bit about self-service and making it so people can come in and build their own data sets, they can search a catalog and find the things that they’re looking for, they can collaborate with other people on their team. In the bottom-left, we know that most companies don’t have an inventory of their data assets, so in Dremio we provide a searchable catalog of all data assets, both physical and virtual, and a semantic layer that lets teams describe the data on their own terms instead of being dependent on IT to describe the data assets.Data Science and BI on the Hadoop Data Lake
We have the world’s fastest SQL engine for the data lake as part of the Dremio solution, and we could spend hours talking about how that works, but this is a fundamentally new engine for the data lake that is built on Apache Arrow. And then we also recognize that most companies do not have all their data assets in the data lake. That may be the vision, but getting there is a long journey, and so in Dremio the system is able to query not just the data in your data lake, but to also query relational databases and other sources outside of the lake, and even let you join data outside of the data lake with data in the data lake, which is one of our most common use cases.And that’s all about letting you get to value faster. It’s about letting you do… Solve analytical problems faster because if the first step is, “Well, the first thing we need to do is to move all that data into the data lake,” that might be a very long journey before you can even begin to ask a question. Wouldn’t it be nice if you could just begin to explore the data wherever it is before all the work of moving the data into the data lake? And so what’s what this data virtualization capability is all about.So that’s Dremio in a nutshell, is a solution that runs between all the tools of your data consumers and wherever your data happens to be. Dremio is not making copies of the data, it is not another silo or repository. It is a process that you can deploy directly in your Hadoop cluster and provisioning manage by YARN, and then your tools are going to connect to Dremio over ODBC, JDBC, or REST, and send standard SQL queries. Your users are going to log into Dremio through a browser and have this really nice self-service experience, and we’ll take a look at a demonstration of the product in just a couple of minutes. But just to give you some grounding, that is at the heart of what Dremio is at a very high level.Okay, so what are people doing with Dremio? Let’s talk a little bit about that, so we have a lot of really… You know, most of Dremio’s customers are Global 2000 large companies with lots of data and lots of different environments and significant investments in their data lake, so what do we see is things like, “Hey, I’d really love to drive more of my BI and data science workloads onto my Hadoop cluster, onto my data lake” and that’s one of the things that Dremio helps our customers do, is drive and consolidate more of those workloads into the infrastructure powering your Hadoop cluster.Data Science and BI on the Hadoop Data Lake
Another is opening up the data lake to a wider range of users. Traditionally you have to be a software engineer to really use Hadoop. Right, there isn’t really a nice, simple way for a Tableau user to take advantage of the power of the Hadoop cluster, so with Dremio you have this terrific self-service experience that lets people go and take advantage of the Hadoop investments you’ve made directly, without IT being the intermediary.Another is offloading of analytics from operational systems and the enterprise data warehouse. So in many cases even if you have a Teradata data warehouse, or an Exadata data warehouse, in many cases those systems are at… They’re at capacity, and companies are looking for ways to reduce the workloads on those systems, to offload some of the work, and so with Dremio you can basically query those systems directly and let Dremio do all of the heavy lifting in terms of the analytics, and what that lets you do is minimize the amount of work that is being driven to those legacy systems.Another is companies that are re-platforming to cloud. We’re talking today about how Dremio works in your Hadoop cluster, but there really are no Hadoop dependencies, so this is something that you can begin with your Hadoop cluster and if part of your strategy over the next few years is to move to AWS or Azure, then you can think about Dremio as being a part of your strategy that works in the near term and the long term.And then finally we have a number of companies that are looking to retire legacy data warehouses and data marts and basically re-deploy those workloads entirely into their data lake, and Dremio makes that a possibility because again we provide the performance and the security controls and the workload management features that are really required for true production workloads, and those are capabilities that I think aren’t quite sophisticated enough in Hadoop.So lots of good outcomes from these kinds of projects in terms of increased speed, faster time to value, lower overhead costs, consolidation of infrastructure and resources, and many other great things. And what we see with most companies is, you know, a Dremio project begins with a particular workload in mind, and then over time they find more and more reasons to drive their workloads to their Hadoop cluster, and to Dremio running in their Hadoop cluster. So that’s a little bit about what people are doing with Dremio and how they’re benefiting from the technology.So I want to talk for a minute here about Apache Arrow, and again I’ll get to a demonstration here in about five or ten minutes to kind of wrap up the session today, and then get to some of your questions. Again, as a reminder, there’s a Q&A feature here, so please if you have questions ask them through the Q&A and we’ll get to them in just a couple of minutes.Data Science and BI on the Hadoop Data Lake
So let me talk a little bit about Apache Arrow, so what is Apache Arrow? So at a high level, it’s two main things. So first of all Apache Arrow is a specification for a columnar in-memory format. So most of you are familiar, I’m sure, with on-disk formats like Parquet and ORC, and both of those formats have a lot on common, but one of the things that they have in common is they’re both columnar, and we learned more than 10 years ago that in terms of analytics there are huge advantages to organizing the data in columns instead of rows. If you’re doing, you know, write-intensive workloads of course being row-oriented is very advantageous, so for analytics columns are the way to go.And Parquet and ORC give you a way to organize data on-disk so that you, if you have the right SQL engine, you can get MPP-type performance out of an open-source solution using commodity infrastructure, which is pretty exciting. What has been lacking in the industry is how the data is represented in-memory, and of course these days most analytics is optimized for in-memory processing, but when you looked at different technologies, whether that was a SQL engine or a data frame, or something like Spark, everyone had their own proprietary way of organizing and representing data in-memory, and in many cases it wasn’t columnar, it was row-oriented, so while you got advantages in terms of IO efficiency by storing data in column, you lost those advantages by moving to a row-oriented format for doing analytics in-memory, and if anything the resources are more scarce in-memory in terms of RAM and on-CPU memory, and CPU cycles or GPU cycles, right?So the idea of… Or where things began with Arrow is why don’t we, instead of reinventing the wheel every time, why don’t we come together as a group of communities, and this is about a dozen different communities, and define a standard that we can all implement and work on together so that we can all benefit from one common best way to organize data for in-memory analytics that is extremely efficient in terms of CPU, GPU and RAM, and well-tested and well-documented, et cetera, et cetera. So that is how Arrow began, as “Let’s define a specification.” And then what you have in addition to the specification are implementations of libraries in over 10 programming languages that let you access and operate on those in-memory data structures, and over time the Arrow project has grown to include not just these implementations for accessing the specification, but also libraries of functions that give you optimal implementations of common things like finding distinct values, or performing certain sorts of calculations, right?And so what you see today is there are dozens of different open-source technologies that are using Arrow, so in Dremio we use Arrow extensively, everything we do in-memory and our entire SQL engine is based on Arrow. Spark is using Arrow, Python uses Arrow, so the fastest way to read data from Parquet into a data frame is using Arrow, and that’s part of what you’re doing now if you’re using pandas, and there are a number of other applications out there, and the number of technologies using Arrow continues to grow.So now you see this, even just looking in the Python community, it’s… This chart is a little outdated, but that looks like… The chart looks like a cumulative chart, but it’s not, those are monthly totals, so it’s really taken of to be… Last month was about 3.5 million. So that’s at a high level what Dremio is all about, but let’s talk a little bit about where Arrow… Sorry, what Arrow is all about. Let’s talk a little bit about where Arrow is headed beyond the kind of current state of [inaudible 00:26:50]. So one of the things that you have now is okay, so if I have this data in Arrow, in an Arrow data structure, what is it… How do I generate the most efficient possible way of operating on those Arrow data structures?Data Science and BI on the Hadoop Data Lake
And so what we… There’s a project called Gandiva that was donated to the Arrow community last year by Dremio that is using LLVM to dynamically compile arbitrary expressions into optimum machine code for the particular environment where Arrow is deployed, and so what this lets you do is take… You know, in Dremio’s case, what this lets us do is take very large, complicated SQL queries, potentially hundreds of lines with complex case statements and branching logic, and you know, SQL that is non-trivial to optimize, and compile it into very, very efficient machine code at query time using this Gandiva compiler that’s now part of Arrow.And this can have huge performance advantages, and some of our customers and their large queries, anywhere from 5x to an 80x performance speed-up by using this new compiler. And one of the exciting things of course about making this part of the Arrow project is you now have the community developing user-defined functions and other low-level operators that everyone can take advantage of to do really powerful, interesting things with the Arrow data structure.Then the second thing is this concept called Arrow Flight that we have been collaborating with other members of the Arrow community, in particular Ursa Labs and Two Sigma, to rethink the way applications consume data in terms of analytics. So for over 30 years we’ve been using this protocol of ODBC and JDBC which were designed for small dataset sizes and for row-oriented applications, and Arrow Flight is a complete rethinking of how to interact with a remote system from a client technology. So you think about something like a BI tool, or you think about something like a data frame in your favorite language, how do you request from a server a payload of data, and what’s the most efficient way to do that?Well, if you’re looking at how that’s implemented in ODBC and JDBC, it actually happens on a cell-by-cell basis, so if you have a result set with a million rows and 10 columns of data, you have 10 million function invocations to access each cell of data happening behind the scenes to make ODBC and JDBC happen, while Arrow Flight is a completely different way of doing this that’s part of the Arrow project that defines a protocol for exchanging data and operates at the column level.And what we’ve seen in early work in this project is things like a Jupyter notebook requesting data from Dremio being anywhere from 100 to 1,000 times more efficient in exchanging data over the wire between the two technologies, and this is a very exciting part of the project, as I mentioned earlier, if you care about data science, if you care about sort of the cutting edge of infrastructure and efficiency for large workloads, Arrow Flight is really charting new territory and doing some big, impactful, exciting things, and it’s something we’re excited and proud to be a part of.So that’s a little bit about the future of Arrow, and again, you know, we helped get Arrow started in the beginning, going back when we started the company, it’s a large community now with hundreds of active contributors, and it’s grown very, very quickly in adoption. I think it’s something to really watch, and it’s very kind of core to the whole Dremio architecture.So I’ll tell you about one customer, TransUnion, who has a multi-petabyte scale data lake deployed on Hadoop. They have huge volumes of data, most queries operate on tens or hundreds of terabytes at a time. They have deployed Dremio in that data lake, are using Data Reflections to accelerate data for data science and BI workloads. It’s been a great partnership working with them, and one of the really nice things they’ve been able to do is dramatically reduce the number of data engineers to just kind of get the system functional and meeting the SLAs of the business, and gone from 14 or 15 full time data engineers to a small fraction of that, making the system operational and exceeding the SLAs that they’d set out originally with the project. So that’s just a little bit about one customer and how they’re using Dremio.Data Science and BI on the Hadoop Data Lake
And for those of you kind of wondering, “Well, I’m using a SQL engine in my Hadoop deployment, how does this relate to that?” So whether you’re using Hive LLAP or whether you’re using Drill, or Impala, just one of the pieces of Dremio’s architecture is this fastest SQL engine for the data lake, and don’t take our word for it, it’s open-source, go try it out yourself, but this is a benchmark comparing for TPC-H, comparing Dremio to the latest Hive LLAP that was done by Cisco that shows on average Dremio being anywhere from five to 150 times faster than Hive LLAP, and delivering sub-second latency on a variety of different workloads. And so I think it’s something, you know… What we think about is, you know, Hive is still an important part of the solution for doing big heavy lifting ETL jobs, but when you want the interactive performance and all the other great things that Dremio brings to the picture, then you can certainly make use of the two things together.So with that, as I promised, I want to get to a quick demonstration, just show you what Dremio is, looks like, and this is a small four-node cluster that is operating on data, and the client application in this case will be Tableau. So this is what it looks like to log into Dremio through a browser, so this is the experience as a data consumer, that… You may already have a dashboard. In your experience with Dremio, you may not log in, right? You may just go into Tableau and click a refresh and get a really fast update to your dashboard, but for people that are maybe building a new dashboard, or building a new notebook in Jupyter, then that journey might begin with the first question, which is where is the data that I need for this particular job that I’m taking out?And so one of the things when Dremio connects to different back ends, whether that’s your data lake or whether that’s systems external to your data lake, is Dremio builds a catalog of information and makes that searchable so it’s easy for users to find different datasets, because most companies don’t have an inventory of their data assets, and so we think it’s important that a data consumer can easily find data no matter where you have it in your company.Data Science and BI on the Hadoop Data Lake
So let’s say I’ve been assigned to do some analysis of taxi rides in New York City. Many of you will be familiar with this dataset you’ve seen before, it’s a great public dataset. So my journey might begin by just doing a Google search, and I know that the taxis have tips, so I could just put in the keyword “Tip,” and what I get back are these search results, and each of these search results can correspond to a dataset that Dremio knows about. Purple means it’s a physical dataset. Green means it’s a virtual dataset. And from here I can see tags, so Dremio allows users to assign tags to different datasets so they can apply a canonical vocabulary to organize their different datasets. You can preview the schema of this dataset here and see the different columns and data types available. You can also see how many jobs have been run on this virtual dataset, so this one seems to be sort of popular, it’s got over 4,000 jobs, but we probably want to do is inspect the data, so what does this actually look at, look like?So I can click on the result to get a quick preview of the data, right, and here in an interface that feels a little bit like Excel I can see the… Let me, sorry, turn this off for a second. The names of the columns, this little icon to denote the data type. But you know, from this sample I can visually inspect the data to see if this is what I’m looking for. Now, I can also look at the catalog entry for this dataset, and so here I have a wiki page that describes maybe different reports that use this dataset, information about some of the fields, the total size of the data, so this is on the order of half a terabyte of data, a description, information on who to contact if I have questions, and then a list of all the fields available in the dataset. And this is a wiki page, so you can grant the ability for users of Dremio to come in here and edit this page directly, and update it and add information as they see fit.Now finally, the last thing I could do from here is see the graph, and what the graph is it tells me the relationship of this particular dataset to its physical source, in this case is a dat lake running on ADLS, as well as children datasets. So one of the common things we see is Dremio lets you build your own virtual datasets, and lets you do that very easily without writing code, and everyone wants their own version of the data, right? People want to name the columns a certain way, they want to add calculated columns, they want to join different datasets together or filter datasets in different ways, and we think that’s great, and so in Dremio you can do that very easily and virtually. In Dremio we do that without making copies of the data.And as this happens, it’s a little bit, in my opinion, a little bit like how PowerPoint is used in the enterprise, right? Nobody builds a presentation by starting with a blank PowerPoint presentation. Everybody takes a presentation and then adds some slides, edits some slides, removes some slides, and that’s how they create presentations. Same thing with datasets. You’re going to take an existing dataset, make some of your own changes, share it with a colleague, and that world of creating and sharing and collaborating is all taking place in Dremio, and we track the lineage and provenance of the data automatically behind the scenes, and that’s what this graph view is, to understand the larger context of this dataset.So between the graph and the catalog and the visual sample of the data, I have a very good sense about what this data is and if it’s exactly what I’m looking for. And if it’s not, maybe I’ll go and create a new dataset based on this dataset, and we aren’t going to have time to get into that today, but that is another part of the journey that we’ve designed Dremio for the data consumer to do on their own.But let’s say from here I like this dataset and I want to launch Tableau connected over ODBC to this dataset. So Tableau’s going to launch, connected using standard ODBC to Dremio. I’m going to log in to Dremio using my Active Directory or LDAP permissions, and let’s just see what we have to work with here. I said before it’s about half a terabyte of data, so it’s about a billion rows, and these are CSV files stored in the data lake, right? And so from here I can start to work with the data. Now, keep in mind here I have no extract behind the scenes. This is SQL going from Dremio… Or sorry, coming from Tableau over ODBC to Dremio, and each of these queries is being executed dynamically to give you this nice, fast, interactive experience on the data.Data Science and BI on the Hadoop Data Lake
And what’s nice here is all the features that I’m accustomed to in Tableau are available to me. There’s not some limited subset that only works in certain cases. Pretty much everything you can do in Tableau or are used to doing with Oracle or SQL Server, you can do here. So there I’ve very quickly in five or six clicks been able to look and see that the number of taxi rides year over year is relatively flat and unchanged on these billion rows of data, but I can see that on average people are paying significantly more per year, and that the tip is a big part of that, so the tip amount has gone up almost 3x since 2008. And I could look at seasonal variations, when’s the best time of year to be a taxi driver, and see that in the colder months, September through December, people tend to tip more, but then in January and February after the holidays people go back to tipping less, even though it’s still cold outside.So again, all these clicks are SQL queries running through Dremio and being executed in less than a second. This is a small four-node cluster, and running these same queries through Hive takes seven or eight minutes per click, so it’s about a thousand times faster in this case than running the same queries through Hive. This is Dremio deployed as a YARN application in that Hadoop cluster, and giving you this nice, fast, interactive experience. So that’s part of the journey that we expect data consumers to go through, and just sort of going back to the presentation here… Sorry. We have another scenario that we could… You know, if you’re interested in, and we’d be happy to follow-up with you, can talk about the experience of a data engineer provisioning new datasets on demand, and the row and column-level access control and masking of sensitive data that Dremio provides on data and any source.Data Science and BI on the Hadoop Data Lake
Just quickly to talk about how this thing is deployed, so back to this picture that we drew before with data consumers and their favorite tools, they’re connecting to Dremio over ODBC, JDBC, or REST, and hopefully later this year having the option of going over to Arrow Flight, and we actually demonstrated that at Strata a few weeks ago, the difference between PI ODBC and Arrow Flight for a pandas data frame accessing a billion rows of data, going from two and a half hours over ODBC to under three minutes through Arrow Flight. So that we hope to be an option later this year.So here is your Dremio cluster which is deployed and orchestrated via YARN. You install one Dremio node on an edge node and bring it up, and then you can basically communicate with the YARN ResourceManager to provision as many instances of Dremio as you like with a certain number of cores and memory per instance, and you can have that associated with a YARN queue so you can allocate resources as appropriate. These coordinator nodes are what your client tools connect to, and then you’re typically going to have three to five coordinator nodes in a cluster, and then some number of executor nodes, and in many cases executor nodes running on every node in your Hadoop cluster. So we have some companies running 10 executor nodes, we have some companies over 1,000 executor nodes, but all that fits nicely into your Hadoop architecture.And then Dremio is reading from the files in HDFS directly with its own readers, we have the world’s fastest readers for Parquet and ORC. We have these vectorized readers that read into Arrow data structures directly. And then as I mentioned before, we can also query systems outside of the data lake, so we can query Postgres and Oracle, and Teradata and SQL Server, and other systems, and let you query those systems directly or join between data in those systems and your data lake. And the Data Reflections that I mentioned before, those are stored in your… In HDFS and inside of your Hadoop deployment.So that’s a little bit about how Dremio is architected and how it fits in with things, and just to kind of… The quick checklist, you know, first-class support for HDFS and MapR FS. To the extent you have a lot information in your Hive Metastore, you can connect Dremio to the Hive Metastore and we pick up all that schema information and make that available in the catalog where people can search and find things. We support Kerberos and impersonation of users in your data lake. We support YARN. We have these… The world’s fastest Parquet and ORC readers that I mentioned before. We support Apache Ranger and we support ZooKeeper. So as I mentioned at the beginning, this all is very tightly integrated into your Hadoop stack, and works very cleanly and nicely in that kind of an architecture.So now I want to get to some of the questions that are submitted here, and I really appreciate all the great questions. I don’t think we’ll have a chance to get through all of them, but I want to try and get through a few, and for those of you, if you have a question please submit through Q&A, and if we don’t get to it we’ll get to it… We’ll get back to you over email. So the first question here is, “Do you support ADLS Gen2?” The quick answer is yes, actually as of this week we support ADLS Gen2.Data Science and BI on the Hadoop Data Lake
Now, Dremio is… Today we’ve been focused on Dremio in the Hadoop data lake. As I mentioned at the top of the call, we have no Hadoop dependencies, so to the extent your data lake is in Azure or in S3, Dremio can work with data in those object stores as well as data in other systems that you have in the cloud or on-prem and give you this great experience, fast access to data, the catalog, the ability to build and provision new datasets, lineage and provenance of the data, row and column-level access control, all of that works whether you’re using Hadoop or not, so you have lots of options and I think that’s important because many companies are re-platforming to cloud, and Hadoop itself in some of those cases is going to be a part of their strategy, and in other cases they’re looking to move their analytics onto the services available in the cloud, and Dremio works in both ways.So that’s the first question, hopefully you feel good about that it’s… So we support ADLS Gen2, ADLS Gen1, and Azure Blob Storage. So that’s one. The next question I want to look at is… So how… The question is, “How is performance impacted in the example of the taxi data?” I’m paraphrasing the question a little bit here. “If it’s a billion rows of data and I apply a filter from Tableau, is the filter executed on Dremio or the database?” So in this example you have CSV files stored in HDFS, right? And so there is no database to apply the filter, you need a SQL engine to apply the filter, and in Dremio that SQL engine is part of the solution, and so if you apply a filter in Tableau to say “Look, I want to look at a particular year of data,” or “I want to look at all the car rides that have more than two or three passengers,” then Dremio will apply that filter in its SQL engine and return the results.And I think part of what this question is getting at is, “You showed me these fast queries on the whole dataset, well what if I want to drill into the dataset in some way, and do I still have the same performance?” And the answer is yes, that you’re able to perform this sort of aggregate-level analysis on the data at very, very high performance without losing functionality. And I want to also bring up an interesting point, which is… Well, you’re not asked… The person asking this explicitly, but I think the next naturally question is, “Okay, well you showed me fast kind of group-by type queries, what about I need the… I’ve got a needle in the haystack query, like bring me the 100 most expensive taxi rides over a three-year period that had more than two passengers,” right? And you’re expecting 100 rows of results, not an aggregation of the data.Well, we have a way of optimizing those kinds of queries as well, and that’s the kind of thing you can’t do in a cube, right? Cubes, in cubes you lose row-level fidelity, so in Dremio you have the ability to optimize queries that are these kind of group-by aggregation style queries, as well as the needle in a haystack queries, and our Data Reflections solve both of those.Okay, so the next question here is, “Could you please comment on the possibilities for setting up column-based and row-based security in order to secure the data for authorized users?” Yes, I’m happy to comment on that, and you can also read about it in our documentation. If you search on Dremio “Column-level access control” or something like that, I’m sure you’ll find it. So, in Dremio every query has a concept of an external user variable, and so when that query is submitted to Dremio implicitly there is a user submitting a query, and in the virtual dataset that you’re using to access the data you can say basically in something like a case statement, if the user is this user, Bob or Jane, or if the user is part of this LDAP group then they can see this column, otherwise they don’t see the column.And that’s also how we handle masking of sensitive data, so same sort of logic, if the user falls into this list of users or groups then they can see the unmasked data, otherwise mask the data with this regular expression. So you have that ability to handle column-level access, and then you have a similar approach to handling row-level access, but that, it is managed in the WHERE clause, and so there are examples for both in our documentation, and all of this is applied dynamically at query time.So as users change group membership, you… That is always reflected in the next query, and the mask’s representation of the data is not a copy of the data, it’s always applied dynamically at query time. So hopefully that gives you a little bit of an answer there, and by the way that works on any data that Dremio can access, so whether it’s in your data lake or in an external source, you have the fine-grain access controls at your disposal.Okay, next question, and this may be the last question I have time for… Actually I think I can get to two more questions. This one is, “In cases where your data sources contain very cryptic and not user-friendly table and field names, would you perform the transformation to user-friendly names in Dremio or would you recommend this mapping to be done at the source?” Well, I think this is one of the really nice things about Dremio, is we have this concept of a virtual dataset, and if you’re familiar with the idea of a view in a relational database, very similar in that you can change the names to be whatever you want.You can add calculated fields, you can apply filters. It, from the perspective of a user, is a first-class relational object, it’s just like a table. You can join it to other datasets, you can apply SQL expressions on that virtual dataset, and it’s in the virtual dataset that in my opinion you should rename the columns to be whatever you think they should be, as well as letting users name things however they want to name them. You can of course have your own kind of common data model that everyone uses, but then you can allow users to build on top of that and have their own sense of the world, and I think that’s a nice set of options.The last question I have… Pick a quick easy one, is “Do you offer training?” If you’re interested in learning more about Dremio or trying it out, of course it’s open-source, you can try it, but we have this really nice thing called Dremio University that’s free, that will provision a Dremio instance on your behalf, and for the duration of the course you have a private instance that you can do whatever you like, it’s our Enterprise Edition, and there are some great… There are currently three courses available and we’re adding new courses every month, so please take a look at Dremio University if you’re interested in getting hands-on with the product and learning more about how it works.Okay, I really appreciate… For those of you I didn’t get to your question, and there are a bunch of them here, we will get back to you over email, and thank you for attending today, we will also be sending you a recording and transcript of the webinar. I appreciate your patience at the beginning as well. Hope you have a terrific day, and I look forward to seeing you out there in the Dremio community. Take care, bye bye.

Interactive Data Science and BI on the Hadoop Data Lake

Table of Contents

Webinar Transcript

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?

Table of Contents

Webinar Transcript

Additional Resources

Apache Iceberg: The Definitive Guide

What Is Apache Iceberg? Features & Benefits

Introduction to Data Engineering

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?