48 minute read · November 10, 2020

A Modern Architecture for Interactive Analytics on AWS Data Lakes

Roy Hasson · Sr. Manager, Business Development - Analytics and Data Lakes, AWS

Stephen Faig · Research Director, Unisphere Research and DBTA

Gabriel Jakobson · Senior Solutions Architect, Dremio

Session Abstract

Built upon cost-efficient cloud object stores such as Amazon S3, cloud data lakes benefit from an open and loosely-coupled architecture that minimizes the risk of vendor lock-in as well as the risk of being locked out of future innovation. However, the many benefits of cloud data lakes are negated if data is duplicated into a data warehouse and then again into cubes, BI extracts and aggregation tables.Because of this, many organizations are now striving to find the right balance between their data warehouse and data lake investments. During this webinar, we’ll discuss how to find and best implement that balance for your organization. We’ll also provide a live demo that shows how Dremio and AWS Glue make it possible to run BI workloads directly on the S3 data lake.You’ll learn:

Which BI and data science workloads are a better fit for cloud data lakes
How to ensure your data architecture meets the needs of both your data teams and analysts
Techniques for accelerating analytics queries on your S3 cloud data lake
How Dremio and AWS enable you to get maximum value from your cloud data lake

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Opening

Stephen Faig:
Welcome to today’s webinar brought to you by Dremio and AWS. I’m Stephen Faig, Director of Database Trends and Applications at Unisphere Research. I will be your host for today’s broadcast. Our presentation today is titled, “A Modern Architecture for Interactive Analytics on AWS Data Lakes.” Before we begin, I want to explain how you can be a part of this broadcast. There will be a question and answer session. If you have a question during the presentation, just type it into the question box provided and click on the submit button. We’ll try to get to as many questions as possible, but if your question is not selected during the show, you will receive an email response. Now to introduce our speakers for today: we have Roy Hasson, Senior Manager of Business Development, Analytics, and Datalakes at AWS, and Gabe Jakobson, Senior Solutions Architect at Dremio. Now I’m going to pass the event over to Roy.

Serving your user and the business

Roy Hasson:

Hi, everybody. Again, thank you very much for the introduction. Just quickly, my name is Roy Hasson. I lead the analytics specialist team at AWS. I’ve been working with a lot of different customers around building data lakes and modern data architectures for their organizations, to really help them become more data-driven organizations.

So, in talking to our customers, the problems and challenges they portray around why they want to build a data lake and the kind of value they want to get out of it really center around two main areas. I’ve tried to present this here in a slide. The first one is: how do I let my users—my personas—be more self-sufficient, get the job done, find the right tool, and just be very efficient? Just get the work done without having to rely on too many other people to get their job done. So that’s one aspect of it. The other aspect is from an infrastructure perspective. We still want to make sure that we’re following best practices, that we’re not building ourselves into a corner, and that we’re using technologies that are scalable, easy to use, and [can] ultimately deliver the features our personas really need.

When we’re talking to these different personas, they obviously have different requirements. But when you talk to the infrastructure or IT team, they want to make sure there’s no data duplication, which caused the original data silo problem. [They want] tools that are easy to use, so the infrastructure team doesn’t have to spend a lot of time benchmarking, configuring, tuning, testing, and monitoring. [They want everything to just work,] right? And also, it needs to be scalable. So as you add more data, as more users come in to use the platform, the platform just automatically scales with it. So those are the two big challenges that we’re going to try to address with a data lake approach.

The Data Lake – Start with storage

Roy Hasson:

So just kind of taking this step by step, we typically start with a data lake. We start with Amazon S3 as our storage layer. It’s scalable, highly durable, highly available, and has great integration with APIs and all the other services out there. So it’s a great place to put the data. Now, the big benefit here is we reduce duplication. We’ve got a central location for all of our data. There’s no more silos. We can manage it in a central place through policies that are managed by a central team. Or even if we distribute it across different teams, it’s still basically one platform. Like I said, it’s well integrated, and it’s really cost-effective, right? There are a lot of features that help you do the intelligent tiering, move hot data, as it becomes colder, move it to less expensive storage so you can save money. But also put data in compressed formats, and now you save yourself on that, etcetera, etcetera. So there are a lot of options [here]. So that’s a great platform for us to store our data.

The Data Lake – Catalog and Metadata

Roy Hasson:

The next thing that we want to do is, once we have the data in S3, for this data lake to actually be meaningful to you, we want to be able to catalog it, right? Extract the schemas, extract the information about the data, about the files into a single, yet decoupled layer. We call this the decoupled metadata layer. We have a service called AWS Glue Data Catalog that’s Hive compatible, which means it’s going to work with the same big data engines that you know and love, like Spark and Presto and Hive, etcetera. But it’s fully serverless. It’s fully managed for you. It’s decoupled from the compute engine itself. So you can use it as a single layer to be able to discover which datasets are available for you in S3, but also provide integration at the metadata layer to other systems. And you’ll hear how Dremio kind of plugs into that as well. So again, it enables ubiquitous access to the data––central management of metadata. You can come in, find all the datasets that are available to you, and plug into them right from there. It enables that ecosystem of partners to come in and provide additional functionality. If that [were] baked into an existing data store, it would be very hard to integrate with. And that’s kind of the challenge that existed with a Hive metastore in the beginning. There weren’t really a lot of good APIs, and you didn’t see a lot of good integration outside of the [Hudi] ecosystem. So with the Glue Data Catalog exposing all this metadata through APIs, we now open up the ecosystem. It’s a really powerful way to future-proof your platform.

The Data Lake – Discovery and Governance

Roy Hasson:

Moving up, the next layer in your data lake is: how do you control the authorization and access policies to that data, right? You can set policies on your storage, [like] on S3. You can set policies on your catalog, on your metadata, right? [For example,] you can list this table, but you cannot read it, or you can read it but not write into it. The same goes for data in S3. But how do we do it in a way that’s easier to manage? I don’t want to use IAM policies. They’re great, they’re very flexible, but a little more challenging. Lake Formation is an authorization layer that sits on top of your data lake in S3 and allows you to define fine-grained access controls to your data. [For instance,] you can say Roy has access to table users, columns one, two, and seven, but not the rest of the columns. That’s a really powerful mechanism to implement both governance—so we know who’s accessing what, with audit on top of that—and also the fine-grained authorization. So with that, you also have a better way to discover your data. You have central governance authorization I talked about. It’s auditable. And the last piece is that it enables what I call the universal data-sharing idea. You see tools out there, [like] databases or systems, that say, ‘Hey, we can share data between our systems.’ And that’s awesome, right? You can share from one data warehouse to another, from one query engine to another. But how do you do it in a more universal or generic way? And Lake Formation allows you to do it. So you can say, hey, I have a table registered in my catalog that maybe sits in my dev test account. I want to be able to share that with another account. Or maybe I have datasets in my production account, and I want my dev test account to have access to it. So I can easily, through Lake Formation, say, ‘Take this table and share it with this other account,’ and then you control all the access to it. It’s really powerful, and it doesn’t depend on any upper-level tools to give you that sharing capability. You get it right out of the box with this data lake.

The Lake House – Data Consumers

Roy Hasson:

So at this point, we build a foundational data lake, right? This is what a typical data lake would look like. The next step is—and you probably hear a lot about the concept of a lake house and what it—but now that you’ve built a data lake, and you can see this in a box at the bottom, the three layers––that’s the foundational data lake. Now, on top of that, you want to be able to bring consumer applications, right? Those are systems like Amazon Athena that run queries on S3. That may be Amazon EMR with Spark or Presto to analyze that data, or even Amazon Redshift if you want a unified query engine for both data inside a data warehouse and in your lake. This allows you to plug into this data lake, right? So that’s the concept of a lake house. Now, there are many different tools and capabilities. So Athena is just a query engine. It’s not a database; it doesn’t store any data, but it allows you to interactively work with the data in place, which is very cost-effective. Compare that to a data warehouse. In a traditional world, [with] a data warehouse, everything would have to be loaded into the data warehouse, and that’s where you use the data. But in this lake house world, that concept is now decoupled. You can still put data in a data warehouse, but think of that as more of a cache. It’s something that stays there for a little bit just to serve the queries, while the remainder of the data resides in your data lake. So this whole lake house platform is a lot more flexible. It’s portable. The data sharing I talked about allows you to share data between all these different systems, and it really eliminates the lock-in aspect of any one solution that controls the full platform end to end.

The Lake House – Easy to use for every persona

And then, lastly, if we kind of talk about where does Dremio fit in, again, it’s both a query layer that’s able to access the data, and you can see by the arrow here, it’s able to connect to the Glue catalog that I mentioned before, which allows us to plug into the lake directly. So, you can come into Dremio, and you’ll be able to query the data right there. But you can also layer capabilities on top of it if you wanted more flexibility, [giving] your users a self-service way of accessing the data. Now, compare that to a data warehouse—it’s not a data warehouse, right? It’s more of an intelligent query engine. In the next section, we’ll dive deeper into that. But if we were to step back to the original problem statement I called out, from an infrastructure perspective, the data lake that I showed you here is flexible, scalable, highly available, and cost-effective. It eliminates a lot of the complexities of data duplication and [simplifies] access to data. And how do I help my personas? That’s where the best tool for the job comes in. Whether it’s Dremio, Redshift, or Athena, there are different use cases where one is better than the other. On top of that, you layer your visualization, your machine learning, your studio experiences, and your user interfaces to interact with the data. So, that’s, very quickly, what a lake house architecture is. The data lake is the foundation, you plug in data consumers, visualization, and user interfaces on top of that, and that extends the data lake into a lake house architecture. So, from this point on, I’ll hand it over to Gabe, and he’ll walk through what Dremio is, and how it plugs into this. Gabe, you wanna take it away?

In a Pure Data Warehouse World, BI Users Are Being Left Behind

Gabriel Jakobson:

Absolutely. Thank you, Roy. So, let’s build on what Roy has just said and talk about Dremio’s vision for a modern data lake architecture, specifically from the perspective of analytics. So, if we step back for a minute and look at the whole problem from a data consumer-centric point of view, data consumers are not necessarily happy in this day and age of exploding data. If you’re a data consumer, in many cases, you may be experiencing slow queries or slow time to insight. Why is that? Well, if you have very large amounts of data, say, for example, in a data lake or a very large data warehouse, in some cases, you [might] not be able to run fast queries against very large datasets, right? So live queries get slower. And if you want to extract data using a product like Tableau or Power BI, there is a limit to how large extracts can be. What’s more, extracts actually produce a delay, right? So, you [experience] slowness in time to insight because if you’re extracting data, naturally, the whole process [slows down]. In fact, if you look at this little ETL pipeline here, it’s not really ETL, but for all intents and purposes, if you take data that currently exists in your data lake and try to chunk it out into data warehouses so it’s optimized for analytics, then cube and aggregate it [to] create all kinds of insights, this is a lot of work for IT. From the perspective of a data consumer, it slows down your time to insight.

This may sound a little bit trivial, but I’m always surprised by something that I think is fundamentally different between data warehouses and data lakes, and something that people often overlook or don’t take time to meditate on. If gave you a value, say, I just typed a number here—what is this value? It could be one of many things, right? Maybe it’s my revenue for the year. Maybe it’s my revenue per quarter. I wish! We don’t know whether it’s dollars or pounds. Maybe it’s not a number at all. Maybe it’s someone’s Social Security number. In the world that’s existed for the past few decades, in a classic data warehouse world, this value has a very specific definition. We know, for example, that it’s an employee’s Social Security number. We know what table it lives in. We know it’s a VARCHAR. We know it can have exactly ten digits. We know how it relates to other data. These are all things that we take for granted in a data warehouse. A data lake is different. In a data lake, for those of you who have played with S3, which is wonderful, you have a bunch of buckets, and you can throw anything in there, right? It could be a movie, it could be a picture, it could be data. So this number or this value doesn’t necessarily have a natural meaning in a data lake. We have to assign a meaning to it, and then we can query it and build analytics on top of it.

To Make a Data Lake Useful for Analytics…

Gabriel Jakobson:

So in order to make a data lake really useful for analytics, we need a few things to exist. Data has to have a meaning. That number that we saw before could be part of a movie stream. It has to have an actual meaning of what that number is, if it’s even a number. It has to relate to other data, right? The nature of analytics is such that, usually, you want numbers to relate to other numbers, measures to relate to attributes. You’re not dealing with just a single element. You need to be able to use SQL, right, the most common language for querying data generally, and that SQL needs to preferably run at speed. You want your data consumers to be able to execute queries very fast. So what does Dremio do? We’ll dive deeper into this. We can query S3 directly with four times the performance, and this number varies. This number could actually be dramatically larger than four times, with minimal data movement. That’s really important. Most of your queries are actual live queries against your data lake via Dremio, while you maintain control of your data. So your data doesn’t migrate or move anywhere else. Your data is your data, and Dremio fits on top of it as an engine. I’ll show you a schematic diagram. In fact, there it is.

Simplified Data Access

Gabriel Jakobson:

So imagine your data lake, S3, and Glue down here at the bottom, with all kinds of external data sources, which we’ll talk about a little bit as well. At the top, you’ve got your data consumers, your data scientists, your BI users, etc. Now in the middle, you’ve got this invisible, from an end user’s perspective, data lake engine, which gives you speed but also [provides] a semantic layer. From the perspective of BI users, for example, you want consistent business logic, right? BI users are not data scientists; they’re not just playing with data. They need to be able to present reports and KPIs, and those need to be consistent and accurate. You don’t wanna wait for IT to keep on ETL-ing your data, right? As an end user, you want to be able to access it directly. You wanna be able to discover data and share it easily. From an IT perspective, you also want to make sure you have good data governance, right? Security and governance are key, especially when it comes to PII data. You don’t want reactive, tedious work; you want easy collaboration. And this is something that Dremio provides you with. That’s the value that Dremio provides.

AWS Glue Tells You Where to Find the Data and the Data Definitions

Gabriel Jakobson:

So, what you see here is AWS Glue. The first step in creating a semantic layer is leveraging AWS Glue as a data dictionary, right? If you remember that [large value] I showed before—so, leveraging AWS Glue, the value gets a meaning, right? AWS Glue crawls your data lake and understands what the various tables are. Tables are typically comprised of thousands, sometimes tens of thousands, maybe even hundreds of thousands, of small partitioned files in your data lake, right? So, it’s really important to have AWS Glue as one avenue to catalog your data and help Dremio understand what your data is at a very granular, fundamental level. This is not business logic—this is more about what a particular number is. Is it an integer? Is it a bar chart? Where does it live on disc?

Dremio’s Semantic Layer is Where You Assign Business Meaning to Your Data and Organize It

Gabriel Jakobson:

So, the next step is the Dremio semantic layer. I’ll show you a demo screenshot of that. But once Dremio is connected to Glue and sucked in the data [to understand] what your data is like in terms of tables, then it’s time to really assign business meaning to it. You can’t underemphasize the importance of this. The idea [is] that while you may have lots and lots of different tables in your data lake that are called all kinds of tables, [many of them] may have ‘revenue’ as a column, right? But ‘revenue’ is just a name, if you’re the CFO of a company, ‘revenue’ has a very specific meaning to you. So, you need a semantic layer that assigns that business meaning to a number that just lives somewhere in your data lake.

Demo

Gabriel Jakobson:

So in terms of a demo, in terms of logistics, just passing screen share wasn’t very practical for this presentation. So I’ve taken a couple of screenshots, and to anyone who is interested in seeing the product in a real demo, sign up with us through Dremio, or you could actually download or fire up Dremio on AWS and start using it yourself. But this is Dremio’s IT view, right? So if you’re a data consumer, you would not see this. But if you’re a data engineer or in IT, this is something that you would see. Down here at the bottom, we’ve got data sources. So you could connect to any data lake out there, and most other RDBMSs or other data sources. In this particular case, we chose to connect to Glue, right? Once we connected to Glue, AWS Glue gives us a list of all the tables it discerned from the underlying data lake in S3. Double-clicking on any one of those tables, Dremio opens up the table and treats it as analytical data. It treats a file that lives on S3 as a real table. If you, as a data consumer, were to fire up Tableau or any BI tool connecting to Dremio, in this particular case, you’d click on the Tableau icon, and Tableau fires up. From a Tableau perspective, all of the dimensions and measures that were discovered by Glue and curated by Dremio become available for you to [drag and drop]. This is really something worth seeing live because if you were to drag and drop attributes and measures, you’d be able to create an ad hoc chart lightning fast, right? For example, creating this chart in terms of actual data reads only takes about a second or two. So as an end user, right, as a BI user, you just drag and drop attributes and measures. And even though, if you look at the previous slide, there are over a billion records in this particular table, [doing] all kinds of ad hoc analytics only takes seconds. If you’re a Tableau user, this feels as if you have an extract that lives on your hard drive, on your Mac’s hard drive or PC’s hard drive. In reality, though, everything happening here is happening via Dremio.

So now we’re back to the Dremio console. This is what your data engineers and IT would generally see. Any query that Dremio receives, for example, from Tableau, gets registered with Dremio. Dremio analyzes the query, plans the fastest route to execution, speeds up the query along the way using about a handful of different technologies, and then returns the data back to the Tableau user—extremely, extremely fast, in most cases under one second, so that when you actually test drive Dremio what you will see is the speed.

Metadata is Solved with Dremio; Now Let’s Talk about Speed

Gabriel Jakobson:

Let’s talk about interactive analytics. What gives us the speed, right? So we understand metadata or the metadata layer. Let’s talk about speed. I mentioned in passing about a handful of technologies that allow Dremio to really achieve that kind of crazy performance—four times speeding up ad hoc queries, up to a thousand times speeding up BI dashboards or things that could be cached. So, reduced infrastructure costs—we’ll talk about that a little bit. But the idea of speed is, Dremio, as a company, started an open-source project called Apache Arrow, which has become one of the most popular Apache open-source projects out there—fifteen million downloads a month. So, if you look at what Apache Arrow does with your data [in-memory,] if you look at our ability to have a scalable, scale-out, scale-in MPP engine that predictably queries your data lake, if you kind of put everything together, you get tremendous speed. I don’t want to dwell on every one of—I mean, I could talk at length about every one of these different technologies—but generally speaking, this is how we achieve speed. Also, elastic scaling is really important in terms of cost. So, we have the concept of engines. And when we sense that there is—or when a certain condition is met for us to expand our execution engine—we do so transparently, automatically. So if you have a flood of queries coming in, we can fire up more executor nodes, crunch through your queries, and then scale it back down so that you’ll save on cost.

Many Analytics Workloads are Ideal for a Data Lake

Gabriel Jakobson:

Many analytics workloads are ideal for data lakes. A lot of these, you could just imagine—things like dashboards and report acceleration. We talked about this; that’s ideal. And add-on query acceleration—ideal for data lakes. If you’re a data engineer, again, data lakes are ideal for you. And if you want the standardized semantic layer, which we talked about at length, let’s not forget our data scientists, right? There is more and more data science out there—ML, AI, data exploration, feature engineering—which is really, really important and can be a rather painful process. This is where, you know, if you’re an AI practitioner, you need to be able to scan over millions of different data points and, you know, engineer them in all kinds of ways to build a predictive model, right? So this is something where data lakes are really making a big impact.

Find a Better Balance by Migrating Suitable Workloads to your Data Lake

Gabriel Jakobson:

In terms of migrating workloads, if you have an on-prem data warehouse, this is really a good time to migrate some, if not all, of the workloads—at least some of [them]—to the data lake, especially the non-transactional workloads. We have support for transactional workloads coming up, and if you guys ping Dremio separately, I can explain the difference. But for now, the majority of your non-transactional workloads—your normal data—is definitely a good candidate for migration from Teradata and other on-prem solutions. Same thing with Snowflake and cloud offerings. If you have a data warehouse in the cloud, it’s really ideal to start saving money. Just take some of the data, migrate it over to a data lake, and leverage both AWS for the data lake and Dremio to speed it up—to speed up your queries and create a semantic layer. Great use case.

Data Lakes and Existing Data Warehouses Can Co-Exist

Gabriel Jakobson:

Another point that I want to make sure I touch on is that it’s not an either-or. Right? It’s a world of coexistence. We understand that data warehouses are not going to disappear overnight. It’s not like you’re going to, you know, turn off your lights on a data warehouse on Friday, do migrations to a data lake over the weekend, turn on the lights on Monday, and 100% of your data is in a data lake. That’s not going to happen. So, another good thing about Dremio is the fact that we support that type of coexistence. We connect both to data lakes, obviously, and also to external data sources, to your regular RDBMSs, to Amazon Redshift, or to, you know, Teradata, or Snowflake, or any other data warehouse. And we can federate queries between the two so that if you’ve started your data lake journey and migrated some of your data to a data lake, your users can submit a query to Dremio, and the query can be satisfied by some answers that live in the data lake and some from your traditional data warehouse. Dremio would then understand the query, federate it to both sources, combine the result set, and give you back the results extremely fast.

How Do You Get There?

Gabriel Jakobson:

So, how do you get there? These are just some tips. Right? I think the next slide shows some of our, you know, impressive customer lift. So, how do you get there generally? What do most people do? Well, if you already have a lot of data lake presence, then you’re ready. If you don’t, maybe start by migrating some workloads to your data lake. Right? If you’re receiving new data or rearchitecting your data warehouses, take some of those workloads, put them on a data lake, and have the two paradigms coexist. Don’t boil the ocean. Right? Start with some data, some use cases. Maybe gather some specific use cases, and Dremio can help you with that. Let’s get those use cases to work and prove themselves, and then we can move more data to your data lake. And really important—make it transparent to your BI users. You never, ever want your BI or data consumers to suffer for your back-end decisions. Right? You only want them to benefit. So, as you do this migration, make sure the performance they experience from the data lake exceeds their expectations and is transparent to them.

Summary

Just as a summary slide, it’s really impressive, and this is a very partial list of some of our large customers out there. Once again, lightning fast. Two things to remember about Dremio: lightning-fast queries—dramatic acceleration of data lake queries—and a self-service semantic layer, which means you can provision datasets very fast. You can share them. And really important, there is zero lock-in, zero loss of control. With a lot of solutions out there, primarily data warehouse solutions, you’re uploading your data somewhere else. You’re uploading your data to, you know, to somebody’s data warehouse, and then you lose control of your data. With Dremio, your data stays in your data lake. Dremio touches it to execute queries but never moves it.

How do you get started? You can try our AWS edition for free. Just go to—just Google AWS Marketplace Dremio—and you’re going to see a screen that looks like this. You can fire up either the normal trial of Dremio or Dremio Enterprise, which has more security features, and start testing it out. It takes very little time, and it’s really fast.

And over to you, Stephen.

Q&A

Stephen Faig:

Thank you very much, Gabe. So we’re going to dive into questions from our viewers. First question: How does Dremio make Tableau faster?

Gabriel Jakobson:

Yeah, absolutely, Stephen. So, we kinda saw it before. The whole idea is that when Dremio receives a Tableau query, Dremio has about five different techniques for executing the query faster if it’s a live query. One of the techniques is called a reflection. So, if we see a query that’s very repetitive in nature, we can use our proprietary technology to materialize the results of that query and reuse it for other queries. That dramatically speeds up your Tableau experience. But in general, the fact that you don’t need to wait for Tableau extracts to happen, and that you can create and execute live queries against Tableau and get the results fast, is what improves your experience.

Stephen Faig:

Understood. Thanks for clarifying, Gabe. Next question: I heard Dremio reflections are really fast. What are they, and do I have to use them every time I want a query to run faster?

Gabriel Jakobson:

Sure. So Dremio reflections are a super useful technology, right? It’s really impressive how fast they are. However, I kind of think of them as almost like an afterburner on a fighter jet engine. I understand this is not an analogy everyone would get, right? But the idea of reflection is that Dremio natively, without reflection, is super fast. In some cases, it makes sense to further accelerate particular query patterns, and that’s where reflections come in. So Dremio is fast without reflection. Reflections are useful in places where you want that extra oomph to execute queries.

Stephen Faig:

Understood. Moving to our next question. Can Dremio run queries that read data both from a data lake and a data warehouse at the same time?

Gabriel Jakobson:

Absolutely. So Dremio understands your underlying data, right? Dremio understands that some of your data—hopefully, the bulk of your data—lives in a data lake. Dremio understands what data lives in data warehouses. Dremio has a slew of connectors into data warehouses. We’re also supporting you in writing your own connectors, right? So we have a Dremio hub where you can create your own data warehouse connectors. Dremio receives the query, analyzes the query, and then federates the query out to either the data lake and/or your data warehouses, fetches the result set from both, combines the result sets, and gives you, as the end user, the answer.

Stephen Faig:

Understood. Thanks, Gabe. Our next question: We have someone curious if there are other AWS services that integrate with Dremio, and what if data is stored on EFS or FSx Lustre—can Dremio be used?

Gabriel Jakobson:

Yeah, sure. So I can answer this. Gabe again. Roy, I don’t know if you want to take a crack at that as well. But from a Dremio perspective, data is data. And as long as we can get to the data, we can work with it. Dremio actually makes use of EFS on its own. A lot of the question becomes not about whether Dremio can get to the data—the answer there usually is yes, just because you’re using pretty standard JDBC drivers and/or APIs. So the data can always be accessed. The real way to answer this is whether that’s an ideal architecture. For example, EFS is a place where you could store data, but it’s not ideal for analytics. It’s not built for speed, right? For query analytical speed. So that’s where we would work with you and encourage you to think about real data lake architecture and maybe shift your data over to S3, just to make the whole process more efficient. But can we read from it? Absolutely.

Stephen Faig:

Understood. Moving to our next question, we have someone curious about what compute costs would look like using Dremio. I’m not sure if that’s something you can really get into—it’s probably specific to each environment—but I don’t know if there’s anything you could say around that.

Gabriel Jakobson:

Yeah. No. It’s a good question. It’s a question we get a lot. So, compute cost—what we try to get to is that, to a certain extent, you kind of pay for what you use. Or, let me phrase it differently. I’m really not the account exec who usually talks about pricing, but what we try to do overall is align compute with cost. So, when you think about data warehouses, you’re typically paying for both when data lives in the data warehouse—you pay for that whether you’re crunching through queries or not. The paradigm with data lakes is that your data lives on S3 at a very low price point. I think it’s, like, twenty dollars per terabyte or twenty dollars a month—extremely low cost. Right? So it doesn’t cost you much to have your data hosted on S3. What does cost you money is compute, right? And so what Dremio tries to do is—since Dremio is the compute portion, right?—when Dremio receives queries, it fires up a compute engine that actually executes on the queries. That’s the real cost. So what we try to do is address that via workload management so you can decide on the priorities of various queries and align those queries with various “engines.” An engine is a collection of [EC2] nodes, which Dremio manages on its own. So, if some queries are really important, we can fire up larger engines for you. Those are more costly, but the queries would typically execute faster. In other cases, we can stop those EC2 instances so that you save money when you don’t need the execution engine. So, we’ve put a lot of work into that specific aspect of saving you money by controlling your compute cost. Hopefully, that answered the question.

Stephen Faig:

No. Thank you very much, Gabe. I appreciate it. Next question: Can Dremio run queries in a hybrid cloud?

Gabriel Jakobson:

Yes. So, [the answer is yes], but it comes with a bunch of caveats. Right? The answer generally is yes, we’re designed for that. But in terms of best practices, we would need to look at it more closely. In fact, most of our large clients do have something like that in place. In today’s day and age, it’s pretty unlikely for data to live only in one place, right? If you’re talking about Fortune 100 companies, [most of them] have some kind of on-prem data presence. Some stick with AWS, others use Azure. So, the answer is yes, but it’s something we should discuss architecturally, to see what actually makes sense in terms of reducing costs and increasing efficiency.

Stephen Faig:

Understood. Next question: Are there any tools or drivers to move data from legacy systems, like IBM iSeries or DB2, to a data lake?

Gabriel Jakobson:

If the question is about Dremio, the answer is no. Dremio doesn’t really get into the business of ETL, moving data from point A to point B. There are a lot of great tools out there that could be used for that. [That’s more of a voice department issue,] but maybe something in the EMR world of AWS would make sense here. From Dremio’s perspective, we only deal with the data once it’s in the data lake. We can give you advice on how to move your data, but we’re not in the business of actually moving it to the data lake.

Roy Hasson:

Yeah, that’s a good point. There are definitely a number of tools available. If I were to look at the AWS ecosystem, the first one that comes to mind is AWS Database Migration Service, which allows you to bring data from various database sources into S3, writing it in Parquet format so it’s ready for Dremio and other tools to query. There’s also a service called Amazon AppFlow that brings data from SaaS providers like Salesforce. So that’s another option. We also have several ISV partners, similar to Dremio, that bring connectors for different platforms like Fivetran, Informatica, and Talend. So, there are many ways to bring data in a structured format to S3. Once the data is there, you can do a lot with it to prepare it for Dremio to consume.

Stephen Faig:
Understood. Thanks, Roy. Our next question: Are there scheduling options available in Dremio?

Gabriel Jakobson:
Scheduling options as far as…? I think I need some more context.

Stephen Faig:
Okay. If the person who asked that question can follow up with additional context, we’ll try to get you an answer. Moving along, next question: Are there additional benefits someone could get by having Dremio on top of Snowflake?

Gabriel Jakobson:
So, Dremio on top of Snowflake is possible. [What] we see is that we’re not really on top of Snowflake as much as a lot of your data lives in the data lake. However, you’re also paying for Snowflake, and you have some useful data in Snowflake, and we can bridge both, right? So we can query your data in the data lake, which is Dremio’s reason for existence. However, we can also access data in Snowflake and federate queries to Snowflake, but we would not live on top of it. So, the question is, the answer would depend on what you mean by “on top of.” In other words, we can read data from Snowflake, and we could accelerate it. We could use reflections to make some Snowflake queries faster. However, we don’t position ourselves as living on top of Snowflake. It’s just a data source for us that, hopefully, in the future, would get migrated to a data lake.

Stephen Faig:
Understood. Thanks for clarifying, Gabe. Okay, so we’re getting close to the top of the hour. At this point, I just wanted to offer both you and Roy an opportunity to deliver final remarks—what you’d like our viewers today to walk away [with]. Gabe, we can start with you. Final remarks?

Gabriel Jakobson:
Well, I think that most companies we talk to either have a data-related strategy or are formulating one. We would definitely encourage you to talk to Dremio. We’re very easy to find—just Google us, look us up, fire us up on AWS, try us out, contact us as a company, ask us questions. We’re really good at giving answers, right? We’re really good at helping you shape your data lake architecture. A data lake is not just a matter of… If you remember some of my earlier slides, data lakes and data warehouses are not the same. You can’t just take data from a data warehouse, dump it into a data lake, run the same queries, and be done, right? There are some nuances and differences between the two. We can help guide you on that journey, so absolutely count on Dremio. And thank you so much for attending.

Stephen Faig:
Okay. And Roy, any final [thoughts]?

Roy Hasson:

Yeah. No. I think one thing I want to make sure folks take away from this whole conversation is that, in the days prior to what we call a modern data lake architecture, to solve data problems, you typically took the data and put it into a tool, right? Whether it’s a database, a data warehouse, on-prem Hadoop, or whatever that may be. That was a quick way to solve the problem, but it was not a scalable solution. What we see now is that customers are really struggling to move away from that approach and into something more scalable and flexible. So, when reevaluating or building a modern data architecture—or modernizing what you already have—think about the decoupled approach that we talked about today. Starting off with a data lake, using Amazon S3 and Glue as a way to save, store, catalog, and expose the data. This allows you to open it up for different tools, whether it’s Dremio, Redshift, Athena, or whatever you end up choosing to consume the data, rather than being locked into one particular solution just because it’s a little faster or does something better. Dremio plugs into a data lake story really, really well. So, I think Gabe did a great job of explaining the benefits, but what I want you to take away is that you have to start with a foundational data lake. If you don’t have that, the rest is going to be much, much harder to do, and you end up, you know, painting yourself into a corner. So again, just make sure that when you’re building a modern data architecture or data lake, you start with a foundational data lake and then layer these new capabilities, like Dremio, on top of that.

Stephen Faig:

Thank you, Roy. I’d like to thank both speakers today: Roy Hasson, Senior Manager of Business Development, Analytics, and Data Lakes at AWS, and Gabe Jacobson, Senior Solutions Architect at Dremio. As I mentioned earlier, all questions will be answered via email. If you would like to review this presentation or send it to a colleague, you can use the same URL that you used for today’s live event. It will be archived, and you’ll receive an email once the archive is posted. And again, just for participating in today’s event, you could win a one-hundred-dollar Amazon gift card. The winner will be announced on November 30th. We will reach out to you via email if you are the lucky viewer.

Thank you again, everyone, for joining us today, and we hope to see you again soon.

A Modern Architecture for Interactive Analytics on AWS Data Lakes

Table of Contents

Session Abstract

Opening

Serving your user and the business

The Data Lake – Start with storage

The Data Lake – Catalog and Metadata

The Data Lake – Discovery and Governance

The Lake House – Data Consumers

The Lake House – Easy to use for every persona

In a Pure Data Warehouse World, BI Users Are Being Left Behind

To Make a Data Lake Useful for Analytics…

Simplified Data Access

AWS Glue Tells You Where to Find the Data and the Data Definitions

Dremio’s Semantic Layer is Where You Assign Business Meaning to Your Data and Organize It

Demo

Metadata is Solved with Dremio; Now Let’s Talk about Speed

Many Analytics Workloads are Ideal for a Data Lake

Find a Better Balance by Migrating Suitable Workloads to your Data Lake

Data Lakes and Existing Data Warehouses Can Co-Exist

How Do You Get There?

Summary

Q&A

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?

Table of Contents

Session Abstract

Opening

Serving your user and the business

The Data Lake – Start with storage

The Data Lake – Catalog and Metadata

The Data Lake – Discovery and Governance

The Lake House – Data Consumers

The Lake House – Easy to use for every persona

In a Pure Data Warehouse World, BI Users Are Being Left Behind

To Make a Data Lake Useful for Analytics…

Simplified Data Access

AWS Glue Tells You Where to Find the Data and the Data Definitions

Dremio’s Semantic Layer is Where You Assign Business Meaning to Your Data and Organize It

Demo

Metadata is Solved with Dremio; Now Let’s Talk about Speed

Many Analytics Workloads are Ideal for a Data Lake

Find a Better Balance by Migrating Suitable Workloads to your Data Lake

Data Lakes and Existing Data Warehouses Can Co-Exist

How Do You Get There?

Summary

Q&A

Additional Resources

Apache Iceberg: The Definitive Guide

What Is Apache Iceberg? Features & Benefits

Introduction to Data Engineering

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?