37 minute read · August 5, 2020

Dremio vs. Presto Benchmarks – Top 3 Performance and Cost Comparisons That Matter Most

Transcript

Serge Leontiev
Thank you, Louise. Welcome to our webinar. Thank you so much for joining us today. As Louise mentioned, today we’re going to share with you our recent benchmarking results. We measured performance and cost efficiency of Dremio data lake query engine versus Presto-based query engines. And today we will discuss a comparison between Dremio and multiple flavor of Presto, such as Presto DB, Presto SQL, Starburst Presto and AWS Athena. Benchmarking methodology that we used, and architectural reasons and technical factors that’s basically driving the difference in outcomes.Please allow me to introduce myself. My name is Serge Leontiev and I’m leading technical product marketing from Dremio. The data lake engine company. Our company provides a cloud data lake query engine solution that basically offers interactive query speed and unprecedented performance directly on the cloud storage. And before we proceed, I’d like to run a quick poll to see if you are familiar with the data lake query engines. Let’s have 30 seconds for you guys to answer questions in this poll and after that, let’s take a look on the results.If you’re not familiar with Dremio, this is a great opportunity for you to learn more about our company. And we are currently providing and powering a lot of world leading companies with our solution. Dremio is a co-creator of the Apache Arrow project. It’s a columnar in-memory processing and the Dremio raised over $100 million recently and we have a good handful of investors that’s basically supporting us. So let’s take a look on those results. Yeah, majority of you guys are familiar with a lake query engine, which is great. However for any of you that are not familiar with what query engines are, let me kind of give you a quick overview of what data lake query engines provides.So basically it’s a new category of compute engines, arise recently. Data lake query engines, that’s basically allow you to connect and provide fast, sequel-based access directly on data lake storage. So which could be ADLS S3 or on-premises, S3 compatible storage. It offers a query exploration for BI queries, reporting queries, ad hoc queries. Also usually offer logical data model and semantic layer, basically allowing data engineers and data consumers easy to discover data sets and query data against those data sets. Also offering ability to connect to external data sources, if needed, to enrich your data. And also definitely most important thing, providing fine-grained access control data masking and additional layer of security. And workload isolation and obviously elastic computer. And you can connect to solutions like a data lake engines with analytical tools, such partner are Jupiter or Spark. Or you can use BI tools such as Tableau, Power BI, Looker and etc.So with that in mind, let’s take a look on the data analytics infrastructure market share. You see where are data lake engines fit. And on this graph, as you can see, Dremio, Presto and AWS Athena basically falls into the same quadrant that represents open and loosely coupled solutions that are geared toward to delivering data inside, rather than performing data processing. And unlike traditional data warehouses that requires you to move data from the data lakes into proprietary systems and use a proprietive performance before user will be able to access it, the data lake query engine allows to query data directly from the cloud data lake storage. Where data is usually stored in open data formats, such as Parquet or JSON or CSV. And just for your information, the global data lake market share size is expected to grow from $7.9 billion in 2019 to approximate $20 billion in 2025 with NL growth rate approximately 20.6%.And let’s do another poll, just to get an idea of what query engines are you currently using with your organization. This will be helpful for me to better understand what query engines are used within your organizations so I will see basically if the content that we’re going to discuss will reflect and resonate with you guys. And we need good comparison across all those competitors and it would be interesting for me to see which one is used most. Let’s wait for another few seconds. Before we get those results. Let’s take a look on those results.Oh it’s well spread out across all query engines, which is great. Thank you so much for that information, I really appreciate that. This is very helpful for us to understand. So let’s continue. So today’s agenda, we’re going talk, why benchmarking, why we’re doing this benchmarking exercise. Second of all, we’re going to discuss the execution time by query time. This is a quite important topic, especially from business user’s point of view. And then from the data engineering and data architect’s point of view, we’re going to discuss architect’s execution cost and performance comparison. And at the end, we’re going to little bit dive into the architectural analysis, highlight differences in the solutions. And answer your live questions in the Q & A session.So why benchmarking. The key requirements that data teams are facing today are, first of all, is ability to provide an interactive query performance directly on cloud data lake storage. To business users and analysts. And I would spending time, expense, building complexity of moving and helping data into traditional data houses into kind of achieve the kind of same interactive query performance results. And certainly, one of the goals to achieve, cost efficiency given for level of performance. This is one of the emerging key decision criteria for data teams. It usually emulates any cloud based solution. With the elasticity of the cloud, it means that the absolute performance alone is no longer the right way to benchmark solution. Simply because any scaled out architecture can add more computer sources to deliver a given level of performance. And it’s more important, I believe, to understand rather the efficiency of given solution. Like what amount of computer sources are really required to deliver the given level of performance.And with that in mind, the top three performance and cost comparisons that mattered most are usually BI and reporting query performance and cost efficiency. Because majority of the queries that’s basically been running through the people or running through teams, the data or lakes usually is a BI query. But also we’re not supposed to kind of discard completely interactive ad hoc query performance because interactive queries are another part of the query types that’s usually run by customers. And the finally, like we discussed, the average performance and cost efficiency across all query types. So whenever you’re considering data lake engine, query engine, you’re definitely looking at the bottom numbers and to dollar amounts that you have to spend on support infrastructure. Does it support query performance that you’re looking for and how much money are you going to spend kind of running all those queries the interactive data lake.So what we tested. As we discussed earlier, Dremio, Presto and AWS Athena, Starburst Data basically falls into the same category of data lake query engines. And our goal was to compare all offerings side by side. So we leveraged as much as we could, AWS marketplace offerings and [sic] offering. With our ultimate goal that others like you could easily reproduce our test results without spending time on setting up, configuring and provisioning cloud infrastructure. However with the open source version of Presto, we had no choice but manually provision instances by following deployment guides and we started out around April 2020, this Spring. And we selected the most current product versions available at that time. Definitely the product evolves and additional versions of the product will come out every months or so and we will continue to test against each other in the future. But at that time, those following query engines have been tested. So Dremio, AWS edition version 421, Presto DB is a Facebook code base of Presto, the original Presto. Version 0.233.1. Presto SQL version 332 and Starburst Enterprise Presto 323e and AWS Athena.To make sure that we are comparing apples to apples, all Dremio and Presto instances where configured was default set and core recommended settings so we weren’t kind of fine tuning anything. We were kind of installing stuff out of the box. We used the same EC2 instance type, m5d.8xlarge, it’s a general purpose compute. Easy to instance for all benchmarking tests, however in the case of Starburst Presto, we selected EC2 instance from the cloud formation that was the closest match by number of VCTU and network bandwidth, comparable to m5dxlarge. Simply because m5dxlarge wasn’t available for the selection at all. With that in mind, our four EC2 instances are memory optimized and actually offered twice more RAM compared to the m5dxlarge. So we’re talking about 244 gig of RAM versus 128.Our budget has exactly the same number of VCPUs on both instance dives and the same network bandwidth allocated to them. Our four eighths largest little bit more expensive, like a surface sensor are a little bit more expensive than m5dxlarge. And obviously while the extra memory is a comparative advantage for Starburst, we decided to proceed anyway and see how Dremio, running general purpose, compute instance would compare.And lastly with AWS Athena, it’s by being service offerings and no one really knows what resources are allocated behind the scene and basically we have no control over it. So what we did, we just followed best practices with AWS Athena. Ideally, you should run your workloads off peak hours, which depends on time of the day and day of the week. So we scheduled to run our tests in the evening hours or over weekend, just to kind of have less crowded environment and being able to get more or less accurate benchmarking results.So now let’s talk about benchmarking methodology and tools that we used during this exercise. We choose TPC-DS as a trusted industry standard benchmark for general purpose decisions, it’s a core system. And it’s geared toward online analytical performance benchmark and offers basically a variety of BI reporting, analytical and ad hoc whereas logic presents typical analytical workloads that our customers are facing. And we used TPC-DS provided tools to generate data sets and queries for this particular benchmarking test. And we used two different scale factors, such as scale factor 1000, which basically approximately one terabyte of data and scale factor 10000, that represents 10 terabyte of data. To test basically linear scalability of the engines of different scales of data. And the generated data sets were converted to Parquet format as the most persuasive open source columnar in data analytics. And broadly supported by industry, including technologies such as Dremio, Presto and Spark.’d like to highlight that initially we used Parquet group size settings at 256 megabytes, so it defaults in the engine whenever you’re generating those Parquet files. However, doing the preliminary testing, the query success rate on Presto engines was lower than we expected. And a lot of queries on the smaller cluster size on Presto, simply failed with the insufficient resources error, due to memory limitations and basically inability of Presto efficiently split data processing between quarters. So to mitigate that, we decided to reduce recommended row group size to 128 megabytes, so it will be basically ingesting a smaller chunk of the Parquet files. And with this tweak, we were able to improve query success rate for Presto.However, it’s not a recommended practice if you have a wide table and wide columns, a lot of data in the columns. That’s simply not going to work. And per Apache Parquet best practices, the recommended size would be around 256, 512 or one gigabyte for the better performance, higher performance and compression rate.So like we mentioned before, we wanted our results to be repeatable. And with that, basically we identified 58 unmodified TPC-DS queries that we able to execute on Dremio and Presto engines. And this subset of queries equally presents BI, analytical and ad hoc queries. We also tested engines linear scalability, basically by incrementing lot count by four. We started in four nodes and we went up to 20 node clusters for each engine, excluding obviously AWS Athena. And lastly we used Apache JMeter as a test suite, since it’s offer an open and flexible framework that can easily leverage any JDBC drivers and provides transparent and easily digestible results.So let’s dive in, but first let’s explore engine performance by specific query type because obviously the data consumers teams, the data analysts will run specific query, such as BI reporting queries that often used for data visualization, dashboards, reports. And usually takes longer time to execute when optimization has not been done properly. There’s analytical queries, it’s more targeted to mine through to data to find pattern and algorithm. Quite often find those queries for [sic] or AI. And ad hoc queries, those queries are dynamic and interactive in nature and they usually users wants to get those results quickly. You don’t want to wait minutes or even hours to get your ad hoc query results.But before we proceed, let me cover Dremio acceleration technologies first. This is important to understand because the next section, we are going to compare results based on default and advanced accelerations. Dremio itself offers end-to-end acceleration, starting from cloud data lake optimized massive parallel high performance readers. So we can ingest data, read data directly from ADLS, S3 or on-prem S3 compatible storage at very high performance rate. Also, Dremio provides real time distributing NVMe-based cache called CS3. It’s an abbreviation for columnar cloud cache. So each executor node will have caching options and will cache data attached to the drive so the consecutive run will run against the cache rather than going back and going through data set.Also it’s offering transparent materialized views with highly granular use pattern called data reflections. That’s an optional acceleration technology, it’s not enabled by default because basically in order to enable that, you would have to follow best practices and find the patterns in new queries. That’s basically could be covered by reflections. And also definitely the key, the heart of the Dremio engine is a distributed, vectorized execution model based on Apache Arrow for columnar in-memory analytics. And we’re leveraging [sic] it’s LLVM based execution basically to allow us to take full advantage of the clean execution. And also the Dremio offers Arrow like RPC interface. We haven’t used that for this particular test but we’re planning to add more about that in the future. It’s a high speed product called that offers 1000 times faster between Dremio and client application and basically geared toward to displace and replace the case old JDBC, ODBC protocols.So with that, let’s talk about query acceleration with Dremio data reflections. It’s a pattern feature of Dremio or what we would call vast acceleration. Like I mentioned before, it’s not enabled by default, it’s something that you requires to enable or configure. And for this exercise, we basically identified a set of TCP-DS queries that can be optimized by common reflection and then compared results with data that we already collected. And with that, we saw performance improvement with data reflections over Presto distributions with up to 1700 times faster for BI reporting queries. And up to 3000 times faster for ad hoc queries.So if you will take a look on this particular graph, this graph represents query execution time for BI and reporting queries with data reflections on scale factor 1000. So it’s approximately one terabyte of data. And as you can see from the graph, the execution time with data reflection was reduced from minutes to seconds. It’s over up to 67 times performance improvement compared to Presto DB, up 60 times faster performance improvement compared to PrestoSQL and Starburst Presto and up to 23 times faster than Dremio without reflection.However, at larger scale, the execution time improvement that we’ve seen goes from hours to one to three seconds, basically. And we’re able to see performance improvement up to 1000 to 100 times faster than Starburst Presto and 1700 times faster than PrestoSQL. And up to 700 times faster than Dremio without reflection. So as you can reflections can give you enormous improvement in this query performance on a larger datasets.So and now let’s take a look at ad hoc ware performance with data reflections. And as you can see on this particular graph, so a scale factor 1000, one terabyte, is medium sized data set. The execution time was improved from minutes to sub seconds. So we were able to return some queries in less than a second. And again, the query performance improvement was up to 100 times faster than Presto DB. Was 60 times faster than PrestoSQL and 25 times faster than Starburst Presto and 10 times faster than Dremio without reflections.And again, if we’re going to the larger data set, obviously the data reflections give you even better performance numbers and the execution times basically go from hours to seconds. And we saw 821 times faster than Starburst Presto, 3000 times faster than PrestoSQL and 380 times faster than Dremio without reflections. So in this graph and on the previous graphs, the times in milliseconds so basically just divided by 1000 and divide by 60, if you would like to get seconds and another 60 if you would like to get hours. So this one is approximately one hour on this graph.And so as you can see, Dremio truly allows you to greatly improve query performance and offer to save a lot of infrastructure costs basically by leveraging advanced query acceleration. So you don’t have to run your engine for an hour to get results. Instead, you can get results in a few seconds and you will save hundreds of dollars amounts just kind of not stressing your infrastructures running queries. But even with default acceleration, Dremio is offering excellent performance out of the box. So let’s review the execution time results by the same query type next.So if you will take a look on query 28 from the TCP-DS data set, it’s a typical analytical query. We ran this query on a scale factor 1000, it represents one terabyte data set. From this graph, basically you can see that Dremio shows better results, six times faster than any Prestos distribution on four node clusters. Like a smaller size that you can get and with 20 nodes, Dremio still four times faster than any Presto distribution returning results under 20 seconds, under six seconds, right. While for Presto, it takes 20 seconds or more to get those results. Athena was four times slower than Dremio on eight nodes and kind of our assumption is that probably Athena running on eight to 12 nodes because the results were more or less in the same ballpark as other Presto distribution the same node size. And as you can see from this graph, actually Dremio achieved the maximum parallelization at 16 nodes. It means that adding more nodes doesn’t make sense. The query has low run time and a fixed post of running query takes a larger portion of the overall query runtime. And the maximum parallelization was achieved based on the data structure and adding nodes beyond 16 nodes, it doesn’t make sense.While Presto is still kind of linear at scaling. The same query, at scale factor 10000, so it’s a bigger data sets, it’s a 10 terabyte of data. And on four node clusters, Dremio returned results for this particular query within a three minutes. Versus 16 or 20 minutes that it took for Presto distribution to process and return our results for the same query. PrestoSQL actually requires five times more nodes to achieve performance similar to Dremio on a four node cluster. You can see 193 seconds for Dremio four nodes and we have 20 nodes for Presto with the same number, 193.4. and beyond 20 nodes, Dremio continued to scale linearly because basically it’s larger data sets, we can add more nodes and Dremio keeps scaling and adding and maintaining that particular gap and execution time still working and delivering queries faster than Presto.BI reporting query was one of the important queries that we would like to kind of pay attention to and those queries, like I mentioned before, usually runs for a longer period of time. On the one terabyte data set, scale factor 1000. Both Presto DB and Athena simply failed to execute this query. Athena actually showed a very poor performance in overall query execution rate and Athena was beyond acceptable rate. [sic] were getting errors that Athena couldn’t even process query on this scale. So with that, with four nodes, Dremio, for this particular query, was up to 9.5% faster and returned results in 41 seconds while it took more than three minutes for Presto distribution to process those queries and return results. And obviously when you’re adding more nodes, your performance will improve. However, if you take a look on this graph, for Presto, Starburst Presto, it took five times more nodes to achieve similar performance to get to 41 seconds that Dremio can deliver in four nodes. And while Dremio kind of again achieved optimal performance on 16 nodes, we don’t really have to scale up beyond 16 nodes. We can get 18 seconds range for this particular query, we could ourselves in an 18 seconds range at 16 nodes and maintaining the gap in the execution time and performance.And on the larger scale as well, if you take a look on the same query, Athena failed to execute this query on a larger scale and on four node cluster, Dremio returned results within 14 minutes. More data, it’s I believe four or five million rows in one of the stables that we have to go through in order to process this data set. And we’re comparing 14 minutes to up to one hour and 28 minutes for Presto distributions. So rather you’re going to wait 14 minutes to get your data, one hour and 28 minutes running computer sources. But obviously yes, you can scale, right. So you can add capacity, you can add more nodes. And if you take a look on the graph again, so Presto requires approximately three to four times more nodes to get exactly the same ballpark as Dremio. So if we have 840 on four nodes Dremio, it will be between 12 to 16 nodes for Presto, Starburst Presto. And PrestoSQL is not even there.And beyond 20 nodes, Dremio keeps scaling and maintaining the gap in execution time nicely on this particular type and on this particular size of the data set.Ad hoc query, I mentioned before, right. So this is dynamic query. People are expecting to see interactive performance for this particular type of query. Query 15 let’s you choose to kind of take a look at, on the smaller scale, factors. Kill factor 1000, one terabyte. Dremio, from this graph, you can see Dremio consistently show a two to three seconds execution time. Across any number of nodes. And why is that, again, we achieved efficient parallelization with the query processing at four nodes and the fixed cost of the running query takes majority of the time. So throwing additional nodes will be unnecessary waste of cloud computing resources.So with that, basically Dremio shows up to 12 times better performance than Presto DB on four nodes cluster. And even with 20 nodes, Dremio’s still an average of four times faster and Athena was actually eight times slower than Dremio eight node cluster and even slower on the larger size. And the numbers, again, for the Athena were approximately in the same ballpark as eight nodes Presto. And on the larger scale, again, you can see the same picture, right. So on the four node cluster, Dremio returns results on the smaller node count within three minutes versus 16 to 20 minutes for Prestos. So we’re talking about interactive query performance, we’re talking about getting results faster. And PrestoSQL was up to 11 times slower than Dremio. Athena was up to 12.8, 13 times slower than eight node Dremio cluster. Starburst Presto on average was doing quite well, right, but still three times slower than Dremio. And as you can see from this graph, on 20 notes, Dremio gained approximately 40% performance boost. And beyond 20 nodes, we still continue linearly scale, keep maintaining the gap in execution time and performance.So execution cost and performance comparison. So we did some calculation … Give me one second. We did some calculation, basically to calculate the average number of queries that each engine was able to execute in one minute, right. On a one terabyte scale factor on a different node count. So basically we divided the number of successfully processed queries on total execution time to average the results. And as you can see, Dremio was able to execute more queries per minute than any other distribution of Presto. Linearly scaling up to 16 nodes, like a magic number. So this particular scale, 16 nodes. That’s our kind of … The optimal performance getting achieved at that number. And after that, basically, our engine just being stop stressed by the size of the data set.Yeah, from this graph, it’s clear that the faster the solution of Presto and we measured the Presto Starburst, requires approximately 20 nodes to get to the same, similar to Dremio eight node cluster performance of this particular scale. And if we will take a look on the larger scale, 10 terabyte scale. Again, so we measured number of queries that had been completed in 10 minutes rather than one minute. Given 10 times larger scale factor for data set. And this graph actually shows even bigger gap in performance, in linear performance, at the bigger scale factor. At this large scale, Dremio engine continues to scale linearly beyond 20 nodes and maintains the leading performance, as you can clearly see from this graph.For example, Starburst Presto achieves similar performance to Dremio eight nodes engine when it’s own scales are up to 16 or more nodes. We applies twice more computer sources compared to Dremio. And don’t forget that Starburst Presto has been running on heavily memory optimized EC2 instances. So basically they have a competitive advantage in that exercise. They had more RAM for the query processing. And that basically shows that larger data set clearly tells how much more data driven performance … With more data, drives more performance out of Dremio engine. So give us more data, we will be more powerful. And obviously PrestoSQL couldn’t even get into the same ballpark as Presto, Starburst Presto or Dremio engine.For example, Starburst Presto achieves similar performance to Dremio eight nodes engine when it’s own scales are up to 16 or more nodes. We applies twice more computer sources compared to Dremio. And don’t forget that Starburst Presto has been running on heavily memory optimized EC2 instances. So basically they have a competitive advantage in that exercise. They had more RAM for the query processing. And that basically shows that larger data set clearly tells how much more data driven performance … With more data, drives more performance out of Dremio engine. So give us more data, we will be more powerful. And obviously PrestoSQL couldn’t even get into the same ballpark as Presto, Starburst Presto or Dremio engine.And finally, so we’re getting at the end of our presentation, let’s talk a little bit about architectural differences that allows Dremio, basically let’s say to dominate any distribution of Presto at any scale. So first of all, yes. Dremio and Presto, they’re using exactly the same patterns. So you can have your server or [sic] node that’s basically leveraging all incoming queries. And redistributing those queries between workers for executer nodes. Basically reading data, processing data and returning queries all back to the client. However, the execution model is quite different. While Dremio is processing data in columnar format, so we’re reading from Parquet file, which is columnar with processing in-memory with Apache Arrow. That’s columnar in-memory. And then we’re returning results. And Presto does it differently. So the Presto actually converting columnar data into row based data. And then processing row by row, when it’s going through a data set. Execution architecture is different so Dremio, AWS edition and actual enterprise edition and open source edition offer a notion of multi engines. So it can dedicate engines for different workload types, based on that workload, based on the Q, it can respond like a four, eight, 12, 20 node engines with different compute size that will basically correlate to your workload needs. While Presto, it’s just a single engine and there’s no scalability from that point of view. Runtime is different, as I mentioned before, Dremio using again LLVM, basically the native core, to take full advantage of the native compute capabilities of the CPU. While Presto running on Java. We offering NVMe caching technology, so Dremio offers C3 cache by default, every time when you’re querying data, the data getting cached the second run for the same, similar data sets, will return data based on the cache. While Presto just announced the caching capabilities, I think in one of the latest versions and it’s still beta. I did run some tests on that and the results weren’t too impressive. So I got less performance actually with cache enabled versus without cache.Query acceleration, so you can achieve 1000 times faster performance by leveraging acceleration data reflections, while Presto is not offering technology like that at all. And lastly, cloud data lake optimized readers, so we’re offering predictive pipeline, asynchronous readers, massive parallels readers. That’s bigger optimized reading process at larger scales, a high performance while Presto is not offering any of those technologies at all.Louise Westoby
Yeah, thank you, Serge. So as Serge mentioned, we will go ahead and take some time for questions now. Be sure to type them in the question box in your control panel. If you wouldn’t mind going to the next slide? So while we’re queuing up the questions, I want to share a couple of resources I thought you might find useful, if you’re interested in digging deeper into the benchmark results. The first is a report that documents all of these results in more detail. That report will be made available to all of today’s webinar attendees so please look for an email from us with the details. The second is our upcoming webinar series that dives deeper into some of the specific Presto Distros. So more information on these will be available soon. All right, so with that, do you want to move onto the next slide. I think we’ll go ahead and open it up to Q and A. So Serge, just a question about the benchmarking process. Can you talk a little bit about why you selected TCP-DS not TCP-H for benchmarking?Serge Leontiev
Excellent question, yes. So TCP-DS and TCP-H are basically based on the same kind of decision support benchmark. And while TCP-H is basically more geared toward ad hoc decision support benchmark, where you kind of measure a performance of ad hoc queries and the concurrency of the queries, TCP-DS basically offers more deeper dive and pull from, where I deal with different analytical queries. Reporting queries, besides only ad hob queries. So from our point of view, the TCP-DS is more kind of reflects the way that our cloud engines or query engines operates in real life.Louise Westoby
Okay, thanks Serge. Next question, did you generate hive meta store stats and use cost-based optimizers for Presto?Serge Leontiev
Absolutely. We run analyze command every time, when we’re enhancing our [sic] adding nodes capacity to the clusters. So we were trying to kind of make sure that we were leveling the fields so we were executing analyzed command every time.Louise Westoby
Okay, thanks. Next question, do you have the timings for the first runs of the queries, where SSD caching of C3 will not have an effect?Serge Leontiev
Absolutely, yes. And actually that’s not giving much difference. So all those TCP-DS queries are basically based on the same data structures. Same set of tables and when we were kind of running a first four, three or four queries yes definitely [sic] cached C3 storage will be SSD cached will be emptied. But starting from query number five or four, I saw improvements in performance and it was showing that basically from the command plan that we were using the cache from C3. You can disable C3 if you want to but I’m not sure why you should do that. So basically yes, we do have results that’s basically showing difference in performance but it’s not that much difference between those. Because it’s getting published, with those particular type of queries.Louise Westoby
Okay. Next question and I apologize Serge if I miss read this. But is the ad hoc query a select asterisk, they included the little asterisk symbol, query?Serge Leontiev
Sometimes, yes. Sometimes, not. So it depends, right. So ad hoc query basically, you’re shopping around. So you’re not quite sure what you’re looking at. You may start with select star and then add additional where plows, right, or filter number fills that you’re selecting. But usually you don’t have a pre-defined kind of results. So you’re not quite sure that you’re looking at so select star, one of those but it’s not necessarily that it would say yeah, that’s only ad hoc query that you can run.Louise Westoby
Okay great and now I know how to pronounce that. All right. Next question, do materialized views get benefit on first run of a query? Or do they need subsequent runs to show benefit?Serge Leontiev
It will benefit on the first run of the query. So when you will define your data reflections, basically, right. The engine, the query engine, will start kind of building those reflections for you. And it’s basically creating data copy, copy of the data representation, the materialized view on C3 storage and then when you’re exhibiting your queries, right. So the query sufficient plan will look at the availability if it could match your query against this particular reflection and if it’s good, it would reuse it, even in the first run, yeah. Good question.Louise Westoby
Yup. For the QPM benchmark, was each query executed sequentially or were there concurrent query stream execution?Serge Leontiev
It was executed sequentially so we not, we decided not to do concurrent execution at this time. So for all tests that we’ve done, it was a sequential query execution of the TCP-DS queries, one after another, yeah.Louise Westoby
And the next question’s somewhat related. How did you launch the queries? Were they run sequentially in the same ECU instances without restarting the cluster?Serge Leontiev
Yes, absolutely, yeah, yeah, yeah. So obviously we had configured cluster for Dremio and Prestos and those were kind of separate infrastructure instances that basically we’ve been running, up and running, all the time. And then I configured EC3 instance just for benchmarking and to run Apache JMeter client. I had exactly the same network bandwidth so my m5d8xlarge instance, so to match the network bandwidth. Basically not do lose performance or latency based on the network. And I was executing JMeter client on that particular instance against different targets. So basically been using JDBC driver for Dremio, JDBC driver for Presto, JDBC driver for Athena, to sequentially run all queries, one by one, yeah.Louise Westoby
Okay, next question has to do with Kubernetes. Can we deploy Dremio nodes across multiple Kubernetes clusters?Serge Leontiev
Absolutely. There is a kind of recommendation if you go to our website and look at the ETS [sic] services deployment guide. So we are providing tools and everything that you need so basically you can deploy it directly on your cloud infrastructure or in private cloud so we’re using hemp charts basically for infrastructure deployment. And they control those instructions and cluster whenever you want.Louise Westoby
In the case of reflections, how were the reflections defined in Dremio? Which parts were configured to be materialized? Are you able to share that definition?Serge Leontiev
Yes, actually we have a very good document outlining reflection best practices and again if you go to our website, it should be in the white papers. Or in the documentation and basically it all depends, right. So it could be broad reflections are basically redefining and collecting the whole data or it will be aggregation reflection, right. So where you can select aggregated numbers and store those numbers in the reflections. I would highly recommend to take a look on that, reflection best practices guide and you can find a lot of answers on those questions.Louise Westoby
Okay, thank you. There’s a question about Gandiva. Could you please explain a little more about Gandiva?Serge Leontiev
Yeah Gandiva, it’s basically the execution kernel that we’re using alongside with Apache Arrow. And it basically allows to use the CPU coronal capacities for data processing. So instead of running everything through the in terms of pressed, Java-based execution engine, right. So you basically dedicating … You’re directing everything directly to the CPU. And we have very good write up by [Tomer 00:52:48] basically and if you go to Gandiva page, I think it was an Apache or under Apache Arrow. It will give you a very good, thorough example how Gandiva works and how to give you additional performance boost.Louise Westoby
Okay. Next question is about process. Do you have a percentage breakdown about these improvements? And I believe that’s covered in the papers, is that correct?Serge Leontiev
Yeah, that would be covered, refer to the benchmarking paper I put together, so we have … In this webinar, we were covering just a high level overview of that. In the paper itself, I went in kind of deeper details about each and every competitors and you can see numbers there. In terms of reflections, for example, I published some additional, a couple of results in the paper as well so you can basically compare the execution time. And also we are going to publish this, to other questions, a couple of results and make them available for you guys as well so you can take a look and run those. For example, I generated those graphs by using Tableau so I was kind of mixing up those data sets and creating those graphs. You can do the same or you can follow the same practices in the topology and you can run the same JMeter and get similar results as well.Louise Westoby
Okay. Next question, how much impact do you think the C3 NVMe cache has on the performance benchmarks? Are the results different if that capability is turned off?Serge Leontiev
Yeah, it will be different. I saw at least 30% performance boost with C3 versus without. So C3 versus without C3. So C3, it definitely helps and improves the query performance, for sure.Louise Westoby
Okay, thanks. Question about Redshift, Spectrum and Snowflake. Why haven’t you benchmarked Redshift, Spectrum and Snowflake?Serge Leontiev
Yes so first of all, yes, Redshift itself and Snowflake, those are data housing solutions. Even though they are providing capabilities to query data from the cloud data lake storage, like Spectrum and Snowflake, you can create a kind of [inaudible 00:56:50] for Spark or float your data to C3. It’s not quite is the same query engines itself, right. That offers you ability to access data but it’s not will give you a similar performance. Or improve query performances, you would have to move your data into the Redshift or Snowflake to be able to get comparable results. So with that particular reason, so for this particular round of benchmarking, we were concentrating our effort on data lake query engine space, as you remember from the quadrant that I showed you in the beginning. Snowflake is something that you’re looking at in terms of benchmarking but definitely benchmarking means Snowflake would be different, right. It would be great for us to compare the cost, right. The cost/saving versus a Dremio versus Snowflake infrastructure.Louise Westoby
Okay. Next question, is there an in-built workload management? Can Dremio tell us which tables joins in aggregations are being used the most?Serge Leontiev
Absolutely, that’s one of the features that we’re offering out of the box, it’s data lineage. So for me to lineage you can see the full past, where the data came from. Absolutely, yes.Louise Westoby
Okay and one very last question. Will you publish the tabular results of your benchmarks?Serge Leontiev
Yes, I mentioned before, absolutely. We’re going to publish our tabular results and make them available for you guys.Louise Westoby
Okay. All right, thanks Serge. Lots of fantastic question, unfortunately we did not get to them all. For those of you that we did not answer your questions, we will follow up via email so please look for a message from us. And I think there’s also some additional kind of deep dive questions that I believe are answered in the white paper. So in the report, as a reminder, that will be sent to you via email shortly. So thank you again for joining us today and we look forward to seeing you again next time.

Dremio vs. Presto Benchmarks – Top 3 Performance and Cost Comparisons That Matter Most

Table of Contents

Transcript

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?

Table of Contents

Transcript

Additional Resources

Apache Iceberg: The Definitive Guide

What Is Apache Iceberg? Features & Benefits

Introduction to Data Engineering

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?