May 3, 2024

Revolutionizing Data Lakes- How Dremio and MinIO Deliver a Modern AI Infrastructure

As businesses continue to transition to AI/ML usecases, Hadoop-based data lakes are often unable to keep up with the large scale data performance and processing needs that these application workloads require. This presentation explores the adoption of Dremio and MinIO for constructing data lakes optimized for these advanced tasks. The key challenges associated with legacy data platforms, when dealing with AI-specific workflows such as model training and refinement, feature selection, and real-time inference and decision-making, typically relate to scalability, performance bottlenecks, and rigidity in data handling. A modern data lake stack that is composed of Dremio and MinIO is all that you need to address these challenges directly.

Sign up to watch all Subsurface 2024 sessions

Speakers

Ugur Tigli

CTO, MinIO

Brock Griffey

Solutions Architect, Dremio

Video Synopsis

Unified Lakehouse Platform
What is MinIO
What Matters to Enterprise Data Leaders
Object Storage as Primary Storage
The Leading Object Store
MinIO’s Value Drivers and Differentiation
MinIO and Dremio
Modern System Architecture for Generative AI

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Brock Griffey:

Hi everyone, I’m Brock Griffey, I’m a master principal solution architect here at Dremio. I’ve been with the company for about five years and been really working very hard with our customers on how Dremio can deliver a modern architecture and working with technologies such as MinIO here. I’m joined by Ugur, the CTO of MinIO, and we’re going to talk a little bit about the journey.

Unified Lakehouse Platform

So most of you are pretty familiar with the Dremio platform and what we offer here. We have the ability to offer unified analytics, a query engine on top of all that, as well as a lake house management piece, and be able to deploy anywhere you want. Dremio can go into the cloud, it can go on-prem, and go wherever you need that data to be and where you need that access to be. And of course, a part of this open architecture is the ability to use any object storage. And this is where we really shine, is we can utilize things like MinIO to actually get better performance over legacy systems like Hadoop. So how do we make this migration better for you? Ugur will be able to talk through that process and what he’s seen with his customers and how we can make that easier for you. And why would you want to make that easier? Well, if you’ve ever been stuck in the Hadoop architecture, you know that there are a lot of complexities around it. Management is a nightmare. You have growing costs constantly, scaling is very, very difficult, and the performance doesn’t get better, it just gets worse over time. And the lack of self-service really makes this a very complicated problem to solve. And the biggest challenge that people have had is, within Hadoop ecosystem, is a paradigm of compute and storage being one thing. You can’t scale one without scaling the other. If I need to add more storage, I now have to add more compute, I have to add more space to my infrastructure in order to get that in there. This is not something that most customers want to go through. It adds to complexity and problems. I’m going to hand it over to Ugur here to talk a little bit more.

Ugur Tigli:

Thank you. I think I’m good with this. Hi everyone. Nice to see you here. And what Brock said in the last point is exactly the reason why MinIO and Dremio is a great combination. We see time and time again people not understanding the query engine, the compute part, and the storage part. They are hand in glove. They have to work together in order to get a system or an architecture that’s working perfectly fine for any kind of use case, any kind of business application that you need to get outcomes for your end customer. In order to achieve that, we truly believe that you have to separate, disaggregate the storage components and grow it separately than the query engine component or any kind of an engine that’s sitting on top of that. With S3 APIs and modern, high-performance object storage, you can achieve that easily. The last point that Brock was making, Hadoop compute and storage has always been together for many years. Hadoop was designed probably 20 years before any of these new topics that we are talking about. It was the right engineering design for that time for the right network. It was one gigabit network at the time, perhaps, the maximum network you could get. And having compute and storage together was the right choice. But now, in a cloud-native world, AI-ready world, things are changing. All of the data lakes, the modernization of the data lakes, and how you approach data and how you query, how you get your information out and extract is a totally different world. And I’m going to talk about it a little bit at the end. But first, I want to introduce you to what MinIO does in this equation and then jump into MinIO plus Tremio and what we see in the field.

What is MinIO

MinIO is a high-performance Kubernetes-native, cloud-native object storage. We have designed the whole system to be very high-performance. There are a few design principles, which I’m going to cover in a bit, but one of them is high-performance. We have been very open about it. We come from an open-source culture and open-source roots. Our founder core engineering team, they already done a distributed file system with POSIX in the previous life called GlusterFS, and that was an open-source approach, and it was acquired by Red Hat in the past, and they had this idea of having S3 API stack and a full stack of storage. And MinIO is that full-stack storage that you would need that’s a drop-in replacement for an AWS S3 service or the whole API stack. However you look at it, we’ve been quite popular and successful in various use cases, one of them being the on-prem replacement of cloud services, S3-like services, data services for all enterprises that uses large-scale on-prem clouds. But we can work on bare metals. We can work on Kubernetes. We can work on any instance from a public cloud perspective as well, technically speaking, because we compile the code for ARM, Intel, AMD, and all of the chipsets that you can see. It’s a software-only company and software-only product and works on any platform, no matter which platform you use. This is the– we are above 300 commercial customer now. We have many more users in the open source community. And the way it works is a lot of people use our code in the development phase, and when they go into a production phase, they come to us in terms of getting direct engineering support and commercial engagement.

What Matters to Enterprise Data Leaders

Some of the reasons why people choose us is summarized here. I’m not going to read through all of them, but performance at scale is really hard to do. It’s the query engine, Spark, Dremio, running against any type of data with Hadoop, with any other system, even in AWS. When you go large-scale, things get really hard because you have bottlenecks at the network. You have bottlenecks at the querying. If you don’t have multiple executors in one of the other speeches, I heard that they had 35 executors that’s going against the same data lake in their production environment. And that needs to go hand-in-hand in a linear scale in order to get the great performance. AI/ML workloads, you’re all familiar with that being in this conference or in this session. So as ML was growing, and now with generative AI and other flavors of AI, it’s all about throughput and getting closer to the GPUs, whether for inference, whether for training. Whatever the case may be, it’s just high-performance data is becoming the bottleneck in most cases. And you need to enable that. And in terms of the modern storage requirements nowadays, it’s all about high-performance network, high-throughput network, low latency. And the disk drives is not– the HDDs, the hard disk drives, the classical disk drives is not cutting it for any of those modern workloads. You really have to push NVMe type of technology in order to fill the pipes.

The difference is about– according to my calculation, it’s about 10 to 12x difference between throughput on a generic workload on hard disk drive to an NVMe. With four to six NVMes on a server today, you can easily fill the 100-gigabit pipe. And then primary storage is– object storage was always thought to be the archival storage in the past with the cloud-native workloads and the advances in the whole AI/ML drivers for this data. You’re now getting object storage because of the S3 APIs, thanks to AWS S3 and Amazon opening up that path. API is the standard interface for all applications. That’s from notebooks to any kind of any AI application or AI databases nowadays. They all use object storage underneath. And that becomes the primary storage for all of those newer use cases and newer application. And we became the primary storage, not the object storage that’s been sitting in an appliance that’s using very slow hard disk drives to archive and park the data, more of a tiering or in a backup case. Now we are in the center of this new world, with especially generative AI. This is coming back further and further into the forefront from vector databases to others. They all use object storage.

Object Storage as Primary Storage

This is just the landscape from databases even, which is the last I would have thought that would adapt to more of an object storage. Object is the primary. They are also doing this with external tables. They are also doing this with other query mechanisms that they extend there from Snowflake to Microsoft SQL. They all support external tables nowadays. And the external table overflow data can be sitting on top of MinIO, on top of object storage. And they all take S3 as the de facto standard. That’s the standard API. And the other parts of the picture here is Log Analytics, AIML Stream. Milvus, for example, it’s one of the newer generation vector DBs. And they, on their website, on their documentation, they support object storage from the get-go. They skipped the part about legacy technologies. They just direct the support object storage. And MinIO is the one that they recommend on their deployments. And similar stories, Spark is adopting. And TensorFlow has always been supporting object storage from day one. If you’re on cloud, they recommend S3. If you’re on-prem, they recommend MinIO. So that integration pieces has always made us the primary storage for those use cases. If you talk about object storage for other generic, mainstream IT infrastructure type use cases, people may talk about it being archival. But for those database, AIML type of use cases, we’ve always been the primary storage. And that’s gave us the popularity in many of those use cases and how we grew our client-based customer base.

The Leading Object Store

This is the list of all of the enterprise features we have. I’m not going to read through or mention them one by one. But I’m going to pick a few of them that is classic, very important features that enable some of those use cases. Number one is erasure code and bitrote protection. Erasure coding, in the previous session also, they were talking about how they modernize Hadoop into a modern data lake or modern architecture with MinIO. And the most important piece is Hadoop has three copies. That three copies, from the get-go, with changing just the three copies into erasure coding, you’re going to 1.2 to 1.5. Or it was mentioned there, it was 33% is what you need of the raw or available disk space or hardware you have. You get 33% yield. With erasure coding, you’re getting 75% yield. And that’s the equation. And from the get-go, any kind of a TCO model, any kind of architectural analysis you do, or any kind of a cost analysis you do, you are getting up front half of the savings from just that change. And then you add up the simplification and other things that goes around. So erasure coding is actually RAID in the classical sense, what all storage systems use, is a subset of erasure coding. Erasure coding is a general mathematical concept that includes RAID itself. And then other parts that I want to mention here is the part that you’re using S3 APIs across all systems, whether it’s on the cloud running MinIO, or on-prem, or in any kind of a system that you have at the edge, on a single node, single instance, the API integration and using or sticking to one API is key here. No code changes, no application changes. If you believe in a de facto standard of S3, and most of the cloud native world and people use that today, you can use that even in other clouds. It doesn’t have to be AWS S3 services. It can be on any other cloud, because MinIO is across all of these clouds that you can run it on. Google Cloud, you can run it on Azure as well. And all other things that you need from an integration to enterprise, from access management, to encryption, to other services in an enterprise, we have done all of that integration work already. And that’s what got us to today. And we have that 300 plus customers, commercial customers, using us all in a setup of an enterprise with LDAP, KMS, and things like that.

MinIO’s Value Drivers and Differentiation

The reason why we are chosen by many of these modern data lake architectures and other use cases that I was mentioning is because we are very performant. We have written blogs about this. We are very open about this. We use the registers in the chipsets in the last 10 years of chipsets to do the erasure coding calculations. And that allows us to push throughput at a higher rate than anybody else not doing that type of approach. It’s kind of offloading into registers, which are available as extensions called SIMD instruction set in most of the chipsets. We use those SIMD instruction set. AVX-512 is the one that Intel uses. Others have similar names and similar functionality. We grow up the code and optimize it for all of them. So we use that. That’s why we can get to very high performance. That 325 gigabyte capital GP is a benchmarking we did with 32 nodes. As you add more nodes, the performance is scalable and linearly scales as more nodes comes in.

Cloud-native, because we have been always very lightweight in terms of– we have written everything in Golang. The whole binary was about 50 megabytes before we put the UI in it. Once we have the UI, it’s about 120 megabytes or so. We’ve been very simplistic on the way we do things. We try to not bloat the software. If you look at other software-based storage systems, they tend to start with block-based storage and then add file system, and then add another gateway. They just kept on adding floors and floors of software architecture. And I call it the seven-layer wedding cake. At the end of the day, your S3 is so slow by the time it gets to the block. We just use commodity raw devices mounted by XFS type of file system. And then we stitch them all together in memory, where we know exactly where the objects are. And then we get our simply high performance from there. And we are cloud-native as a result of that, because from Docker days to Kubernetes, everybody started using us because of our simplicity, our light footprint. And of course, we’ve been always cloud-native, the way that they are used to, from permute used to log messages going into the console. They were used to the way we work and the way we implemented the product.

Simplicity has always been there. We’ve been extremely simple. Developer loves it, because it’s a single command line. You just get the binary. It’s a static binary. You run it against a sandbox or against a file system that you have local mount point. We start working. And that’s kind of the beauty of MinIO, and that’s why we have been quite popular, both on open source and the commercial side as well. And AI ready is more about the AI ecosystem. We are at the center of it. I already explained from Milvus to other. Now the whole AI world, it’s kind of an increase or incremental progress of the things that we already know, and you guys know it best, from the ML days, from the data architectures that was done from ML Ops. Now it’s AI Ops, but it’s just added a few other pieces from the vector databases to do foundational models. The ecosystem got larger, but we are still in the center of it because of the integration to object storage. And there’s a nice slide at the end that I’m going to share how we play a role in that, because all of those components, apart from the more transactional DB needs in a ML or an AI workflow, the rest of them can sit on object storage. And that’s why we talk about us being AI ready.

MinIO and Dremio

MinIO and Dremio is also very much together in this modernization world. The three areas that Dremio focuses on, two of them is the same world that MinIO focuses on. Hadoop modernization is one that Dremio plays a big role in that area, that a lot of people changing from Hadoop being the legacy technology, and for one reason or another, going downhill. And they are shifting into more of a query engine approach to data and trying to use open table formats as a way to approach, solve some of the other problems, and get rid of all the bells and whistles of Hadoop world. And in one of the slides Brock had, it’s Hadoop being compute and storage together. If you desegregate that, you open up so many opportunities for architectural reasons. I remember I joined MinIO seven years ago. Before that, I was responsible for running data centers, storage, and compute for Bank of America Merrill Lynch. We had a Hadoop cluster of 135 nodes or so, I remember. And the utilization on CPUs were 2%, literally 2%. And to me, that’s waste of silicon. I mean, 2% for querying, and even at the peak, when you take the average, it was being bought or deployed just as a way to get storage. And many other companies, that was just my example from the past, but many other companies has been using Hadoop in a way that’s totally inefficient.

Once you break that, you have a lot of cost savings, and you have a lot of performance savings. And you enable a lot of– all these bottlenecks goes away. And MinIO plus Dremio is the two components you need in that modern architecture. A lot of customers of MinIO, they just don’t think about all the bells and whistles from the Hadoop. Maybe they will have the Spark job still running, but then Spark, Dremio, all working against the same data lake is what they care about. Because they can achieve the high performance, they can achieve the simplicity by a simple architecture of Dremio being the query engine, catalog, everything on the top. And object storage, whether it’s the OpenTable formats or any other classical format that they’re using, it’s still the same object storage, and it’s kind of the– you’re just storing basic data on top of it, and it’s highly scalable, both on the data side of it and the query engine part of it. That’s the key here. And Iceberg makes life so much better. And we heard in everybody has a different approach to it, but the moment you have the advantages of Iceberg metadata and things like versioning and treating the data as more of a Git style approach, you can achieve a lot more with a lot less of infrastructure, a lot less people managing that infrastructure. And then you get control of your systems and architecture back with a system and technologies like this. And that’s kind of what we are talking about when we talk to any type of a data architecture modernization of Hadoop and data lakes.

If there is no modernization of Hadoop, then there is the modern data lake approach when you’re starting from fresh, basically, then it makes sense to just have that architecture that I just showed in this slide. You don’t need anything else for 95% of the use cases as far as I’ve seen in the field. You can achieve most of the things that you want to achieve with this two, three components, and that resonates with a lot of customers, a lot of enterprises, and they are shifting slowly. Of course, it’s hard to let go of the old technology, but there is a way to kind of start small and then grow into it. And as they see the benefits, it’s really hard to go back into the old style Hadoop world.

Modern System Architecture for Generative AI

In the last couple of minutes, I just want to talk about how we kind of link this to the modern way of doing things in the AI world as well. So all the things that we talked about, modern data lakes and then how we, depending on how we define the lake architecture, modern data lake house, warehouse, and all of those buzzwords is actually at the end of the day, it’s all about those, where you keep your data, how modern you can work with your ETL process, and how you can do it more like a CI/CD, and what kind of a format you’re using for your data, and the query engine at the end. So all those three components is what you need, in my opinion. But then in the generative AI world, that ecosystem changes a bit. At the core is still what we are used to and what we were talking about modern data lake, but then you have the ML ops and AI ops. It’s becoming AI ops now. There are different kind of pipeline phases that also use data and that needs tremendous amount of high throughput, fast performance type of data. In the inference, in processing the data, as well as the training, especially training that we have seen at MinIO working with– I personally work with Entropic folks. Their need was just to push the throughput to the maximum in a public cloud environment. At the time, they were working with one of the public cloud vendors. I think they were close with them, and they needed to use their infrastructure. But they were getting bottlenecked with the network interface.

So in the training, in the AI world, ML ops or AI ops type of world, you really need to make sure that whether you’re using internet technology or other technology, it doesn’t matter what the transport mechanism is. You just need to make sure that those GPUs are getting filled with data at all times in a scalable fashion. And that’s kind of the challenge there. And then once you’ve done the training, there is kind of snapshots of the models that you have to save back, and those are also requiring a lot of data. So we play a role on that front as well. And the last piece is the vector database. Most of the vector databases already runs on object storage, so they utilize object storage anyways. So the picture is like upside down, but they are kind of writing on top of MinIO. And the middle section, the modern data lake architecture, is what we’ve been talking about, Dremio plus MinIO previously as well.

On the left side, the hugging face hub is more about the repository part of the whole foundational models and repository. And then this whole picture gets more complicated when you go to large enterprises, when you do AI models, foundational models, plus RAG. The RAG ecosystem is new and it’s just starting, but most of the enterprises will need to have control of their data in the AI world. And then combined with RAG, basically, their proprietary data that needs to be kind of blended in into the output of this whole ecosystem. That’s another layer of complexity, but that also requires on-prem data. And most of the data, if it’s already on a storage system, object storage system like MinIO, it’s easier to integrate it into a AI workload or an AI ecosystem like this. So this is just to give you a flavor of what we are talking with some of the customers. And some of them implemented this part on the right and left. And then the RAG is coming down the pipe. They are just thinking about the RAG piece and they are thinking about how to integrate their own data with the foundational models that are either open, public, doesn’t matter. At some point those models will be commodity. The most important part will be RAG plus the foundational model, in my opinion. But that’s to be seen. We don’t know how that’s going to pan out.

In summary, I want to give some time for questions. We are– Dremio plus MinIO is both going after the same type of a kind of a base architectural solution that’s going to be flexible, highly performant, and modern. And that’s why we are kind of at the forefront of these architectures, because you can use any system you would like underneath as a data lake or a modern data lake. But you need to have those characteristics. It has to be highly adaptable. It has to be highly performant. And it has to be easy to run and operate. And that’s what you get with MinIO. And the simplicity in the architecture by all the other things unifying query engines on the top and the catalog and all the many things that Hadoop was providing to achieve the same goal. With Dremio, you can achieve the same things at a higher performance. And then the same thing with the object storage at the bottom with MinIO. High performance, simplicity, and protection of the data, and enterprise-grade features that we talked about.

We talked about the cloud operating model in some of our blogs. And we’ve been talking about that. Most of our popularity comes from the simplicity. The whole cloud operating model we talk about is cloud is not a location. Cloud is how you do things. If you do it in a more lightweight, same deployments again and again, more Kubernetes managed, or in a simplified Docker containers based to today, not a monolithic software, but having cloud native software, it solves many of the IT problems or IT infrastructure problems today. You can deploy it on the cloud. You can deploy it on-prem. It doesn’t matter where you are running them. The whole public cloud, private cloud, hybrid cloud kind of discussion goes away because you’re just deploying them anywhere you need to. And that’s kind of what we drive the superior economics and cost savings and all of that. A lot of our customers has been in cloud, born on the cloud. Now they’re coming back on-prem because of the egress costs, because of the other things, or they just want to control their data because of AI and rack type of approaches. They just want them to stay in place. And then we already covered the build for the future part. I talked about how we are playing a critical role in the center of the AI ecosystem, whether it’s vector databases or the AI ML ops type of phases. You just need a high performance storage to feed into the training phases of the AI model and everything together.