May 2, 2024

Next-Gen DataOps with Iceberg & Git for Data

Tomer Shiran, co-founder of Dremio, unveils a streamlined approach to DataOps that champions simplicity, data quality, and self-service, all essential for powering AI innovations. By integrating Apache Iceberg and Dremio’s open “git-for-data” model, the keynote showcases how Dremio transforms data management with an emphasis on easy-to-use SQL interfaces, enabling engineers, data scientists, and analysts to move past complex data engineering tasks like Spark pipelines.

Shiran highlights the creation of high-quality “data products” within a self-service platform, allowing for seamless collaboration and ensuring the integrity of data fed into dashboards and AI models. The talk underlines the critical role of data quality in AI performance and showcases how Dremio’s approach simplifies data access and manipulation, making advanced data operations accessible to a broader audience through familiar SQL.

This session offers insights into adopting a DataOps framework that not only boosts operational efficiency but also accelerates AI-driven insights, all within a user-friendly environment. Join us to discover how Dremio is pioneering an open future where data management is as straightforward as writing SQL queries, paving the way for more dynamic and effective AI applications.

Topics Covered

AI & Data Science
DataOps and ELT/ETL
Keynotes

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Tomer Shiran

Can you hear me? Is it working? Can you hear me? Yes? OK, there we go. All right, thank you, Colleen. Thanks, everybody, for being here, and of course, all the folks online. And we have a lot more folks dialing in from all over the world. So yeah, thank you for joining today. This talk will be about next-gen data ops with Iceberg and Git for Data. 

Apache Iceberg

All right, so as a company, when we think about data platforms, as you all know, we’re big believers in open standards and open formats. And so when we started Dremio and we built the early versions of it, what made it possible was the fact that there was a file format called Apache Parquet. It was kind of a standard. Industry had already pretty much agreed that that was the standard. It was pretty high performance. It was columnar and all those kinds of things. And then when we set out to build Dremio, we saw an opportunity to create another open standard, this time for in-memory processing, but more importantly, for exchanging data between systems. And so we created something called Apache Arrow. And that was basically Dremio’s internal memory format. We decided to open source it because there wasn’t any columnar memory formats out there. Every database had its own thing. Oracle had an in-memory columnar format. SAP HANA had one. But there wasn’t anything open source. And so we did that. And Arrow today is downloaded something like 70 million times a month. So everybody uses it. And it’s not, as you can see, every year or every month or even every few years that a new standard emerges that enables you to do new things. And it’s actually very difficult to create these standards because you have to get agreement from the whole industry. And so we did that with Arrow. Fortunately, that was very successful. 

But what happened a few years ago is that a few folks at Netflix set out to create a new table format. So they realized the need to create a higher level abstraction than a file format, did it as an Apache project, and got some consensus in the community around the need for that and different systems to want to integrate it. And so at Dremio, we became basically the first vendor to pick that up. It was still a Netflix project, but they just contributed to Apache, the Apache Software Foundation. And we started evangelizing it. And so we started creating lots of events and conferences and lots of blog content and all that kind of stuff. And fast forward to today, Iceberg has indeed become a standard. And what’s important about it is that it allows us to do things that we just couldn’t do before. Data lakes couldn’t do before. And if you look at this table here, it’s kind of a short comparison between what was possible with a data lake pre-Iceberg and pre-table formats and what’s possible today. And so you can see the common thing is, of course, you can query data. You can read data. But you have all these new things that you can do today, such as inserting, updating, deleting individual records, evolving schemas very easily. This gives the opportunity for systems like Dremio to automatically optimize the data behind the scenes instead of companies having to figure out somehow how to optimize their data and make sure there aren’t many small files or too big files and things like that, things like time travel, going back in time and having versioning. All of these types of things are made possible by Iceberg. And so now the data lake is way more powerful. And then, of course, we have a new name for that. We call it a data lake house.

What’s also really important about all these enhancements that are now available in the data lake is that you now no longer need a data warehouse. The reason we used to copy data from data lakes into data warehouses and all the complexity that comes with that is because we couldn’t do all these things that were needed. Data warehouses could do it, but data lakes couldn’t because the common format was a file, Parquet files. So now with an abstraction of a table, you can actually do all these things and you no longer need the warehouse. And the important thing about that is that this takes away a lot of cost and complexity. If you have to manage the copying of data, you have to build that pipeline. You have to maintain that pipeline. You need to pay for all these different systems and manage them. Security becomes a big problem because permissions don’t travel with data. So as soon as you move data from the lake into the warehouse, now you have two places to manage permissions and opportunities for those to get out of sync. And so all sorts of problems. So having one platform that can do all these things, that’s a big advantage. 

Iceberg in the Last Year

So the Iceberg community, not just Dremio, but all the companies and individuals contributing to the project, have made a ton of progress in the last year. It’s actually been a really exciting year for the project. And this talk is way too short to cover all the different things that have happened in the project. But I wanted to highlight a few things. And so in the last year, we’ve added support for a variety of different languages with Iceberg. So you can actually just open the Python library and start interacting with Iceberg directly, both read and write. We have Rust support now. We have Go support now in the project. And so lots of new languages supported. And also many different capabilities. And so Iceberg, for example, in the last year now has support for views. And so not just tables, but you can actually have views. And the views live in the same catalog side by side with the tables. The REST catalog API has been embraced, as you heard this morning, by us and others as well. And so that’s kind of the next generation standard interface for the project. We’ve added partition statistics to the project and lots of performance capabilities. So really an exciting year for this project. 

The other, I think, even more exciting thing is that in the last year, I think it’s pretty clear that Iceberg has become the industry standard. So a year ago, or certainly two years ago, it wasn’t clear. A few different projects kind of competing for this. But what we’ve seen over the last year really is kind of Iceberg cementing itself as the industry standard table format. Just in the last couple months, Confluent announcing that they’ve built their next generation kind of engine and storage on Iceberg as their table format. Salesforce announcing that its data cloud is built on Iceberg. You all know that Snowflake and Amazon and Google have also selected Iceberg as their table formats. And of course, Dremio, as you heard as well. And so really, the vast majority of the industry is now standardized on Iceberg. And that’s great because a lot of the value that the data lake brings to the table is this interoperability, the openness and the flexibility to choose. You know that even if you use, say, Dremio’s query engine today, you know that in the future you could use something else. If something else gets created that’s better, you just point at the same data and you get to use it. And so that flexibility to choose the right technology for the job and to not be locked in is one of the benefits of an open format. And I think that’s true not just in our world of data analytics, but everywhere. If you look at the last couple of weeks, we saw Meta introduce the LLAMA3 model. Very powerful, kind of similar performance to GPT-4. And they also announced that they’re spending $10 billion on this thing a year. And the stock went down because of that. But their willingness to spend $10 billion a year on GPUs and all the work to create this model means that now even maybe companies that have bet on some other technology for building LLMs, for building Gen-AI models, well, now there’s a new model in town. And so that flexibility to be able to move when new technology is created is really important. It’s important, I think, not just in our world, but basically everywhere, especially as things are moving at these speeds. 

If you compare Iceberg to other projects such as Delta Lake, the much broader and more diverse developer community, this means that you get a lot more innovation, a much faster innovation, but also the longevity of the project. I hear from a lot of companies that we work with how tired they are of having to move between platforms, all these platform transitions. The moving off of Hadoop to something else, or from Teradata to something else. These are really painful, multi-year projects. And pretty much nobody wants to be doing them. And so having a way to store data in a format that’s both very strong technically, but also you know that it will be around for a very long time. Because all these companies, both technology vendors, but also companies like Netflix, and Apple, and Airbnb, and so forth. 

As you’ve all seen, and those especially in the room here have gotten physical copies signed by Alex. If you haven’t, maybe there’s still an opportunity to do that. But if not, scan the QR code if you haven’t already done that, and you can download the book for free. The O’Reilly book on Apache Iceberg. 

Dremio’s SQL Query Engine For Iceberg

All right, so we talked about the flexibility. And when you think about Iceberg, one of the nice things about it is that you get to choose from a variety of different engines to work on this data. And you also have the choice of different catalogs to use. This is something you don’t get with other table formats. So I want to talk a little bit about what we’ve done specifically at Dremio in terms of our query engine for Iceberg, and also our catalog for Iceberg. All right. So Sender talked this morning about performance. Being a query engine for Iceberg, one of the important things, of course, is you want it to be fast, right? Because at the end of the day, lots of engines can scale. But the curve, the slope of that price performance curve, that’s how much you’re paying for your cloud infrastructure or your on-prem infrastructure, right? And so the fact that Dremio is more than 100% faster than Trino, more than 70% faster than Snowflake, that basically means that you’re saving a lot of money by using a faster engine. Sometimes that raw performance isn’t enough, right? And we all have been in this world where we’re using a data warehouse or a data query engine, and it’s not fast enough for various workloads, for example, dashboards, right? And then what ends up happening is you end up creating Tableau extracts or Power BI imports, or you end up creating pre-computed tables within the warehouse, such as aggregations and things like that, for every workload. And now you have to manage that complex ETL to keep those pre-computed tables in sync, to keep them up to date. And it just becomes a big mess, right? You’re managing tens of thousands of these kind of extracts or copies of data. And so we created Reflections, and we’ve enhanced them a lot in the last year if you haven’t used Dremio. Lots of new capabilities around Reflections. Pretty soon, live Reflections that automatically stay in sync incrementally. But what Reflections give you is that ability to get sub-second performance for these workloads without having to manually create these things, without having to point the applications at any kind of materialization of data. You simply work with the tables and the views, and the Reflections kind of work behind the scenes automatically to give you very fast performance. 

But speed isn’t the only thing. It’s got to be easy, right? And so our kind of mindset, our focus around Iceberg is to make it so that you don’t even need to understand that you’re using Iceberg, right? You want to get that interoperability and that no vendor lock-in. But you want it to be as easy as using Postgres, or MySQL, or any other database. And so what we’ve done here is we’ve made it so that you can simply use SQL commands to interact with data, to create tables, create table statement. To change the scheme of a table, you have an alter table statement. Standard ANSI SQL, right? You want to mutate records. Just insert into, or update, or delete, right? So standard SQL. Underneath the hood, it’s Iceberg. But your users think about it just like any other database table, right? 

Equally important is how does the data get into the Iceberg format, right? Maybe we’ve spent years building this whole pipeline that goes into like Parquet tables, right? Or Glue, or Hive Metastore tables. So how do we go into Iceberg? Over time, all these pipelines, I’m sure, will get updated. But it’s got to be really easy to get it in. And so we’ve started, for example, by providing SQL commands that make it very easy to get data into Iceberg. So there’s a copy into command that takes a bunch of files, say in an S3 bucket, and loads it into an Iceberg table. And those can be in many different formats. You can have bad records that are automatically handled. And so there’s all sorts of capabilities in this command to handle a variety of different scenarios. You can use a CTAS statement to take data from some other database, relational or NoSQL, get data into an Iceberg table. And of course, an insert or update statement as well. Similarly, going from one table format or Metastore into another as well. 

Auto-Ingest Pipelines

And so today, we’re happy to announce the availability of auto-ingest pipelines. And what this allows you to do is not just run a SQL command to do a one-time ingestion of data, but this gives you a continuous stream of ingestion. So if you have an S3 bucket, Azure storage folder, and data is getting dropped into there, files are periodically getting dropped into that location, we’ll now automatically ingest that on a continuous basis into an Iceberg table. And this happens with basically zero effort. And so you create a pipe. As you can see here, well, actually, you create the table that you want, the Iceberg table, if you don’t already have it. And then you create a new object called a pipe. So you create pipe, run that statement. And then the data is automatically ingested into the Iceberg table. Dremio will sign up or register to get notifications from AWS and then Azure as well and basically automatically understand when new data is arriving and load that into the table. So you don’t have to do anything, and data is immediately ingested. There’s not this long wait period or once an hour pulling kind of thing. It happens based on a notification from the object storage. And so auto-ingest pipelines make it really easy when you have that continuous stream of kind of files arriving, and you want to get that into an Iceberg table. 

Apache Kafka to Iceberg

Sometimes what you have is a scenario where you’re not going from files, but you’re going from, say, a Kafka cluster or a Kafka topic. Data is being pushed into Kafka. And so what we’re also announcing today is the availability of ingestion from Kafka into Iceberg. So now if you have data producers that are pushing data into a Kafka topic, you can, again, with very little effort– this is based on Kafka Connect– you can have that data flow directly into an Iceberg table, automatically update records as they’re coming in, and you have an Iceberg table in your system ready to be queried with Dremio’s query engine, with Spark, with Snowflake, with anything else. So two new ways to ingest kind of streams of data, continuously arriving data. And so if you look at where we are today as a whole, both things that we’ve had and we’ve added in the last year, but also the things that we just released and announced today, we have the broadest range of capabilities in terms of ingesting data into Iceberg. So you can see here, we have SQL commands to get data in from object storage, copy into, kind of the most important one there. We have auto-ingest pipes, where you can create pipe and ingest the data on a continuous basis from files. There’s ingestion through Kafka. That’s for streams. You can get data in from a database. And you can also work with many of our partners to ingest data from various SaaS applications and various databases. Companies like Fivetran, Airbyte, Upsolver, and others, which make it also easy to get data from hundreds of different locations into an Iceberg table. So we work with these companies to provide that kind of seamless experience with Iceberg specifically. 

Flexibility to Choose Catalogs

All right. So that’s about the engine. So we’ve talked about how we’ve made it really easy, really fast, and took care of the ingestion. So you now should be really easy to get started with Iceberg and to really build your kind of Iceberg-based lake house. But as I mentioned, with Iceberg, you also have the choice of a catalog. And the reason we started working on this and created a catalog in the first place was that all the options out there were kind of outdated. You had Hive Metastore, which had been created back in the Hadoop days. It was over a decade old. So it comes with lots of baggage, lots of complexity, things like hosted versions of that, like AWS Glue. And so really, there weren’t good options. And we saw the opportunity to create something that was a lot more modern, a lot more kind of built for the cloud, much more scalable and lightweight. And so we started by creating this open source project called Project Nessie, which is a modern open source Iceberg catalog. And the idea was to create a native Apache Iceberg catalog. This is actually integrated into the Apache Iceberg project itself. It scales to millions of tables, thousands of transactions per second, super easy to deploy. There’s Docker images, Helm charts, all that kind of stuff. We built this with an Apache license. So it’s a permissible open source license. You can use it for anything. This morning, we also announced that it’s part of the Dremio Enterprise edition as well. So if you’re using Dremio in kind of its software form, you now are able to use Nessie as well, full support. And we also talked about the fact that we will soon support the REST catalog API as well, which is kind of the next generation interface in the Iceberg project. 

But what really stands out with Nessie is its support for Git-style data management. So the idea here was to really rethink how data is managed. If you think about developers and how they work with source code, that’s changed a lot in the last decade or two. We have things like Git and GitHub and CI/CD, really sophisticated. And with data, it’s like, here’s a bunch of tables. You can run SQL commands on them. And so we wanted to reimagine how data is managed. And we saw, basically, the concepts of Git as being very applicable to the world of data as well, things like commits and tags and branches. And so that’s what we built here with Nessie. We didn’t use Git because Git is very slow. It’s designed for developers and so a few commits per second. So it’s a different world here in terms of performance. But the same concepts are actually very applicable to the world of data. 

Git for Data

So what does it mean to do “Git for data,” in quotes? What are the benefits of that? So first of all, branching, being able to create isolated sandboxes, no copying of data involved. And that has lots of use cases. For example, being able to create a branch for ingestion so that you can test the data before you actually put it in the main branch where everybody can see it. Being able to create a branch for experimentation, so some data scientist wants to work on the data. Rather than creating a copy of tables for them, you just create a branch. There’s no cost to it. And they can work on that branch and throw it away when they’re done. Version control is another benefit of “Git for data.” So the ability to reproduce models, and dashboards, and things like that. The ability to take a point in time in the past and be able to query the data as it was at that point in time. That’s what you mean by version control. Also being able to roll back to a previous point in time. So I’m sure all of you have been in a situation where somebody accidentally deletes data, or messes up the schema, or maybe deletes a bunch of the data, or corrupts it, or you find out that bad data was brought in and it’s been propagated to 100 other tables. So you no longer have to worry about that, because you can roll back and just go back in time. The same way with source code, with GitHub or Git. You can just go back in time. And then finally, having a log of everything that’s going on in the system. Again, similar to Git. We’ve all seen that log in GitHub where you see what every developer has done, what they changed, and so forth. And so having that same kind of concept for data, where you have a log and you can see every single change made to every single table, who made that change, when did they make it, how did they make it, was it a specific query in Dremio, was it this Spark job ID, that kind of thing. So lots of benefits to Git for data. 

Let me give you another example. This is from the world of financial services. So in this scenario, we have the main branch, and that’s where everybody is working on the data. All the dashboards are being served off of the main branch. The machine learning models, the AI models are all being built off of the main branch. And maybe once a day, somebody creates a tag so that it’s very easy to go back and query the data based on that end of business day kind of timestamp or something like that. Maybe there are some folks that are doing cross-date analysis. They want to query the data and compare what it was like three days ago with what it’s like today. It’s very easy within the SQL query. You just say @tag and provide the tag name, and then you’re querying that table at that specific point in time. And now, for every time they want to ingest the data, what this company is doing is they’re ingesting the data in a separate branch. And so they create a branch, and all the data is ingested into that branch, transformed within that branch, and then actually tested within that branch. And this is the important part. So you’re never polluting the main branch with data that’s either inconsistent, incorrect, incomplete. Only once all the calculations and adjustments are done, the data has been tested by the controller, then it’s merged back into the main branch. And so now you have a new point in time in the main branch. Maybe you then run the create tag statement, create a tag. So you have that consistent tagging of all these versions. And then again, next time you want to ingest data– and here it’s kind of based on day, but it could be hourly or whatever the schedule is. Some might be scheduled, some might be ad hoc. But basically, every time you’re ingesting new data, you do it in a separate branch. And only when you’re sure it’s correct and it’s been tested, you integrate it into the main branch. As you can see, the main branch remains very clean, always has the correct data. 

Another use case for this is data exploration or experimentation, what-if analysis. And so rather than, say, a data scientist, or in this case, a controller, wants to do some analysis and ask questions on the data, they can do that, again, in a separate branch. So a new branch is created, and they can start doing work on that separate branch and then drop it when they’re done. So again, you can see this step here does not involve any copying of data. And there’s no pollution of the main branch. It remains clean. You don’t have to worry about somebody accidentally doing something to the tables that are being used by other people. Kind of remains very clean, very elegant. And the benefits of this are no copying of data, a lot better productivity. If you think about how fast you can move when you have this kind of isolation between the work that different people are doing and the temporary things that are happening in the main branch, it’s a lot more safety that you have, less errors. So all sorts of benefits to this kind of approach to data management, which is not surprising, because we know that in the world of code, software development, this is how people work. And this allows them to collaborate much faster. 

So what’s also exciting about this– I talked about Nessie as kind of an open source project and how that’s used and the benefits of Git for Data. This is now also available as a free service on Dremio Cloud. And so if you sign up today on Dremio’s website, you just go to dremio.com and sign up for Dremio Cloud, you can actually use this service for free. It works on both AWS and Azure. It’s a service. There’s no software to manage. You can connect to whatever user you’re using. Users and permissions can be defined. You have a GitHub-style kind of user experience. You can see a few screenshots here. And there’s also additional capabilities like automatically optimizing iceberg tables and vacuuming things that are no longer used and so forth. All right, so we talked about Iceberg, how it’s become the industry standard. We’ve talked about what we’ve done at Dremio with the query engine for Iceberg and what we’ve done with a catalog for Iceberg. But to bring this all together and also demo how this applies, how this enables you to do data ops in a much more modern way, I want to bring to the stage Alex Merced, who you may have met already today signing books. And Alex is going to walk you through that. Alex is a developer evangelist at Dremio. Thank you, Alex.

Alex Merced

DataOps For Iceberg

Hey, everybody. Nice to see you all again. Hey. But what I’m going to talk about today is basically how do we bring this all together, this whole concept of data ops, Git for data when it comes to Iceberg. And seeing it come together, but also the question, why? What is the transformation to your practices, to your business that occurs when you embrace all these practices? So right now, you probably live in a world that looks a lot like this, where you have all your data sources and then large layers and chains of data pipelines that are brittle, that break, that require a lot of maintenance, or use to eventually populate data products and data marts within your systems. And this leads to the result that you end up spending 80% of your time doing things you don’t want to do, which is oftentimes manually merging and reconciling data from multiple sources, repetitive manual processes, going through that data and cleaning it up. This is all stuff that’s just maintaining what you already have versus expanding your footprint, expanding your data footprint, bringing new types of value to your data consumers. Wouldn’t it be nice if we could do less of this 80% and more of that 20% that brings new value to your business? Well, that’s possible. How about we just stop it? We stop optimizing the data for each workload. We can use live reflections. So that way, basically, as you update your data, your reflections update too, and everything just kind of works and is nice and fast. Wouldn’t it be nice if we could just stop having data quality issues in production? Why? Because we created a branch, ingested data in the branch, validated it, so those problems just don’t show up in production. Wouldn’t it be nice to not have so many data copies? Because every time I need another– to do that type of ingestion, instead of creating a demo or ingestion environment and create all these copies for all these different environments, I can just create a branch. And now I have an isolated environment that didn’t require a copy of my data that I can work in in an isolated way, which could also be used for creating development, experimental environments. And now, when it comes to recovery, all I can do is roll back my catalog, and I can be back to a good state without all the work of having to backfill and do all this stuff. I can just roll back and try the job again. 

And that brings us to a world that looks a lot more like this, where we have our data sources. We can just easily ingest them into Apache Iceberg using Dremio, and then we can use integrations with Dremio, like DBT Labs, as a way to create a very replicatable, easy-to-manage semantic layer across hundreds, if not thousands, of views as you curate across many, many data sets, and then easily keep those data sets accelerated using a feature like Reflections, which, if you joined me for some of my talk this afternoon, you’ve certainly heard me talk about Reflections today. But yeah, it all becomes a much easier world. And it’s not just easier, but it also becomes faster as far as how you deliver your data to your end users. It also becomes cheaper, because you’re not duplicating your data as much. You’re also optimizing your other costs, such as storage, compute, data access, minimizing egress costs. So this sounds like a really good world. It sounds almost like too good to be true, but it isn’t. It’s real. It exists. And you can actually physically see it. We’re going to do this demonstration, because in the same way that MC Hammer says it’s hammer time, it’s demo time. Oh, not yet. Not yet. Stop, stop, stop. Ah, OK. 

The Traditional Solution

Now, next slide. OK. Spoilers. OK, but what we said was going to happen here is that we are Retail Company X. We have a variety of products in our inventory at different locations across several different categories. Now, what we want to do is we want to create a dashboard to see our inventory across different locations and different product categories. And the problem is we have our data across different data sets. We have some data in Postgres. We have some data in MongoDB. You’ll see this all throughout there. And again, everything you’re going to see here today, you can do and replicate yourself. We have just published a blog on Dremio.com that will walk you step by step through what you’re about to see today. Also, at the same time as this talk started, two videos published to YouTube, on Dremio’s YouTube channel, that’ll walk through the same replication again. So basically, you can do this to your heart’s content. And it’s actually pretty cool. It’s a pretty cool exercise that you can do strictly right there on your laptop. Try it out. So no need to go spin up cloud infrastructure. You can try it out, see it firsthand, and do it step by step, everything including the DBT, everything. 

But first, before we do the demo, I want to walk through how this could look like in today’s world or in the traditional world. So in traditional world, I would have my product table in my MongoDB. I’d have my inventory table in my Postgres in this scenario. And what I would do is I would ETL that data into my data lake. OK, that sounds fine. I’d probably ETL it as raw Parquet data. And in my data lake, I would probably land that data into what we might call a bronze layer. Any kind of three layer cake, we’ve all seen different names for these three layers. But we’ll have our raw bronze layer, which will be a physical copy in Parquet. Then I’ll make it curated a little bit more, curate that data into a silver layer, and then curate that data again into a gold layer, which– OK, now we’ve got our data nice and ready. But we’re still not done yet. Because oftentimes, we’re not serving our BI dashboards directly from our data lake. We’re serving them from our data warehouse. So then I have to ETL that data again into the data warehouse. And then from there, I’m going to have to curate a bunch of data marts for my different business units, marketing, accounting, supply chain, et cetera. So we have all these data marts. We’re making more copies of the data. So again, copy, copy, copy, copy, pipeline, pipeline, pipeline, pipeline. That means compute cost, storage cost, egress cost, the whole deal. And then finally, I can allow the analysts to begin working with the data. We can finally hand off the data to the analyst, who then brings it into their favorite BI tool. But again, it’s still not probably fast enough. So now, either they are making extracts in their BI tool, or we’re making materialized views and cubes over there at the data warehouse. And all of these things need to be maintained, and tracked, and synced, and all this fun stuff. This sounds complicated, and expensive, and a lot of work. Could we do better? Yes. OK.

Dremio Solution

So how could it look if we do it better? OK, once again, we start with our two tables. That’s all fine and good– products, inventory. And what I would do is I would still ETL it to the data– well, not the data lake house. In this case, we’re using Dremio, our unified lake house platform, to do it. And we would only make one copy. We just land our raw data into the data lake in a raw layer. So there’s our copy. And then our other layers– in this case, we’ll call it a curated layer and a production layer– well, that’ll all be views. And we could actually hand off to the analysts as early as curated if we wanted to. At this point, we can be like, you know what? Analysts, you have access to the raw data. If you want to do, model the data as you need to on top of that, if you want. Or we can model it ourselves, and then provide that final production layer directly to the analysts, however we’d like to do it. But you have that flexibility, because you’re not duplicating the data. And then once we have our layers of virtual views, we can just build our BI dashboards directly off of that. And if for some reason we need some additional acceleration, like you saw in Isha’s demo in the opening keynote, we can use a feature like Reflections. And again, when I initially ingested the data, I could have used Git for Data to isolate and validate that data. And when it comes to creating all my views, I can use DBT to help easily do that sprawling, or that generation, or that scaffolding of all those views in a way that’s easily replicatable, using a lot of DBT’s cool features, which you’ll see in action. And yeah, now it is demo time. 

Live Demo

So here we have a demo. So bottom line we’re going to do is we’re going to do this exact exercise that we just mentioned. So here I have Dremio. And in Dremio, you would connect your data sources, which makes it really easy to work with all your data. And then I already have my MongoDB and Postgres data sources connected. So now all I have to do is create a data product to do the ingestion in. So oftentimes, a data product in Dremio can be as simple as just a subfolder in your default project catalog, which is exactly what I’m doing. I’m creating a folder called Supply Chain for a supply chain data product. And now that I’ve created that folder, I want to create folders for my three layers. So for that raw, curated production layer. So I’m going to create these three folders. So there’s curated. Here’s production coming in. And then now I have my three folders. So now I can begin populating these folders with data. And again, anything you see me do in the Dremio UI is something that can be done with an SQL command. And anything that can be done with an SQL command in Dremio can be automated through any of the interfaces that Dremio has, whether it’s JDBC, ODBC, Apache AeroFlight, REST API. So bottom line is, again, anything you see here, while the UI is super cool, and super useful, and super intuitive, it can all be automated. But here, all I’m doing is I’m ingesting the table. And the way I’m doing that is just through a simple Create Table As statement. So here I am, and just saying, hey, let’s create this products table in my raw folder. And let’s just ingest it from that MongoDB products table. So Create Table As. And I’m going to do the same thing for the Postgres table. Just a quick CTAS statement. And we populate that. And again, I can use the Dremio UI to just quickly drag and drop namespaces. So that way, I don’t have to type everything out. Notice I have it in dark mode. So I have a dark mode, light mode. OK, all these little nice quality of life things. And notice, I’m running multiple queries at the same time. And as I run them, each of them get their own tab. And I can see that they’re both confirmed. So that way, I don’t have to run one query, hit Enter, enter another query. And there they are. There are my two physical data sets in my raw folder. And I’m good to go. 

Now what I want to do is I want to create my three layers of views. So what I’m going to do is I’m going to use DBT. Because with DBT, I can define all those layers. So here, I actually configured my DBT project to have a section for each layer. So any SQL in my raw folder is going to go into that raw layer. Anything in that curated folder is going to go into my curated layer. And anything that’s in my production folder is going to go into my production layer. Now that I have it all configured, all I have to do is just give DBT the SQL that defines the views I want to define. And use DBT’s nice ginger-like syntax to define the relationships between those views. And it’ll always run all those SQL statements in the right order to create the semantic layer that I want. And I’m going to create two SQL statements. So this one’s just a raw join. Just joining the two tables. That’s with all the columns. Just so that way, I have the join ready to go. But that’s not what I want for production. So in my production layer, I’m going to have another view derived from that view that just has the columns that I want. And now that I have that, and you see here, I can reference the other model, and that makes sure that they run in the right order. And now I just run the DBT run command down here, which I’ll go run in my terminal. That is going to run those models. DBT will send all that SQL over to Dremio, create those views, and it’s done. Now I’m going to go back to Dremio in a second. And you’ll see that all those things that I defined here are now visible in my Dremio semantic layer. So see, there’s those views. I can query them. And the cool thing is, let’s say I’m going here. I’m going to query both views in a second. But I’m going to notice that there’s one column that I want to rename. So you’ll see here there’s a quantity available. I don’t want that to be quantity available. I just want it to be quantity. And there’s another column that I forgot to add altogether. Well, I’m going to go back and fix it. I just go back, edit the model, rename that column, run the DBT models again, and it’s up to date. So when I’m getting those tickets to make updates to a model, or to make updates to a table, or to create a new view, it becomes really easy to execute that using this integration. And there we go. Those changes are there, and we’re good to go. So now we have all the data nice and sprawled out across our three layers. 

Ingesting New Data with Git For Data

But new data is coming in. Again, it’s not like the data stays the same forever. So we’re going to assume that new data has been coming in, and I want to ingest that data. So I have my main branch, and that’s all fine and good. But if I ingest the data there, then the data becomes immediately available to all my production queries. And what if something’s wrong? Then that’s going to result in bad queries. I don’t want to do that. So what I’m going to do is I’m going to make a branch, and you’ll see how easy this is. I just make a branch. I can ingest the data, and then I can validate that data. I’ll just ingest and validate. And once I know I ran my dedupes, I’ve done my referential integrity checks, I’ve done my null checks, and everything looks fine and good, I can then merge that data back into the branch, and that’s across not just one table, but all the tables I’ve made changes to. So I’m publishing multi-table transactions simultaneously that get that multi-table consistency that we need to make sure that people’s queries are consistent. 

So now let’s actually see that in action, demo time. Cool, so basically what I’m going to do first is I’m going to run this query just to show you that there are new records. So you can see here I’m running one query against our main branch in the table that we created, and then another one against our Postgres and MongoDB tables. So I’m going to run those queries. Again, each query gets a tab for the results, making it really easy to see their results. And as I go through each one, you’re going to notice that the Postgres and Mongo table now have 60 records, while the Iceberg tables have 50 records. So that means they’re not in sync. I need to add those 10 records over to the Iceberg table. So how can I do that? Well, we’re going to be using branches. We’re going to isolate that data. All real nice. And again, really nice. I love working in the SQL editor, because it’s just such a nice UI. Again, you get also really cool features like autocomplete. You get syntax highlighting. It’s a really, really quality SQL crafting experience. But here I have my SQL that I’m going to use for ingestion. And basically, the flow of this is that I’m creating a branch. After I create the branch, I’m then switching over to that branch, and then I’m running my insert queries. So that way, any changes after this use statement occur inside the context of that branch. So now I’m going to run those queries. And then we’ll see that those changes will be, well, in the branch. So again, create branch, use branch, and then I just do all my SQL. And I’m going to run those queries right about– I did it at some point. I know it’s happening. Come on, come on, me. Hit the button. It’s coming. There we are. There we go. I hit it. OK, so I hit the button. The queries ran. And huzzah. So now all four queries ran. Now what I’m going to do is I’m going to run these queries that show you the data as it is in the main branch and the data as it is in the ingestion branch. They show you that those new 10 records are only in the ingestion branch. We have successfully isolated the data. So if I look here and I go through each result, I see here 50 records. So main branch only has 50 records, nothing changed. But I go to the ingestion branch, 60 records. Guess what? Production users are not seeing the new data. How cool is that? I have protected them from accidentally seeing inconsistent data. And it was just simple SQL. And again, any simple SQL or any complex SQL to Dremio can be automated through JDBC, ODBC, Apache AeroFly, REST API. You have your options to design the workflows that you want to design. 

But now that we’ve established that we have isolated the data, what if I ran all my quality tests? They passed all those quality tests, and I want to merge them in and publish them over to production. Well, let’s see how that would look like. It’s just simple as this. A SQL statement that just says merge branch, copying over the name of the branch. I post the name of the branch in there into our main branch in the particular catalog that I called Arctic. And yeah, we just run that. And done. There we go. I’ve just published the changes. It was simple as that. I just ran a quick merge statement. Now all those changes, they’re now available. And just to prove that, seeing is better than believing, I’m going to run those queries again. So you can see, now all of them have the 60 records. So now those additional 10 records are now available everywhere. I’ve officially published those changes over to main. And we’ve completed the ingestion of that data. So to recap, we’ve created a data product. We created those tables from our other data sources pretty easily using a CTAS statement. Now we have the data. We’ve ingested some additional data. And now we can just go build some dashboards. How cool is that? And again, this all took– right now we’re doing this in 10 minutes. But you do the tutorial at home, it really only takes you like 20 to 40 minutes to actually do this whole thing at home. But now basically what I want to do is I want to make sure that that dashboard’s really fast. So what I’m going to do is I’m going to create a reflection. I’m going to click on editing the dataset over here in a moment. I’m going to kind of hover over there. There we go, I clicked it. And now I’m going to be able to click here and say, hey, I want reflections. I want aggregate reflections, because we’re talking about the BI dashboard here. So I’m going to go down to aggregate reflections, choose the dimensions and measures that I want to optimize for. So I’m creating a dashboard that has covers like, hey, what’s my products by location, products by category. So I want to make sure I optimize by those dimensions. So I’m going to drag those and drop there. It’s pretty straightforward. I hit Save. Those reflections are generated. Dremio’s now aware of them. So now any queries on this dataset, if those are aggregation queries within these dimensions and measures, Dremio will just know, let me substitute these reflections in that query. The analyst who might be building that dashboard doesn’t have to even know that this was created. They’re just going to notice it’s faster. 

So now here I’m in Apache SuperSet, literally running right off my laptop. And I’m going to build out my three dashboards, or my three charts. So first, I’m building out the chart inventory by location, showing me inventory by each one. Then I’m going to do inventory by category. So there’s inventory by category. And then I’m going to do basically the quantity of each product bar chart, which will be coming up next. And then I’m going to put those on a dashboard and done. I’ve built my dashboard. And you’ll see that after we see the whole dashboard, I’ll go back and take a look at the jobs on Dremio. And you’ll see that all of them are now sub-second queries. The reason being is that Dremio didn’t have to crunch all those aggregations every time. It just used those reflections. So now every time we access this dashboard, it’s super fast. And this could work for data sets of any size. Because basically, Dremio is going to manage calculating those aggregations ahead of time, making your dashboards super fun and fast. And yes, I do mean super fun. So yeah, there you go, sub-second, sub-second, sub-second, sub-second. And you can see that they’re using reflections. Because right over here, you see little reflection symbols. So bottom line is basically building your data lake house from end to end was really just that simple. And that’s the data ops story here with Dremio. We were able to isolate our ingestion. We were able to easily replicate and monitor the changes to our views. And yeah, it works. Hopefully, you guys try out that blog at home. But with that, I’m going to hand it back to Tomer.

Tomer Shiran

All right, thank you. Thank you, Alex. That was amazing. I know for a lot of folks here in the audience and joining us remotely, they probably think that’s too good to be true, too easy. Maybe that’s because you wear Dremio socks and you wait for your previous self to click buttons. But no, that’s all real. And to prove it, you can go to that blog post. You can actually do this in 20 to 40 minutes. But another way in which I want to make sure that you believe that it is really that easy and that valuable is by bringing to the stage Tian De Clerc, who’s director of data engineering and reporting at S&P Global. And he’s going to talk about how S&P is using these technologies to implement data ops end to end.

Tian de Klerk

Hi, everyone. I’m Tian De Clerc, and I run IT business intelligence inside of S&P Global. Specifically, I’m located in the corporate division under DTS. And at S&P Global, we’re in the business of data and consists of many divisions, all focused on delivering value through analytics or data across industries like finance, energy, and transportation. I am specifically located in the IT business intelligence team, like I kind of covered. And our job is to deliver value through the data we get from our IT systems. The IT business intelligence team is tasked with delivering reporting and data on and around IT, the goal being to provide visibility to corporate and divisional stakeholders. We achieve this by connecting to various products through REST APIs or other automations available to us and then ingest that data into our data lake. We then combine and transform the data from various sources to draw meaningful business insights and deliver that to our end users or reporting to the end users as well. 

S&P Global Data Challenge

Our challenge originally at conception, we’re an internal team. And so we’re not directly revenue generating. So our solution needed to be cost effective. We ended up connecting Power BI directly to Azure Blob Storage and ingested the data into Power BI that way and did most of the heavy lifting inside of Power BI. We also used flat files as we needed our business stakeholders to be able to download and use those files, which may have been a bad choice to start out with. And then lastly, we had a CMDB data stored inside of Cosmos DB, which in this case was not the right tool for the job and was highly inefficient for our use case. So we could see that for costs and just general processing power. All of this, however, did serve us well for quite a long period until the transformations became too large for Power BI to handle and potentially for the Cosmos DB instance to handle. And we realized we needed something between Power BI and our data lake. We already started investigating some warehousing solutions, but realized it required too much finesse and hoops to jump through to get the data into the warehouse. And I heard from a colleague about Dremio, and we started investigating. And essentially, in our investigation, we realized this tool would work for us, just based on its ease of use, native integration to Power BI, and ease to just get started. 

S&P Global Data Platform

So we ended up with this architecture. It’s really simple, but very powerful. We essentially placed Dremio straight onto our Azure data lake, ingest the data into a raw layer, which we basically control using REST APIs to create an iceberg, which we then keep maintained with the REST APIs. This serves as a acceleration layer already at base, because we realized the CSVs, our flat files we originally started with, weren’t keeping up, even with reflections. We maintain the icebergs by upserting daily. And then in addition, we delete any records that are no longer in the source system. We then utilize the data from the raw catalog and import that directly into Curated using Views. And then from Curated, we combine or transform these sets to then build presentation views, which both Curated and Presentation can be used in our reporting. 

We also picked Octa catalogs to essentially store our various semantic layers. And within raw, we’d essentially branch out every time we would create a table. And once we’re happy with it, we’d merge that into production. And then next, any merges we do per day or delete, we’d also branch out. When we’re happy with those, we push them back in. For Curated and Presentation, since they mostly contain Views, we would branch out any changes we make on a daily basis, or if customers require transformations, or we just like to mess around with it. And if we were happy with our changes, we’d merge them back in. So essentially running data as code. So we can plan feature releases. We can communicate with our stakeholders and make sure they’re happy with the transformations before we actually push it to production. 

So our team’s first goal with Dremio was to replace our legacy solution we had in place. And it did that. This change saves us about 50% in our monthly Azure spend, based on our poor architecture and for our use case. But the second purpose was moving processing off of Power BI and into Dremio, making data sets more portable and reusable. It had the added benefit of removing the strain from Power BI, increasing report refresh times by at least 30%. We learned a few lessons along the way. We learned to first make sure the data’s stored in a sensible way. Our data lake was a bit of a mess. And so some data sets took a bit longer to get to a state where we could potentially pick it up and put it into Dremio. Also make sure your file format works for you. And if it doesn’t, well, you get to use Dremio to help you migrate that to something better, like we were doing with the icebergs. The other lesson is to architect with governance in mind. I think everyone has touched on governance today. But it’s really important, as getting all your data centralized is a big goal for a lot of companies, but it’s also a problem statement. Dremio provides us with all the features to govern the data, but only you know your use case. So keep that in mind as you’re building. 

What’s Next for S&P Global

What’s next for us? We’re still working through our challenges in terms of governance and ownership. Since we are the IT business intelligence team, our data doesn’t necessarily belong to us. We pull it from various teams. And they’re not necessarily the users. So questions like permissioning or regarding data quality, these are the questions we’re relying on our processes to start answering. We’re also not still fully utilizing the data as code– data as Git– in our environments and processes quite to its fullest extent. And so there’s a bit more we can potentially do there. Lastly, we would like to provide leaders the ability to ask questions of our data. That requires having all the data present in a single platform and including the metadata and the governance around that, which will help reduce barriers to potential JNAI solutions and implementations. And on that, I’m going to hand it back over to Tomer. Thank you very much, everyone.

Tomer Shiran

DataOps with Iceberg and Dremio

Thank you. Thank you, Tian. So just to wrap up this session and recap this, we saw a few things here. We talked about the rise of Iceberg and how it’s now become the industry standard. We talked about specifically the things that Dremio was doing as a company in our product with both the SQL Engine and the catalog. And then we saw the end-to-end workflow, the DataOps workflow, everything from ingesting the data to managing changes to it, using Git for Data, DBT, and other tools. Again, you can go and try out that tutorial that Alex was talking about. Very simple. You can do it at home. And then we saw a real-world use case at S&P Global. So feel free to scan this QR code. It basically takes you to the Getting Started page. You can sign up for Dremio Cloud, like I said, and start using all of this for free. So with that, we’re going to wrap up this keynote.