May 3, 2024
The Modern Data Stack: Data Intelligence, Lakehouse, and GenAI for Business
What is a modern data stack and how can it help your business be more data-driven?
Businesses are adopting Data intelligence and lakehouse architectures to build GenAI and data-driven applications for their businesses. Having helped lead the adoption of data science throughout baseball, including creating the Chicago Cubs analytics department, this session will feature a lively discussion on the most up-to-date business trends for building data & AI at scale and velocity, and how businesses are now training their own proprietary LLMs on their own data with transparency, governance, and explainability.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Ari Kaplan:
What I’m going to be speaking about a little bit is the history of lake houses, what it’s used, but I think most importantly, what is like the most modern state that there is today? What is the next thing? How is gen AI being used? So people, whether they’re developers, architects, engineers, business execs, like what is the latest and greatest in lake houses and the next marketplace, which is what we believe will be the data intelligence on top of the lake house. So as to have one slide on my life, I’ve had a pretty wild journey in life. A lot of people know of me from the sports analytics world, created and led the Chicago Cubs analytics department, been involved in that whole money ball paradigm, using data and AI for good. For example, finding missing prisoners of war, working with McLaren Formula One racetrack, all sorts of fun stuff, but above and beyond sports, having been at all different iterations throughout technology. So whether it’s the early.com to having been the president at one time of the worldwide Oracle user group, when we acquired Java, PeopleSoft, MySQL, and others. And now we are a part of like the next revolution, which has been lake house technology and now gen AI, which is everywhere. But I’m going to be focusing on how it all comes together and what people can be doing practically now with all of this technology coming together.
The Databricks Company
Little bit on Databricks. We are a Dremio partner, so appreciate that. If you hadn’t heard of us, we were the actual founders of the term lake house, which is the combination of data lake and the lake. So we have over 6,000 employees, one and a half billion in revenue, and also now a pioneer of generative AI. But what is pretty wild, pretty exciting for me is the whole foundation is the Delta Lake, MLflow, and Apache Spark open source technology. And one eye opening thing having joined Databricks about a year and a half ago has been just the popularity of open source has been tremendous. Tens of millions, hundreds of millions of downloads every single year, a community building to making sure it’s the latest and greatest, scalable, secure, and open, and recognized leader by the analysts. We’re the top tier, Forrester Wave just came out a couple days ago, we’re in the top right. Quadrant and Gartner for a couple different categories, such as database management, data science, and machine learning.
So we’ve heard a couple keynotes, a couple presentations. It’s pretty clear data and AI is going to be the drivers of the winners in every single industry and couldn’t be more true. So the lake house solved the very real need in the market. I’ve been a computer programmer in my earlier days with Oracle and other database vendors, data warehouse vendors, pretty much business intelligence. What happened writing SQL queries, things are structured, kind of acid. If you know what that is, it’s like discrete queries, reliability. But then you had the data lake, and that’s where the lake house, the data warehouse and the data lake play on words comes together. And that is where you have largely unstructured data.
That’s very good for predictive analytics, things like real time streaming, things like PDFs, PowerPoints, word documents, videos, social media, stuff like that. There are very different. You can see this kind of line dividing in the middle. And this is the data maturity curve. Looking at past reports, what happened in the past, like sales transactions to being more prescriptive, what might happen in the future based on the complexities of the world and the life of real life. And then the other complexity is we started getting multiple types of data. It’s called multimodal. You can get numeric data. So traditional AI is predicting based on numbers, dates, and simple text. And now AI has gotten smarter. You can do things on the bottom right. It’s kind of a humorous image of dogs or muffins. Top right, words. How do you analyze documents?
How do you analyze massive amounts of encyclopedic information and reports and summary documents aligning with all the other data? The bottom are time series. It’s a special case of AI. How do you predict future patterns of numbers based on the past, based on outliers, based on like if something in real life happened, weather patterns changed, interest rates fluctuate. How might that affect things in the future? Then in the top left, that’s geospatial data. And while this has always been a data source, when you speak in the context of AI, it’s not just where were products sold, but it could be, for example, what geographics in my supply chain are over or underrepresented. Where could I place my routing trucks, for example, or ambulances to preemptively be in the closer to the right place at the right time? Where are we over or understaffed in our stores? So that’s kind of geospatial.
All coming together gives you the best outputs for AI for your decisions. But huge struggle challenge, since you would typically have one different solution for your data lake, for your machine learning, for your streaming, gen AI, data warehouse, governance, data science, BI, orchestrating it all. That could be many, many different tools stitched together. So that creates problems. Number one, costs a bunch more, but it dramatically has a lot of your data siloed. The prior speaker was mentioning 80, 90% of your data is either unstructured or just not searchable. That’s a big challenge. What data do you have? How is it accessed? How is it protected? Data privacy. If you have six vendors, you’re going to have six audit logs, or you’re going to be able to see from the raw data to the final decision along every step of the way. How is that data coming together? And then you’re also dependent on highly technical staff. So I’ve also been hearing a lot at this conference, I’ve been listening in remotely, the whole collaboration and the whole democratization. So how can you get your highly technical staff focused on the highly complex, but your less technical staff focused on being able to ask questions as well. So that’s kind of been the challenge of the Lakehouse. So unified platforms are key.
Where is the Industry Added?
Now where is the industry headed? So you have the data Lakehouse, it’s an open unified foundation of all your data. And now about a year and a half now already, Gen AI jumped on the scene. A lot of us has been doing bits and parts of Gen AI, but now it’s really mainstream. When I say Gen AI, it’s in addition to having LLMs, like being able to make your, or fine tune or do rag, like augmenting your data with Gen AI, but also being able to improve your operations. So if you’re a software developer, you can now use Gen AI in products to be able to debug your code, for example. Or if you’re a data engineer, you could use Gen AI to automate some of your workload management or help if you’re in a data warehouse like Databricks SQL, how can you scale up and scale down different performances based on a prior workload knowledge so that you’re not overspending for utilization that’s just kind of sitting there. So the terminology that we hope will be catching on for the entire industry is called data intelligence platform. So that’s where you have a Lakehouse as the foundation and your data is democratized and AI is used to get intelligence from all of your data. Could be Gen AI, could be traditional AI.
Data Intelligence Platform
So this is what an architecture of a typical data intelligence platform looks like with kind of going from the bottom up, we’re talking about the actual like cloud environment. There’s different main vendors, Google, Amazon, et cetera, Microsoft. And that’s where the data Lake you can store in volumes and files, raw data. And then on top of it, you have something like a Delta Lake or there’s other open standards out there, Iceberg, Hudi, but you can uniformly put weave it all together. So you all have that same reliability and also eliminate the need to move data back and forth everywhere. You can do management, you can do security, your data, wherever it is. Then Unity Catalog is the capability for everything security, governance, and cataloging all of your assets. So I will be doing a quick demo. And part of that is super exciting since once things are registered in Unity Catalog, your data intelligence knows what the context, what your data actually contains, what it’s saying. So you don’t really need that exact keyword search. Also helps data science with a mosaic, a solution acquired over $1.3 billion a little over about a year ago, ETL, extract, transform, load, being able to do real-time analytics as things are happening, and then orchestrating it all. And one of the most performative, fastest-growing data warehouses out there, Databricks SQL.
So where does intelligence fit in there? When you’re doing LLMs, being able to create, tune, and serve custom large language models, whether it’s like dbRex, which is our own, or whether it’s any of the number of open source proprietary, whatever it is, LLMs out there. How as data’s coming in, can you use AI or intelligence to automate the data quality? As you’re ingesting, how is the data merging in fuzzy logic? Are there outliers that can be identified? Is using AI to run your jobs based on past runs, so you can scale up and scale down the computer, the storage, automatically, or at least guided through humans? Then Databricks SQL, one of the cool things is text-to-SQL. They say that the greatest, most popular programming language these days is your natural language, English or what have you. The foundation for all of this, good data, good AI requires good data. So you really need that unified data and governance underneath. And when I hear governance, some people think, “Ugh, another tool that I have to add that kind of limits me,” but there’s so many positive things. So lineage is key. Being able to have transparency when you do a machine learning model, for example, to see where the raw data came in, how it was transformed, who was able to access it. When you do a model, to seeing which different attributes or features or columns were relevant, that’s all very important with lineage every step of the way.
Access control is even more important now than ever. When you have these siloed data or you want to bring it in, whether it’s a data clean room where you can share with third party companies only the data you want. And then discovery, all these assets, really hard to find. Auditing, like what happened, when and where for compliance, and then monitoring. Like a lot of times, just the price to use platforms, if you’re charged per compute, you want to be able to monitor and control. This employee can spend a million dollars a day, I’m exaggerating. This employee can only spend $5,000 a day, or what assets are being used or not used. So all of that through AI, Unity Catalog, super important to help enable that. And this is an animated GIF of, you can see lineage from the source on the left, how it gets refined, what columns are transformed as it moves further downstream or back upstream. So that’s ensuring data quality, it captures that lineage to make sure you understand exactly what’s going on. It’s not a black box.
And then in terms of data privacy, there are some techniques where you just mask the data. And then there are other techniques where you, like I do this with my banking and credit card system, it never actually even sends your email or your social security, whatever the private data is. In this case, you can see hexadecimal. So it never is even transporting data and transforming it on the backend if you do it appropriately in the clean room manifestation. And then being able to do, for example, lake house monitoring, seeing everything that’s going on, not just in your data, but in your notebooks, in your GitHub repository, in your AI models, in your LLMs, schemas, tables, catalogs, et cetera.
Live Demo Using Data Intelligence on a Lakehouse Platform
So let me jump in and do a quick demo of using data intelligence on a lake house platform. In this case, we’re helping with code assisting. And this is our interface. If you follow me on LinkedIn, there’s all these other videos. We have a whole demo center if you want to learn more, but I’m in my own workspace. I’m just going to show four key things being able, if you haven’t seen this before, these co-pilots might’ve heard of GitHub co-pilot or others that are out there, there’s a lot out there, but this is the Databricks assistant, which is built in and long story short. I was just in India literally two days ago, gave a presentation, 700 people, about 60% of the developers are already using some co-pilot or another in their environment. So very important to start using these, understanding them, since they are going to be table stakes commonplace. How do you, how you can use it to write code, to edit code, to document things and to do intelligent search, where it’s not just a keyword search, it understands the context of what you’re looking to do. So made a notebook, made, imported IMDB is the internet movie database. Every movie made and imported just the top 1000. So what I can do is go ahead and read in. We have a volume already kept, maybe I’ll make that a little bit larger. And so this is what a data looks like.
So the main thing here is if you’ve ever been a developer, anyone in this audience, could be a familiar problem. The data is what you call dirty. It actually, when it says the title of the movie, it’s not just the title, it has a number, like the ranking, and in parentheses, it has the year number. That is interesting, but you pretty much want to clean it out. You want to break the title into multiple columns. So here’s where it gets fun. You have a prompt, and a couple ways to do it. One is you click on the assistant, and you’ll see a new, like a co-piloting assistant pops up on the right screen. I just copied and pasted. Here’s an example of the title column. I want you to write a function to extract the release date and the title from the title column. And you can see it not only creates the code, but it explains what happened. So that alone there is like a mic drop moment if you’ve never seen it before. A younger me from two years ago being able to see something like this, and it works in real time, is incredible.
Then you just click to replace the active cell content, and then you can run it. This is 100% generated by Databricks by our AI assistant, and now you see the title does not have that year of the release date, and it’s put there. And you can do similar things, additional transformations, you know, hey, there’s another column that has a vertical pipe. Let’s go ahead and pump that in, make two new columns, can go ahead and run that, and now it has a new column with a gross revenue and the number of votes. If you didn’t have this, you know, it could have taken me, you know, hours to figure out like all of this, you know, reg exp, if you’re not an expert in string manipulation, it takes time, it takes time to debug. And speaking of debugging, if you do make an error, like I’ll hit a couple spaces here, then you run it, it will give an error, then you can diagnose the error with just slash fix, and it says, hey, it looks like you have the issue, here’s how you fix it. It recognized extra spaces. So that’s super cool, in my opinion. I’ve had customers just say, just that fix thing, you know, fix up broken code is extremely helpful.
Then you do additional prompts. I had mentioned there’s a couple ways to do it. One is through the side assistant. Another is in the notebook itself. Please extract the director from the cast column. You can accept it, you can reject it. I’ll go ahead and accept it, and then I can run it, and it’s slicing and dicing the data as I wanted it, doing things, just looking at the time, doing accelerated, changing your data types is an innuendo in computer programming that’s kind of important, changing things from date fields to integer fields, things like that, what columns are being run. So the bottom line is you literally can just say, I want a program, in this case Python, but you can switch it to SQL to do all sorts of different things. Write code, write documentation, debug code, counting the stars, diagnosing errors, which I had shown you, you know, this says, oh, there’s an error, diagnose it. You can’t just do one plus a string, you have to convert that to a string plus the string. So I plop it back in, hit run, and now it works.
And then the next thing is the power of Unity Catalog is that you, this is just a text, this is a prompt to make it, and then when I run that, this is the SQL that it generated, it pumped it back into Unity Catalog, and this is where it starts getting cool. So I am, let me do two quick things. First pop up the table. This is where you can automatically generate what the table contains. The movie table contains information about your movies, your release date, your revenue, et cetera. That’s pretty good. I kind of accept that, and then I can AI generate similar tags on each of the columns. What does the title maintain, the release date, gross things, and things like that. So now that I’ve done that, I can ask a question, and this is where semantic search is really cool, is there a table I can use to find which director brings in the big bucks? See what it comes up with, and it comes up with a budget table and the movies table. So to bring that to light, if you didn’t follow what I just did, I used a natural language query, and nowhere in this question did I use the word budget or movie, but it found, or movie silver, three potential tables that, and those big bucks might be a different phrase for revenue. And director, you know, is not actually a keyword search. That was pretty darn cool. It figured out movies might be a good example. And then there’s like popularity, the more people that query the table, the more it kind of rises up or sinks down to the bottom.
GenAI Journey
So that was a cool demo. Just looking at a little bit of time left, you know, workload management, I had mentioned, if you’re interested in large language models, co-authored a blog, you know, check my LinkedIn profile or Google my name and LLM, and it should come up. But this is where, so that was LLMs for an AI proving the developer and the engineer and the IT workflow. But now companies are looking to generate their own LLMs based on their own data. And a year and a half ago, very expensive, millions of dollars, no skill set. Now it’s getting easier and easier to do. Encourage you to look at Mosaic, Databricks, we have our own LLM where you can use others. But the journey of maturity, prompt engineering, you know, if you have kids or you’ve ever just said, write an essay or summarize this, that’s prompt engineering. Write it in the Shakespeare theme, prompt engineering, retrieval, augmented generation, you have your own data and you want to use an existing LLM infrastructure, RAG is where like 90% of the world is today, but that is made easier with things like vector database, getting more and more popular.
Fine tuning is where you can better insert a larger amount of your corporate data and control it within your environment. So your data, your queries, your prompts, your sample code does not get leaked to the world. And then pre-training is where the latest thing you can generate your entire model from scratch that really understands the context of your data, all private, it’s like really optimized. So that’s kind of where the future is getting easier and easier to do with a couple of commands with a lot less, it’s sort of like 99% less expensive for companies to do this now than it was a year and a half ago. So I wanted to thank everyone, hopefully, you know, you can follow me on LinkedIn or like more adventures, more content on the subject, and I wish everyone well on your data intelligence platform journey and your Lakehouse journey.