May 3, 2024

Demystifying Medallion and Lakehouse Architectures

Lakehouse and medallion architectures are popular subjects, but also controversial with discrepancies between theory and practice. In the session, we explore the history of data warehousing and examine various design patterns for building a proper Lakehouse architecture.

Topics Covered

Data Mesh and Fabric

Sign up to watch all Subsurface 2024 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Piethein Strengholt:

Today’s presentation is about demystifying the Medallion and Lakehouse architectures. And this is one of my favorite subjects. The Medallion and Lakehouse architectures are very popular, common. It’s also a proven pattern. It’s easy to set up. Nevertheless, it comes with many nuances. So I’m happy to guide you through. But before we dive deep into the content, let’s quickly introduce myself. So I’m the Chief Data Officer for Microsoft in the Netherlands. For three years now, I’ve been before in that role, working for ABN AMRO, where I used to be the Chief Data Architect. I think in that phase, I designed something similar to what you would call today a data mesh architecture. I’m also the author of this book, Data Management at Scale, which you see on the left. Second edition came out last year in May. And I blog post a lot also on data architecture, data management, data strategy. I recently opened a new YouTube channel, Data Pancake. So please feel free to check out. 

Medallion Architecture

On the Medallion architecture. So there are two terms I think we need to demystify first. So the Medallion architecture, well, it’s just an easy way to explain how the data should be logically managed and separated. So we use these terms like bronze, silver, and gold, hence that word Medallion. But honestly, I think that practice of logically organizing and managing your data in separate layers isn’t really new. So I found this picture recently also on LinkedIn. And I think it was kind of nice also to put it in here. So that logically separation of managing your data, I think, dates back from the way the late 90s or 80s even, where we started working with data warehouses. And on that, so when we look at data warehouses, this is typically what we see. So we see on the left, source systems, the data there often is typically complex in its nature, in its structure. There might be business logic encapsulated. We use these systems for transactional purposes. And then we sort the data into our data warehousing architecture. So either via an intermediate layer, could be an operational data store, for instance. We bring that data to a staging layer, which is often a low-cost search environment. Think about a file store or a low-cost relational database management system. And then next, we bring it into the data warehousing environment. So first, we have an integration layer, where we integrate, combine, and unify all of the data. Typically, we use metadata also for describing that data, maybe for invoking the transformation pipelines. And after we have integrated, unified, corrected all of that data, we bring it to the next layer. So these are our data marks for consuming the data on the right-hand side, typically in this data warehousing architecture with reporting tools. 

And this data architecture has been there also for quite a while. The main purpose of this architecture is really consolidating, integrating data from different source systems into one central repository, which we call the data warehouse. But the decoupling of the layers always were already there. This is a best practice. That data’s back from many, many years ago. So that part isn’t really new. So then the other word, so the lake houses, and how do these differ from traditional data warehouses? And I think lake houses, in my view, they combine the benefits of that first generation warehouses, and a later, I think a second generation data lakes in which we could use low cost storage for storing vast volumes of data. We use a schema on read approach. We could combine even structured and unstructured data. But lake houses, in that way, they tend to combine these two different architectures. So you don’t have to have these architectures side by side separately. No, those are tightly also integrated into one more unified architecture. 

Also compared to the previous slide where you could maybe vary between maybe the technologies you would like to use for your staging layer or your integration layer, here in this type of architecture, everything is unified. So we have one, usually one single technology for storing all of that data. Also, typically, when I look at lake houses, we do this on the cloud. So we separate storage and compute from each other, meaning so you can independently scale up and down these architectures. Also, lake houses are a bit more open maybe to these more previous traditionally data warehouses. So often I see we use open columnar table formats. For instance, Delta Lake is a popular one. And this supports the asset transactions, which I think maps and links back to the data warehouses. But also there’s versioning, schema enforcement, evolution, also those kind of things we can bring in. And then lastly, we have an open source in-memory compute engine, so for massive data processing. And that supports Python, Scala, SQL, we can do with that, Java. And with this engine, you can also do machine learning and data science beside traditional ETL or integration of data. So lake houses, in that way, to summarize, combine the best practices, so the benefits of first-generation warehouses and the second-generation data lakes. 

How the Medallion Lakehouse Architecture Often is Presented

So when I talk to customers, so this is typically how the Medallion architecture is presented to customers. So on the left, again, so very similar to data warehouses, we have the source systems, we ingest the data first in a bronze layer, so this could be our staging layer. And from there onwards, we bring it to the silver layer, gold layer, and then yet we start consuming that data. We need to orchestrate all of that. And about those three different layers, so bronze is typically more the raw technical data, it could be a one-off or we apply some mediation between that, but it’s the unprocessed data still. 

Then typically in silver, we start to clean, filter, maybe de-duplicate, we make some lightweight corrections on that data. And then lastly, the gold layer is for the consumption, really. So we meet the functional requirements we typically see on the consuming side. So this is high level how this architecture is being presented. 

However, it’s really about the nuances, and this is more how I see things might look in practice. So yet again, we have the source systems on the left, and often I see this optionally landing area. And this makes sense, for instance, when you would like to decouple the ingestion really from the left to the right. There could be maybe a network sitting in between, or the source systems themselves would like to be in control rather than we start reading directly from the source system. So the landing zone or the landing area could be, I think, an ideal intermediate environment to store first the data on. Then second, we ingest that layer. Next, again, we have that bronze layer. And here often I see, well, we at least validated that schema. But depending on how the data is ingested, either we append that, if I append only, or we override it, or we layer that using a partition scheme with folders. For instance, often what I see when we do full extracts, we make folders using the year, month, day partitioning scheme, and we have all that raw unmodified data first sitting in our bronze layer. 

Then the next layer, often I see in the silver layer, here we start to functionally validate that data. We clean it a bit, we filter it, we apply lightweight modifications. Often I also see we hysterize the data within this silver layer. Although the structure more or less matches that structure of the bronze layer, but we use then slowly changing dimensions for storing that historic data. So we could try and travel really on that data functionally, not technically. Then next, I see sometimes a second silver layer. Still the data is source-oriented, but I see how customers, they would like to conform it a bit already upfront before we start integrating and combining the data with data from other source systems. I see we apply more standardization maybe on the data types. Often I see we map and enrich the reference data. So master data management often also plays a big role in this. Or maybe we source data from that silver layer, bring it to a master data management solution, and then we enrich that second silver layer with the master identifiers. So the customer master identifiers, for instance. 

Then in the next layer, so the gold layer could really become an integration layer. So where we combine and integrate and we unify all of the data. And this is really where we start cross-source system, combine and integrate all of the data. And that’s, I think, a bit classical also to what we saw in that previous slide with the typical data warehouse, the integration layer. So the unified layer, which holds all of that integrated combined data from many of these different source systems. Also on this layer, I see lots of business logic often is being applied. And then the next layer. So yet again, another gold layer. This could be for meeting the requirements on the consuming side. So where we would like to serve data towards our use cases. Often in that layer, I see we apply additional calculations. Maybe we only select a subset of the data. We aggregate the data for the use cases we see on the right hand side. We filter, I think often we also see feature engineering that could also play a role in that last layer. And then from there onwards, we start serving the data to the right. So again, so that classic bronze, silver, gold layer, it comes with lots of nuances and it could really look different depending on your needs. 

My Best Practices

What I’ve done, but I will not go and start to discuss this in very much detail, but here you see my best practices, what I typically see on all of these different layers. So you see the best practices for the lending layer, the integration layer. So this is all part of bronze. And then we have a silver layer. So our cleaned, validated layer, conformed sometimes. And then gold could really fall apart depending on the problems or the use cases you are dealing with. So it could, for instance, contain a more classic integration layer or an analytical layer. It could contain a use case layer. So similar to the data marts we saw in the typical data warehouse architecture, but sometimes often I also see a distribution layer where we share data really with other domains. And that data typically is a bit more stable, well-managed, described with more metadata. We typically also call the data in this layer, our data products. So that’s then the distribution layer. 

So this is roughly how we could layer the architecture. Next, a nuance or a modification to that architecture. So the architecture I presented in itself looks pretty much simple, but often I see customers start to complement this with additional metadata or data management services. So maybe initially you started working with your first sources and you started writing your Python scripts, but then over time you learned all of these hard-coded paths and data quality validations. I’ve noted down in my notebooks, well, that doesn’t really make sense. I would like to make it metadata driven. So I store all of that schema metadata in a tiny SQL repository. And I use that as input for declarative scripts I invoke with pipelines. And then I use that metadata for validating also data quality, for instance. So I add an additional data quality component. Maybe I also would like to capture the logging of all of these different steps and store that in a more centrally managed database, because there are more teams operating lake houses and you depend maybe on a specific order on how the data is ingested in all of these different architectures. So separating out that logging component and managing that more centrally, I think, yeah, it makes sense. So that’s another additional component. 

Data catalog often plays an important role here. Data quality also matters. And maybe you would like to capture the data quality results again in another centrally managed component or repository from which you can oversee data quality really end to end. So cross all of these different teams. Imagine a situation where you have multiple teams operating multiple lake house architectures. It would be nice maybe perhaps then to oversee data quality really end to end. And then there could be additional components maybe for the transformation logic. So what I often see is the first layer of ingestion is easy, I think, to standardize with metadata. The transformations are relatively straightforward, often syntactic transformations. So simple unions, joins, filters, aggregations, maybe you split or concatenate data. But more closer to the right, there you see more business logic is involved also in writing the scripts. So in transformation, some popular transformation framework is, for instance, DBT. And I see that also being used a lot among the customers I work with. 

So the architecture in itself starts to grow with more components. Another best practice I often share to my customers is to separate the distribution between domains and the usage of data for the domain itself. Why would you like to have separate layers here? So you see two layers. So if you would share the data you would use for your use cases directly to other domains and users of the domain, they require disruptive changes to that layer, so to the table structures, for instance. It’s likely all of those different consumers outside of this architecture will be directly impacted. So best practice would be then to separate those concerns in an additional layer, for instance. So you have one layer for distributing data across domains and another layer, so gold layer, for the use case specific data within the boundaries of this domain. So that’s another nuance or variant of this architecture. 

Another iteration could be issue at real time ingestion. So data comes in via streams or you use change data capture to capture the changes made to these operational systems and you emit those as events into this architecture. And then you have a real time streaming processing engine in which you maybe compare events, you apply windowing functions, maybe you enrich the data. I sometimes see customers, they use the lake house. So the lake house tables in the middle to look up data and to enrich those real time events. Those can be even distributed to other teams. 

Another variant could be is maybe your gold layer, you would like to have better, more predictable performance or the type of queries are more singleton queries. So you constantly fetch out single records or you make small updates or you have high concurrency maybe as a requirement or faster updates and reads. So it could be very well that you replace maybe the spark engine with a more traditional relation database management system like you see here, SQL for having more predictable performance. This also could make sense when you have lots of joints, these relational database systems in general, they handle a massive joints better than the spark architecture, you could say. So they’re also a relational database, could make sense. 

Also, the real time events, sometimes I see they are pushed into also into the lake. So you could also keep those and persist these, for instance, in, I don’t know, your silver layer in the bronze layer. So this is another, I think, nuance or modification to that common architecture we saw in the beginning. 

Also, the bronze layer could really look different depending on the scenarios and the ways you ingest data. So this is an example I also have encountered while working with a customer. So the bronze here is really a conglomerate of, on one hand, data that’s really persisted and ingested into this architecture. So we make extracts. So that’s that middle lane. On the other hand, we apply mirroring. So real time, we capture that data and we have an in sync, so near real time copy. So those could be other tables that sit in your bronze layer, for instance. And yeah, we could also maybe apply some virtualization even where we shortcut data that resides elsewhere. So we leave the data as is and only when we need and we need to read that data, we access the data. So we bring it in virtually. So then your bronze layer becomes more conglomerate of, I think, different approaches you use side by side. This perfectly also makes sense in lots of scenarios. 

Another nuance or variant I see, so we have a more, I mentioned, a more larger central team and that takes ownership of all data ingestion integration of more the larger enterprise or a larger scope, maybe in that respect. So we ingest the data, we clean it up, and then we have this classic gold integration layer, so in which we have all data integrated and combined. And then from there, we start to make subsets. So we select data subsets for different use cases and there we apply optimizations. And again, we have different types of business logic we apply on those different use cases. So then your gold layer falls apart into, again, different layers for these different use cases on the right hand side. 

Also sometimes I see a self-service layer. So here we have, again, another layer, but this is predominantly for ad hoc data exploration, experimentation, business users who would like themselves, would like to do experiments or write some SQL statements sometimes even, and they would like to use the data themselves directly. So play with that data. So here in this architecture, you see, so we provide read-only shortcuts in another layer, and we also allow users to upload their own data, but this is managed separately from the production environment, so the production workspace. So we, again, see another variant. 

We could even combine the application integration architecture with the data management architecture in here. So we real-time ingest data. We store that in the lake. We do batch data processing, but on the consuming side here, you see, we have a large variety of different patterns consumers and other domains even could work with. So we apply request-response, so maybe persist the data yet again. So another copy in an, I don’t know, NoSQL environment, document store, for instance, and that data, we serve real-time out to other consumers. We could give them virtualized access. We could distribute even events to other consumers, or, you know, we have an API gateway even segregating the strong, consistent reach you would like to take directly from a source system versus the eventual reach you see more on the right below. 

So that also works, and all of these different patterns, I also sometimes see, you could even combine and integrate into one giant, big, crazy architecture like you see here. So here in this architecture, we’ve combined and integrated all of these different patterns, and you even see, we duplicate data for, I don’t know, different read patterns we would like to apply. So we have brought in even here a time series database for vastly varying, lots of time series-related data, for instance, but that data, again, yet could come from your lake house architecture. So this architecture could really grow over time. 

Then another variant I see is where not from a blueprint you start to provision each time again and again the same individual architecture. No, you allow for variants, so specific lake house architectures depending on the requirements you see. So this is another, I think, side by side where you see different lake house architectures working together with other lake house architectures, and all of these lake house architectures are managed and owned by different teams, and the architectures themselves really look different depending on the requirements. So there could be one layer here in this example, or even not a layer, so a virtual layer where we shortcut or access the data really directly virtually. So that could be, again, another variant where we see all of these different architectures sitting side by side. 

And then bringing it all together, and this is my last slide, and then we have room for questions, is in a typical large enterprise, you see many of these architectures are running side by side, managed by these different horizontal teams, and at the end, it’s really about creating a community. So it’s a team sport managing all of these different architectures. And important here is really to think about setting standards. It doesn’t make sense if all of these teams themselves decide for what is the best way to model my data, serve to other customers, how to measure my data quality, for instance. So data quality here really is critical, and that central enabling teams at the top really should set and build the foundation and work closely with members from all of these other different domains together to set these standards and create a community of enablement. And that brings me almost to the end. 

Recapping Medallion Architecture

So recapping, so the Mendelian architecture, yes, it’s a nice framework for data processing, but like you have seen today, it’s not an end-to-end data architecture. So it comes with lots of nuances. Lake houses, like we also discussed today, they combine the benefits of classic data warehouses as well, data lakes. The silver layer often I see at organizations, it could play the role of an operational data store. If you keep it still source-oriented, it could very well also support the business with the operational queries or the operational analytics they would like to perform. And gold often is the most obscure layer because of all of the different requirements and the breadth of the architecture. Lastly, standardization is key, and a typical end-to-end architecture we saw in the previous slide, it’s usually a blend of many diverse and different technologies, which you need to bring together. So Spark is great for big data processing, but usually I see at customers not enough to handle all of the different requirements. So you need to think also about complementing this architecture with additional components and technologies.