May 3, 2024
Sharing a Lakehouse at Nordea Asset Management: How we implemented Data Domains with Dremio
Nordea Asset Management has been on a data journey over the past few years. Data is the raw materials for Asset Management, and we consume plenty of it. Good stewardship of this data is not merely a matter of understanding the business – it is the business. This session will review how Nordea used Dremio to implement data domains. We will talk about our approach and results.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Anders Bogsnes:
So, very happy to be here, calling in from Copenhagen, Denmark. I wish I could have been there in New York as well, but I guess I’ll have to be for next year. So, yes, my name is Anders Bogsnes, I am the Head of Investments Engineering at Nordea Asset Management. I’ve been working in the space of data and Python and machine learning and analytics for the past 10-ish years, so my former jobs were in Python enablements and machine learning and analytics. But currently, I’m in charge of engineering in the investments domain, father of two, husband of one, as well as a PyData Copenhagen organizer. And hopefully we’ll get some really nice questions in the chats, because I hope to get some discussion going.
Get to Know Nordea Asset Management
So, just to give you a little bit of history about where we came into this journey. So Nordea Asset Management is a subsidiary of Nordea, the bank, which is one of Scandinavia’s or the Nordic’s biggest banks. We have around 250 billion euros of assets under management, around 900 employees spread across Stockholm, Copenhagen, Lisbon, and a few other smaller offices. So one of the advantages of being a subsidiary of a very big company like Nordea is that we are able to be front runners on a lot of ideas. So a number of years ago, some seven, eight years ago, some people from Nordea came to Nordea Asset Management after they had tried to implement domains and hadn’t quite succeeded in the way they wanted to.
So when they joined Nordea Asset Management, they saw an opportunity to try to redo and learn from what had gone wrong in their previous implementation. Part of it was helped by the fact that when they joined, having some fresh blood, they realized there was a lot of overlap in the various functionalities that was at the time. So they decided to try to map out the domains based on their experiences. And that led to the first draft of our domains way before data mesh was a buzzword. So that put us in a very good position back when the data mesh concepts came out. So we had already mapped out our data domains and identified our business functions that we had. So converting those into data domains and adding data products as a concept, at least, wasn’t a big deal. So we came into this from the other end, from where a lot of companies come into this, where we had the governance in place and we had, of course, not 100% perfectly defined domains, but we came into it with a knowledge of what our domains should be. And then we had to kind of find the technology to enable data products in the company. So that’s kind of a unique position to be in when I speak to my contemporaries across different industries.
NAM Data Domains
So for those of you who aren’t very familiar with data domains, the basic idea is that you have an encapsulation of a business function. We have 12 different data domains in Odea Asset Management. Now, there was supposed to be lots of fancy animation here, so I just have to imagine those in your head. But one of our big products, for example, is ESG. So basically, we are very strong on green bonds and stocks and responsible investments. So they curate a lot of data around what we know about companies and their profile when it comes to ESG. And they expose that as a data product that then, for example, our limits domain can take in and use for checking that we are maintaining our agreements in our portfolios that we sell. At the same time, they’re exposing various data products for performance and calculations, reference data. Obviously, a very big part of what we do is buying data to go into our various models and analytics. So all these are exposed as data products that other domains can consume.
So the concept was there, but the execution was basically the core application database that we have is currently some 250 terabytes of data. And it was great if you knew how to use it. You could just sign in, use SQL to get all the data that you wanted, join it and use it without having to worry too much about getting access to the right thing. So it was good for speed, but obviously, anyone who has been in this position is probably cringing right now because, yes, it is super painful 15 years later. So of course, this is what got us to where we are. So taking it from the top, an application database has its own schema, right? It’s not a schema that is meant for human eyes. So you kind of have to have 10 years of experience in the application to know that, well, you have this table over here, and if it is a stock that we’re talking about, then you can join it to this table B using this column C. But if it’s an obligation or a bond, then it’s actually a different table altogether, and you just have to know that. So it doesn’t scale very well, and the idea of commingling all this data and not being able to trust that the data that you get is actually the data that you need causes us a lot of headaches.
So the first stab at it was, OK, we all agree that giving people direct access to the application database is not the best approach. So let’s move everything out into various databases. We can build some REST APIs on top of it. We can use Kafka. We’re a big Kafka consumer in Nodia’s management to expose various events happening, and everything will be great. Everyone can now control perfectly managed data products through REST API, or they can get it from Kafka directly if they have events and they need to react to that. Again, in reality, this works to some extent. We now at least have, in this setup, we now have much better defined data products. So what goes into a transaction or what goes into a holding is now codified in code. And if I need to get access to that, at least I can go to an API and know that the data I’m getting is agreed to be the source for a transaction or a holding.
The problem that we have is what have we given the people that actually have to work with the data in the end? So again, we have to remember where we came from. We have a lot of clever people hired, being as a manager, who are very good at wrangling data and writing SQL to get the data that they want, and they’re testing out various hypotheses about the markets, trading strategies, all that sort of thing. So now we told them that you can no longer just write the SQL to get the data that you want. You need to go to 10 different APIs and maybe five different Kafka topics to actually get the data that you need in order to build your strategy. So the user experience went from, “I just write the SQL and I get the data that I need,” to, “Oh, by the way, you also need to maybe write a Python program and set up a Kafka connector to sync it to a database, and then you can actually get started on using the data as input for your strategy.” So the user experience, not great.
The Lakehouse
So the next step that we took from here, leveraging the fact that we do now have more defined data contracts and data products, is that we’re moving into a lake house. So the idea is we want to give our users the experience they had before. They just want to be able to write some SQL, join the tables, they want to join transactions and holdings and all these entities that they’ve always wanted to do without having to write a bunch of different API calls and connecting to various Kafka topics and then joining it on their local machine. So that’s where the lake house comes into the picture. That becomes the gathering points for all these various data products that then makes it easy for people to access that data without having to write a ton of code.
Outer Architecture vs Inner Architecture
So in Node.js and management, we think a lot about outer architecture versus inner architecture. So in our domains, everything that happens inside the domain is what we consider inner architecture. And then outer architecture is how those domains communicate with each other. So we have strong views on what do you need to do as a domain when you’re talking to another domain, what you’re doing inside your domain, that is the responsibility of the various value streams and the product owners of those value streams. So the inner architecture is drawn as a gray box here because it’s not a black box. It’s not that we don’t care at all. It is that the choices that are made inside the value stream in the domain, those are always going to be trade-offs. And those trade-offs are best served by the people on the ground instead of having a retire architecture telling them how to do it.
That doesn’t mean we don’t have sensible defaults. So we do have, what do we recommend if you don’t know what you’re doing or if you don’t really care that it’s not a core competency, then you have a number of pieces or components that you can plug and play in order to get those capabilities. But pretty much everyone looks to some extent the same way when it comes to data domain. You have a number of data sources. Those could be external vendors. They could be other data domains consuming other data products. You’ll extract data from those data sources. You’ll do some kind of transformation enrichments, and then you will publish it. And the contract that we’re setting up between our domains is that you’ll publish them in three different places. So we still love Kafka. Kafka is still a great tool for what we call the operational plane. So if I need to react to someone else making a trade, then I will consume that from Kafka.
At the same time, we also have a use case for what are all the trades that have been done over the past 10 years. And that means that we need to put that somewhere, and that somewhere is in the lake house in the analytical plane. So the contract is you put it into Iceberg in AWS S3. And then again, how you can choose to consume it from there is more of a sensible default, which we’ll touch upon a little bit. The last task that you have as a value stream is you’ll publish it to the data catalog. So the data catalog for us is the place where we want to store the source of truth for quality and lineage. So previously in the old setup, there wasn’t so much trust between the various domains and data products because you didn’t really know what you were getting from inside that shared database. So part of the problem was that everyone was consuming data and then having to run a lot of data quality checks to make sure that what they were getting was actually right, which meant we had a lot of repeated calculations and a lot of repeated jobs, which were basically redoing the same checks because it was the consumer’s responsibility to check data and not the producer.
So what we want to do here is I want to flip that around and say, well, it’s actually the publisher who’s responsible for running the data checks as part of the data contracts. They’re the ones who know what’s happened. So after they run their data checks, they must publish those into the catalog so that downstream consumers can consume those data quality checks from there and then decide what is it that they want to do with that data based on the quality checks. So this gives us a number of opportunities in terms of graceful degradation, for example. So for example, one column might have a data quality constraint that it can’t be null. But if my downstream pipeline doesn’t care about the column, then it shouldn’t break my pipeline. I shouldn’t just stop it because there was a data quality problem.
The other part is, of course, discoverability and lineage. So when I’m building my data product or doing my analysis, I need to know what data products are available on the shelf. And that’s sort of the secondary purpose of the data catalog, is to add a bit of structure and metadata to our data landscape. The third very important property that it has is to enable the federated computational governance, which, if you’ll remember, is the fourth pillar of data mesh. So all the data governance rules that we set up, we want to embed those into the data catalog because it has all the metadata of our data sources. So if we have a rule that says every data set should have an owner, or every data product should have an owner, it’s the data catalog that can enforce that for us. In the same way, we can enforce constraints such as every data product must have freshness checks. Every data product must have documentation. And the data catalog can, of course, help us in scraping the metadata out of the sources to populate schemas and similar things.
The final piece of the puzzle is, of course, our on-premise legacy. So this is, if I may say myself, a nice architectural drawing. It is what we want to do. It’s what we aspire to do. It doesn’t mean that we’re going to go away from our on-premise legacy setup anytime soon. And if we were to want to do that, we would have to do a big bang migration, which is always extremely painful and generally doesn’t work. So the sensible default that I mentioned around Iceberg that we chose was Dremio, because not only does it give us the ability to very easily handle the use case of, I have lots of people in my teams who know SQL, and they just want to join tables and holdings. We can also bridge the gap to on-premise legacy and not have to do this in a big bang. We can use Dremio to abstract away the source of the data so that our end users don’t get broken as we move data and databases from on-premise to AWS and into Iceberg and Kafka.
So that’s why we chose Dremio for other sensible defaults for consuming the analytical plane. Now that doesn’t mean that you’re forced to use it in your inner architecture, for example. How we choose to publish into Iceberg is your own issue or your own decision. But again, if you don’t know what you want, or if you don’t have any special needs, then Dremio is the default choice. It does also mean that if we have some teams who do have special needs, maybe they do need Spark or maybe they do need something a bit more specialized, then you have that ability. Everything is in Iceberg open source that we know is going to stick around for a while. So if Dremio doesn’t suit your use case for your inner architecture, that’s perfectly fine. Everything is available in an open source format.
The People
So the final piece of the puzzle is now we talked about the governance, we talked about the tech, the final piece for us is the people. So I don’t think that any of this would really work if we weren’t organizing ourselves in value streams, which I mentioned a few times already. So if you haven’t read this book, team topologies, I highly encourage you to go and do so. It’s a pretty easy read. But the way that we organize ourselves is in end-to-end value streams. So we have ownership in the value stream about the end products, they can make decisions autonomously. It doesn’t mean that they need to know everything. We have enablement teams that are able to embed into a value stream to upskill them. For example, maybe they need to know about Kubernetes. Maybe they need to know about some machine learning or some AI or how to work with Dremio or Data Lakehouse. So those enablement resources can embed into the value stream and help them get upskilled and then kind of leave again when they’re ready.
So they become internal consultants that know the bounded context of the company, but are also available for more long-term support and general community building. And then finally, the platform team, again, each of these value streams shouldn’t have to know how to run Kubernetes or run Dremio. That’s the job of a dedicated platform team who provides a self-service platform that they’re able to consume out of the box.
So without this approach, you don’t get the independence that is needed to adopt the inner architecture, outer architecture model that we talked about previously. You don’t get the autonomy to adopt data mesh properly. But if you don’t have the other two pieces, then you’re just not going to be able to implement any of the technological aspects that we mentioned as well.
The Holy Truimvirate
So for us, it kind of comes down to the Holy Trinity. You need to have your people strategy in order, you need to have your technology strategy in order, and you need to have your governance in order. So of the three, my experience is the governance is always the hardest, which is why I’m very thankful that we managed to kind of crack that nut before we got really started. The people is second. We’re still not there. We still have a lot of work to do to make sure that everyone is aligned, everyone knows what the strategy is, how to use strategy. The technology kind of comes last because you can always buy what you need as long as you have a few people who can see the vision. But if the people don’t share that vision, then that’s where you kind of end up in trouble. So this is how we see the world. And so far, we are very happy with our choices and our strategic direction. And hopefully we’ll get some interesting questions that we can share or some experiences. So thank you very much for your time and feel free to contact me on any platform that you find me. I’m always happy to discuss. Thank you.