May 2, 2024
Replacing a Traditional Data Warehouse with Dremio for a Low Latency Lakehouse
In the dynamic world of energy trading, faster and better insights can provide a significant competitive advantage. In this session, we will discuss our data transformation journey, focusing on how we took the entire business through a 3 year transformation programme to empower analysts with quicker access to data, self-service insights, and analytics to enable simpler collaboration and data sharing across the firm. We will touch on the challenges of our traditional relational warehouse and review how we shifted to a low-latency data lakehouse approach leveraging Dremio and Spark streaming to meet evolving business needs.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Andy Guppy
Straight into it today. So really pleasure to be here today to kind of give this presentation on how we’ve utilized Dremio to enable a low latency lake house with an RWE supply and trading. So maybe just to start a little bit of an intro to myself. So been at R-West for 13 years now, worked in various different kind of roles within the organization from back office into the kind of ETRM team focused on Endura as our kind of main ETRM application. But the last five years been very much focused on data. So primarily bringing data from external sources, making that available internally for our traders, our analysts and support functions. But more recently, we’ve been on a journey to create a new kind of data platform where we can analyze, share and visualize data easily via a centralized platform. Prior to R-West, I was working at JPMorgan Chase in London. Again, mainly focused on kind of trade execution and back office. So really excited to be here today to give you a low down of what we’ve been doing from a kind of data perspective in the last two years. Pankaj, over to you.
Pankaj Choudhary
Yeah, thank you Andy. Hello everyone, myself Pankaj. I’m a lead data engineer at RWE supply and trading. It’s been 15 plus year into IT industry. I have been into big data, data warehouses, data lake, ETL for last 12 plus year. I have worked with various organization in the past, primarily building data platforms for regulatory reporting, retail banking, API management, commodity trading and so on. That’s pretty much about me.
Andy Guppy
About RWE
Cool, thank you Pankaj. So maybe the first question a lot of people’s minds is who are RWE? So RWE is a large energy company. Ultimately, we’ve been one of the leading energy companies in Europe. We’ve been around for around 125 years, mainly within the kind of power station area, but primarily in the last 10 years, our focus has been in the green energy side. So we’ve been focusing on green energy, building our kind of renewable assets within the organization and growing that part of the business worldwide, opening an office in Chicago, Melbourne more recently. But what we’re gonna be talking about today is primarily on our kind of commercial hub and our route to market. So this is RWE supply and trading. So where we started on our data journey was we’ve come from a quite a historic background with regards to an Oracle database. So 20 years ago, when the company started trading, we built an Oracle database. And as you can imagine, with a trading organization, that grew in size quite considerably over time. So around five years ago, the leading data journey began. This is where we wanted to create a sustainable, bring a sustainable platform into our organization, which would allow us to utilize some of the modern cloud tools that could work with data. So some of the common problems we had was not being able to find data within a large complex organization, people not having the access, and also the speed to market with regards to onboarding data from external sources and sharing that. Not only that, with the scale of data, that has evolved immensely over the last few years. So the expensive databases that we saw before was becoming uncontrollable.
So the Leading Data Project was initiative to solve a lot of these problems, cultivating a culture of cold and data sharing and technology usage across the business. So not only upscaling IT, but also upscaling our business people across that as well. How we did that was create a platform, three main components. So we first of all wanted to tackle the topic of users being able to onboard their own data. So we brought in a data wrangling tool, a lot of people will be aware of called Databricks. This is where we can analyze data, we can work with that using kind of AWS S3 and kind of integrated that into our platform. We also wanted to make sure there’s one unified way of which users could access data. So this is where Dremio comes in. This is our key access layer in our single source of truth for data that sits on top of a lot of the data sets that we have within the organization, primarily market data, but we’ve extended that out to be more trade data and also more commodity data and renewable assets as well.
As part of that, we also consume the data via various different means, so Tableau, Excel, and also kind of localized kind of AeroFlight and JDBC clients from that perspective as well. Our final component was a data catalog. One of the biggest problems was finding data within the organization. So we wanted to bring that into kind of help harmonize and make that more readily accessible for people within the organization today. So there’s quite a journey as we went along, but what we’re gonna do now is explore a little bit more of the technical components that kind of come with it as well. So before we kind of do a deep dive and Pankaj is itching to talk about some of the technical areas of the low latency lighthouse we’ve put in place, we wanted to kind of share some of the key principles as an organization we wanted to follow as part of our journey.
Lead in Data Principles
So we wanted to leverage cloud. That was obviously one of the key things, and we couldn’t bring that functionality in ourselves. We wanted to leverage third-party suppliers into our organization, bring some tools in, which would give us that capability in firsthand. So we could take that, customize it, and make that readily accessible for our users. We wanted to kind of focus on that kind of one-click deployment. So focus on DevOps, making sure that our kind of scripts are fit for purpose. And if something does happen from a disaster recovery perspective, we have this one-click deployment mechanism in place to be able to recover quickly. If those eventualities happened. And our preferences towards SaaS products. So we have bought a lot of kind of, we have Dremios hosted ourselves, which Pancage will, but we have got other products within our tech stack. We are working in a more SaaS-orientated way, which works really well from a maintenance and upgrade perspective there. We’ve also, I also touched on the previous slide, but data integrity is also another key point of our organization as well. Having a single source of which to go and get the data, be able to find that and access that there. We’re taking in consideration license constraints that we’ve got across the organization as well.
We wanted to rapidly onboard data. There was silos before where we could kind of have, not be able to bring on that data outside the organization we wanted in a quick way. So creating those kinds of templates and mechanisms to do that was an important aspect of the project, and part of that as well. And again, I’ve mentioned about kind of the data lake approach as well, moving away from the traditional databases in that sense, and leveraging that kind of S3 storage from that point of view.
The other key part of our kind of data journey in the last few years is about users. So upskilling our traders, our analysts, our support function on working with Python, with SQL as well. So it’s not something that’s just dealt with within the IT department. We built a governance office, which has really helped us kind of instill a lot of these kind of key principles across our organization as well, and also changed the culture around kind of sharing of data there as well. So making sure you don’t keep that data siloed and to yourself, you’re sharing it with others within the organization who could benefit from that as well. So not only the data, but the code and the models and the logic that’s associated with calculating that information as well. And yeah, as I’ve mentioned, one of the key aspects for the project was training, making sure the training was in place to be able to develop our people as well as our kind of more technical staff that comes along with that as well. So these were the key principles we followed as a project. A bit later on, I’ll give you a little bit of an overview of where we are now, but now I’ll hand over to Pankaj for a little bit more of a deep dive into our platform.
Pankaj Choudhary
Components to RWS
Sure, so yeah. So today, we’re gonna briefly discuss how we integrate Dremio in our Lakehouse platform. And we’re also gonna deep dive a little bit into how we onboard a large volume of data from external data sources in a low latency and self-service manner. So before starting, let’s have a quick look at some of the key components of our platform. So you can see we are on AWS Cloud, and S3 is our data lake where we store all our structured and non-structured data. For data wrangling, data transformation, we use Databricks Spark. And as you can see, Dremio is the star of the show, which is our business layer, which is kind of a unified data access layer and a query engine in our platform. Last but not least, data catalog, which is there for users to discover the data, what all data is on the platform. And there are a few interfaces where users can also get the data from Dremio, such as BI tools can directly connect to Dremio and populate the BI workbooks, or users can also have RESTful APIs to access the data. Of course, direct connection with Dremio with the JDBC or DBC, and there are Spark jobs to do the further processing.
So now let’s see how these components work together on the platform. So we’ll start with the data source integration, basically how we onboard the data. So you can see we have two category of data sources primarily, internal data sources and external data sources. So in the internal data sources, we categorize it as there are data, which is the structure data. It’s already there in the organization with the different team. Let’s say our US entity has some data onto their own warehouse, and you want to onboard that data on our platform. So for these kinds of data sources, we use Dremio’s built-in connectors, where we directly connect those sources to Dremio, and we apply some sort of semantic layer on those data where we manage all sort of permissions so that the users who has the access to that data should see the data, and then that’s it. It’s like very quick and easy way to onboard the data from such structured data sources.
Now, the second category of data sources, which is a bit complex one, which is the external data sources, right? And when I say external sources, these are the one where we are getting data from outside the organization. Let’s say the data coming from various commodity exchanges, let’s say from the US stock market or Europe and so on. Let’s say where the data coming from all over the globe. There are data providers for all sort of data that is required for the commodity trading, right? And if you see how many of those sources we have, right, so there are nearly 2,500 data sources from where we pull the data every day, right? And it’s keep growing, so that’s the big challenge for us, right? There is a variety of the data sources. There is a volume. There is a velocity, right? There are 500,000 files every day we import, extract from these sources, and it’s like 2,500 data sources. So you can see there is a variety as well. Each data source provides data in a different format or different schema and so on. So that was a big challenge to bring this data in the timely manner to Dremio so that users can have a fresh data access in a timely fashion, right?
So let’s see how we solve this problem, right? So it’s a very typical ETL, right? Extract, transform, and load. So on the extraction side, we have extraction module where our users submit a request to our data onboarding team that this is the source we want the data from, right? Our technical team basically defines the source from where to extract the data, what will be the contract of this data, what schema it will gonna have, what should be the frequency of importing the data from the outside. Now that metadata is the key. So the extraction module uses that metadata to extract the data from external sources. And the data has been extracted, dumped in the very raw format into the S3 bucket. Now, once the extraction has been complete, the extraction module publishes a message on the Kafka topic. Now in the next stage where we do the transformation, so there is a AWS Lambda which continuously pulls the messages from the Kafka topic, does the transformation of those files on the real time, and dumps it into the landing zone. So it happens very quickly on the real time. So now on this landing zone, we have a very clean data, the transform data, which is the validated one, present for the further processing. Now, the loading part is the bit of the complex here, where we use structured streaming to process all these variety of data. So as you can see, there are 2,500 sources, and they kind of populate nearly 1,800 tables. It means we have to spin up 1,800 streams in parallel. So that’s where the major challenge is, and that’s where we kind of build our Spark streaming job, where we spin up one stream for each table. So there are nearly 1,800 streams running in parallel, continuously, incrementally, listening to those files that are coming from Lambda, and processing it on the real time, and then updating the data into refined zone. And the refined zone is basically our Delta tables. So we use Delta tables, where we store all this data on the real time, I would say. Not real time, but near real time. So there is the latency of one to five minute, like once the extraction is finished on the extract layer, it takes nearly one to five minute to reach into the refined zone.
Now, once the data is in refined zone, it doesn’t automatically get visible to Dremio. We have to fire a metadata refresh as well. So that’s a bit of challenge right now. But including the metadata refresh and everything, the end-to-end latency comes around five minute. So, yeah, so that’s pretty much the ETL, that part of this. Now, coming to the performance, right? So even though we are using Delta tables, there is, as we do more and more updates, it creates a lot of small files and so on. So for that, we do very frequent vacuum, or auto-optimization, or the partitioning, and so on. So that can be performant on the Dremio side. So that user should get the consistent performance while querying those tables. Yeah, that’s pretty much on the ETL part.
Now, if you see some of the stats, the performance stat of this pipeline. So as we talked about the volume, like 500,000 files per day, and they have different dimensions of like 200,000 dimensions. So dimensions are basically identified for each of the extraction, which we call curve in the term of our analysis. And if you see the query performance from Dremio on these tables, it’s like one second for the best case, and worst case is 10 seconds. And as we saw, the end-to-end latency is one to five minutes. And in terms of scalability, every component on the platform is pretty much horizontally scalable. And to make the consistent performance at scale, we do monitor each and every component on our platform. And as and when we see any kind of performance bottleneck, we kind of scale the individual component horizontally. So that’s our strategy.
Now if you see a quick stats from the overall platform. So how many users we have on the platform? So there are 1,300 plus users, and they are all from 84 different teams. So you can see the number of users. As the number of users, we can see the data as well. There are 300 terabytes of data on the platform. Every day we receive 400,000 queries on the Dremio. So it’s a huge scale, right? 15,000 Databricks job that pulls the data every day from different sources, all the transformation, and so on, right? So yeah, that’s about all the stats on the platform, which can give you a bit of idea how big the platform is in the terms of data and the usage.
Design Consideration & Technology Selection
Now a very quick one on the design technology selection and the design consideration, right? So very first is that storage format, which is very important for us to do all sort of upsert and at the same time to get the best query performance from Dremio, right? So we chose Delta, obviously, because we have a Databricks Spark, and it’s very much compatible with Delta. But at the same time, Dremio is also very much compatible with the Delta format. One of the things that we liked about Delta is its auto-optimization. That’s where we kind of struggled when we did some kind of proof of concept on Iceberg, right? Iceberg has some kind of stored procedures. I mean, when we tried two years ago, there were a few stored procedures, but they were not that stable as Delta was. So that was our consideration to go with Delta.
Structured streaming, yes, that’s the technology I would say best fit for our use case, where we have 1,800 streams that we have to spin up in parallel. And one of the main components of the structured streaming that we use here is the autoloader. So autoloader is the one which kind of allows us to do the incremental listing of the files and processing them into streams in parallel in the micro-batches. So our micro-batching means we have the batches from one minute to five minute, where we pull the data from all the different sources. So that’s where the structured streaming is our selection. Now the very important one is the Dremio. So we use Dremio AWS-E offering, which is the deployment on the AWS one. And some of the key consideration here was the– so we wanted a horizontally scalable component here in the query engine, right? And Dremio does support multi-coordinator, multi-engine set up. We wanted to have a workload management base where we should be able to manage the workload on– distribute the number of queries on across the engine. So that’s the feature that we use on the platform, where we set up different engines for different teams, and that way how we manage the workload. So that’s something is very useful for us to scale the platform. Obviously, the monitoring and alerts– so we use quite a lot of JMX stats from Dremio, which gives us on the real-time monitoring of the platform how each engine is performing, and is there any bottleneck on the engine or coordinator, and so on.
Query performance– we had a few issues with the query performance, but we worked very closely with Dremio on that, and we kind of in a good position right now. Especially, we use the runtime filtering quite a lot, because we have some of the tables where we have terabytes of data in the table, right? And if you query on such a large table, the query performance really is very poor. So here we use runtime filtering feature of Dremio to separate out the data and metadata into the smaller tables. And then with the runtime filtering features, we were able to query those terabytes of data table within a second. So that’s quite a good feature that we use here. And obviously, the partition pruning and the fileskeeping is the key one where it gives a very good performance, particularly on the data tables.
The Kafka– Kafka is one of the best tool– I mean, messaging channel for getting the events notification as our file export has been complete from the extraction layer, and that’s where we use Kafka. Obviously, AWS Lambda for all the file validation and file transfer from one stage to other. So yeah, that’s our tech stack, and these are some of the considerations that we made while doing the technology selection here. Yeah, that’s pretty much about the platform.
Andy Guppy
Thank you, Pankaj. I know Gabrielle wants to get to the questions quickly, just to kind of mainly summarize where we are now as an organization. So we mentioned earlier about the culture and the scalable cloud-native platform that we’ve kind of put in place. We’ve also now got a dedicated function within our IT department that look after the platform and also instill the best practices across our organization as well. And we’ve also– our dedicated governance team is actually located within the business to ensure the longevity of the platform is lived in our organization as well, and that data ownership is lived there as well. Our next focus is on data quality. It’s around kind of more real-time charting from that point of view to hopefully enable us to cover a lot of our kind of Gen IA and AI topics that we have in the future. But yeah.