Gnarly Data Waves

Episode 27

August 1, 2023

How Maersk is Building A Next Gen Data Lakehouse with Dremio

Learn how Mearsk is building the next generation data platform for unified analytics on Dremio’s Open Data Lakehouse

Maersk is a global leader in container shipping, logistics, and energy. With an extensive network of offices in 116 countries, over 900 vessels, hundreds of warehouses, and a modern fleet of aircraft. Maersk provides comprehensive shipping services across the globe with commitments to achieve decarbonization and reach net-zero emissions.

Join this live fireside chat with Mark Sear, Director of Data Analytics and AI/ML at Maersk, and Tomer Shiran, founder and chief product officer at Dremio, as they talk about Maersk’s journey in building a next-generation data platform for solution development using Dremio’s open data lakehouse and GenerativeAI. In this episode, you will learn:

Common data platform challenges in the shipping and logistics industry
How Maersk uses Dremio’s open data lakehouse to empower their developers and end users to deliver agile and cost-effective solutions
A live demo of GenerativeAI

Topics Covered

Data lakehouse

Register to view episode

Speakers

Mark Sear

Mark is an accomplished executive and entrepreneur with broad-based experience ranging from bootstrapping startups through to major multinationals. His career highlights include achieving results as a visionary tech innovator and “blue sky” thinker as well as consistent delivery of both strategic and tactical solutions.

Tomer Shiran

Tomer Shiran is the CPO and founder of Dremio. Prior to Dremio, he was VP Product and employee #5 at MapR, where he was responsible for product strategy, roadmap, and new feature development. As a member of the executive team, Tomer helped grow the company from five employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He holds a master’s degree in electrical and computer engineering from Carnegie Mellon University and a bachelor’s in computer science from Technion – Israel Institute of Technology, as well as five U.S. patents.

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Introduction

Alex Merced:

And with no further ado, let's get on to our feature presentation. We're going to be talking about building the next-generation data platform for the supply chain with Dremio's data lakehouse and generative AI. With us today are Mark Sear, director of data analytics and AI/ML at Maersk, and Tomer Shiran, co-founder and chief product officer of Dremio. Mark, Tomer, the stage is yours.

About Maersk

Tomer Shiran:

Thank you, Alex. It's great to be on your show here. And of course Mark, thanks for joining. Maybe let's start with some background about yourself, about Maersk. I know many people, I'm sure, have seen their containers all over the place. So tell us more about the company, maybe.

Mark Sear:

Yeah, containers are part of Maersk. We're moving now from being 100% focused on shipping to full end-to-end logistics. So that's everything from what you might say, factory gate to the consumer. It's a change. It's a rapid change, but it means fundamentally offering more value for our customers end-to-end. It's what the customer wants. It's our strategy. And also we are launching a whole series of zero-carbon initiatives. Our first vessel, believe it or not, I think it set sail this morning, running on green methanol. So, zero carbon footprint, which, if you consider in the past, and indeed, largely today, ocean transport has been somewhat polluting. We are moving to make it zero-polluting. So, leaders in many areas.

The Scale of Maersk

Tomer Shiran:

And that's cool. Given the scale, right? I understand [you have] over 900 vessels, and offices in 100 different countries, and––

Mark Sear:

I think it's now 800 vessels, 120 countries, ports, and offices, yeah, it is colossal. Over 100,000 people work for the organization. So it's astonishing. You think of some of those big ships [that[ transport 18,000 containers. So that's 18,000, I think in America you call them semi-trucks. 18,000 in one go, and there are about 11 people on board those ships. So. Although we're supposed to call them vessels, I keep calling them ships. But there we go. So yeah––

Tomer Shiran:

What do you call the semis in the UK?

Mark Sear:
We call them 'Arctics'––articulated lorries.

Opportunities for Data

Tomer Shiran:

Okay, okay, cool. So we're talking data here, and, of course, any operation at this scale, and especially in the area of supply chain and logistics, there have to be all sorts of opportunities when it comes to using data to make things more efficient, or reducing costs and increasing business velocity. So maybe you can just kind of give us an overview of some of the opportunities that you see and that you're pursuing.

Mark Sear:

But I think it's you know, that data is everywhere. And it's right at the center of what we want to do now. And I think if you consider the life of a product, let's do one from China to the UK. Just as simple. Think about everything that goes into getting that, well, let's take an example. This mouse. What goes into getting this mouse from China to the UK? You've got a production schedule, then you've got to work out, how you get it from the factory to the port, the port to the next port, the port to the warehouse, and the warehouse to the consumer. So there are vast amounts of data points throughout that. And every single data point is

pretty much an opportunity for improvement for both sites. We can improve our profitability, our knowledge, our ability to price correctly, and our ability to staff correctly, but also what's critical to us is the customer.

The customer is truly the king of our business. They need to know “When is this mouse going to be available to me?” Because there's no point [in] me listing it on Amazon or saying it's going to be in my store, or putting it on my website. If that ship turns up 6 weeks late, or the container gets to the port just as the ship leaves, because there's no way you're throwing it onto a ship from even one meter, let alone a kilometer after it leaves. So a lot of that is about visibility.

Much of it is about us sharing our data, and partnering with our customers to make sure they can see where the vessels are when they're turning up. Because again, you think about it, 18,000 containers turn up at a port...think of how long a convoy of 18,000 vehicles is to unload that ship now, if they all turn up at 9 a.m. on a Monday. You can imagine 18,000 semi-trucks in a line. Not a pretty sight.

So there's a lot of data to make sure they know it's going to be 2 days before your containers are unloaded, pick it up at gate 27, etcetera, etcetera. So all opportunities are as endless as data. And you will have heard of the phrase, of course, data is the new oil. It isn't. I've always said, data is the new solar, because it just goes on and on and on. Oil runs out. It's finite. Data isn't.

That's how we see it.

Data is Finite

Tomer Shiran:

And it's probably growing, right? As you instrument more things. And you know, just everything is more and more online.

And the more things are instrumented, the more people want to know data. The consumers are a lot more savvy than they were 4 or 5 years ago. I've only been immersed for 3 and a half years, but even in that time, you know, consumers used to say: "When is my container arriving?", and now they kind of like to know: "When is it arriving within what timeframe? How is it getting there? Is it going to be carbon neutral? Can I put it on a low carbon neutral route? Send it the lowest carbon footprint way?", etcetera. So it's a function of everybody maturing about data. Private lives––you know, we've all matured about data in our private lives as well as in our business life. Right?

Tomer Shiran:

Yeah. So your architecture, now, when it comes to data and data infrastructure, of course, is evolving. And that's why we're talking here. But maybe we can start by...just giving us a little detail on what it looked like before. What it's been like, how things were built.

Mark Sear:

Okay, so go back 3 and a half years. When I joined, our CTO, Navneet Kapoor, the awesome guy, came up with a new vision to move us forward with data, to move us forward infrastructure-wise, etcetera. We had a whole bunch of legacy stuff, we had Teradata. Big Teradata estate. I think it was one of the biggest Teradata estates in Europe. We did not have everything in the cloud. Lots of things were simply not in the cloud. We also had Palantir, and so he kicked off this journey of, let's make things simpler. Let's make things work better, and let's make the end-user experience better. So we started the journey towards that. Then, decomming all the old systems, [and] moving towards Data Lake, initially constructed and still held, even to this day, on Microsoft Azure, which you know is very scalable, very accessible, etc., but still lacking that little bit of, I suppose we could call it user-friendliness, at that stage, for users. So then the question becomes, how do we make data ubiquitous? How do we deliver it into the hands of everybody? And how do we do that, you know, it's pretty crucial to do it cost-effectively right?

Otherwise, you know, if the cost of your data balloons, you're probably not in a better situation than you were with Teradata, for example. So we did that, and then we waited and

rather fortuitously came across Dremio. [We] piloted it, proof-of-concepted it. Yeah, it's cool. It's good.

Requirements

Tomer Shiran:

So you talked about some of the requirements being, you know, as being cost efficient; that the TCO has to enable a broad set of users to achieve, maybe, more self-service? Were those the key requirements? Were there other important requirements, when you were thinking about this new kind of architecture?

Mark Sear:

The number one key requirement is, let's give people access. Let's make sure that people can get used to that because there's no point in having a data lake if you, in effect, to use the analogy, restrict [it to] those people that can fish. If you have to be a technical genius to use it, or if you have to know the ins and outs of security to use it, it's not a data lake. It's just a walled garden, and you can talk about it, but you're not going to get a lot of use out of that. TCO is one thing, also for us, what's crucial is: let's build on open standards. Let's build on open source. Let's make sure everything we do is open. We don't want to be locked in. Probably many of the people on this podcast have used Teradata. I mean, it's been a great product, a great servant. But you kind of [are], I would contend, somewhat locked into it. It's not easy to move. By putting our data in open formats and plopping a query engine on top of it, if we wanted to, we could move fairly rapidly from Dremio to, I don't know, maybe Trina, with a security layer, or whatever we want to do. So for us, we mustn't box ourselves in and make those architectural decisions that mean we're locked in for a decade.

And, Dremio, you guys have been incredibly open about that. And you've adopted open source. You helped us on our journey toward Iceberg, which we think is spectacularly important.

And yeah, that's where we are.

Security

Tomer Shiran:

And is security an important kind of criterion? Maybe data governance? Is that something you care about in your world?

Mark Sear:

Me, personally? Ha-ha. Yeah, absolutely. It's essential. I mean, if you, if you think about the people that we're moving goods for, you should not, and you cannot break those Chinese walls. If you happen to be moving for Nike, Puma, and Adidas, your Adidas salesman shouldn't know what the Nike salesman knows, or whatever. So for us, that fine granular security is super important. And of course, those models they're quite complex to build. So you've got this sort of equation of: do I build it in-house, or do I go to someone that's built it and prove it? And we chose to go down that route for the next period, and we'll [see what] it comes out with.

I think if you haven't got a highly secure environment these days, you're going to have really deep problems fairly quickly.

Data Sources

Tomer Shiran:

Yeah, yeah. And where's all this data coming from? Like before it gets to the lake. Like where is it––to what systems or what platforms?

Mark Sear:

It's coming from all over the place, some from SAP, some from Salesforce. I guess that doesn't surprise too many people, you hear those 2 big names out there. We've also got what we call platform teams, and that's not a platform in the sense of the way a lot of people think about platforms. A platform supports a specific business function. So very, very vertically focused, but tightly, horizontally integrated. So we're moving to an entirely event-driven, API-driven environment, so that people exchange messages from these things, moving away from the big monoliths that we've had in the past of a giant SQL database in the middle, and a million people trying to work out what to do with it. So it comes from all over the place, even from, you know, when you see containers on the back of trucks, and they've got those little freezer units on them, or the little freezer units themselves. Actually, of course, if you're shipping several tons of let's say meat from Argentina to, I don't know, to New York, whatever. I don't know if we do that, but let's say we do. It's got to be frozen. So even those freezers that are on board the vessels will be sending data to say what's happening.

So you've got a whole bunch of stuff. The data is coming from everywhere—cranes in dockyards. People walk around dockyards so they don't get squished by automated machines driving around 20, or 30 tons of material at a time. So yeah, data is, it's everywhere.

And I have to say I love it.

Use Cases

Tomer Shiran:

That seems important, making sure people don't get squished. That's a good use case. Cool. What kind of––on the analytics side, on the end user side––what kind of use cases? Are they primarily BI workloads, or are you starting to see, you know, we hear a lot of folks these days experimenting with Gen AI and LLMs on that same data once you have this kind of a platform? What are you seeing in the organization? Who's taking advantage of it? Where is that going?

Mark Sear:

I think it comes as no surprise that the bulk of workloads that you see at the moment are quasi-traditional BI. So we're moving away from old tooling, we've got a whole bunch of power BI out there. And that would probably be in the traditional pyramid, that would be the bulk. I'm sure that's no surprise to anybody, and I'm sure if you go to most organizations, that's the case.

But what Navneet has said is, he wants us to be leading lights in AI LLMs––the whole box and dice of AI. So if you take AI as machine learning AI and LLMs, and we're doing a whole bunch of work in that area. For example, we've built our own little–– I guess you'd call it a generative AI-type product that allows people to type natural English questions against Dremio, and we'll return result sets to them for their questions. It'll also chart it, we've got that. We've got people now starting to get into rudimentary, what I would call business science, and that's really where you sit a data scientist with the business, and start to come up with [the] algorithms. So models will predict who is the best salesperson to take on this type of account. When is this vessel likely to arrive? How do we move things around? And you know, we've got a digital twin of a number of our ports now as well. That helps optimize how the ports work. So

things are fast.

It's kind of like, I don't know if you follow the Tour de France, but I think if you imagine going up the [...], which is a pretty steep mountain, how you feel at the top if only you were allowed to turn around and go down at high speed instead of keep going, which, of course, the poor Tour de France guys do, but it now feels like we're not at the top of the mountain, but we can at least see the top, and we can imagine how great it's going to be when we come down the other side. So we're getting to that stage now, we're starting to see those exciting use cases crop up in the field of AI modeling and large language modeling. So yeah. It's an exciting place to be in.

Tomer Shiran:

Cool. No, I've been out to the [...]. It was in the winter, though, and it was mostly going downhill. I was getting some assists––I'm sure that was a good fit for my abilities. But I get the analogy. Very cool. So you know, right before we started this webinar you were showing me a screen of that internal application. I thought it would be cool, I don't know if that's something you could show us.

Mark Sear:

Now I can't show you, obviously, any of the answers to the questions, because that would be just showing people our business. But data query, we've got a few 100 users on it now it's in the pilot. It's going to go full, live about the start of September hopefully, so the first of September.

These are the sort of questions that people can ask it. What have been the 5 most profitable routes in the last 6 months? Now, if you imagine the SQL to build that can be pretty complex. But

you don't have to be smart to type that in. Okay. So all the clever stuff is done behind the hood, and it will bring back to you, obviously, a table of results there.

This same large language model, we're also using to show data that's internal data. So show me the Java code to extract a vessel. What is a truck? Believe it or not? There are some people, that's a very specific definition, a truck versus semi, verses, etc. What are the 20 most profitable routes? How many bookings were there by an operator in 2022? Show a total and broken down by refa and non-refa. Refa is the containers of the freezers and not freezers. So you can see there you're getting some pretty deep questions that you can ask without having to learn SQL.

And that to me means several things. I mean, the primary thing is, it means we can, not saying de-skill, but we can up-skill our users. Because why should they have to learn SQL to access their data? I don't think you should, right? I mean, if I want to use my vehicle, I don't have to know how to get oil out of the ground and then build a fraction tower, refine it, etc.

I just want to put it in my car and go. And if you think about what most business users want, they want to put the answer to those business questions in their hands and go. And that's what we're all about doing now, is doing that at that level and then bringing in the IIR modeling on top of the lake, allowing the data to speak to the users without them having to know what's happening. So in the background, really working on those types of problems for them.

TCO

Tomer Shiran:

Yeah. So there's a theme here of democratizing, and you know, that was, as you mentioned the reason or one of the reasons for deploying Dremio, and just kind of making data more self-service, putting the hands of more people, obviously going and enabling people to ask natural language questions, you know, continues with that theme.

Let me ask you about the total cost of ownership, TCO. In general, you went from a kind of legacy platform to a modern platform. What do you think about the cost savings, and what that does for the business, going from a data warehouse to a kind of lake or lakehouse?

Mark Sear:

I think that's a really interesting question, and we may not be far enough ahead to know the true TCO. But to me, and this is a personal opinion, TCO is almost the wrong calculation to use. Because how can you put a cost and say, what is my cost of ownership, for a query for a solution that you couldn't do before, that you can do now, that may bring you 100 million dollars of business? So if I said to you in your position as––are you CTO? If I said to your CEO, I'm gonna charge you, you're gonna have a million dollars cost of ownership for this new mouse, but you're going to double your revenue, he's not going to worry about that TCO he's going to worry about, “Wow! I just doubled my revenue because I bought Tomer a new mouse.” And that, I think, is what we need to do here is kind of get people thinking about the right equation and thinking about, what am I delivering? And am I delivering it at optimal cost? So it's not about the total cost. It's about the optimal cost of ownership for me. It's a conversation, I think, that needs to happen in a very widespread way across our industry.

Lessons Learned

Tomer Shiran:

Yeah. And I'm sure they're also, you know, as you modernize the stack here, that people element is probably important as well, right? Not just talking about the people that are consuming data, but just how do you go through a transition in a large company? Right? What are some of the kinds of lessons learned? The best practices, maybe, for folks in the audience. just based on your own experience.

Mark Sear:

I think the best practice is to trust your people. When we started to pivot our technologies, we were very open that we were starting to move things. And there were a whole bunch of our guys down in Bangalore. And of course, yeah, UK, and Denmark as well.

And pretty much everybody said, “Yeah, cool. Let's change.” We trusted them to learn. We gave them the skills to learn, and I think they trusted that the tech would deliver, those decisions would be made, and like everything else in life, you got a good leader, and Navneet is superb at articulating what he wants, what the vision is. If you've got that good leadership, it makes things easier. And the key thing is you have to have a clearly articulated vision. If you just turn up with like, “Oh, yeah, here's this new tech over here, give it a go,” that's never going to cut it. Because people, how do they buy into that message?

You can only buy into it by being encouraged to try it. See it work, pick it up, start to move in that direction, and be on a clear, almost like an on-rails decision, where you can't just turn right and say, I'll get off the train, or I don't want to do it. You've got to have that clear articulated vision, that would be one of my lessons.

The second lesson, I would say, is if you sign up with a vendor, make sure that you can absolutely trust them, and don't waste their time or your time. Don't absolve yourself from your level of responsibility. When you're implementing, make sure you go to them only with questions that you actually need to have answered, not questions that you think you'd like to have answered. So I think there's a phrase in English that may not translate to American or any other language which is “Physician heal thyself.” And it basically means, take care of yourself, right? Don't come to me, just heal yourself, and that's my approach to my guys is, if you get super stuck, ask for help. If not, figure it out. Take responsibility for yourself. Bootstrap yourself. Take advantage of all the things that are out there and collaborate. So I don't think there are necessarily any different lessons than with anything else. But be free and be open, because the way things are changing fast and they're going to continue to change fast.

Supply Chain Logistics

Tomer Shiran:

Talking about changing fast––and I see we have a bunch of good questions coming from the audience. So one last question from me, before we dive into that.

Where are things heading in your world of supply, chain, and logistics? There's so much change we see now in the world with AI and LLMs. And you know they're becoming pervasive, my kids are using them now, it seems every day, when it's not the summer vacation, for school stuff. Where is that?

Mark Sear:

What about the world? Right? People are not going to be able to, schools are not going to be able to set essays anymore. That's dead, right? If you're a school that's setting essays, you're in the wrong place already, because you're gonna get your answers from one LLM or another, or you'll get it from one, and you'll do page milling and put it into another to say, re-author it so you can't trace the types of phrases. A lot of news stories are written these days.

I think the differentiators that we have are, we have a scale, we have an end to envision. So people will want to know more real time where things are. Now, if you think you've got ships doing one part, the second person is doing a bit of trucking and the third person in warehousing, you've got yourself quite a big integration job there to know what is happening across this whole supply chain.

One of the things that we're going to do is stitch that together, and we are stitching it together, so people will know where everything is at one time, and that is a big challenge for us. And then, of course, as I said before, we've all seen what the temperature––well, I was going to say the temperatures around the world, apart from the UK where it's chucked down with rain also, which is depressing. Yeah, we've all seen what's happening to temperatures. The climate situation is now becoming clear that people want to change, and we're reacting very, very quickly to that by putting out our methanol ships. And so the first one, you'll see it, I think it arrives in Copenhagen in 6 weeks, so I'm sure that will make news headlines. Worth seeing, looks amazing and zero pollution, it doesn't get any better. Right? So for us, the challenges [are to]

help the climate, help the planet, help our customers. And you know, full steam ahead, so to speak. You know, as our strapline is all the way, and that is what we are doing, and everybody in tech, we are going all the way to deliver for our customers, both internal and external.

Audience Questions

Tomer Shiran:

Cool. Well, I'll pass this now to Alex. He can help us with some of these questions that are coming in.

Oh, you're on mute, Alex.

Alex Merced:

Oh, the mute button! But here we go. I want to say first off, I want to say thank you to both of you for being on the show this week. This was an absolutely fascinating conversation. And a very enlightening one, so thank you very much for that. But we got several, a lot of interest from the audience. And again, if anyone else has any other questions do put them either in the Q&A. box or in the chat, and I will start passing them along. Okay. So our first question is about someone who's kind of in the same position with it, trying to create something similar for their firm, far as, like the chat, the chat application. They just want to know if there was any advice on where to begin and how to start looking into that.

Mark Sear:

Yeah, just pop me a message. We'll collaborate with you. We'll give you some code, why not? This is open source, right? Nobody, you know, we're not going to give you anything that's going to be of massive commercial sensitivity, but we'll collaborate with you. Get in touch with whoever your Dremio rep is, put them in touch, and we'll gladly have an hour long meeting with you, showing you how we approached it. Maybe even share some code. That's not a, you know. It's not obviously commercially sensitive.

So yeah, no worries. We'll do that for it. It's a pleasure.

Alex Merced:

Awesome. Thank you very much. And then the next question is, do you use Dremio for data engineering, like creating cables, etc., or primarily for querying?

Mark Sear:

I think it will be fair to say that initially, it started off as just the querying, because we had kind of a legacy cloud environment, which sounds crazy, but things were moving so quickly, we just did. So initially, it was just that. Now we're starting to, we've got a full dev, test, prod, promoted by GitHub events, that take you through all of that. We're now starting to see people create tables, create a whole bunch of––what's the correct term in Dremio––reflections, things like that to speed up queries. So we're now actually starting to see people, data engineers, use it as an environment and building pipelines in it as well, because a lot of pipelines, people think, I'll build them in Spark, or I'll build them in something complex. Most pipelines are actually relatively simple. So we're now starting to see that. Why? Because it's a lower cost of ownership for us, and that’s what we want. The lowest, most cost-effective environment that delivers the results that you need for our business. So yeah, that's the journey we've been on.

DBT

Alex Merced:

Awesome. And then our follow up to that question is, do you use something like DBT to manage the SQL for these pipelines?

Mark Sear:

I'm not gonna lie. I don't use it on a day-to-day basis. But again, if you squirt me an email, or whatever, you're welcome to have a chat with my team on that, no problem at all. As a separate spin out, or we'll either email you or or tell you. So I apologize. I can't answer that.

Alex Merced:

All good. Okay, here. This one's a bit of a longer one. So here we go.

We've been told you use power BI as your main recording tool. Could you tell us about your experience with PBI tabular models, a.k.a. data sets that are based on Dremio data source indirect query mode? We've been struggling for over a year to get good performance on large data and non-trivial tabular models, even with the best reflections in sub-second Dremio response time to every query. PBI often generates a dozen queries for one PBI visual that is sent sequentially. There are some features that Microsoft is releasing to compensate for this horizontal fusion, dynamic and parameters direct query. I guess, bottom line, is there any best practice when using something like a Power BI to get a little bit more juice out of it?

Mark Sear:

I think it kind of depends on what your––I mean, you know, ‘how big is a piece of string?’ Without knowing the data sets, that's kind of a bit of an odd question to ask. I suppose, I would say, use less data. To be honest with you without seeing the data––again, open offer, we're about collaboration. You can speak to my guys directly. No problem at all. We’ll do that for you, the greatest of pleasure. So I know that sounds like a bit of deflection, but I don't want to sit here and say, if you index the third column, you'll find it all works, because I simply don't know if that's true. I thought you could try it. or the fourth. [laughs] Yeah.

Platform Architecture

Alex Merced:

The next question is just sort of like a confirmation of the architecture. So they're saying, Okay, it's an LLM, plus Dremio, plus ADLS. Is that the general sort of architecture?

Mark Sear:

Yeah. It’s Azure ADLS with Iceberg on top of it. And then, to the side of it, we've got an open AI model that's internally a hosted model ourselves, via Microsoft. And we bounce our queries off of that, because obviously you don’t want the queries or data or anything leaving our environment.

So we basically use that to generate the sequel, we apply the sequel, drop it into Dremio. It comes back with the results. Ba-da-bing, ba-da-bosh.

Alex Merced:

Got it, and I think there was a follow up to that question, which isagain, if you can specify which LLM model of the open AI ones you are using. If that’s not––

Mark Sear:

We've got Turbo, 3.5 Turbo, and we're moving to 4. We're testing 4 at the moment.

Alex Merced:

Awesome. And let's see here. Oh, sorry. The next question is, is the platform on cloud or on-prem? If it's on cloud, is it self managed or fully managed, like a Dremio cloud?

Mark Sear:

We've got a really cool, very good cloud team, as part of the area that I work, which is called TSE––technical service engineering. They manage it for us. We've built our own K-X clusters, so we spin up a K-X cluster with Dremio on it that's fully managed for us, and the rest of the lake is a 0, and we manage it ourselves, and to be really honest, that's not because we don't want to use external clouds, but our security requirements are extraordinarily tight, because we move goods around for so many people. and we don't want that to leave our own cloud. So nothing against the Dremio version, which I'm sure is pretty nice. Not the biggest compliment because I don’t know what it’s like.

Alex Merced:

The next one is, how are you exploding geographical data with Dremio?

Mark Sear:

How are we exploding geographical data?

Alex Merced:

I guess more like, do you have geospatial data in your data sets in Dremio and I guess how are you using them?

Mark Sear:

We tend not to have any geospatial data in it. We don't really have a lot of need for it, is my guess, but I haven't seen any, anyway. Put it that way. So we have coordinates data, obviously.

And we're looking at the moment in one area to what 3 words for really precise location, because you know, what few words gives you––I think it's 3 meters square of everywhere. And it's words, it's not coordinates, because I don't know about you guys, but I can find my way to minus-fifty-point-zero-zero comma. Not any more. I could in the days of a military satnap, but not anymore. So what 3 words will be there? If there's geospatial data there, I don't know about it. I apologize.

Alternative Technologies

Alex Merced:

What were the alternative technologies which were evaluated? Are there any other open platforms on similar lines? Basically, what made you go with Dremio? What was the ‘it’ factor that that sold you?

Mark Sear:

Honestly this is an interesting one. There's clearly a lot of technologies out there. There are technologies which do similar, but use proprietary formats and are hugely expensive and charge on a consumption basis.

We didn't want a consumption model because we don't want to model based on how much money you're going to make out of this use case. Because, you know, if I buy a car from Mercedes, I don't expect Mercedes to say well, what are you doing with this car? And I say I'm going to transport the King, and they say, oh, well, it's in that case it's a million dollars, or I'm taking Jay-Z, well, it's 5 million dollars. I don't want that model, and we don't want to model where you pay by consumption, because again, the more successful we are, the more we will use. Why would I wish to pay a––I think it was Warren Buffett who says the best business in the world is a toll booth on a road, right? I don't want to pay that toll booth every time I run a query, so those are the dollar factors. The technical factors, we tested––it's super robust.

And then there are soft factors, and I think the soft factors people often forget, and they forget because they're quite hard to quantify. What is the company actually like to work with?

And I think that's a differentiating factor that not many people have. I’m very lucky––my account manager, whatever his title is, Jamie Allen, Jamie, he's the sort of guy I can text at 10 o'clock at night and say, I just had a user complain to me, what was going down. And at 11 o'clock at night, he'll have sobered up, and he'll text me the answer. And he'll go that extra mile, and that means a lot to me. You don't want to be doing business with 9 to 5 companies.

I also want access to the senior people in the company. This is a personal thing, because I like to know where is the roadmap, and I like to affect it, and I've been very lucky to have had several meetings with Tomer, and I think we've at least allowed them to shape-shift what they do slightly, to cave to the things we need. And those are all the soft things. But people forget that ultimately, humans don't do business with purchase orders or checks or whatever, you do business with other human beings.

And I would say that in my experience it's the exact opposite, the gnarly, doing business with Dremio. It's pretty smooth. I used to surf a lot in Australia, and it's the equivalent to land on a completely calm thing on your board, just knowing there's no waves coming along. But when a wave does come along, it’s very reactive. So there you go. That's the end-to-end journey.

Data Modeling Techniques

Alex Merced:

Awesome, love that. The next question is, how would the data modeling techniques, like data, volt, etc.,help replace power BI modeling? How are these different?

Mark Sear:

I think power BI modeling is not truly about modeling. It's about assembling large flat tables of data. Modeling in a more flexible manner is what we are all about. We want people to be able to

ask any question, not the question that you've predetermined fits into the dimensional model. Yeah. So for me, as we spoke about democratizing data, it's actually more about empowering people. It's about putting everything in the hands of the business, not saying, “oh, you have to come back to us if you want another column put into that because only we know how to do it.” Or, conversely, “hey, let's let everybody create models, and we'll have a thousand duplicates around the place. It's about just putting that little bit of thought in there. and it's a moving feast, right? You have to be aware of what people want all the time. Just slightly different.

Questions

Alex Merced:

Awesome.

I'm a consultant at Stena and chose Dremio as a data management platform. I worked a lot with Mearsk in a previous role. Thanks. So I think it was just a thank you.

Mark Sear:

So that's great. If you work for Stena, I'm going to be driving to France. if you've got any discount codes for ferries. Nah, I’m only joking.

Alex Merced:

Okay, next question. Oh, this one's kind of out of scope. So right now, we're not doing any kind of performance benchmarks here. We'll probably have other episodes where we'll talk about things like different platforms versus Dremio and whatnot.

Can you explain the contributions of Dremio to data product and data mesh practice? Does it have every kind of native adapter for connecting to various databases?

Mark Sear:

It has the bulk of them, one particular one that we are missing, which we had a meeting on last week––this will give you an example, as well, to say work with a company that's flexible––is Kafka. We're looking for a Kafka adapter. Kafka wasn't prioritized until the end of the year, but we want Kafka now. So we had a meeting last week to kick off the Kafka project. That's the only thing I can see that is missing from our environment. We've got pretty much every relational database. Obviously, you've got flat files and that sort of stuff. SAP HANA, I think, is on the roadmap. We don't need to look at that until Q4. But yeah, pretty much everything.

There's nothing I can think of that makes me jump up and down, and if there was I wouldn't hold back in telling Tomer so.

Best First Steps

Alex Merced:

Awesome. Okay. That's most of the questions. I do want to ask one last thing. People are listening, and they’re probably like, “okay, this is an exciting thing to kind of get started with.” If someone were looking to get in touch with Dremio, and start making those first steps, what do you think would be, sort of like from your experience, like the best first step to start evaluating Dremio and start thinking about how it fits into their data story?

Well, I think there are 2 ways to go. The classic way to go is to get hold of a test edition and just play with it yourself. For me––I prefer to do a little bit more research first, so I would go almost a traditional route. Get a non-disclosure agreement, sit down, and talk about what you actually want to achieve. And to be blunt, Dremio do seem to just treat that as a cost of sales. Right? Don't listen to everything, all that glitters is not gold, as they say. So be a bit cognizant that of course, they're going to do that. Just follow the classic process, ask for 2 or 3 reference calls, make sure you understand it, and test it out. And if you get problems, ask them to solve those problems as part of your evaluation. So chuck something quite meaty at it, something you haven't done before. But before you do it––and this comes back kind of like the classic wide-table problem––don't build a 200-column table and then say to whoever Dremio or Databricks, or anybody else, “hold on, this doesn't perform.”

Ask them upfront, “what is the best way of achieving this solution with my data set?” Just get that little bit of advice, because that way, you are learning all the time the best practice for the product, and often people don't do that. They dive headlong in, thinking they know everything, and say, well, I used to do it like this on Teradata. And now I'm going to do it like this on Dremio. Then they wonder why it doesn't translate. There is for every product, a learning process, like

if you're switching from a Diesel vehicle to an electric vehicle. You have to start thinking of different things, right? With the Diesel, you can drive it until the very last minute and pull over, fill up in 3 minutes and drive off. With an electric car, you've constantly got to be thinking, where am I going to fill up? How far can I go when I get there? Do I need a coffee for an hour while it fills up, yada-yada-yada? So you need to think of it like that. As with any new technology, adapt yourself to the situation that you are likely to encounter with that new technology and leverage the experts. You guys have got some good––like at Dremio––I've got some really good technical people to leverage them. Put in a bit of time upfront, then try it out. See what your users think. That's what we did. And at the end of the day, they liked it. It works. We're happy, -ish. We're never going to be really happy. You let them know that, Never let your vendor know you're happy.

How Results are Checked

Alex Merced:

A couple more questions came in! With no code, prompt-based queries, and getting results from such queries, how are results checking to ensure that the AI is returning the right values metrics? Is there any checking what the response is?

Mark Sear:

Yeah, there's 2 things that happen. First of all, the way that we built it in Dremio is that when you submit your query it submits, with that query, a predefined metadata structure.

So we know that the answers that come back will work. There won't be a Cartesian product in there, etc. So we check that quite tightly. So by doing that, that eliminates a lot of it. Second check is always a really good one, which is the users, because the users are––I don't know about everybody else's users, but our guys are phenomenal. If you deliver them some wrong data, it's not going to be more than 4 or 5 nanoseconds before they say that data maybe doesn't work or whatever. So that's the second way. And the third way is we store every query. Then we actually look at those queries to see, “this is a complex group, why did that take so long to execute, etc?” So you start to get a good feeling from that.

But primarily you shouldn't get those problems because you're submitting the whole metadata of the underlying tables at the same time, as you do it. So it's no different than letting people do it themselves and see it, right?

Future State

Alex Merced:

Perfect. Okay. Final question––great data points! At what time capsules and years do you envision your future state? And where do you see Mearsk's roadmap? How do you administer the moving of 900 vessels in terms of lost connectivity in the deep sea? How do you manage it in real-time?

Mark Sear:

Okay. So I'm clearly not a ship's master or vessel expert. It's not really my area of expertise. I'm quite nervous about the sea, because it's quite big. If you think about it, you can learn how to swim, but if your boat sinks in the Atlantic, you aren't swimming to Africa or America. You’re just fundamentally dead. So it's not my thing.

But I would suggest that maybe Elon Musk holds the answer to the data. Right? I mean, I've got a Starlink satellite that is portable, number one. I can take it when I go down to my house in Italy, I can turn it on there. I've got full connectivity. I think the connectivity issues will disappear over the next 3 years. I think data is just going to grow and grow and grow. But I think the way that it's exploited will change, and I think you'll see the rise of AI, and the roles of every data engineer on the planet will change very significantly over the next 3 years. The question I get asked a lot is, “am I going to lose my job to AI?”

And the answer, I think, is, if you don't embrace AI, what is happening and the new techniques that are out there, you will not lose your job to AI. But you will lose your job to a human that has embraced AI. And that's where I think the whole thing will go. I think you'll see data engineering changing radically. For those who don't believe me, have a look at a product called Notable.

Just see how fast that generates Python code for you and the things you can do. It's mind-blowing. We're just looking to plug that in right now into Dremio. The role of a data scientist has already changed overnight with that one tool. So be aware of what's coming. I'm 59 in a couple of weeks, so who knows if I even have a future?

Tomer Shiran:

He looks 30, but it's all those outdoor climbs, you know.

Mark Sear:

I wish I did.

Alex Merced:

Well, that brings us to the end of our program. I want to say again, thank you both for being here this week. It was a wonderful conversation, and an early happy birthday. And again, everyone, come back next week. Next week will be Apache Iceberg office hour, so again, Apache Iceberg was part of this story today. So if you want to learn more about how you can implement Apache Iceberg and have any questions about doing so, come back next week, we got some answers for you. But again, thank you, Mark and Tomer, and I'll see everyone next week.

Mark Sear:

Cheers, thank you. Bye, bye.

Gnarly Data Waves

How Maersk is Building A Next Gen Data Lakehouse with Dremio

Register to view episode

Speakers

Transcript

Introduction

About Maersk

The Scale of Maersk

Opportunities for Data

Data is Finite

Requirements

Security

Data Sources

Use Cases

TCO

Lessons Learned

Supply Chain Logistics

Audience Questions

DBT

Platform Architecture

Alternative Technologies

Data Modeling Techniques

Questions

Best First Steps

How Results are Checked

Future State

Ready to Get Started? Here Are Some Resources to Help

Infographic

Quick Guide to the Apache Iceberg Lakehouse

Analyst Report

It’s Time to Consider a Hybrid Lakehouse Strategy

Case Study

Navigating the Data Mesh Journey: Lessons from Scania’s Implementation

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?