15 minute read · April 5, 2016
What are Dremio and Apache Arrow?
Transcript
Andrew Brust: Welcome to Big Data and Brews. I’m Andrew Brust with Datameer. We’re here at Strata Plus Hadoop World, live in San Jose, California, and we have with us today, Jacques Nadeau of Dremio.Jacques Nadeau: Hi there.Andrew Brust: How are you?Jacques Nadeau: Good, good.Andrew Brust: Welcome, welcome.Jacques Nadeau: Thank you.Andrew Brust: Yeah. What time is it? About 11:00 in the morning, I think, so maybe we’re granted an exception today. You’re now CTO at a new company called Dremio. Dremio seems to have a close relationship with an interesting open source project called Apache Arrow.Jacques Nadeau: That’s exactly right. We actually worked very closely with a bunch of different open source organizations, as well as a number of companies to launch Apache Arrow last month.Andrew Brust: Apache Arrow … Actually we’ve had another guest here today and we were already discussing Apache Arrow with him. It seems to involve a number of contributors and also a number of vendors, even if they’re not necessarily acting on behalf of their companies. There’s a lot of industry cooperation, de facto industry cooperation around this project, and as I understand it, it went straight to top-level status. For those who don’t know, most Apache Software Foundation projects go through what’s called the incubator for a while first, just really kind of to see if the project has legs, and allow it to mature. You guys went straight to graduation basically.Jacques Nadeau: That’s exactly right, and it really has to do with what we’re doing and who’s involved. Right? So, what happened was, is that we saw … Arrow, before it was known as Arrow, started last summer. We were talking to a bunch of people and seeing a lot of common patterns. Basically people struggling with the same kinds of things, which is that … okay, we’ve got discs fast enough, we’re reading off of discs fast enough, but we want to do query execution, data analysis, data science, machine learning, those kinds of things. The bottleneck starts to become the CPU. And so you’re like, “Okay, well how can we make better use of the CPU?” That’s really the core of what Arrow is about. It’s saying, “Hey, let’s come up with a set of data structures which do a much better job of taking advantage of the CPUs capabilities and use that more efficiently.”That was really a bunch of different groups were recognizing the same pain point. Whether you talk about sort of the guys when they were talking about spark and tungsten they’re like, “Hey, we want to make these things go faster. Or you look at say, the research paper that came out last year I think it was, talking about how Impala was built, one of the sort of key sort of notes at the bottom of the paper was, “Hey we would really like to move to canonical and memory representation.” Right?So with the Drill stuff that I’ve been working on as well, we also saw-Andrew Brust: Drill being another Apache open source project?Jacques Nadeau: Yes, yes.Andrew Brust: This one kind of focused on sequel on everything.Jacques Nadeau: That’s exactly right.Andrew Brust: [Hadook 00:03:17] being one of those things, but not the thing.Jacques Nadeau: Yeah, so my main role at MapR was really driving the Apache Drill initiative and building that into sort of a very powerful processing engine. And so as part of that, we start to experiment a lot with in memory columnar. So if you think about the world, its kind of like crawl, walk, run, and in the case of data, it’s typically … So crawl is going to be to your row-wise execution. Walk will be something along the lines of in-memory execution. And run is really about columnar in memory execution, and that’s really where everybody wants to go.Andrew Brust: Where maybe you can operate on whole batches … Not batch in the Hadoop, slow batch …Jacques Nadeau: Yeah, not that batch. Yeah.Andrew Brust: But bunches, bunches of rows at once.Jacques Nadeau: Yes that’s exactly right. So basically we break down data into chunks of data, and then we orient it so that data that’s similar is together, right? So for example, if you’ve got an integer column and a string column, right? Normally the way data is held in memory is this row-wise representation where it’s gonna be integer, string, integer, string in the memory. And so if the CPU wants to interact with only integers, it has to like step over strings all the time.Andrew Brust: Sure.Jacques Nadeau: And that means that you spend a lot more time retrieving data from main memory, bringing it to the CPU crash, and the CPU sort of stepping over that thing. You also have to always look at how long each of the strings are before you know where the next integer is. Okay? So what we do is, we orient things differently so that integers are by themselves and strings are by themselves. This is a simple example, but integers are by themselves and strings are by themselves, and in that situation, if you just want to interact with integers, the CPU can just see the integers, knows exactly how they’re offset from each other, and be very, very efficient pulling that data from main memory into CPU cache. Right? So that’s a huge amount of benefit to in-user applications, so that’s one of the really sort of amazing things about Arrow is that Arrow …Andrew Brust: And cache could be … Most people don’t know, but cache is even faster than RAM.Jacques Nadeau: Yes.Andrew Brust: RAM is not the Holy Grail.Jacques Nadeau: Yeah, no actually that’s exactly it. Like what’s happening now is we’re getting more and more main memory, but the main memory is not as fast as the CPU can work, and so just like it’s way slower to have something on disk than have it in memory, the same is true in memory and CPU in terms of latency differences. Okay, and so the CPU, if the CPU actually has to go to a main memory to get something because it’s not in its cache, you can actually slow down the CPU substantially. 1-100x times.Andrew Brust: So I think of those algorithmic trading guys who want to have their offices as close to the exchange as possible because it lowers the amount of wire miles between them and those machines and lets them execute milliseconds faster. Maybe its not quite at that split hair level but it’s sort of the same principle.Jacques Nadeau: It’s exactly the same in that the distance here is as close to the CPU as possible, right? And that’s what it’s all about. Everybody who’s doing analytics wants it to go faster. It doesn’t matter how fast you go, they’re gonna always want it to go faster, and so recognizing what those opportunities are is sort of the key to all of this, right? So that’s why, as you mentioned, right, there’s a bunch of different people involved. So you see people from the Impala project, the Phoenix project, the [Cal-site 00:06:18] project, the H-base project. There’s people from Hadoop itself. We’ve got people from Deep Learning for J, we’ve got Spark People, we’ve got Drill people involved. A bunch of different groups are all involved in this.Andrew Brust: You mentioned Parquet.Jacques Nadeau: I didn’t mentioned Parquet but I should have. There’s 16 or 7 different developers from different technologies that are involved now, so I no longer can …Andrew Brust: And they feel each other’s pain apparently.Jacques Nadeau: Absolutely. Right? And that’s the key to all of this, right? This is the whole open source opportunity, right, is that people recognizing a common need and working together to solve that, and why we’ve been able to do what we do with Arrow is because it’s a foundational component. Right? It’s something that can be embedded into each of these different technologies, and so it isn’t perceived as a threat to any particular technology.Andrew Brust: Right.Jacques Nadeau: So the exciting part for the end user, that’s … So there’s a bunch of dynamics, right, in terms of how you get this sort of thing off the ground. The first set of dynamics are all around making sure these people are collaborating and have their goals aligned. But the second piece is this obviously has to be very important to end users for everybody to spend the time against it. So obviously number one, faster performance is huge, right? But the second thing that Arrow brings is, it brings the ability to move data between systems more efficiently. Okay? And this means that if you want to … So I was just in a session yesterday with West McKinney from Cloud Arrow and what we were talking about there was being able to move data very efficiently between execution engines and machine learning languages like Python as well as sort of doing machine learning inside of Spark, right?Andrew Brust: Because if everything has its own in-memory format, then you’ll need to, as you go from one component to the other, you’ll need to serialize out and then load it back in in the second format, and so forth.Jacques Nadeau: That’s exactly, yeah, so you’re constantly … What happens is, you look at these workloads and you see a huge amount of time being spent on serialization and deserialization, which is basically just wasted work, right? So what happens is people sort of approach things and say, well I’m going to try to approach things with a monolithic approach where I’m just trying to use one system because I don’t want to pay that overhead, but that means that a bunch of your people inside the organization need to learn a new technology in order to do their job. So the opportunity here is that you can move data between different systems at zero cost, which means that end users can use the technology that they’re comfortable with and then the sort of knock out effect of all of this is that once you start to share memory between the systems, you can actually work with larger data sets with the same hardware.The reason is that right now, people use multiple different systems to interact with the same common working data set, and …Andrew Brust: So they’re getting copiesJacques Nadeau: Basically, yeah, you’re gonna have copies of data sets in memory. One for each different application, right? And so if you have a shared representation, you can have one copy of the memory, and everybody can just use that, and therefore obviously you can have a much larger data set.Andrew Brust: Okay, I think I get it. We’re good. We’re getting technical here, but this is kind of a technical thing so I think that’s alright.Jacques Nadeau: Yeah.Andrew Brust: So it seems like in memory in general has been a very hot thing across the industry. There’s a lot of consensus around that broadly, but then implementation wise, pretty much every project and every product has been handling the in memory representation of data differently, and if you can come up with a common representation, it seems like nobody’s really objecting to that. It’s low level enough, it’s a common cause such that pride of ownership isn’t a problem and people are happy to have a standard there.Jacques Nadeau: Yeah that’s exactly right. I think the way to understand it is, it’s not just the in memory. Actually most of these engines already have an ability to do in memory execution, right? The next step is in memory columnar, and that’s really where you take advantage of CPUs at a greater level, and so the problem and why it hasn’t happened already, right? If everybody knows it’s a good idea, why hasn’t it happened? It’s because it’s very, very hard. Right? So we’re talking about in memory columnar. We’re talking about in memory complex, shredded columnar. Which means that we’re not just dealing with rows and columns. We all know that data looks like json these days, right? It’s complex objects, it’s documents, right? So building a system that can do in memory is columnar is very difficult. Building an in memory columnar system that can work with complex data is even more complicated, right? And so it’s not the lack of desire. It’s the fact that it’s a very difficult task, and that’s why everybody is attracted to this, is that we can work together to build something that’s never been built at this level of quality and this level of capability before.Andrew Brust: Its interesting too because columnar stuff comes up in analytics all the time, but that’s more around just having all the values in a given column together so you can table scan and aggregate easily. This is actually about making things more compact and us being able to take advantage of modern CPU technologies …
Jacques Nadeau: Yeah. Yeah, yeah, as you said earlier, right, this gets a little bit technical. But the high level is that columnar storage, things that are sitting on discs, right, we’ve been talking about columnar storage for years. Like one of the guys that’s at Dremio, a guy named Julian invented Parkay which is basically the de facto standard for on disk columnar representation. Right? And so people have been taking advantage of that capability for some time and it makes it much more efficient, how quickly you can get to the data on the disk.What happened historically was, the moment the data moved into memory, it moved from a columnar representation to a row-wise representation. So what we’re doing now is we’re saying you know what, the same kinds of benefits that exist on disc, there are similar benefits that exist when we’re working with a CPU in memory and so we should use a columnar representation in memory, and so the fastest systems are gonna be those ones that have the columnar storage on disc using something like Parkay, and then bringing that into memory and using Arrow to process that in a columnar way of memory.Andrew Brust: So the journalist in me wants to know what you guys at Dremio are up to. How much can you share today?Jacques Nadeau: Yeah, well we established Dremio last year. We’re about twenty people now, and we’ve got a huge number of people involved in things like Arrow and Parkay and Calcite, and we’ve got people who worked on the Oracle Flash Cache, and people who worked at Twitter as big data pipeline. We’re bringing together a bunch of experts in this space to really make it a lot easier to get to data. Okay? And that’s pretty much all I can say right now, we haven’t launched yet. But one of the things I can say is that as a new company, we obviously are focused on the things that are very important to us, and Arrow we see as a huge opportunity. It’s something that’s going to be a core, sort of foundational component to what we’re building.Andrew Brust: Well that makes sense. Certainly when I first learned about it, I thought it was pretty important. I raised the alarm bells here at Datameer and I wanted everyone to read about it, so …Jacques Nadeau: Yeah, yeah.Andrew Brust: We’ll be looking forward to you guys coming out of your stealthy mode and to a place where we know what you’re up to and we see some things productized.Jacques Nadeau: Yeah.Andrew Brust: And in the meantime, thanks for being on Big Data and Brews.Jacques Nadeau: Yeah, thanks for your time. Really appreciate it.