40 minute read · December 14, 2018
Starting Apache Arrow
· Creator of the Python pandas project, Co-creator of Apache Arrow, CTO and Co-founder of Voltron Data
Transcript
Wes McKinney:I started with the Pandas project back in 2008. It wasn’t called that, but eventually it became the Open Source Pandas Project, but I’d had almost no contact with the big data world, with database development world really. It was a pretty narrow view of Python programmers and the emerging field of data science. So I’d seen, essentially, Python but didn’t really have much of a view of the world outside of that.So, when I arrived at Cloudera at the end of 2014, I started to look at how can we make Python plug in to all these big data frameworks, and databases, and be more of a first-class citizen; and one of the very first things that popped out, was like, oh, well, each of these system has their own way of getting access, or making data available, or enabling you to put data into the system. So, I was like, “Oh my gosh, I’m going to have to write a dozen different data converters to marshal data, convert data between Pandas each of these data processing systems from Spark, to Impala, to different file formats and HTFS, so it was basically this overwhelming problem.Beyond that, within the data science world, there’s a significant amount of fragmentation where everyone talks about dataframes, like everyone knows at a high level, what a data frame is. It’s a table, and they have this API where they write code to manipulate tables that are in memory, but if go under the hood, the dataframe is very different internally, and so there was no way for me to take re-used code that the r community had built, or take code that I had built, and make that available to other data science communities.I was sort of … saw this problem all over the place. It’s like, well, it’s difficult for us to share data, it’s difficult for us to share code. I wonder if a technology could be created that would help us do both of those things, make my life simpler, but also make the world of big data and lives of data scientists a lot simpler, and more … simpler, faster, and more productive.
Apache Arrow and the Idea of Table Middleware
Transcript
Wes McKinney:
And so, I started kind of doodling around with some designs, and seeing if my colleagues at Cloudera, I thought it was having some kind of … I think I called it a table middleware. It’s like I always want this. I want a C library or a C++ library that implements a table, and we can use that as the way to move data from point A to point B, that kind of a standardized table interface.
How Wes and Jacques connected to help start Apache Arrow
Transcript
Wes McKinney:
At some point in, this was 2015, I bumped into … Me and my colleagues and I at Cloudera we connected with Jacques and the folks from the Apache Drill community and they’d just started [Dremio]. And they said, “Hey, we’ve also been thinking about this problem. Let’s join forces and see if we can agree on the way to solve the problem.” So, that was how the initial collaboration got boot strapped.
Explaining Apache Arrow
Transcript
Wes McKinney:
The way I explain the project to people is that, first of all, we are creating an open standard for The way I explain the project to people is that, first of all, we are creating an open standard for tabular data and memory. So we call it columnar data. That has to do with the way that the data is arranged in memory. So column-oriented data is more efficient for some kinds of analytical processing.Pandas is a columnar analytics tool. You know, R’s data frames are columnar. A lot of databases, not all, but some databases are column-oriented and there’s pros and cons to using that type of memory layout.But there was no open standard for what does a table look like in memory. So we sought, as our initial project, to define an open standard and to build reference implementations of that and demonstrate that we could generate that memory format in multiple languages and share the data, let’s say, from Java to C++ or to Python without any copying.So that was how the project got started. The way I describe the project to people is that we’re creating a development platform for building in-memory data processing systems in query engines.So there’s several parts of that project. I think one of most important things that we focused on from the early days of the project is defining a language-independent open standard for in-memory tables that’s optimized for analytical processing.That alone isn’t enough to be useful, so you need reusable libraries to do things with that format. So we’re developing fast data access and data movement. So tools for accessing file systems, accessing different storage formats, databases. Writing data to shared memory and reading data from shared memory without copying on the way back, so you can treat data on disk as though it’s in memory. Building an RPC framework for fast messaging of datasets from server to client.And lastly, we’re building vectorized processing libraries for fast in-memory computing natively on the Arrow format. An important thing to remember about the project is that it’s front-end agnostic. So it’s not a new data frame library, like Pandas. It’s not a new database.It’s kind of, if you look at those libraries like the internals of a database, so the internals of Pandas- we want to take the different components that you would use to build a database and make those into reusable libraries with a public API.So that if you want to replace a component of an existing system with something that’s faster and more standardized, you have high-quality libraries to do that. If you want to build a new database or a new data frame library, you have a robust set of tools to do that. And we’ll give you a head start. And also, the system that you build will be immediately interoperable with the entire Arrow ecosystem. And so that’s really exciting for me, having implemented many of the same kinds of things many times in many different projects.
Transcript
Jacques Nadeau:
One of the things that I’ve thought about is that to me there’s these two different … like Phase One of Arrow from my perspective is something that has happened and we’re heading towards the end of it and moving to Phase Two.Phase One, to me, was this concept of how do we … there’s two different things, right? This concept of table middleware, I like that. The concept of what is the unified or universal table? I think that the thinking that we were looking at … when we started, one of the things we were struggling with was this concept that if you’re trying to drive towards something that’s much more loosely coupled, right, like the old world data infrastructure is all ‘Okay, I stay in one system.’, right?But, if you say ‘Hey, you know what, new case I want to solve, I want to solve that with different technologies.’ There’s a lot of different technologies that are great at specific things. We started to say ‘Hey, how do we move data between systems’ and, so, I think one of the things that was, at least for me, was a mental breakthrough was the concept of, and I’m curious of what you think about this, the concept of you can’t solve transport alone.Being able to move data between different systems that if you want to say ‘Hey, I want to have a single representation and do that efficiently’, the inefficiency at this point, networks are pretty fast, then efficiency is primarily due to the data format itself. That’s due to internal systems having no representation in the serialization costs of moving between the separate systems.The breakthrough for me is that in order to solve any kind of transport problem, you have to first solve the processing problem. So, you say ‘Hey, if we can come up with the representation that’ve very efficient for processing purposes, for a large portion of analytical workloads, then people start to adopt that, and you start to build toolkits around that. Then, that also allows them to start to interchange that data without a lot of overhead.To me, it’s been … I don’t know what one is the chick and which is the egg, but I feel like it’s … Arrow, it’s been about, you’ve got to solve the processing before the transport, but then, as we start to see now, is that people adopt first with the transport. I think that is Phase One, this processing transport representation. Phase Two is actually giving a bunch of tools to solve the processing side of things in a better level.Wes McKinney:
Right, yeah, definitely.I think the interchange problem is important. I think solving that alone is definitely valuable, but it would be a shame if you created a universal table interchange format and all that any system ever did was convert their data into that format, hand it over to another system, and then that system converts it from that interchange format to their custom in memory representation.You’re still paying … even that would be beneficial, because you’d have a standardized interchange format, but you would still be paying the cost of converting to and from that format.
Let’s Stop Reimplementing Algorithms
Transcript
Jacques Nadeau:
One of the things I’ve heard you talk about is, and I think as someone else who works in open source, I think I’ve experienced this a lot as well, is the hey, everybody has to re-implement the same algorithm. Right? And I think that neither of us are a huge fan of that, because we know that there’s constrained resources in getting these open source initiatives built.And so what do you think the big opportunities are in the near term, in terms of starting to reduce that at different levels? Like I think right now, if Arrow at its current state, it’s best at interchange. Would you agree with that?Wes McKinney:
We’ve focused on interchange.Jacques Nadeau:
Yeah, yeah, that was the goal of Phase 1. These are my internal phases, not project phases. But I think that we’ve done that well. Do you think that there’s been some work on, for example I think we have some algorithms for dictionary encoding things, for example.Wes McKinney:
Right.Jacques Nadeau:
Right. And so hey, how do I get my data, if it’s in an Arrow format, how do I get it into a dictionary encoded Arrow format very efficiently?Wes McKinney:
Right.Jacques Nadeau:
Right. What do you think’s going to happen? Do you think that the existing systems will start to adopt? As we try to add more processing stuff to Arrow, do you think the existing systems will start to adopt that, or do you think it’s going to be the newer technologies that adopt those things, and it takes longer for the existing systems? I don’t know if you’ve thought about that.Wes McKinney:
Yeah. There’s a lot to unpack there. I think I’ll get to the last question at the end, but I think one of the things that really motivated me about the product … beyond defining the Arrow columnar format and coming up with a standard for interchange, just the re-implementation, the wheel reinvention problem. One of the things that really drove that home is looking at what has developed in the Pandas project, and now that code base is nearly ten years old, and you think that we’ve essentially built our own in-memory query engine, we have our own CSV parser, we have the interfaces to databases, we have our own link data present display and presentation system for Jupiter Notebooks, for Console.But the scope of the Pandas project, the code that we’ve developed and the code that we own has become massive. It’s a couple hundred thousand lines of code. But if you go look at a database, I had some experience working with the Apache Impala team, the folks at Cloudera, and I looked at Impala, and it’s like gosh, they’ve implemented a lot of the same things. They’ve got their own CSV reader, and they’ve got their own I/O subsystem, and ways to access all the different places where data comes from. They have their own query engine, they have their own front end, Pandas has its own front end which is Python code.And so there’s all this energy loss to people re-implementing the same things over and over again, and I think it would be easy to naively say, well let’s stop re-implementing CSV readers, and let’s stop re-implementing hash joins, and sorts, and all these things, but the problem is that all of these implementations, it all goes back to the memory, so that the implementations are specialized, and they have to be highly specialized to where the data lands after you read it out of the CSV file, after you read it out of the Parquet file.And so there’s all this energy loss to people re-implementing the same things over and over again, and I think it would be easy to naively say, well let’s stop re-implementing CSV readers, and let’s stop re-implementing hash joins, and sorts, and all these things, but the problem is that all of these implementations, it all goes back to the memory, so that the implementations are specialized, and they have to be highly specialized to where the data lands after you read it out of the CSV file, after you read it out of the Parquet file.So I’d like to see the community work together to solve these problems really well, and then have libraries that can be used in may different projects over a long period of time. And so I think what will happen, and what’s actually already happening as far as who’s using these libraries, is that existing projects in some cases will use the Arrow libraries to get access to data, strictly use the Arrow ecosystem’s high performance data access layers, and then they pay the cost of converting from Arrow to whatever format they’re using. So this is already the case with the Pandas community, so a lot of Pandas users are reading Parquet files via Arrow. I think as we develop more processing libraries, I think that slowly these communities will begin to take advantage of Arrow native processing.But it will be difficult to displace a whole ecosystem of working code. It’s like if it ain’t broke, don’t fix it. So if you have to pay this conversion cost to use Arrow, that conversion cost might outweigh the benefits of the faster Arrow processing.But I think the really interesting thing long term, and you can think on a ten or twenty year horizon, will be next generation data processing systems that design from day one to be Arrow native, and are able to focus on much higher order concerns, in terms of query optimization, and distributed computing, and not have to re-implement all these really basic things to be able to spend time focusing on the higher order big data optimizations that end up taking a lot of time.
Can Apache Arrow Help My Legacy App
Transcript
Jacques Nadeau:
No, ever since, I think one of the things that you commented on is this cost of, you might today might, if you already have something, take Arrow and use it to get access to other kinds of data, but you might have to sort of … you have an internal representation of your data already before Arrow even existed, and so you have to pay a cost to translate.My experience and sort of my observation has been, and I’m curious if you’ve seen the same thing, is that it’s rare that that’s actually more expensive than the efficiency gained by using the Arrow access stuff. I think part of that has to do with is, at least my experience has been, is that in a lot of existing access approaches or libraries, whether it’s some of the stuff from databases or other systems, most of them are cell level interfaces, and so for every record you have to interact with an API to get to every cell. So when you interact with any library that’s been pre-built to try to solve a problem, you’re going to be interacting with that cell level access.I don’t know. I feel like many cases, even if there is a cost to ultimately, if you’re not Arrow native throughout, a cost to moving from Arrow to your internal representation, that that is frequently outweighed by the benefit of how highly optimized the code is that is in the Arrow project for accessing things. Do you think that’s fair?Wes McKinney:
Yeah. I think so. Yeah, I think it will vary on a case-by-case basis, but for the data science use cases, we’ve spent a lot of energy making that sort of jump between kind of let’s say Pandas native and we’re starting to work on our native, so it kind of would be a similar story there. But to making that cost of moving back and forth as low as possible, so there we do use multiple cores and try to get the cost to be as close to just plain old memory copying as possible. You have to a little more work than just a plain memory copy, but we do the best we can.But I think the overhead that, if you have an algorithm where you get a 5 or 10X performance gain over the thing that you’re currently doing, then the memory copying cost definitely could be insignificant.
How Apache Arrow is Being Used in Apache Spark
Transcript
Wes McKinney:
One of the big projects in the last couple of years is getting … was building Arrow integration into Apache Spark for use in pyspark. And so this is when I was working with some of my former colleagues at Two Sigma, heavy Spark users. So we looked at how can we get Spark sequel to convert batches of tables into Arrow format, send that data to Python for in Pyspark and enable you to run Pandas code on, custom Pandas code, on that data to write user defined functions in a way that is more natural for Pandas users.We found that just the efficiency gains of using a better interchange format, what was being used before was a cell by cell kind of evaluation model and so going from cell by cell to vectorized gave almost 100x speed up in some cases. And so, I think, for people to see that kind of a benefit, just kind of moving to the kind of batch based vectorized execution model is really eye opening. I think we’ll see a lot more examples like that as people begin to adopt the project as like an add-on library or accelerating certain parts of existing systems.
How We Use Apache Arrow in Pandas
Transcript
Wes McKinney:
So the way I’m using the project and spending my time, so there’s a couple of ways. One way is building faster data access, and in some cases, faster computing for the Pandas project, so we’re placing some current things that we’re doing in Pandas with new and faster Arrow-based code, and the second thing is building essentially a next generation Pandas type dataframe library for the data science community.My goal is for that to be not focused on Python, but something that the underlying code could be used in R, could be used in Julia and different … it’s not tied to Python.My vision for that is to be a companion project to the existing Pandas project, not a new thing that is intended to replace Pandas, and focused on faster, more scalable data processing. To be able to go ten times faster on 10 or 100 times more data and be able to do that on a laptop.It did not have to spin up a giant cluster to work with very large on disk data sets.
“Apache Arrow in 60 seconds”
Transcript
Jacques Nadeau:
When people ask me what Arrow is, I usually say it starts with a format. It’s a format that’s designed for very efficient processing. Format is memory centric and so you have a bunch of different libraries. I think that there’s close to 10 at least languages that are bound with arrow now. So you can download and start using today. And then that format then opens up a set of possibilities that otherwise you’d have to have a lot of engineers spend a lot of time doing before you could be able to do those things.And so rather than everybody having to re implement how they’re doing processing with data, they can use Arrow to do that, and so I kind of think of it as the back end of a database but in a lot of little pieces. And I think that as I think you pointed out, a database is just one form of a processing engine. It happens to be the one that I’m closes too. But there are a lot of the same problems I think are trying to be solved in all of them.
How We Use Apache Arrow in Dremio’s SQL Execution Engine
Transcript
Jacques Nadeau:
In terms of how we use and I how use Arrow, so at Dremio we’re trying to allow people to sort of abstract away some of the physical concerns of data no matter where data exists and try to improve the performance of data access.That’s for whether it’s an analyst use case or for a data science use case for something else. What we decided to do early on was, we believe very much in this concept as a shared data processing system.So we decided to build our product entirely in an Arrow-native way. Whenever we’re trying to interact with data, we say hey, let’s step one, convert it into Arrow.Then everything we do inside of what we do is to try to use Arrow to process those things and to process them efficiently and it’s both about processing efficiently and then also the efficient use of memory.Then trying to communicate those out.
How to Get Started With Apache Arrow
Transcript
Jacques Nadeau:
What I get excited about next, I think getting the format in place and getting the tools and the different languages in place is probably the foundational level to me. And what I get excited about is that I think that for example, you and I are working on Arrow flight with other people and that’s this vision of a parallel communication set of libraries for different languages to move data between systems. Right now, the format is the backbone, but how do you move that around between systems is right now entirely up to the implementers, so it’s a lot more work when someone wants to try to adopt it for that purpose. So I’m excited about that.And I’m also excited about some of the Godiva stuff that we’re working on in terms of what’s the fastest way you can possibly process Arrow for arbitrary expressions because it’s a very common pattern that people want to pass into these expressions and process them efficiency. But right now, that’s so custom, and it’s a lot of work to do over and over again.
How to Contribute to Apache Arrow
Transcript
Wes McKinney:
People often ask me how they can help or how they can get involved. The Apache Arrow Open Source Project, it’s in the Apache Software Foundation which means it’s a community led kind of open process open source project. We do all of our work on public channels and the mailing list. One of the biggest ways that people can help is by just following the development process and sharing their use cases so if you’re working on … if you’re building a data processing system or you’re solving some of the problems that we’re also working on in Arrow or you have ideas about where the project could go that you share them with us and that we have a dialogue about it.The road map can be … is sort of open and can be shaped by the people that kind of share their ideas about the project. Obviously if you or anyone you work with wants to contribute code or documentation or review code or help refine the road map, we’d love to have you involved in the actual development process in the Open Source Project. You can try to use the libraries to solve problems that you have whether it’s improving the performance of data access or improving data processing performance and reporting your experience or findings to help us understand what it’s like to be a user of these libraries to find the rough edges. If you find missing features or you run into problems, the more we know, the more we can help you and the stronger the community.Jacques Nadeau:
Now one of the … that actually reminds me one of the things I say about contribution. I believe that a couple of things that I think are important there. One is that I think people sometimes think that the only way they can contribute is code, and I think you kind of hit on the fact that there’s a lot of other ways that people can add huge amounts of value to this project. Whether that’s writing documentation, providing feedback and bug reports and sharing what are the missing pieces that [inaudible]for them to sort of engage with it in their use case. I think that there’s lots of different ways for people to contribute. I think that one of the things that is key to me is that I think the most prolific contributors are frequently people who use things a lot. Then they learn challenges from those things.I think that then it goes to hey, what are different ways that you can find use and value in Arrow. There’s lots of different ways that you can pick up and start using it. It is a library rather than a server and so it’s gonna be a development choice to incorporate it into some of the things that you’re doing.Wes McKinney:
Some people ask me can we join the development calls? Can we … I think sometimes people are like oh who’s the … is this project open to everyone? Do you want to hear about my use case and my answer is always yes. Come, we want to hear from everyone. Arrow’s not part of some kind of vendor initiative. We’re not competing with anyone. Our goal in building the project is to work together as a community to define standards and solve problems and help reduce the amount of fragmentation that exists in the big data world. Just to make the whole ecosystem more efficient and more productive.Jacques Nadeau:
Now I think one of the things that I’ve seen with Arrow is that we have a very strong diversity. I think part of that may come from the different languages. People from different languages, typically people have their [inaudible] languages and so people are building different kinds of front ends for interacting with Arrow are gonna be coming from different communities. I think that makes it a much more welcoming situation. I think one of the things you mentioned is that we do do sync ups regularly, every few weeks and those are announced on the mailing list and anybody is welcome to those. People can just join the mailing list because basically that’s where everything happens on Arrow. I think that’s important to know. Yeah, I think it’s a very welcoming community. I think that there’s a desire to incorporate lots of new ideas.
Where Should I Use Apache Arrow
Transcript
Jacques Nadeau:
Some really easy wins that Arrow is very, very good at, that people have found a lot of value from, right. I think one is, is that if you’re writing some kind of application where you need to access data in some of the most common big data formats. Odds are if you write something that interacts with Arrow you’re gonna be able to do that, and then you’re gonna be able to access multiple data formats and do that very efficiently. That’s I think one case where it’s a easy win that people have.I think another one that a lot of people have is that when they have staff that’s possibly trying to move data between native applications and different languages, and/or native applications and JDM applications. Being able to move data between those things people can … I think we’ve seen a lot of people have very quick successes with that. I think that with some of the new things we’re working on, this concept is like data, what I call data microservices which is this ability for you to build up individual services inside your organization but make them efficient for high speed throughput of analytical data.
What is the Future of Apache Arrow
Transcript
Jacques Nadeau:
I feel like I saw numbers, like a half a million downloads a month, for some of the Python stuff?Wes McKinney:
Yeah, I think just the Python wheels, using the PIP package installer. Yeah, I think there more than a half million installs in August. So the uptake is great, and there’s a handful of use cases that are driving that. I think shared memory IPC is driving a lot of that. Reading and writing Parquet files is driving some of that. So people are using Arrow in different processing frameworks like Pandas and Dask.But I think where we’re going from here, I think there are two major directions. Well, there’s maybe three major directions. So one direction is increased language diversity and within each language, more language support. So we’re just starting working on an R integration with Arrow. So there’s a lot of work to do there. We still have a lot of work to do in Python, JavaScript, Java, Go, Rust.So there’s different libraries and different binding layers and those are kind of all co-developing and are at different stages of maturity. But with kind of across the different languages, I think the two major themes of development is- one is IO and data access. So supporting more kinds of storage formats, more kinds of databases for getting data efficiently into Arrow format.And then probably the biggest project is processing. So in-memory computing on Arrow data. So one project that’s going on right now is the Gandiva Initiative, which is an LLVM compiler for efficient evaluation of expressions on Arrow data. And I think that co-generation and LLVM is going to be a really important part of the project going forward. But I think we’ll see in-memory processing for Arrow develop in all of the libraries, so that every native implementation of Arrow has some level of native manipulation of, effectively, Arrow tables or Arrow data frames, however you think of them.Jacques Nadeau:
Yeah. No, I agree. When I think about where it’s going, I think that more and more different systems are going to incorporate it. Like if we think about- you mentioned earlier, Spark now supports it for moving data back and forth between Pandas. We saw that Nvidia adopted that as its representation for GPU analytics. There are dozens of projects and products that are using it.So I think that we’ll continue to see an expansion of the adoption. I think absolutely, the processing will continue to grow. I think that we’re just scratching the surface there of what we really want to do. I’m super excited about Flight because I really do believe that this is going to be this ability for people to move between different systems. So I think making that as easy as possible, I think, is going to be very valuable.I think that one of the things that we’re only starting with is applications that people can build on top of that are using Arrow. I think that we have end-user applications, but I think that we’re going to start to see development applications that are additional libraries built on top of Arrow that allow you to do even more work to complete your task. And so like the Apache [inaudible] you guys were talking about- how do we build an Arrow-native [inaudible] interface.And I think that we’ll see more and more of that, which will be sort of additional foundation components, but one layer up from the representation and processing, which I think is kind of cool.
Arrow Flight and Data Microservices
Transcript
Wes McKinney:
I mean I’m really excited about the flight project which is our PC system. I think one of the biggest growth areas for people that use Arrow will be people building … Earlier you mentioned data microservices and so if you’re a large organization you could imagine rather than having essentially a custom server protocols that if many systems standardize on using the Arrow Flight our PC protocol, you can just say, “For this system here’s my Arrow port.” You can get a list of the datasets that are available from that server and you can request those at high performance.Clients will benefit in that the protocol is very efficient because of all of the benefits of Arrow, not copying memory, not de-serializing. Some systems will process the data that they get from those servers natively in Arrow format or they’ll convert to some other representation. In either case the performance gains in real world applications will be significant. I think that will drive a lot of adoption of the project over the next couple of years.Jacques Nadeau:
Yeah, people always like to be faster.Wes McKinney:
Faster is always, pretty much always better. Faster and more resource efficient.