Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlin is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value.

Go to data engineering podcast.com/atlan

today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary

things with metadata. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/

linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Wes McKinney about his work at Voltron Data and on the Arrow project and its surrounding ecosystem. So, Wes, can you start by introducing yourself?

Yeah. Sure. Thanks for having me. I'm Wes McKinney. Many people know me as the creator of the Python Pandas project.

Started almost 15 years ago, but over the last 7 years, I've been primarily focused on the Apache Arrow project

and the surrounding open source ecosystem.

More recently, I'm the CTO and cofounder of Voltron Data,

a data analytics startup

where we are offering enterprise

support and services around

Apache Arrow and doing a substantial amount of open source development in the in the ecosystem.

And do you remember how you first got started working in data? I've told the story many times, but I was working in quantitative finance right out of college. I had a math degree, and turned out that I thought I was gonna be doing math and solving partial differential equations and that sort of thing, but it turned out that I was mostly

doing data analysis and writing SQL queries and using data frames and things like that.

And so I started to get interested in the tools for doing data analysis because I wanted to make myself more productive because I found my job to be quite tedious and working with data to be much more difficult than I thought it should be. And so for me, it started out as a personal challenge to see if I could create tools for myself to enhance my productivity. And I found that I

I found that I enjoyed building tools and I, you know, become very passionate about open source software. And, you know, I love working in the

community and building projects and helping

progress happen faster.

In terms of the Voltron Data

business

and

the kind of focus of it, I'm wondering if you can give some of the overview and some of the story behind how it came to be.

The Apache Arrow project started we

got the initial group of developers together in 2015

to start the project, formally

launched it as a top level project in the Apache Software Foundation in 2016.

And we set about, you know, growing the different layers of the stack.

And as time went by, we started to observe,

you know, more general trends in the interplay between

programming languages,

you know, data storage, data access, and kind of the the data analytics stack,

and the role of, you know, the

evolution of hardware and computing hardware. So in particular,

things like graphics cards, you know, GPUs, FPGAs,

and custom silicon.

There were many different groups of developers working on different layers of the stack in and around the the Apache Arrow ecosystem.

We saw an opportunity

to build a unified computing company,

bringing together, you know, several of those groups of people. So, respectively,

you know, myself and my team from Ursa Labs, which became Ursa Computing,

group of leadership from the RAPIDS projects, which had been started at NVIDIA,

and the Blazing SQL project, which is a SQL engine built on top of RAPIDS.

So we reason that we could build a more integrated and more successful company, you know, working together

under 1 roof than than pursuing, trying to grow our different slices of the pie, so to speak. So that's how the company

came together beginning of last year, you know, to build a large team.

Thankfully, we were able to assemble quite a bit of investor capital before the market turned south

earlier this year. You know, we've been really just heads down, heads down building the last last year and a half, which has been really, really exciting.

1 of the, I guess, kind of meta notes that I'm curious about is how you settled on the name of Voltron as the company name, and how often people wonder what that is in reference to? You know, we like the name Voltron Data. You know, we wanted to evoke, you know, the feeling of, you know, what we're building being something that is, you know, the whole is greater than the sum of its parts.

And, you know, I think the mission of the company,

kind of the, you know, the heart or the soul of the company is making the modern

data analytics stack more modular and

posable to make it easier for developers and users of analytics or data engineering tools

to unlock

the value of modern hardware

and to take advantage of advances in

computing capabilities, you know, as they become available.

And so I think we've seen, you know, in the in the world of machine learning and AI, you know, deep learning training,

machine learning training, that sort of thing, we've seen,

you know, significant change to the technology landscape through use of

hardware acceleration through, you know, through GPUs, and now we're seeing TPUs and and custom chips

for accelerating

machine learning.

There's

the same kind of innovation and improvements in computing efficiency

can be brought to the other layers of the data processing stack, analytics,

machine learning preprocessing,

you know, ETL.

You know, we are, you know, we're really focused on

improving the, you know, protocols and standards, like the fundamental technologies that enable that that kind of modularity and composability

at the kind of, you know, language data and hardware level.

And so if, you know, developers observe, you know, the work that we're doing in not only in Apache Arrow, but in some of the surrounding projects,

like Substrate

and and IBIS, which we we can dig more into in this podcast,

you can see how we are working on, you know, really hardening, like, these interfaces and protocols between the different layers

layers of stack to make it easier for

developers to, you know, swap out components

and develop a more kind of framework agnostic or, you know, an engine agnostic fashion, if that makes sense.

As far as

the broader vision of

Arrow,

you know, it has these immediate benefits of being able to operate as an interchange format between different languages and run times and frameworks,

and it has been growing in terms of its scope and its capabilities. And I'm curious if you have any

overarching vision for Arrow and its potential impact on the broader data ecosystem

and some of the ways that the work that you're doing at Voltron

is aimed at helping to

bring forth the realization of that vision.

You know, going back 6, 7 years, when we started the Aero project,

I did always have the aspiration

of building

a

more modern computing foundation for data frames and tabular data processing.

And so for me, like, expanding the scope of, you know, what we call Apache Arrow has always been, you know, something that I've been really motivated to do.

But when we started the project, we had to start small. Like, can we, as a community, come to an agreement around how we how we represent

tabular data

in a framework and language agnostic fashion, such that we can achieve this concept of, like, a universal data frame, which can be used portably across

computing frameworks, programming languages, different processing environments

so that we can have a basis for beginning to think about that kind of, you know, frictionless modularity and composability

at the data level. Once we did that, we had to move on to building the other layers of stack, which are necessary to build Arrow native applications. And so that's,

you know, the data serialization,

building

RPC for moving around data efficiently in a distributed system.

More recently, you know, we've been looking at the protocols and interfaces

for interacting with databases in an Arrow native way. And so we've got subprojects which are specifically focused on

integrating Arrow more natively into database systems.

So we make it easier to push Arrow based datasets in and out of databases.

And the other dimension of not only having a data format that is a universal data format,

protocols and interfaces for moving it around,

protocols for connecting systems together in an Arrow native way. But we also needed to build

computing engines to process Arrow data so that we can embed into different systems to do, you know, data cleaning, data preparation,

teacher engineering for machine learning, analytics,

all those things that you would do with a,

you know, SQL engine or a data frame library, that sort of thing. And so as time has shifted, the work in the Arrow project has moved away from building these fundamental protocols and interfaces to more of the,

you know, modular, embeddable compute engine development, which has been really, really exciting to see.

1 of

the initial motivations for Arrow was to cut down on some of the inefficiencies

of that data interchange.

I think 1 of the most notable examples is using the PySpark library to interact with the Spark runtime and having to

serialize and deserialize the data in between those interfaces, as well as having to translate between the

representations of information between Java and Python.

And I'm wondering if you can give an overview of the kind of types and scope of impact on

engineering productivity and compute efficiency

that the Arrow project and the kind of growth thereof

is intended to address.

I think the Spark example

is a really motivating 1 because that was 1 of the first problems,

practical problems that we focused on solving with the Arrow project was the problem of

making

Python on Spark

a lot faster. And so if you look at,

you know, as a user using PySpark versus using the Spark Java or Spark Scala APIs,

there was a significant

performance

penalty in using Python whenever you wanted to extend

Spark with custom Python code that might use Pandas or might use scikit learn or, you know, something else in Python ecosystem.

So by defining a

column oriented, you know, data format, which could be constructed

on the JVM side inside the Spark runtime and then

sent over to the Python side for executing custom code

by having that not only a more efficient data format to move across, but also something that could be interacted with very cheaply on both sides without having additional

conversion or serialization.

We were able to you know, this was my colleagues at 2 Sigma and collaborators at IBM. We were able to make custom code running in PySpark

10 to a 100 times faster in some cases.

Now, of course, like, you know, there's many workloads in Spark which has shifted to use

the Spark Dataframe API where under the hood, you know, Python code, Java code, Scala code, it gets translated into

effectively a SQL query, which gets run by Spark SQL. And so there's no need for data to ever be transferred into Python.

But there still are plenty of use cases where it's necessary to run custom code, and Spark is used in many cases as

a convenience layer for doing parallel and distributed computing with Python. But users shouldn't have to pay an enormous penalty,

have that privilege. And so Arrow

has really helped in reducing the overhead, the impedance

between those systems and in those cases.

That being said,

you know, Spark and Spark SQL

are,

you know, systems that have been been around for a long time, but Spark SQL was built before

Arrow existed. And so, internally,

it is not, you know, an Arrow

native system,

so to speak. So, like, it represents the data that flows around Spark or inside Spark SQL in a data format that is not the same as the Arrow format.

And so I think what's really interesting

for thinking about the future

is having

spark like distributed systems for large scale

tabular data processing

that are fully aero native end to end. And so you have

the ability to extend those systems with custom code written in principle in any programming language that knows about Arrow. And so we enable

much more, you know,

kind of fair and kind

of consistent polyglot experience across the stack where where no programming language is being unfairly penalized,

both through as a result of having to, you know, do expensive data serialization at the at the programming language boundary.

Because of the fact that you aren't paying a penalty

by virtue of the language or the runtime that you're choosing, I'm curious how you see that influence

the decisions that engineering teams make as to how they want to compose their stack and compose their analyses

and some of the ways that that reflects in terms of the skill sets that are necessary to be able to

build and maintain these analytical systems.

For me, what's exciting and motivating is is for the users to have

the choice and being able to choose the programming language and the types of APIs and user interfaces that make most sense for the systems that they are building

and to have more

kind of a more natural, you know, let's call it language integrated querying capability.

So I think part of the challenge that we have as system developers is

is to make it easier for the programming language interfaces to evolve and innovate

independently from the back end compute engines.

And so, you know, our goal with what we've been building in Arrow is to put,

you know, very fast,

you know, Arrow native data processing

to make that available in a form factor

where it can go everywhere. So it can be, you know, deployed in, you know, heavily resource constrained environments where having, you know, very low latency efficient

tabular data processing

with the in process is highly desirable.

But that we can also, using the same APIs and user interfaces that we use to do local small scale computing

at the edge, so to speak, that we can build descriptions of

our workloads or our data transformations

in a form where they can be serialized and sent into, you know, large clusters for, you know, doing larger scale data processing.

And so that's 1 of the reasons that we've

been investing pretty heavily in this new project called Substrate, which is building a intermediate representation

for data analytics operations that can be used to connect

user interfaces

and computing engines on the back end.

So you can think about substrate as being like something that's, you know, lower level than SQL

and can be used to represent, you know, tabular data or data frame operations that go outside of what is expressible

in SQL.

And so it's our hope that by hardening the interface

and making it straightforward

for engine builders, you know, compute engine builders

to focus on building a substrate interface to their engine.

And so then from

the API developer standpoint, the user interface developer building Python libraries or

Go libraries or Java libraries or Rust libraries.

At the user interface layer, we can just focus on generating substrate rather than having to think about, well, how do I build an interface or an integration with a particular computing engine? Because then whenever there's a new computing engine that you wanna take use of, maybe to accelerate some part of your

data processing workload, you've gotta build a new interface to that engine.

And so by reducing the surface area of the problem to just, let's just think about the world in terms of this substrate intermediate representation,

that makes it so much easier for us as API developers to build the user experience because we just have this, like, 1 intermediate representation to generate. And then on the back end, you know, the engines can decide how to most efficiently execute the the substrate.

Are you struggling with broken pipelines,

stale dashboards,

missing data?

If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability

Trusted by the teams at Fox, JetBlue, and PagerDuty,

Monte Carlo solves the costly problem of broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, airflow jobs, and business intelligence tools,

reducing time to detection and resolution from weeks to just minutes.

Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box.

Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/montecarlo

to learn more.

As far as the overall scope of the Arrow project itself and what is actually contained within the code repository

and also the growth of the broader ecosystem around it, They have definitely

grown substantially,

and I think the most recent release of Arrow right now is version 10. So definitely a lot of development happening there. And I'm wondering if you can

give an overview of the current set of capabilities

and the, I guess, features that are targeted with Arrow and the related projects beyond just that in memory columnar data representation.

I mean, the way that we describe the project these days is we describe the Arrow project as being a multi language toolbox for building analytical data processing systems. And 1 important part of the project is

Arrow calendar data format.

From there, there's, you know, a whole set of different software components, which enable you to do things with the Arrow data formats that includes

data serialization,

inter process communication,

remote procedure calls, so building services that need to send and receive

Arrow data. So there's a framework called Flight

for building Arrow native data services.

We've started building some database middlewares

for integrating Arrow into

database systems. So there's a SQL database protocol project called Flight SQL, which provides

basically a wire protocol for talking to SQL databases

over gRPC

using the Flight framework.

Another project called ADBC, which is a standardized

API

for database drivers to provide aero native data access. So it's kind of orthogonal to Flight SQL, so it has nothing to do with the data protocol or the wire protocol. It's more about having a standardized API

for

inserting and selecting arrow datasets from SQL based systems.

We've got Compute Engine, so there's multiple computing engine projects. There's the Acero,

c plus plus Compute Engine.

There's,

Rust Data Fusion,

Compute Engine. So there's kind of the, you know, kind of embeddable

Compute Engines in multiple programming languages designed for different

use cases.

You know, it's an increasingly, you know, diverse and

and and federated community of subprojects. So not just the core

10.0

Aero.

You know, probably, if you go to apache/arrow

on GitHub, you'll see a large you know, very large Polyglot

Git repository.

But we've grown several other repositories that house a number of the Rust subprojects as well as the the Julia

Aero project

lives in its own repository

nowadays as well. And so we have some support for around a dozen programming languages.

And within each programming language, we have, you know, a stack of libraries, which

are there to make it possible for you to build systems that that use the Arrow format or connect to other systems that use Arrow. Of course, some of those libraries are in different levels of maturity. So the ROS libraries, the c plus plus libraries are generally the most

featureful and mature,

but we're growing an increasing, you know, amount of support in Go and Java. You know, initially, the project started out. It was just c plus plus and Java, but it's expanded significantly since then. It can be a little bit difficult for a newcomer to navigate, but

I think the community has around a 1000 developers. Maybe around a 1000 different people have contributed to the project over the last

7 years. So the developer community,

we've done a good job or we've put in a lot of effort, I should say, to make the project accessible to new contributors. So that's through, you know, developer documentation

and, you know, efforts to, you know, engage and grow

the open source community

around it. So it's not just a small, you know, insular group of developers building all of these things, but that we're actively

trying to make the developer community

larger to share the burden of maintaining all of these different software components that have to be, you know, have to have bug fixes and security fixes and make releases periodically.

Digging more into the

implementation of the Arrow project, I'm wondering if you can talk through the actual

architecture

of the code and

some of the ways that you're able to validate

conformance with the specification

across those multiple different languages that are implementing an interface to it.

As I was saying, the project is fairly federated in the sense that the different subprojects,

their set of features, and,

you know, what the developers are focusing on at any given time, they evolve somewhat independently

of each other. But there is the commonality

of

the

Arrow memory format

and the protocols and interfaces

for interoperability.

So 1 of the first things we did when we started the project was establish

for integration testing between implementations.

So, for example, in c plus plus we have a set of c plus plus classes

and different tools for dealing with

memory allocation

and construction

of arrow, tabular, in memory data.

And we have similar set of classes and interfaces

in Java,

and we needed to

define a procedure for

a Java application and a c plus plus application

to determine that they understand each other's data. That you say, here's my Arrow data. Do you agree that what I think the data is the same as what you think it is? And so we devised a test harness where we generate a point of truth

version of the dataset in JSON,

both applications.

You know, the integration test harness parses that JSON,

constructs

the corresponding Arrow version of that, and then compares that to the binary

kind of serialized representation of the Arrow dataset

to determine whether it is identical.

So that enabled us to show compatibility between different implementations. So that integration test harness has grown, you know, to encompass several implementations.

And so a new implementation of Arrow,

that's the first target for

showing compatibility

is participating in those integration tests.

But, you know, within

the integration tests have expanded to include other things like Flight, which is RPC framework built on top of gRPC

for building data services that send and receive arrows. So there's a set of integration tests

for those.

That's how the main way that we verify interoperability

or that implementations

are doing the same thing. But within the different programming languages, like the architecture of the projects has evolved fairly independently,

there's a different extent to which the implementations rely on external libraries for solving certain problems. So, for example, in c plus plus we've been developing a subproject called datasets API

or datasets framework for a number of years where we enable

users to interact with large datasets that are spread across, for example, partition directories of parquet files in s 3.

And

within the project, like, we've had to build

a c plus plus interface to s the s 3 API.

You know, we developed the parquetc plus plus interface.

There's a lot of supporting code for

dealing with, you know, asynchronous

interactions with remote datasets that we've had to develop. But if you go and look at Rust or Java, the libraries for doing these same sorts of things, there's a different level of, you know, reliance on external libraries.

Rust uses more

external libraries for some of the things where we've had to build, you know, develop, you know, home grow

some libraries and tools

within c plus plus because there weren't off the shelf libraries

available.

So in general, like, our mantra is providing a

batteries included experience for developers. And so we think about,

like, you know, just thinking from the mindset of somebody doing data engineering or building,

you know, building an analytics stack, and we think about, like, what problems

is that developer or that user going to need to solve.

And so rather than, you know, leaving developers to cobble together solutions,

would rather that there be, like, good out of the box solution for some of these, like, you know, run of the mill, like, everyday,

you know, data engineering workflows. So, like, for example, anything having to do with, like, interacting with a large parquet dataset

in cloud storage. Like, we wanna make that really easy for for a developer and for using Arrow to be the fastest, like, most efficient way to build their system.

And you mentioned with the substrate project, 1 of your efforts to

reduce the level of effort necessary to be able to

add Arrow's capabilities and benefits

to the end user experience and being able to integrate with different components of a potential data stack. And I'm curious

with that in mind, but also with some of the other projects that you mentioned,

how you think about balancing

the desire to be able to move fast and expand the reach and capabilities of Arrow

with the need to wait for some of these other products and

frameworks and projects to actually do the work of integrating with Arrow and maybe the substrate project and some of the ways that you're

collaborating with and maybe incentivizing some of these other engines and run times to do that work? You know, as a company, Voltron Data, I think we're lucky to be in a position where we can help accelerate some of this some of some of this work, like the adoption and integration work. We've been pursuing

that through our enterprise subscription program, which is basically an Arrow development and support partnership. So working with companies that are building on Arrow.

We've also

made some strategic partnerships with projects that are outside of the Arrow ecosystem

where we want to invest

in Arrow ecosystem integration. So 2 examples are DuckDV

and Delox, which is a a new project started by Meta. And so we reason that, you know, there, you know, other developers working on other projects that that are, you know, part of this

greater vision of building a more modular and composable

data analytics stack,

Arrow has become a central, you know, key part of that by providing this language agnostic, you know, sort of let's call it a data fabric

for connecting systems

and computing on.

And so by making it, you know, straightforward

for, you know, computing engines like DuckDb or Velox to connect

to other systems which use Arrow, that's in everybody's best interest, making it easier for

user interfaces.

So for example, you know, we've for many years, like, we've been building this project, IBIS, which provides a scale independent data frame API, engine independent data frame API for Python,

building an interface between that and Substrate,

and then working on substrate interface to these different compute engines. So we make it easy to, you know, generate substrate once and then execute it in any of these different computing back ends. So we enable

engine choice

for the developer.

You know, we've been making,

you know, over the last year and a half since we started the company, you know, we've been growing our support program, working with customers there, working on partnerships around,

you know, areas of mutual interest in hardening and growing the Arrow ecosystem

and improving these, you know, standards and protocols for interoperability

so that we, you know, can accelerate towards this more kind of modular computing stack for building large scale data analytics systems.

Another interesting aspect

of what you're offering with Arrow is that it is

very optimized for tabular data, which is a substantial portion of what people are trying to perform analysis on. But with the growth of machine learning and more

scalable and capable compute frameworks, there has been an increase in usage of other formats of data, whether that's unstructured data such as binaries or images or videos or semi structured or document style data or even multidimensional data.

And I'm wondering how you see the role of Arrow in that avenue of

either being able

to accommodate some of those of the different data types or being able to

cooperate with run times that are trying to maybe enrich

either

those unstructured data assets with tabular information

or, you know, enrich tabular information with metadata from those unstructured assets?

It's a great question. I mean, it's true that, you know,

columnar

tabular data processing is the bread and butter of many systems, and a lot of the advances in computing efficiency have come through, you know, better use

of SIMD instructions,

you know, just just better, you know, utilization of modern CPUs, and Arrow definitely was designed to

enable that, enable, you know, more efficient vectorized

data processing.

It's also facilitated

use of, you know, GPUs

and has been used productively to do accelerated processing on GPUs and FPGAs.

But there are these other types of data that are non tabular.

And so 1 thing that we've seen is is embedding

structured data in Arrow data structures, so images or text,

and doing kind of building a,

I guess, you could describe as a hybrid structure. It's like a table that contains unstructured data. Just to give an example,

so, you know, my former cofounder,

early pianist developer, Chang Shi, has got a new project called

LAANC, which is a

computer vision stack that is aero native that enables,

you know, training and model scoring

on image datasets that are all represented as an arrow native

in an arrow native fashion and then stored out in storage as as parquet files.

You know, you can use DuckDb as an engine to execute

to deal with large image datasets.

And so the, you know, image scoring functions, you know, training and scoring have to be represented as user defined functions,

which get run against these images, which are embedded

in aero tabular data structures.

So we've seen, you know, successful

use of, like or successful,

you know, hybrid structured, unstructured

datasets.

That being said, like, Arrow is not be gonna be a fit for everything. Like, there's workloads where, you know, fundamentally, you're dealing with a large you know, very large tensor.

So you could have an Arrow dataset that has a column of where every cell in the column is a tensor,

and we've seen people do that.

But, like, Arrow is not gonna be a tip for a 100% of use cases.

That's just the nature of the beast. Like, not everything is a table.

Data engineers don't enjoy writing, maintaining, and modifying ETL pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics,

Salesforce, AdWords, Facebook, and Spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions.

Hivo Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance.

Posting more than a 150 out of the box connectors that can be set up in minutes, Hivo also allows you to monitor and control your pipelines.

You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks, preload transformations and auto schema mapping precisely control how data lands in your destination,

models and the workflows to transform data for analytics, and reverse ETL capability to move the transformed data back to your business software to inspire timely

action. All of this, plus its transparent pricing and 247 live support makes it consistently voted by users as the leader in the data pipeline category on review platforms like g 2. Go to data engineering podcast.com/hevodata

today and sign up for a free 14 day trial that also comes with 247

support.

As far as the business model of Voltron Data, you mentioned that you are

support focused, and I'm curious if you can just talk through some of the ways that you think about

working with and on the Arrow ecosystem

and being able to build a viable and scalable business, and also

because of the fact that a significant portion of your work is focused on aero,

how that translates into the way that you think about the marketing and customer education

about what you're doing?

We have been spending a lot of time and energy on customer education

and developer relations,

documentation,

generating, you know, if you look at our website, ultrondata.com,

like, we've been putting out pretty steady cadence of, you know, of content

to help with educating the customer, you know, user base, I guess, to be more accurate,

user base about different layers of Aero and other projects in

the the, quote, unquote, you know, Aero Cinematic Universe, I would say. We've also been running an Aero oriented

conference called the Data Threads. We had our first back in June. We had amazing, you know, set of set of talks around

showcasing

the work of the the community, you know, Arrow users and the kinds of things they're building, the systems that they're building, and solutions.

We have we have another edition of the data thread 2023

in February.

You know, very excited about that. As far as our business model and the company, you know, in the short term, we're really focused on

hardening some of these fundamental technologies,

which, you know, we think are going to power the data analytics, you know, machine learning preprocessing

data engineering stack for the coming decade or 2. So things like, you know, as we've been discussing, things like Substrate, Core Apache Arrow itself, and

some of the user interface

layer projects like IBIS.

You know, those are, you know, essential building blocks

for, you know, enterprises

globally to become more aero native.

That's been our, you know, primary focus, educating the world about how to take advantage

of the work that's happening in the Arrow ecosystem,

hardening, you know, these fundamental

projects,

working with partners to accelerate

adoption and integration of Arrow,

and supporting large enterprises that are building on Arrow through our, you know,

our enterprise subscription program. So the short term, that's our focus. You know, we have raised,

significant amount of venture capital,

and so we need to build a, you know, a large scalable business.

And, you know, we look forward to doing that over the coming years. But we are surfing on a a very big wave that is, you know, disrupting and changing the landscape of data systems. And so, you know, our strategy right now is oriented at accelerating,

you know, the growth and the size and speed of of that wave.

In your work of helping to create the Aero project and now you're focused on growing and expanding its capabilities in the surrounding ecosystem and

working with your customers on these support capabilities? What are some of the most interesting or innovative or unexpected ways that you've seen the Arrow project and its related

ecosystem applied?

I would say there hasn't been anything super surprising, but I think what's been really interesting

is

is seeing, like, how,

like, the early adopters

you know, I think there's plenty of companies and use and developers, users who are in the mode of being aero curious. Like, they've learned about the project. They've seen content over the last several years. They've seen the

you know, growing trends, you know, use and people talking about the project.

But then you have, you know, companies that have essentially already adopted the, you know, aero religion, so to speak, and have spent, you know, 2, 3 years or more

building systems that are aero native.

And to see the business impact

of that in terms of, you know, lower resource utilization,

you know,

systems that are more interoperable,

have just better efficiency, better performance, lower latency.

And I think there's this system turnover problem where

companies are replacing their last generation of internal systems that they've built with new systems that are using Arrow. And so there's a certain sense of loss. Like, you know, there was many, you know, developer years of time spent building systems,

you know, 7 years ago or 10 years ago. And so now there's this

activation energy of building new systems, which are built with this new, you know, more efficient computing stack.

But once these systems start to come together

and the business starts

seeing, you know, a return on that investment,

it's really very exciting

just to see people's, you know, computing or data platforms, data infrastructure

become

great deal simpler and more efficient, it's just really validating to me having spent such a large, you know, fraction of my life working on this project to see the dream of the Arrow project and its potential

in large scale data platforms become a reality.

In your own work of building this business

and helping to create this project and grow it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I mean, I would say that, you know, building large open source projects that become

depended on by

large

traction of the ecosystem is is always very difficult. So I have spent yeah. I think the initial

the early bootstrapping of the Arrow project was definitely not easy.

Required, you know, a lot of, you know, personal and professional

sacrifices,

you know, on my part. And so I was lucky to have

some passionate, you know, true believers supporting the work. So, for example,

folks over at 2 Sigma, you know, who employed me and then also were

supporters of Ursa Labs, you know, for a couple of years before we, you know, became Voltron Data. Folks at RStudio who really wanted to RStudio, which is now Posit,

who saw the potential for building a more polyglot

data science stack.

And so I think the lessons learned are that relationships really matter. It's not just about

writing code and pushing code to GitHub.

Like, the social dimension

of building these types of projects

is the most difficult, but also the most important if your goal is to build

something that has, you know, large scope and that you want to be sustainable

over a long period of time.

We still have, you know, social and sustainability challenges

in the Aero projects. Like, we're

Voltron Data, we're taking on a large amount of maintenance

and systems burden

supporting the Aero project, which has been, you know, great in the sense that we're pumping out releases. Like, we're improving the CICD infrastructure.

You know, our testing and continuous delivery for the project is better than ever. But then, you know,

other people in the open source project would be justifiably

concerned about, you know, can we count on Voltron Data to be around and providing this level of support for forever?

So there's a justifiable

suspicion about

companies being involved

and large companies being involved in open source projects.

But I think, you know, our goal is to always do right by the community. You know, I've always been very, very community minded in in thinking about building these projects. So it's been interesting and challenging

and stressful at times, but it's also very rewarding. So,

you know, ultimately, like, we see this as the Arrow project as being an agent of change and progress

in the open source ecosystem. So we're excited to keep rolling the ball forward and supporting

growth, of the ecosystem and making sure that, you know, the developers and the users can be successful building on this new computing stack.

Given the fact that Arrow isn't even necessarily

the kind of end user selected utility, this question might be nonsensical. But what are the cases where Arrow or its related projects might be the wrong choice?

I think a question that we answer a lot is,

you know, Arrow is a storage format or Arrow is a format for data warehousing.

And Arrow is not designed to be a replacement for a competitor with or a replacement for

Parquet, for example.

And so, you know, sometimes people do come to the project thinking like, oh, I've heard about Arrow. It's a data format. Right? So can I, you know, use it to build my data warehouse or build my data lake? And so,

you know, occasionally, there's, you know, some confusion around purpose of the project.

But I think as we've improved,

you know, our developer content

and helping folks understand

about how, you know, we're building this companion technology

to

storage, you know, storage systems like, you know, file formats like Parquet and, you know, large scale metadata management, you know, large scale dataset systems like Iceberg.

I think that's becoming more clear to users.

And and, certainly, like, there's people doing data engineering or, you know, machine learning that is primarily

dealing with, you know, text or unstructured data.

In some of those instances, you know, Arrow may not provide a lot of value depending on depending on the nature of the work. But fortunately, a lot of the data that's processed in the world is fundamentally

tabular or at least representable on a tabular format.

Most, you know, data generated by modern web applications, mobile applications

can be, you know, represented and processed in a tabular format.

And so even though, you know, we don't strive to be all things to all people,

there's a large fraction of, you know, data analytics or data engineering

where Arrow is a relevant technology

that can make things, you know, faster, simpler, more efficient.

As you continue to build and iterate on the Arrow project and invest in that ecosystem and help to

grow the degree of integrations that are available, what are some of the things you have planned for the near to medium term or any projects you're excited to dig into?

Right now, I'm pretty you know, we talked a lot about substrate. I'm very pumped about that. Another,

you know, project that I'm really excited about is we've got this

effort in Aero called Nano Aero, which is building a small implementation

of the Aero format and protocol

for embedded use. So if you have a system like a database or,

you know, like a microservice or, you know, it could be really anything

where you want to add the ability to

send and receive Arrow data, but you don't want to take on new library dependencies.

This is a project that can be, you know, dropped in and copied into a project in principle in any programming language as long as you have c you know, the ability to call c code. And so we think that that will help expand the adoption of Arrow into

places where it has not reached yet. We're pretty excited about that project, Nano Arrow. Also, really excited about ADBC,

like, the standardized database API

to be used alongside existing JDBC and ODBC interfaces for talking to databases.

But I think I've always had the desire to make it easier to talk to databases and for

applications and users to not have to write so much custom code to just get data in and out of SQL databases.

And so I think that the ADBC effort gives us a path to, you know, making that a reality so that we can just think about tables and data frames and not so much about, you know, how do I, you know, translate between this database's wire protocol and my, you know, data frame, data structure. Because god knows, you know, I've written and, you know, folks in all of these

ecosystems, you know, we've all written a ton of code just dealing with converting between data formats. And so I'm looking forward to a day when we won't have to think about that. We'll have written some of our last

data connectors, and we can just think about Arrow, and that will make our lives a lot easier. It's such a great experience for new programmers to have to figure out how to reconstitute

the data that they get back from the database with the column names and make sure that they're matched up properly. You're gonna rob them of that experience? That's right. I get that it's, you know, it's like a code kata. It's like a almost a rite of passage to have to write converters between, you know, the data that comes out of the database in your application. But, you know, I think

we've reached a point where I think our efforts as programmers would be best reserved for other for other challenges. Absolutely. Yeah. No. I

I I don't think anyone will miss having to go through that exercise.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question,

I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I'm not totally sure about this. But, I mean, I think, you know, we're we're still in a state of learning and change when it comes to

how best to build and manage very large data lakes. I'm

really hopeful about,

Apache Iceberg, which came out of Netflix and is 1 of the next generation approaches for

large scale date dataset management.

I think that the the sooner that we can settle on standards for

scalable open data warehousing, so to speak, I think that makes things

for someone like me who's more focused on

computing engines and

user interfaces.

You know, how we get access to the data, how we store and manage the data is less of a a moving target.

And so the world becomes increasingly standardized on iceberg, for example,

in file formats like Parquet,

then that simplifies the problem for the the engine and user interface developers to make an end to end stack for developers,

you know, where the choices are much more straightforward

and there's less fragmentation

and waste in the stack.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Voltron Data and your experience

on working on the Aero project and helping to

illuminate some of the ways that it is being used and the surrounding projects and its growth in the ecosystem. So I appreciate all the time and energy that you and the other members of the Voltron Data and the Arrow teams are doing. So thank you again for your efforts there, and I hope you enjoy the rest of your day. Thanks. Thanks for having me.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast

dotcom with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links