Summary
The data ecosystem has been growing rapidly, with new communities joining and bringing their preferred programming languages to the mix. This has led to inefficiencies in how data is stored, accessed, and shared across process and system boundaries. The Arrow project is designed to eliminate wasted effort in translating between languages, and Voltron Data was created to help grow and support its technology and community. In this episode Wes McKinney shares the ways that Arrow and its related projects are improving the efficiency of data systems and driving their next stage of evolution.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
- Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support.
- Your host is Tobias Macey and today I’m interviewing Wes McKinney about his work at Voltron Data and on the Arrow ecosystem
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what you are building at Voltron Data and the story behind it?
- What is the vision for the broader data ecosystem that you are trying to realize through your investment in Arrow and related projects?
- How does your work at Voltron Data contribute to the realization of that vision?
- What is the impact on engineer productivity and compute efficiency that gets introduced by the impedance mismatches between language and framework representations of data?
- The scope and capabilities of the Arrow project have grown substantially since it was first introduced. Can you give an overview of the current features and extensions to the project?
- What are some of the ways that ArrowVe and its related projects can be integrated with or replace the different elements of a data platform?
- Can you describe how Arrow is implemented?
- What are the most complex/challenging aspects of the engineering needed to support interoperable data interchange between language runtimes?
- How are you balancing the desire to move quickly and improve the Arrow protocol and implementations, with the need to wait for other players in the ecosystem (e.g. database engines, compute frameworks, etc.) to add support?
- With the growing application of data formats such as graphs and vectors, what do you see as the role of Arrow and its ideas in those use cases?
- For workflows that rely on integrating structured and unstructured data, what are the options for interaction with non-tabular data? (e.g. images, documents, etc.)
- With your support-focused business model, how are you approaching marketing and customer education to make it viable and scalable?
- What are the most interesting, innovative, or unexpected ways that you have seen Arrow used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Arrow and its ecosystem?
- When is Arrow the wrong choice?
- What do you have planned for the future of Arrow?
Contact Info
- Website
- wesm on GitHub
- @wesmckinn on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Voltron Data
- Pandas
- Apache Arrow
- Partial Differential Equation
- FPGA == Field-Programmable Gate Array
- GPU == Graphics Processing Unit
- Ursa Labs
- Voltron (cartoon)
- Feature Engineering
- PySpark
- Substrait
- Arrow Flight
- Acero
- Arrow Datafusion
- Velox
- Ibis
- SIMD == Single Instruction, Multiple Data
- Lance
- DuckDB
- Data Threads Conference
- Nano-Arrow
- Arrow ADBC Protocol
- Apache Iceberg
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Atlan: ![Atlan](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/ys762EJx.png) Have you ever woken up to a crisis because a number on a dashboard is broken and no one knows why? Or sent out frustrating slack messages trying to find the right data set? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all this collaboration chaos themselves, and started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to [dataengineeringpodcast.com/atlan](https://www.dataengineeringpodcast.com/atlan) and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription.
- MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png) Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit [dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo) to learn more.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlin is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlan's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.
With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/ linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Wes McKinney about his work at Voltron Data and on the Arrow project and its surrounding ecosystem. So, Wes, can you start by introducing yourself?
[00:01:37] Unknown:
Yeah. Sure. Thanks for having me. I'm Wes McKinney. Many people know me as the creator of the Python Pandas project. Started almost 15 years ago, but over the last 7 years, I've been primarily focused on the Apache Arrow project and the surrounding open source ecosystem. More recently, I'm the CTO and cofounder of Voltron Data, a data analytics startup where we are offering enterprise support and services around Apache Arrow and doing a substantial amount of open source development in the in the ecosystem. And do you remember how you first got started working in data? I've told the story many times, but I was working in quantitative finance right out of college. I had a math degree, and turned out that I thought I was gonna be doing math and solving partial differential equations and that sort of thing, but it turned out that I was mostly doing data analysis and writing SQL queries and using data frames and things like that.
And so I started to get interested in the tools for doing data analysis because I wanted to make myself more productive because I found my job to be quite tedious and working with data to be much more difficult than I thought it should be. And so for me, it started out as a personal challenge to see if I could create tools for myself to enhance my productivity. And I found that I I found that I enjoyed building tools and I, you know, become very passionate about open source software. And, you know, I love working in the community and building projects and helping progress happen faster.
[00:03:09] Unknown:
In terms of the Voltron Data business and the kind of focus of it, I'm wondering if you can give some of the overview and some of the story behind how it came to be.
[00:03:21] Unknown:
The Apache Arrow project started we got the initial group of developers together in 2015 to start the project, formally launched it as a top level project in the Apache Software Foundation in 2016. And we set about, you know, growing the different layers of the stack. And as time went by, we started to observe, you know, more general trends in the interplay between programming languages, you know, data storage, data access, and kind of the the data analytics stack, and the role of, you know, the evolution of hardware and computing hardware. So in particular, things like graphics cards, you know, GPUs, FPGAs, and custom silicon.
There were many different groups of developers working on different layers of the stack in and around the the Apache Arrow ecosystem. We saw an opportunity to build a unified computing company, bringing together, you know, several of those groups of people. So, respectively, you know, myself and my team from Ursa Labs, which became Ursa Computing, group of leadership from the RAPIDS projects, which had been started at NVIDIA, and the Blazing SQL project, which is a SQL engine built on top of RAPIDS. So we reason that we could build a more integrated and more successful company, you know, working together under 1 roof than than pursuing, trying to grow our different slices of the pie, so to speak. So that's how the company came together beginning of last year, you know, to build a large team.
Thankfully, we were able to assemble quite a bit of investor capital before the market turned south earlier this year. You know, we've been really just heads down, heads down building the last last year and a half, which has been really, really exciting. 1 of the, I guess, kind of meta notes that I'm curious about is how you settled on the name of Voltron as the company name, and how often people wonder what that is in reference to? You know, we like the name Voltron Data. You know, we wanted to evoke, you know, the feeling of, you know, what we're building being something that is, you know, the whole is greater than the sum of its parts. And, you know, I think the mission of the company, kind of the, you know, the heart or the soul of the company is making the modern data analytics stack more modular and posable to make it easier for developers and users of analytics or data engineering tools to unlock the value of modern hardware and to take advantage of advances in computing capabilities, you know, as they become available.
And so I think we've seen, you know, in the in the world of machine learning and AI, you know, deep learning training, machine learning training, that sort of thing, we've seen, you know, significant change to the technology landscape through use of hardware acceleration through, you know, through GPUs, and now we're seeing TPUs and and custom chips for accelerating machine learning. There's the same kind of innovation and improvements in computing efficiency can be brought to the other layers of the data processing stack, analytics, machine learning preprocessing, you know, ETL.
You know, we are, you know, we're really focused on improving the, you know, protocols and standards, like the fundamental technologies that enable that that kind of modularity and composability at the kind of, you know, language data and hardware level. And so if, you know, developers observe, you know, the work that we're doing in not only in Apache Arrow, but in some of the surrounding projects, like Substrate and and IBIS, which we we can dig more into in this podcast, you can see how we are working on, you know, really hardening, like, these interfaces and protocols between the different layers layers of stack to make it easier for developers to, you know, swap out components and develop a more kind of framework agnostic or, you know, an engine agnostic fashion, if that makes sense.
[00:07:28] Unknown:
As far as the broader vision of Arrow, you know, it has these immediate benefits of being able to operate as an interchange format between different languages and run times and frameworks, and it has been growing in terms of its scope and its capabilities. And I'm curious if you have any overarching vision for Arrow and its potential impact on the broader data ecosystem and some of the ways that the work that you're doing at Voltron is aimed at helping to bring forth the realization of that vision.
[00:08:06] Unknown:
You know, going back 6, 7 years, when we started the Aero project, I did always have the aspiration of building a more modern computing foundation for data frames and tabular data processing. And so for me, like, expanding the scope of, you know, what we call Apache Arrow has always been, you know, something that I've been really motivated to do. But when we started the project, we had to start small. Like, can we, as a community, come to an agreement around how we how we represent tabular data in a framework and language agnostic fashion, such that we can achieve this concept of, like, a universal data frame, which can be used portably across computing frameworks, programming languages, different processing environments so that we can have a basis for beginning to think about that kind of, you know, frictionless modularity and composability at the data level. Once we did that, we had to move on to building the other layers of stack, which are necessary to build Arrow native applications. And so that's, you know, the data serialization, building RPC for moving around data efficiently in a distributed system.
More recently, you know, we've been looking at the protocols and interfaces for interacting with databases in an Arrow native way. And so we've got subprojects which are specifically focused on integrating Arrow more natively into database systems. So we make it easier to push Arrow based datasets in and out of databases. And the other dimension of not only having a data format that is a universal data format, protocols and interfaces for moving it around, protocols for connecting systems together in an Arrow native way. But we also needed to build computing engines to process Arrow data so that we can embed into different systems to do, you know, data cleaning, data preparation, teacher engineering for machine learning, analytics, all those things that you would do with a, you know, SQL engine or a data frame library, that sort of thing. And so as time has shifted, the work in the Arrow project has moved away from building these fundamental protocols and interfaces to more of the, you know, modular, embeddable compute engine development, which has been really, really exciting to see.
[00:10:23] Unknown:
1 of the initial motivations for Arrow was to cut down on some of the inefficiencies of that data interchange. I think 1 of the most notable examples is using the PySpark library to interact with the Spark runtime and having to serialize and deserialize the data in between those interfaces, as well as having to translate between the representations of information between Java and Python. And I'm wondering if you can give an overview of the kind of types and scope of impact on engineering productivity and compute efficiency that the Arrow project and the kind of growth thereof is intended to address.
[00:11:09] Unknown:
I think the Spark example is a really motivating 1 because that was 1 of the first problems, practical problems that we focused on solving with the Arrow project was the problem of making Python on Spark a lot faster. And so if you look at, you know, as a user using PySpark versus using the Spark Java or Spark Scala APIs, there was a significant performance penalty in using Python whenever you wanted to extend Spark with custom Python code that might use Pandas or might use scikit learn or, you know, something else in Python ecosystem. So by defining a column oriented, you know, data format, which could be constructed on the JVM side inside the Spark runtime and then sent over to the Python side for executing custom code by having that not only a more efficient data format to move across, but also something that could be interacted with very cheaply on both sides without having additional conversion or serialization.
We were able to you know, this was my colleagues at 2 Sigma and collaborators at IBM. We were able to make custom code running in PySpark 10 to a 100 times faster in some cases. Now, of course, like, you know, there's many workloads in Spark which has shifted to use the Spark Dataframe API where under the hood, you know, Python code, Java code, Scala code, it gets translated into effectively a SQL query, which gets run by Spark SQL. And so there's no need for data to ever be transferred into Python. But there still are plenty of use cases where it's necessary to run custom code, and Spark is used in many cases as a convenience layer for doing parallel and distributed computing with Python. But users shouldn't have to pay an enormous penalty, have that privilege. And so Arrow has really helped in reducing the overhead, the impedance between those systems and in those cases.
That being said, you know, Spark and Spark SQL are, you know, systems that have been been around for a long time, but Spark SQL was built before Arrow existed. And so, internally, it is not, you know, an Arrow native system, so to speak. So, like, it represents the data that flows around Spark or inside Spark SQL in a data format that is not the same as the Arrow format. And so I think what's really interesting for thinking about the future is having spark like distributed systems for large scale tabular data processing that are fully aero native end to end. And so you have the ability to extend those systems with custom code written in principle in any programming language that knows about Arrow. And so we enable much more, you know, kind of fair and kind of consistent polyglot experience across the stack where where no programming language is being unfairly penalized, both through as a result of having to, you know, do expensive data serialization at the at the programming language boundary.
[00:14:18] Unknown:
Because of the fact that you aren't paying a penalty by virtue of the language or the runtime that you're choosing, I'm curious how you see that influence the decisions that engineering teams make as to how they want to compose their stack and compose their analyses and some of the ways that that reflects in terms of the skill sets that are necessary to be able to build and maintain these analytical systems.
[00:14:46] Unknown:
For me, what's exciting and motivating is is for the users to have the choice and being able to choose the programming language and the types of APIs and user interfaces that make most sense for the systems that they are building and to have more kind of a more natural, you know, let's call it language integrated querying capability. So I think part of the challenge that we have as system developers is is to make it easier for the programming language interfaces to evolve and innovate independently from the back end compute engines. And so, you know, our goal with what we've been building in Arrow is to put, you know, very fast, you know, Arrow native data processing to make that available in a form factor where it can go everywhere. So it can be, you know, deployed in, you know, heavily resource constrained environments where having, you know, very low latency efficient tabular data processing with the in process is highly desirable.
But that we can also, using the same APIs and user interfaces that we use to do local small scale computing at the edge, so to speak, that we can build descriptions of our workloads or our data transformations in a form where they can be serialized and sent into, you know, large clusters for, you know, doing larger scale data processing. And so that's 1 of the reasons that we've been investing pretty heavily in this new project called Substrate, which is building a intermediate representation for data analytics operations that can be used to connect user interfaces and computing engines on the back end.
So you can think about substrate as being like something that's, you know, lower level than SQL and can be used to represent, you know, tabular data or data frame operations that go outside of what is expressible in SQL. And so it's our hope that by hardening the interface and making it straightforward for engine builders, you know, compute engine builders to focus on building a substrate interface to their engine. And so then from the API developer standpoint, the user interface developer building Python libraries or Go libraries or Java libraries or Rust libraries.
At the user interface layer, we can just focus on generating substrate rather than having to think about, well, how do I build an interface or an integration with a particular computing engine? Because then whenever there's a new computing engine that you wanna take use of, maybe to accelerate some part of your data processing workload, you've gotta build a new interface to that engine. And so by reducing the surface area of the problem to just, let's just think about the world in terms of this substrate intermediate representation, that makes it so much easier for us as API developers to build the user experience because we just have this, like, 1 intermediate representation to generate. And then on the back end, you know, the engines can decide how to most efficiently execute the the substrate.
[00:17:54] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/montecarlo to learn more. As far as the overall scope of the Arrow project itself and what is actually contained within the code repository and also the growth of the broader ecosystem around it, They have definitely grown substantially, and I think the most recent release of Arrow right now is version 10. So definitely a lot of development happening there. And I'm wondering if you can give an overview of the current set of capabilities and the, I guess, features that are targeted with Arrow and the related projects beyond just that in memory columnar data representation.
[00:19:26] Unknown:
I mean, the way that we describe the project these days is we describe the Arrow project as being a multi language toolbox for building analytical data processing systems. And 1 important part of the project is Arrow calendar data format. From there, there's, you know, a whole set of different software components, which enable you to do things with the Arrow data formats that includes data serialization, inter process communication, remote procedure calls, so building services that need to send and receive Arrow data. So there's a framework called Flight for building Arrow native data services.
We've started building some database middlewares for integrating Arrow into database systems. So there's a SQL database protocol project called Flight SQL, which provides basically a wire protocol for talking to SQL databases over gRPC using the Flight framework. Another project called ADBC, which is a standardized API for database drivers to provide aero native data access. So it's kind of orthogonal to Flight SQL, so it has nothing to do with the data protocol or the wire protocol. It's more about having a standardized API for inserting and selecting arrow datasets from SQL based systems.
We've got Compute Engine, so there's multiple computing engine projects. There's the Acero, c plus plus Compute Engine. There's, Rust Data Fusion, Compute Engine. So there's kind of the, you know, kind of embeddable Compute Engines in multiple programming languages designed for different use cases. You know, it's an increasingly, you know, diverse and and and federated community of subprojects. So not just the core 10.0 Aero. You know, probably, if you go to apache/arrow on GitHub, you'll see a large you know, very large Polyglot Git repository.
But we've grown several other repositories that house a number of the Rust subprojects as well as the the Julia Aero project lives in its own repository nowadays as well. And so we have some support for around a dozen programming languages. And within each programming language, we have, you know, a stack of libraries, which are there to make it possible for you to build systems that that use the Arrow format or connect to other systems that use Arrow. Of course, some of those libraries are in different levels of maturity. So the ROS libraries, the c plus plus libraries are generally the most featureful and mature, but we're growing an increasing, you know, amount of support in Go and Java. You know, initially, the project started out. It was just c plus plus and Java, but it's expanded significantly since then. It can be a little bit difficult for a newcomer to navigate, but I think the community has around a 1000 developers. Maybe around a 1000 different people have contributed to the project over the last 7 years. So the developer community, we've done a good job or we've put in a lot of effort, I should say, to make the project accessible to new contributors. So that's through, you know, developer documentation and, you know, efforts to, you know, engage and grow the open source community around it. So it's not just a small, you know, insular group of developers building all of these things, but that we're actively trying to make the developer community larger to share the burden of maintaining all of these different software components that have to be, you know, have to have bug fixes and security fixes and make releases periodically.
[00:23:02] Unknown:
Digging more into the implementation of the Arrow project, I'm wondering if you can talk through the actual architecture of the code and some of the ways that you're able to validate conformance with the specification across those multiple different languages that are implementing an interface to it.
[00:23:25] Unknown:
As I was saying, the project is fairly federated in the sense that the different subprojects, their set of features, and, you know, what the developers are focusing on at any given time, they evolve somewhat independently of each other. But there is the commonality of the Arrow memory format and the protocols and interfaces for interoperability. So 1 of the first things we did when we started the project was establish for integration testing between implementations. So, for example, in c plus plus we have a set of c plus plus classes and different tools for dealing with memory allocation and construction of arrow, tabular, in memory data.
And we have similar set of classes and interfaces in Java, and we needed to define a procedure for a Java application and a c plus plus application to determine that they understand each other's data. That you say, here's my Arrow data. Do you agree that what I think the data is the same as what you think it is? And so we devised a test harness where we generate a point of truth version of the dataset in JSON, both applications. You know, the integration test harness parses that JSON, constructs the corresponding Arrow version of that, and then compares that to the binary kind of serialized representation of the Arrow dataset to determine whether it is identical.
So that enabled us to show compatibility between different implementations. So that integration test harness has grown, you know, to encompass several implementations. And so a new implementation of Arrow, that's the first target for showing compatibility is participating in those integration tests. But, you know, within the integration tests have expanded to include other things like Flight, which is RPC framework built on top of gRPC for building data services that send and receive arrows. So there's a set of integration tests for those.
That's how the main way that we verify interoperability or that implementations are doing the same thing. But within the different programming languages, like the architecture of the projects has evolved fairly independently, there's a different extent to which the implementations rely on external libraries for solving certain problems. So, for example, in c plus plus we've been developing a subproject called datasets API or datasets framework for a number of years where we enable users to interact with large datasets that are spread across, for example, partition directories of parquet files in s 3. And within the project, like, we've had to build a c plus plus interface to s the s 3 API.
You know, we developed the parquetc plus plus interface. There's a lot of supporting code for dealing with, you know, asynchronous interactions with remote datasets that we've had to develop. But if you go and look at Rust or Java, the libraries for doing these same sorts of things, there's a different level of, you know, reliance on external libraries. Rust uses more external libraries for some of the things where we've had to build, you know, develop, you know, home grow some libraries and tools within c plus plus because there weren't off the shelf libraries available.
So in general, like, our mantra is providing a batteries included experience for developers. And so we think about, like, you know, just thinking from the mindset of somebody doing data engineering or building, you know, building an analytics stack, and we think about, like, what problems is that developer or that user going to need to solve. And so rather than, you know, leaving developers to cobble together solutions, would rather that there be, like, good out of the box solution for some of these, like, you know, run of the mill, like, everyday, you know, data engineering workflows. So, like, for example, anything having to do with, like, interacting with a large parquet dataset in cloud storage. Like, we wanna make that really easy for for a developer and for using Arrow to be the fastest, like, most efficient way to build their system.
[00:27:41] Unknown:
And you mentioned with the substrate project, 1 of your efforts to reduce the level of effort necessary to be able to add Arrow's capabilities and benefits to the end user experience and being able to integrate with different components of a potential data stack. And I'm curious with that in mind, but also with some of the other projects that you mentioned, how you think about balancing the desire to be able to move fast and expand the reach and capabilities of Arrow with the need to wait for some of these other products and frameworks and projects to actually do the work of integrating with Arrow and maybe the substrate project and some of the ways that you're
[00:28:26] Unknown:
collaborating with and maybe incentivizing some of these other engines and run times to do that work? You know, as a company, Voltron Data, I think we're lucky to be in a position where we can help accelerate some of this some of some of this work, like the adoption and integration work. We've been pursuing that through our enterprise subscription program, which is basically an Arrow development and support partnership. So working with companies that are building on Arrow. We've also made some strategic partnerships with projects that are outside of the Arrow ecosystem where we want to invest in Arrow ecosystem integration. So 2 examples are DuckDV and Delox, which is a a new project started by Meta. And so we reason that, you know, there, you know, other developers working on other projects that that are, you know, part of this greater vision of building a more modular and composable data analytics stack, Arrow has become a central, you know, key part of that by providing this language agnostic, you know, sort of let's call it a data fabric for connecting systems and computing on.
And so by making it, you know, straightforward for, you know, computing engines like DuckDb or Velox to connect to other systems which use Arrow, that's in everybody's best interest, making it easier for user interfaces. So for example, you know, we've for many years, like, we've been building this project, IBIS, which provides a scale independent data frame API, engine independent data frame API for Python, building an interface between that and Substrate, and then working on substrate interface to these different compute engines. So we make it easy to, you know, generate substrate once and then execute it in any of these different computing back ends. So we enable engine choice for the developer.
You know, we've been making, you know, over the last year and a half since we started the company, you know, we've been growing our support program, working with customers there, working on partnerships around, you know, areas of mutual interest in hardening and growing the Arrow ecosystem and improving these, you know, standards and protocols for interoperability so that we, you know, can accelerate towards this more kind of modular computing stack for building large scale data analytics systems.
[00:30:50] Unknown:
Another interesting aspect of what you're offering with Arrow is that it is very optimized for tabular data, which is a substantial portion of what people are trying to perform analysis on. But with the growth of machine learning and more scalable and capable compute frameworks, there has been an increase in usage of other formats of data, whether that's unstructured data such as binaries or images or videos or semi structured or document style data or even multidimensional data. And I'm wondering how you see the role of Arrow in that avenue of either being able to accommodate some of those of the different data types or being able to cooperate with run times that are trying to maybe enrich either those unstructured data assets with tabular information or, you know, enrich tabular information with metadata from those unstructured assets?
[00:31:51] Unknown:
It's a great question. I mean, it's true that, you know, columnar tabular data processing is the bread and butter of many systems, and a lot of the advances in computing efficiency have come through, you know, better use of SIMD instructions, you know, just just better, you know, utilization of modern CPUs, and Arrow definitely was designed to enable that, enable, you know, more efficient vectorized data processing. It's also facilitated use of, you know, GPUs and has been used productively to do accelerated processing on GPUs and FPGAs. But there are these other types of data that are non tabular.
And so 1 thing that we've seen is is embedding structured data in Arrow data structures, so images or text, and doing kind of building a, I guess, you could describe as a hybrid structure. It's like a table that contains unstructured data. Just to give an example, so, you know, my former cofounder, early pianist developer, Chang Shi, has got a new project called LAANC, which is a computer vision stack that is aero native that enables, you know, training and model scoring on image datasets that are all represented as an arrow native in an arrow native fashion and then stored out in storage as as parquet files.
You know, you can use DuckDb as an engine to execute to deal with large image datasets. And so the, you know, image scoring functions, you know, training and scoring have to be represented as user defined functions, which get run against these images, which are embedded in aero tabular data structures. So we've seen, you know, successful use of, like or successful, you know, hybrid structured, unstructured datasets. That being said, like, Arrow is not be gonna be a fit for everything. Like, there's workloads where, you know, fundamentally, you're dealing with a large you know, very large tensor. So you could have an Arrow dataset that has a column of where every cell in the column is a tensor, and we've seen people do that.
But, like, Arrow is not gonna be a tip for a 100% of use cases. That's just the nature of the beast. Like, not everything is a table.
[00:34:06] Unknown:
Data engineers don't enjoy writing, maintaining, and modifying ETL pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics, Salesforce, AdWords, Facebook, and Spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions. Hivo Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance. Posting more than a 150 out of the box connectors that can be set up in minutes, Hivo also allows you to monitor and control your pipelines. You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks, preload transformations and auto schema mapping precisely control how data lands in your destination, models and the workflows to transform data for analytics, and reverse ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 247 live support makes it consistently voted by users as the leader in the data pipeline category on review platforms like g 2. Go to data engineering podcast.com/hevodata today and sign up for a free 14 day trial that also comes with 247 support.
As far as the business model of Voltron Data, you mentioned that you are support focused, and I'm curious if you can just talk through some of the ways that you think about working with and on the Arrow ecosystem and being able to build a viable and scalable business, and also because of the fact that a significant portion of your work is focused on aero, how that translates into the way that you think about the marketing and customer education about what you're doing?
[00:35:52] Unknown:
We have been spending a lot of time and energy on customer education and developer relations, documentation, generating, you know, if you look at our website, ultrondata.com, like, we've been putting out pretty steady cadence of, you know, of content to help with educating the customer, you know, user base, I guess, to be more accurate, user base about different layers of Aero and other projects in the the, quote, unquote, you know, Aero Cinematic Universe, I would say. We've also been running an Aero oriented conference called the Data Threads. We had our first back in June. We had amazing, you know, set of set of talks around showcasing the work of the the community, you know, Arrow users and the kinds of things they're building, the systems that they're building, and solutions.
We have we have another edition of the data thread 2023 in February. You know, very excited about that. As far as our business model and the company, you know, in the short term, we're really focused on hardening some of these fundamental technologies, which, you know, we think are going to power the data analytics, you know, machine learning preprocessing data engineering stack for the coming decade or 2. So things like, you know, as we've been discussing, things like Substrate, Core Apache Arrow itself, and some of the user interface layer projects like IBIS.
You know, those are, you know, essential building blocks for, you know, enterprises globally to become more aero native. That's been our, you know, primary focus, educating the world about how to take advantage of the work that's happening in the Arrow ecosystem, hardening, you know, these fundamental projects, working with partners to accelerate adoption and integration of Arrow, and supporting large enterprises that are building on Arrow through our, you know, our enterprise subscription program. So the short term, that's our focus. You know, we have raised, significant amount of venture capital, and so we need to build a, you know, a large scalable business.
And, you know, we look forward to doing that over the coming years. But we are surfing on a a very big wave that is, you know, disrupting and changing the landscape of data systems. And so, you know, our strategy right now is oriented at accelerating, you know, the growth and the size and speed of of that wave.
[00:38:12] Unknown:
In your work of helping to create the Aero project and now you're focused on growing and expanding its capabilities in the surrounding ecosystem and working with your customers on these support capabilities? What are some of the most interesting or innovative or unexpected ways that you've seen the Arrow project and its related ecosystem applied?
[00:38:34] Unknown:
I would say there hasn't been anything super surprising, but I think what's been really interesting is is seeing, like, how, like, the early adopters you know, I think there's plenty of companies and use and developers, users who are in the mode of being aero curious. Like, they've learned about the project. They've seen content over the last several years. They've seen the you know, growing trends, you know, use and people talking about the project. But then you have, you know, companies that have essentially already adopted the, you know, aero religion, so to speak, and have spent, you know, 2, 3 years or more building systems that are aero native.
And to see the business impact of that in terms of, you know, lower resource utilization, you know, systems that are more interoperable, have just better efficiency, better performance, lower latency. And I think there's this system turnover problem where companies are replacing their last generation of internal systems that they've built with new systems that are using Arrow. And so there's a certain sense of loss. Like, you know, there was many, you know, developer years of time spent building systems, you know, 7 years ago or 10 years ago. And so now there's this activation energy of building new systems, which are built with this new, you know, more efficient computing stack.
But once these systems start to come together and the business starts seeing, you know, a return on that investment, it's really very exciting just to see people's, you know, computing or data platforms, data infrastructure become great deal simpler and more efficient, it's just really validating to me having spent such a large, you know, fraction of my life working on this project to see the dream of the Arrow project and its potential in large scale data platforms become a reality.
[00:40:25] Unknown:
In your own work of building this business and helping to create this project and grow it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:40:36] Unknown:
I mean, I would say that, you know, building large open source projects that become depended on by large traction of the ecosystem is is always very difficult. So I have spent yeah. I think the initial the early bootstrapping of the Arrow project was definitely not easy. Required, you know, a lot of, you know, personal and professional sacrifices, you know, on my part. And so I was lucky to have some passionate, you know, true believers supporting the work. So, for example, folks over at 2 Sigma, you know, who employed me and then also were supporters of Ursa Labs, you know, for a couple of years before we, you know, became Voltron Data. Folks at RStudio who really wanted to RStudio, which is now Posit, who saw the potential for building a more polyglot data science stack.
And so I think the lessons learned are that relationships really matter. It's not just about writing code and pushing code to GitHub. Like, the social dimension of building these types of projects is the most difficult, but also the most important if your goal is to build something that has, you know, large scope and that you want to be sustainable over a long period of time. We still have, you know, social and sustainability challenges in the Aero projects. Like, we're Voltron Data, we're taking on a large amount of maintenance and systems burden supporting the Aero project, which has been, you know, great in the sense that we're pumping out releases. Like, we're improving the CICD infrastructure.
You know, our testing and continuous delivery for the project is better than ever. But then, you know, other people in the open source project would be justifiably concerned about, you know, can we count on Voltron Data to be around and providing this level of support for forever? So there's a justifiable suspicion about companies being involved and large companies being involved in open source projects. But I think, you know, our goal is to always do right by the community. You know, I've always been very, very community minded in in thinking about building these projects. So it's been interesting and challenging and stressful at times, but it's also very rewarding. So, you know, ultimately, like, we see this as the Arrow project as being an agent of change and progress in the open source ecosystem. So we're excited to keep rolling the ball forward and supporting growth, of the ecosystem and making sure that, you know, the developers and the users can be successful building on this new computing stack.
[00:43:07] Unknown:
Given the fact that Arrow isn't even necessarily the kind of end user selected utility, this question might be nonsensical. But what are the cases where Arrow or its related projects might be the wrong choice?
[00:43:21] Unknown:
I think a question that we answer a lot is, you know, Arrow is a storage format or Arrow is a format for data warehousing. And Arrow is not designed to be a replacement for a competitor with or a replacement for Parquet, for example. And so, you know, sometimes people do come to the project thinking like, oh, I've heard about Arrow. It's a data format. Right? So can I, you know, use it to build my data warehouse or build my data lake? And so, you know, occasionally, there's, you know, some confusion around purpose of the project. But I think as we've improved, you know, our developer content and helping folks understand about how, you know, we're building this companion technology to storage, you know, storage systems like, you know, file formats like Parquet and, you know, large scale metadata management, you know, large scale dataset systems like Iceberg.
I think that's becoming more clear to users. And and, certainly, like, there's people doing data engineering or, you know, machine learning that is primarily dealing with, you know, text or unstructured data. In some of those instances, you know, Arrow may not provide a lot of value depending on depending on the nature of the work. But fortunately, a lot of the data that's processed in the world is fundamentally tabular or at least representable on a tabular format. Most, you know, data generated by modern web applications, mobile applications can be, you know, represented and processed in a tabular format.
And so even though, you know, we don't strive to be all things to all people, there's a large fraction of, you know, data analytics or data engineering where Arrow is a relevant technology that can make things, you know, faster, simpler, more efficient.
[00:45:05] Unknown:
As you continue to build and iterate on the Arrow project and invest in that ecosystem and help to grow the degree of integrations that are available, what are some of the things you have planned for the near to medium term or any projects you're excited to dig into?
[00:45:20] Unknown:
Right now, I'm pretty you know, we talked a lot about substrate. I'm very pumped about that. Another, you know, project that I'm really excited about is we've got this effort in Aero called Nano Aero, which is building a small implementation of the Aero format and protocol for embedded use. So if you have a system like a database or, you know, like a microservice or, you know, it could be really anything where you want to add the ability to send and receive Arrow data, but you don't want to take on new library dependencies. This is a project that can be, you know, dropped in and copied into a project in principle in any programming language as long as you have c you know, the ability to call c code. And so we think that that will help expand the adoption of Arrow into places where it has not reached yet. We're pretty excited about that project, Nano Arrow. Also, really excited about ADBC, like, the standardized database API to be used alongside existing JDBC and ODBC interfaces for talking to databases.
But I think I've always had the desire to make it easier to talk to databases and for applications and users to not have to write so much custom code to just get data in and out of SQL databases. And so I think that the ADBC effort gives us a path to, you know, making that a reality so that we can just think about tables and data frames and not so much about, you know, how do I, you know, translate between this database's wire protocol and my, you know, data frame, data structure. Because god knows, you know, I've written and, you know, folks in all of these ecosystems, you know, we've all written a ton of code just dealing with converting between data formats. And so I'm looking forward to a day when we won't have to think about that. We'll have written some of our last data connectors, and we can just think about Arrow, and that will make our lives a lot easier. It's such a great experience for new programmers to have to figure out how to reconstitute the data that they get back from the database with the column names and make sure that they're matched up properly. You're gonna rob them of that experience? That's right. I get that it's, you know, it's like a code kata. It's like a almost a rite of passage to have to write converters between, you know, the data that comes out of the database in your application. But, you know, I think we've reached a point where I think our efforts as programmers would be best reserved for other for other challenges. Absolutely. Yeah. No. I
[00:47:46] Unknown:
I I don't think anyone will miss having to go through that exercise. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:05] Unknown:
I'm not totally sure about this. But, I mean, I think, you know, we're we're still in a state of learning and change when it comes to how best to build and manage very large data lakes. I'm really hopeful about, Apache Iceberg, which came out of Netflix and is 1 of the next generation approaches for large scale date dataset management. I think that the the sooner that we can settle on standards for scalable open data warehousing, so to speak, I think that makes things for someone like me who's more focused on computing engines and user interfaces.
You know, how we get access to the data, how we store and manage the data is less of a a moving target. And so the world becomes increasingly standardized on iceberg, for example, in file formats like Parquet, then that simplifies the problem for the the engine and user interface developers to make an end to end stack for developers, you know, where the choices are much more straightforward and there's less fragmentation and waste in the stack.
[00:49:10] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at Voltron Data and your experience on working on the Aero project and helping to illuminate some of the ways that it is being used and the surrounding projects and its growth in the ecosystem. So I appreciate all the time and energy that you and the other members of the Voltron Data and the Arrow teams are doing. So thank you again for your efforts there, and I hope you enjoy the rest of your day. Thanks. Thanks for having me.
[00:49:44] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast dotcom with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Wes McKinney's Journey in Data
The Birth of Voltron Data
Vision and Impact of Apache Arrow
Engineering Productivity and Compute Efficiency
Current Capabilities and Ecosystem of Arrow
Architecture and Interoperability of Arrow
Balancing Expansion and Integration
Arrow's Role in Unstructured Data
Voltron Data's Business Model
Innovative Applications of Arrow
Lessons Learned in Building Arrow
When Arrow Might Not Be the Right Choice
Future Plans for Arrow and Voltron Data
Biggest Gaps in Data Management Tooling