Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

A new approach to building and running data platforms and data pipelines.

It is an open source, cloud native orchestrator for the whole development life cycle with integrated lineage and observability, a declarative programming model, and best in class testability.

Your team can get up and running in minutes thanks to DAXTER Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments.

Go to data engineering podcast.com/daxter

today to get started, and your first 30 days are free. Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture

with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your

data. Want to see Starburst in action? Go to dataengineeringpodcast.com

slash starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

Your host is Tobias Macy, and today I'm interviewing Paul Dix to talk about his investment in the Apache Arrow ecosystem and how it led him to create the latest FAD and database design. So, Paul, can you start by introducing yourself?

Sure. I'm Paul Dix. I'm the founder and CTO of, InfluxData.

We are the makers of InfluxDB, which is an open source time series database.

Prior to that, I have a lot of experience in industry. I'm obviously a computer programmer by training, and I've worked in a lot of large companies, small companies

all over. So

And for folks who haven't listened to your previous appearance on this show, we were where we were talking about the

influx product suite and your experience there, where you actually hinted at the work that you've been doing, where we're bringing you back to talk about. Can you just give a refresher on how you first got started working in data?

So as I mentioned, InfluxDB is a time series database. Now how I got interested in this topic.

I mean, generally, like, when I was in school, I was interested in information retrieval, database systems, that kind of stuff.

But in 2010, I was working at a Fintech startup here in New York City, and we had to build a solution for working with a lot of time series data. Later, when I started this company,

initially, we were building a product for doing server monitoring and real time application metrics and that kind of thing. And to build a back end for that, I had to build a solution that was very similar to

the back end I had built for the Fintech company. So I saw

2 different use cases. 1 was in financial market data and the other in, like, server monitoring and application performance monitoring data. But the back end solution for both was basically the same thing. And

at that point, I realized

building

a database that could work with time series data at scale and make it easy for the user was a more interesting problem to solve.

So, you know, we pivoted the company to focus on that, became InfluxDB,

and we've been building for that ever since.

So initially, we had, you know, version 1.0. The initial announcement of InfluxDB

was in the fall of 20

13. We released version 1.0 of InfluxDB

in September of 2016.

We released 2.0

in basically late 2019, early 20

20. And then just this last year, we released version 3.0 of the database, which is the

the significant,

rewrite that you were hinting at that, basically

caused us to adopt all these new technologies and start investing heavily in the Apache Arrow ecosystem.

Now bringing us through to this part of the conversation,

I,

made a little bit of a play on the acronym with the introduction, but the the different letters of it are f d a p. And I'm wondering if you could just start by describing the overall context of that stack, how the what the different components are, and how they combine to provide a foundational architecture for database engines.

Yeah.

So the FDAP stack,

is an acronym for the different pieces.

F stands for flight, which is Apache Arrow Flight or Apache Arrow Flight SQL.

A is actually Apache Arrow, which is essentially the foundational project

under which all these components reside.

So Arrow is like the umbrella project for everything.

So Apache Arrow

is an in memory columnar specification. So basically, it's a format for in memory columnar data so that you can do quick analytics on it. D,

which is Data Fusion, which is a SQL

Processor,

like, it's a query parser,

planner, optimizer, and execution engine for SQL. Specifically, it also follows the Postgres dialect of SQL

and parquet, which is

a file format for persisting

columnar data, but also structured data. So you can have nested structures.

It's essentially an open source implementation

of

the Google Dremel research paper that came out in the early aughts.

I'm wondering if you can talk to the design goals and constraints

that you were focused on in the reimplementation

of InfluxDB

and how that led you to

the selection of this composition of tools to execute on that vision?

Yeah. So for Inflectivity 3.0,

as I mentioned, we we basically did a ground up rewrite at the database, which generally speaking is not something you'd ever want to do, but there are a number of problems we wanted to solve for. So

first is this idea of infinite cardinality.

Right? Within

time series databases,

generally,

there's this idea of the cardinality problem where

cardinality comes into in, like, dimensions that you describe your data on. Right? So these could be, like, a server name or a region or a a sensor ID, but you can also have other dimensions like what user made this request or what security token made the request. And, really, when you think about it, the dimensional data is basically just data that

describes

different observations that you're making.

So

when people want infinite cardinality, they basically just wanna be able to say they wanna capture as much precision and information about these observations that they're making. Now

traditional time series databases like InfluxDB versions 12 and others

have a problem essentially when this cardinality

gets super, super high.

And we had a bunch of, you know, customers and users who were saying they wanted to record this and use Influx DB for it, but we didn't have a solution. It was basically like a fundamental limitation of the architecture of the database. So how do we achieve infinite cardinality?

How do we achieve cheaper storage? Right? People wanted to decouple

the query processing and the ingestion processing and indexing from the actual storage of the data and they wanted to be able to ship historical data off to cheaper object storage that could be backed by spinning disk

while still making it so that queries against recent data are super fast. Right? So again, you're talking about a very fundamental

shift in the architecture of the database to be able to enable, you know, keeping everything in object storage while processing

recent data and memory and all this other stuff. So is that.

And then the other big piece is essentially, like, we wanted broader ecosystem compatibility.

Right? In versions Influx DB versions 12

have their own query languages, their own data formats.

Right? We wanted to be able to integrate with a much broader set of third party tools.

So, specifically, we wanted to support SQL as a query language

in addition to InfluxQL,

our our older query language.

We wanted

persistence formats that could be read and used

in tools outside of Influx DB. Right? So

and we wanted all of this essentially to be super performance. And, basically when we looked at this, we're like, okay, there are fundamental architecture changes of the database, which means we're essentially gonna have to rewrite most of it.

And this was at the beginning of 2020.

And at that time,

I thought, well,

1, older versions of InfluxDB are written in Go, that's That's kind of an artifact of when we created the project back in 2013.

Right? Go was very is starting to become

hot then. Right? The Go 1.0 release was in March of 2012.

But in 2020, the beginning of 2020,

I was very interested in Rust, and I felt that Rust as a programming language would be essentially

the best way to implement this kind of, like, high performance

server side software.

And I also thought that we could bring in other open source

tools and libraries that would help us get there faster.

Specifically, like, we didn't want to create our own

SQL execution engine from scratch. Right? That's a very, very big investment

and there are other systems out there that can do it. And initially, we thought that we might be pulling in something that was written in either C or C plus plus

which meant, like, bringing that code into a Rust project is actually fairly straightforward and you have, like, 0 cost abstractions and basically a very clean way to integrate it.

But when we started looking around, we saw that there were actually some Rust projects that were super interesting, right, that would enable us to do this. So 1,

persistence format. Right? We wanted a format that was more broadly

addressable, right, from other tools.

And in 2020,

the most obvious choice at least to us was Parquet.

It was still, like, Parquet came out I think in, like, 2016,

so

it

was beyond, like, the early, early adopter phase and was getting more usage

starting to get more usage in, like, other big data processing systems, data warehouses.

And we felt that if we use that as the persistence format,

we'd 1,

get the amount of compression we needed for our data to make it, like, you know, compact at scale.

But the other is, like, make it so we could share it with other third party systems. So that was kind of an obvious choice.

Then we knew, like, we needed fast analytics

on the data,

right, so that's when we started looking at Arrow as the, like, in memory

columnar data structure.

Right? 1 of the things I mentioned is, you know, this need for supporting high cardinality data, but then the other need is essentially, like, doing analytic style queries on time series data so that you can do analysis.

Versions 1 and 2 of InfluxDB,

those kind of analytics queries were like slow because of the way the system was

architected under the hood, and we thought if we're going to be able to do fast analytical queries on time series data,

it's going to have to be in this columnar format. So we kind of adopted Arrow as the in memory format for this data,

which then led to, you know, these other pieces.

And then

in early 2020, we looked at, a number of different query engines we could potentially

use. We looked at DuckDV,

which was still very nascent at that time. We looked at ClickHouse's engine, which again was nascent compared to where it is now, and we also looked at data fusion.

And at the end of the day, we decided that data fusion would be our choice because,

you know, it was written in Rust. And the thing is, like, all 3 of those projects that we evaluated, we realized there was gonna be a lot of work that we would have to do

to be able to support the time series use cases that we were aiming for.

And we felt that if we're gonna have to do a lot of work and end up contributing heavily to this query engine,

we might as well do it in a language that we intend to use, which is Rust, Right? DuckDeeBee and ClickHouse are both implemented in C plus plus

And we also felt that Data Fusion being part of the Apache Foundation and being part of the Arrow project,

we're making a bet that it would essentially, like, start to gather momentum and pick up steam, and there'd be other people who would contribute to it over time.

And over the last, you know, 3 and a half years that we've been heavily developing with it and contributing to it, we've certainly found that to be the case as more people

have been adopting Parquet, more people have been adopting Arrow.

They've been contributing to those 2 and Data Fusion,

and flight and flight SQL are also becoming kind of a standard

RPC mechanism essentially for

exchanging, you know, analytic datasets or, you know, millions of rows quickly in a high performance way.

And each of those pieces of the stack are definitely

well engineered. They've been gaining a lot of momentum. There's been a lot of investment in that overall ecosystem,

but they are all

I guess, they're not as narrowly scoped in particular Arrow as when they first started, but they are all focused on a particular

portion of the problem.

And in order to build them into a cohesive experience, I'm curious what was the engineering effort that's necessary to actually build a

fully

executable

database engine and platform experience on top of those disparate parts?

Yeah. I mean, it's certainly true that when Arrow first started, it essentially was like an in memory specification, and the the dream there was essentially that,

you know, you have data scientists

who are trying to do analysis in either Python or R.

Right? And the thing is they almost always have to get their data from 1 place and bring it in and exchange it to another thing. So the vision there was essentially how do you do

data interchange between these different data science tools and systems

that is 0 copy,

0 cost serialization, deserialization,

right, super, super fast.

And Wes and his team started with that, and then they saw saw, like, okay, wait a second. Now people also have these needs to, like, persist the data. So we need a persistence format. He brought in Parquet because he also helped define Parquet when it was first created,

but that became an obvious add on. And then,

you know, the RPC mechanism, they're like, okay. That well, now you have servers that are running things, so you need a way to exchange the data. Again, an obvious add on.

And data fusion, again, like, you need if you're working with this data, like, in Python, you have, like, pandas and r, you have, like, these, you know, different things. You have, like, either data frames, libraries, or whatever. But a lot of time, people just wanna execute a SQL query and you need an execution engine

that can work with

this arrow format natively that's gonna be super fast. Right? Anything that's fast in Python isn't actually written in Python, it's written in c or c plus plus

and then wrapped.

So that's what they realized from the data science perspective. Now from the perspective of people creating

a data platform, like, an entire data platform or a database server or something like that,

the thing that's tricky about it is

a lot of these formats are actually they're designed for exchanging, like, a a set chunk of data. Right? Like, parquet is an immutable format. Right? It's not meant to be updated. You write a parquet file, and that's that.

Arrow, again, like, you don't append to arrow buffers,

on the fly. Like, you create an arrow buffer, it's well defined, and then you can hand it off.

So

having a system that's basically able to ingest data

live, right, like individual rights, individual rows that you're writing in, and being able to combine that with this historical dataset that's represented either as arrow buffers in memory or parquet files on disk. Right? Moving all that data around, that becomes the really, like, the trickiest part of creating, like,

a larger scale data platform. It's like, how do you move that data around? How do you combine the real time data with the historical data? And how do you make that all fast,

and how do you make it easy to use.

All of that work is basically non trivial

amount of effort,

but it's certainly made easier by the fact that you no longer have to create the lower level primitives,

right, to build that data platform. You don't have to create the query engine. You don't have to create the file format. Right? Those things basically just exist

and they're

you know, I've

heard Wes refer to it as basically the composable data stack. Right? Which is you can kind of pick and choose these pieces

that you want to work with. Right? You can use the Data Fusion query engine, but not use Parquet at all and, you know, not use flight if you don't want to. It uses Arrow under the hood, so that kinda, like, comes along for the ride.

But, yeah, like, all of these different pieces are kind of, like,

you know, they're designed to be modular so that you can pick a different persistence format if you want that. You can pick a different execution engine. Right? Within the Arrow ecosystem,

1 of the things that,

Voltron Data, the company that

Wes ended up starting with some other people,

that backs a lot of the Arrow stuff as well.

1 of the things they created was this project called,

I don't know how to pronounce it, Velox,

basically, velox,

which is basically like this execution

engine that was created in conjunction with some work at Facebook

to do stuff. Right? So the idea is you can pick and choose these components and kind of tie them all together

into a larger, like, operational system where you're essentially solving problems around

data warehousing,

real time analytics,

and essentially just, like, working with

what I would say observational data at scale. Right?

Where observational data could be

data from your servers, applications,

sensors,

logs, whatever it is.

Are you sick and tired of sales y data conferences?

You know, the ones run by large tech companies and cloud vendors?

Well, so am I. And that's why I started Data Council, the best vendor neutral, no BS data conference around.

I'm Pete Soderling, and I'd like to personally invite you to Austin this March 26th to 28th, where I'll play host to 100 of attendees,

100 plus top speakers, and dozens of hot start ups on the cutting edge of data science, engineering, and AI.

The community that attends data council are some of the smartest founders,

data scientists, lead engineers, CTOs, heads of data, investors, and community organizers

who are all working together to build the future of data and AI.

And as a listener to the data engineering podcast, you can join us. Get a special discount off tickets by using the promo code depod20.

That's depod20.

I guarantee that you'll be inspired by the folks at the event, and I can't wait to see you there.

Another interesting element

of building

your platform on top of all of these open source components

is that

by virtue of it being a layered stack,

you can have additional integrations

that can come in at each of those different layers rather than having the main

interface be the only way of accessing the data that it contains.

It also gives you the benefit of being able to

capitalize on the overall ecosystem of investment and the network effects that you get from those different open source projects.

So I'm wondering if you can comment on some of the ways that you've seen that benefit materialize in your work of building this data platform on top of these different components.

Yeah. So this is actually, like, 1 of the things I'm most excited about

for these different pieces and for, you know, the work we're doing,

which is

there's, I I think we actually need to add another letter to the acronym, the FDIPDAP

acronym and maybe, like, jumble them up. But basically, the the other letter is I for for Apache Iceberg.

So Iceberg is essentially a catalog a standard for creating a data catalog

of essentially parquet files in object storage.

Right? And

we're we're basically building first class support for that in InfluxDB 3.0 where

all of the data that's ingested into an Influx DB 3 dot o server can be exposed

essentially as

iceberg catalogs, which is awesome because

that so that's a standard that was originally developed at Netflix and that was open sourced out into the Apache Foundation,

and it's quickly being adopted

by other by other companies. Right? So Snowflake just added support for Iceberg as a format. Even

Databricks is adding support for it, even though they have a a competing standard called Delta Lake,

and a lot within

Amazon, the Amazon Web Services, for example, they're adding first class support for Iceberg

so that if you have data that's exposed as an Iceberg catalog, you know, in s 3,

you can then query that data using any of the

Amazon, you know, query services like Athena

or Redshift

or all these different pieces.

So

that I think is, like, a really interesting integration because it makes us so that you can

access

this data in bulk. Right? So if you want need to, like, train a machine learning model or whatever or query against this data for doing large scale analytical queries and be totally outside

with for Influx DB 3, for example, totally outside the operational envelope of the system that's kind of, like, managing all this real time data movement, being able to query in real time,

you you can basically do all these analytics tasks completely disconnected from that. And,

again, like,

you could use Data Fusion for that, but you could also use Athena, right, which is, you know, based on,

a Java

Java

query engine called Trino or Presto or whatever it is now.

Or you could use DuckDV or ClickHouse or any 1 of these other systems

to do your query processing and analytics against that data.

So

that integration, I think, is super interesting. The other 1 that I think is interesting is

within

the Arrow project. So they have

they have flight SQL is basically like an RPC mechanism for essentially sending SQL queries to a server and getting back millions of rows really, really quickly.

And they have basically a new standard that they've created,

that's kind of like competing with ODBC.

So ODBC is obviously the database connection standard. Was for essentially transactional databases and relational databases.

The Arrow 1, once

that becomes a thing, I think it'll be a really, like, a standard way

to connect to

analytical data stores of any kind, whether it's data warehouses or real time data systems or whatever.

And I think

those like,

having those things be standards and have them, you know, contributed to by many different companies, not just supported by a single vendor,

I think

will make it the the pace of innovation in this space

for these, you know, large scale

data data use cases, which are only gonna continue to, like, increase and multiply.

I think

it makes it so that we have we can have basically many more tools

that can integrate with each

other. Whereas, like, if you look at data warehousing,

you know,

for the last 20 years,

it's largely

been, like, your day you know, it's data data warehouses are basically kind of like data roach motels.

Like, your data goes in and

you have to get all the data in the data warehouse, but then if you wanna do anything with it, you have to send the query to the data warehouse and, like, all this other stuff. Right? And there's just not

there's not this really good integration, like the data warehouse just becomes this 1 place.

So being able to access it from a bunch of different tools,

without having

1 piece of software be the arbiter of the entire thing, I think is is really interesting.

Absolutely. And

to your point of,

flight SQL being a new RPC mechanism to unlock a lot of potential and reduce a lot of the pains, it it just, makes me sad that I obtained all of that scar tissue around ODBC for nothing.

I mean, I think I think ODBC is gonna be around for a very long time. I don't think it's going away. Yeah.

Absolutely.

And

the counterpoint

to the benefits that you get building on top of open source is that particularly

when you have a business that is being powered by these components, you

adopt some measure of platform risk because

you're not the only person who has a vision for the future direction of these technologies. And

some of that future

may or may not be compatible with the vision that you have for it. And I'm curious how you think about that platform risk and the mitigating factors that you have in the engineering that you're doing to account for any potential future shift in the kind of vision and direction of those products?

Yeah. I mean,

you know, you can wrap the libraries with your own abstractions,

but the problem is that comes with a high price, a high cost.

And the truth is even if you wrap it with your own abstractions,

if the libraries end up changing significantly and you're like, okay. We need to replace it with something else,

It's not good. This is gonna be like a nontrivial task. Right? The best insurance

is essentially to have enough people contributing to the core of the thing to be able to have

some level of influence on the direction of the project. Right? Like,

ultimately, there's gonna be platform risk,

but

I think,

you know, take it from the other side, which is we decide to develop all this stuff ourselves, right, and keep it closed source and just, you know, whatever. Well,

the the risk there is, like, I mean, that's just an absolute mountain of work to do. Right? And to and the thing is, like,

as

as these projects have matured, like I said, like, we've seen other people contributing to them. So now

we regularly get, like, performance improvements in the query engine or new functions in the query language and all this stuff. Like, we help manage the project. Like, we have people contributing to, you know, we make significant

investments into the open source pieces,

but, you know, those are things that we kind of get for free,

as a result. Like, essentially, like, it means that

the the risk we have if we kept it all closed source is that our pace of development

would be outpaced

outmatched by the set of people contributing to this open thing.

Right? We may be able to

to get,

you know, somewhere initially, but, like, eventually, like, the open source people are gonna, like, outpace

a small team of proprietary developers. Now if you have unlimited resources and you can basically just, like,

you know, create,

you know, a long lived team of people that you're able to fund forever, then the situation changes.

But I think for for startups

in the technology space, like,

their best bet is to adopt

platform pieces that are not that, you know, that you can contribute to that can form the basis of the things you're building. Right? Like and this is you know, you you don't create your own operating system. Right? You use Linux, and you don't create your own programming language. You use whatever language you're gonna use there. And I think

all that stuff happens, you know,

it happens higher and higher. You all these pieces kinda, like, build on each other. In this case, like, we're talking about the FDAP stack and all these different components.

They're essentially the toolkit that you would use to build a database, an analytical database or a data warehouse.

Right? So

why create those things from scratch? Right? Your ultimate goal is not really to create a data warehouse. It's to

deliver, you know, value for your customers who are actually paying for the solution. And they don't really care about a data warehouse per se. They care about solving their data problem for their

customers. So

as much as you can, like, adopt to say, like, okay,

this isn't gonna be our thing that we innovate on. This is gonna be the you know, that's not how we actually, like, add value to this market, to this thing that we're selling. This is basically just like a barrier to entry. And if you can adopt an open source thing that, like, reduces the barrier, then great.

Absolutely. And by virtue of being

involved

with and participating in the open source projects that you're relying on, you also get the benefit of early warning of knowing that,

okay, this is the future direction that the community would like to see. And so now I can proactively

plan for those shifts in the underlying technology so that I can accommodate them in the end result that I am building on top of it.

Yeah. Yeah. 1 and, ultimately, like, the absolute worst case scenario, right, is, like, the community is gonna make some weirdo changes. They're just completely incompatible

with what we need to do.

Great. Then you could we can just fork the project from whatever that last point was.

It's from permissively licensed open source. We can fork the project and then we have 2 options. Do we make our fork

closed source,

or do we make our fork something publicly available and you just continue on from there. Right?

And at that point, you haven't adopted

any more risk

than you would have had anyways, you

know, your closed source thing. Although, I will say, like

like I mentioned, we we spend a lot of time

contributing to these community projects.

So there's a there's a good amount of effort that we put forward that essentially doesn't benefit us directly. Right? It's not the we're doing this community thing or managing these, like, efforts of different people contributing or whatever

because it's something we need specifically for our product.

But, again, the bet is that, you know, like,

okay. There are a bunch of things we'll do. They're not direct benefit to us, but there are other things coming in from the community that are, so it all kind of, like,

evens out. And actually, in our you know, in my experience, it doesn't even out. Like, we get far more out of it than we give than we put in. Even though we, like like I said, we try to put in as much as we possibly can.

It's just that

when you have, you know,

dozens of developers

from around the world and different

companies contributing to this thing, like,

the sum is gonna be greater than what any 1 individual or 1 company produces and puts into it. And so looking at the

component pieces of this stack and

the overall architecture and system requirements

for a database engine?

What are

the additional pieces that you had to build custom? What is the work involved in building a polished user experience on top of these different components

and some of the ways that you're thinking about what what are the appropriate abstraction layers or what are the appropriate system boundaries for

what these 4 pieces of the stack do and the eventual inclusion of Iceberg,

and what is the responsibility

of Influx

as the database experience that needs to be built on top of it? Yeah. So, I mean, basically, like, these components are really just libraries. Right? They're just programming libraries that we use.

So they're not actually a piece of running software that will do anything on its own. I mean, Data Fusion

data fusion does have, like, a command line tool where you can say, like, point it at, you know, a file and execute a query against it if it's CSV or JSON or parquet. Right?

But beyond that, it's not like a sir a process that'll run on a server that will respond to requests and all this other stuff. So you kinda have to build

all that scaffolding around it. Right? You have to build a server process and you have to

decide

what your API is gonna be. Right? For writing data in, most people are not gonna wanna write,

you know, arrow record batches or parquet files in because those 2 formats actually aren't

super easy to create

yourself. Like usually when people create those formats,

they do it as a transform from some other

data that's easier to work with, like CSV or JSON or whatever. Right? So

you have to decide, like, how do you write data in, what's that format, how do you translate it to

arrow or parquet.

You need to decide, like, for the query interface, like, SQL is the language,

but then how are they gonna make the request? Right? Is it gonna be HTTP, Jira PC, whatever?

And then what is the response format gonna be?

Do you want to give them Arrow? Do you wanna give them Parquet? Do you wanna give them CSV, JSON,

something else? Right?

So all those pieces you kinda have to decide on and

create.

Right? Basically, the the entire, like,

piece of server software.

And then there's,

you know, all the operational pieces, which is if you have to run this

in a Kubernetes cluster, if you have to run this in the cloud or whatever.

And also for for us, for Influx DB 3,

you know, we have currently, what we have is a distributed version of the database where we've

it's comprised of a number of different services that run inside a Kubernetes cluster. Right? And we separated out the ingestion tier

from the query tier, from

compaction,

from a catalog that runs. Right?

So

we basically had to create services for each of those and APIs for how they interact with each other and then a bunch of,

like, tooling and stuff like that to actually monitor, you know, spin this up on the fly and monitor it, run it,

all that separate stuff. So, I mean, there's still like if you're gonna adopt these components to build,

you know, a data system,

there's still a lot of work to do,

but

but yeah.

For people who are

interested in building some database engine or they are interested in the functionality of any of these different pieces. I'm curious

what you see as some of the other

types of projects

that would benefit from the capabilities

of any or all of those pieces of the stack and and maybe some of the other

elements that could be built up and added to that ecosystem

to maybe reduce the barrier to entry that you've had to pay?

Yeah. I mean, so

what I've seen, like,

a bunch of different kinds of projects are starting to adopt and companies are starting to adopt these pieces of the stack. So,

you know, I just saw 1,

yesterday. There was basically, like, a new stream processing engine that essentially is using

data fusion and thus also Arrow

as the the way to do, you know, processing within the the stream processing engine. Right? So you can execute SQL queries against, like, data coming in a stream, whatever. So there's that.

There are

different kinds of database systems, either time series database or document database

or data warehouse or whatever. Like, I've seen a number of projects

in either open source or in companies that are starting that to to use those components.

There's

another project right now where contributors from Apple are basically putting in

a essentially a Spark,

execution engine, which is based on data fusion.

Essentially, this is, you know, a replacement for the open source Java Spark implementation that's supposed to be faster and stuff like that. So basically you see like 1 component within Spark is being replaced with

Data Fusion as part of this. And actually the creator of Data Fusion,

Andy Grove, was originally doing

creating Data Fusion for that use case

inside of NVIDIA.

So you see, like, all these different companies, like, creating those different pieces.

I think it's still early

in the for the the Rust ecosystem of tools

to see what's gonna happen, like, what open source projects are going to become kind of big.

Right? Right now, when you think of, like, big data processing tools, most of that environment is in Java. Right? It started with Hadoop

and then continued with,

Spark and, like, all the different components there and writing Kafka's written in Java and Flink's written in Java. Right? So you have different stream processing systems and all these things have to integrate together.

What I anticipate

is that,

you know, over the next 10 years,

you'll see a lot of those systems

rewritten, recreated

using Rust and using Data Fusion and Arrow and Parquet

as the underlying primitives

and ideally they wouldn't just recreate the exact same thing, you know, but instead of Java, it's in Rust. There will certainly be some of that,

but ideally, what they will do is they will take, you know, a lot of lessons learned from those previous versions of those of those pieces of software.

Does that like, okay. How can we make

the user experience better? Right? So it's easier to express the kind of things we wanna express. Or how do we make operations better so it's easier to, like, operate these systems at scale. So

I think

I think it's really early yet, though. I it's not clear to me, like,

from an open source perspective, what projects are gonna be the winners here that that eventually, like,

you know, supersede the the previous Java

systems. Absolutely. And I've definitely been seeing a little bit of that as well even

3 to 5 years ago of c plus plus being the implementation target, particularly built around the c star framework for being able to take advantage of multi CPU architectures,

most notably the Cilla DB project as a target to reimplement

Cassandra

and then Red Panda taking on the Kafka ecosystem.

Yep. And

another interesting aspect of this space is

Aero as the focal point of that data interchange

has been gaining a lot of ground. It started off as a very nascent project. There's been a lot of effort put into

making that more of the

first

target rather than being a second consideration,

and it's been working on integrating with the majority of the components of the data ecosystem.

I'm wondering what you see as some of the remaining

gaps in coverage or some of the white spaces in the overall Arrow ecosystem

that are either

immature or completely absent and, spaces that you would like to see the overarching data community invest in

building out more capabilities and capacity?

So I think there's still probably some work to be done within Arrow as a specification itself

for representing data in a more compact form.

Right? For for some kinds of, like, columnar data, it's just not as efficient as as I think it could be. But that was

originally, I think that was a

a result of 1 of the design goals, which was essentially o of 1 lookup for any individual element within the

the set.

I think if that constraint is loosened, that opens up the possibility for other kinds of

compression

techniques and stuff like that that will make it a better format for,

compressed data and memory, which I think is something that would be potentially interesting.

I think

there's still there's still a question of like, okay, if we're gonna have a stream processing system

in, you know, that

that uses these tools, what does that look like? Because Arrow

as a format actually is not is not well suited for stream processing. Right? Because it's a columnar format, so they, you know, the the conceit there is that you are you are sending in, you know, many, many rows at the same time. Whereas when you think of stream processing,

you think of either micro batching or individual rows, like, 1 by 1. Right? So there's

no there's no there's no good, like, I think, translation layer between okay. If you're moving if you care about doing stream processing and you wanna move to Aero or, like, batch processing

or larger scale data processing, how do you make that trend you know, the transition,

and what do the tools

look like for that? I think that's still very difficult. Right? And it's certainly, like, something we've done in influx of v 3, which is, like,

translating to, you know, line protocol, individual rows being written into

the Arrow stuff.

I think

the distributed query processing

is something that is probably gonna, you know, get

more

work. It's definitely something that needs more work within the data fusion

piece itself.

I think later this year, I think, in a couple of months, hopefully, they're gonna vote on whether Data Fusion becomes its own top level Apache project

outside of Arrow.

My my best guess is that's gonna happen. And then what we'll probably see is, like, Data Fusion will then have some subprojects, 1 of which I think will be around distributed query processing,

which I think will be important for for it really to become

a contender and a competitor in the larger scale data warehousing space.

What else?

I don't know. Like, Parquet has gotten some interesting improvements along the way. I think I don't know. There was, like, GeoParquet for representing geospatial data. I think that's gonna be super important.

So yeah.

This might be a little bit too far afield or too deep in the weeds, but there was also for a little while, a bit of contest between Parquet and ORC as the preferred

columnar serialization

format. I'm wondering if you have seen that the dust settle around that, and there has been a general consensus around 1 or the other, or if those are still kind of a case by case basis do what you think think is right for a different use cases?

I I may just be biased because, you know, I'm

I'm looking for parquet,

but I don't I I remember that being a thing, and I remember looking at both formats, you know, from a high level,

back in the day.

But I don't really see ORC as a format

coming up nearly as much. Right? It seems to me that

Parquet has kind of won

won the,

you know, the mind share largely, and that's what people have kind of coalesced around. Now, of course,

you know, because we're talking about data at scale, there's probably, like, mountains of data in people's, like, data lakes and data warehouses that is represented as as ORC, so that's not gonna go away.

But

by and large, what I see is that parquet seems to be

the standard format that all the big data vendors are

are coalescing around.

I've I've been seeing a similar thing. And then to the point of streaming

and record based digestive data versus the columnar approach for Parquet and Arrow,

I know that Avro and Parquet have a defined

kind of translation

method of being able to compact multiple Avro records into a pic parquet

file.

And I'm curious if you're seeing anything analogous for the Arrow ecosystem of being able to maybe manage that translation of multiple Avro records batched into

an Arrow buffer that can then subsequently be persisted into parquet

or using that Avro to parquet translation as the intermediary to then get loaded into an arrow buffer?

I mean, I haven't really seen that. I mean, there's because

I mean, it's it's pretty easy to go from arrow to parquet or parquet to arrow, right, because they're,

you know, parquet's within the arrow umbrella. So people

people at the product you know, in the various projects have created a bunch of, like, translation layers to do that.

I haven't seen I really haven't seen any, like, rise of, like, oh,

these, like, row based formats

into either arrow or

parquet. It just seems to be, like,

kind of 1 off. I I honestly, I don't see Avro come up that

much.

So,

mainly, I think what I see the most, what people care about is, like, JSON data just because it's so easy,

you know, to change between different languages and different services.

And, honestly, I think protobuf

more than

more than Avro or anything else. I think that's maybe

because of, you know, the popularity of gRPC.

And as you have been

investing in this ecosystem,

building on top of the different components, I'm wondering what are some of the most interesting or innovative or unexpected ways that you have seen some or all of those pieces used together?

So, honestly, stream processing was a surprise for me because I like, I didn't when I think of, like, Aero and

and Data Fusion, like, I wasn't originally thinking that people would use these things for stream processing systems.

Right? I think more

like it's they're around, like, batch processing and do it. You know, I execute a query against this data, whatever.

So having people seeing people pull that stuff into the stream processing systems

has been very surprising.

Elsewhere, I'm I'm not sure. Like, I think

so

I've seen a few, like, observability

solutions start to look seriously using Parquet as the persistence format. That's a little surprising too. Mainly because, like, when I think about observability, it's largely like, oh, you think of, like, matrix log traces. Right? You and, generally, what people have done is they've created specialized,

you know, formats and back ends for each of those individual use cases.

So,

I've seen, you you know, some people start to look at

seriously at having Parquet represent, like, any of that kind of data, which I think

to me, that's like I that's definitely, like, 1 of our visions long term

is that being able to store any kind of observational data in influx and thus in Parquet.

But to see more observability vendors start to look at that seriously, has been a bit of a surprise too.

And

in your experience of working in the space, rebuilding the influx database, and

investing more into the Arrow ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I mean,

1 of the lessons which is some somehow a lesson I always real need to relearn as a software developer is things always take longer than you expect them to take.

So

this project, like I said, you know, we I started seriously thinking about it about 4 years ago.

Really serious development on it for the last 3 and a half. It's basically just a long a long road to to create this kind of system.

So there's that.

I've been pleasantly surprised by the adoption,

by by actually, the level the level of contribution

from outside,

you know, people

at at actually, you know, companies of a very significant size,

has been also a bit of a surprise. Like, I think

I think they're,

you know, they're

for for companies that reach, like, you know, crazy scale,

which are, you know, companies that you know the names of. Like,

I think many of them,

are contributing to these projects because they kinda have to, like,

create their own things because literally nobody on earth has the kind of scale problems they have except for maybe, like, 10 or 20 different companies.

So they end up having to roll their own solution.

And, again, I think the

the fact that these companies are contributing is something I didn't expect

particularly this early on.

And I think that speaks to,

you know, the thing we were talking about earlier, which is, like, what kind of platform risk

is there to adopting this code? And it's like, well, the alternative is

you create all this closed source software that is really, like, this is not the problem you're trying to solve. This is just, like, the problem you have to solve to get to the problem you're trying to solve. So,

that's that's been, like, I think a pleasant surprise

seeing seeing this,

you know,

mature over the last few years.

And for people who are looking to build

data systems, data processing engines, what are the cases where the FDAP stack is the wrong choice?

So I

I don't think it's particularly designed for OLTP workloads. Right? So, you know, traditional relational databases and stuff like that. Like,

there are places where, you know, you can

it it would make sense to have it as, like,

essentially, like, an interface point.

But,

I mean, you could certainly use, like, data fusion as your query engine in an OLTP workload.

But to me, it wouldn't make sense to use, like, Arrow as a way to ingest data or Parquet.

Because really when you think about OLTP workloads, you think about individual requests with individual record updates and stuff like that. So I really do think these tools are more geared towards

larger scale analytical workloads

against,

you know, data that you can largely view as immutable. Right? This is like observational data and stuff like that. So

yeah.

And as you continue to build and iterate on the new version of Influx DB and invest in the Arrow ecosystem and the components we've been discussing, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?

So as I mentioned,

the thing I'm most excited about is essentially, like, more integration

adding support for Apache Iceberg.

So

what that's gonna there so there's already, like, a Rust project to do Apache Iceberg, but it's not, like, fully baked yet. So we may need to contribute to that, or maybe the people who are working on it will get it fully baked before we actually get to the point where we're pulling it in.

So

Apache Iceberg is a big thing. I think

in the medium term, the distributed processing stuff and data fusion

is gonna be super interesting.

And then from

InfluxDB's

perspective, as I mentioned, like, we have right now

our commercial distributed

version of the database.

But this year, we're coming out with, you know, the open source version of the monolithic single

single server version of the database

and getting that open source piece out there with, like, a new version 3 API

that kind of represents a much richer data model than previous versions of InfluxDB

that takes advantage of what you can do with Arrow and Parquet

as the as the formats,

that I'm actually really, really excited about because then I really think that

from a technology perspective, InfluxDB

will actually be able to fulfill the, like, vision that we've had all along, which is that essentially it is

useful for any kind of observational data you could think of, not just, like, metrics data from your servers or networks or your apps. Right?

Are there any other aspects of the work that you've been doing on the Influx DB engine,

the work that you've been doing investing in and building on top of the Arrow ecosystem

or the overall space of how the Arrow ecosystem might influence the future direction of the data processing ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?

I don't think so. Like, I think we kinda I I mean, I guess, like,

more broadly, like, the the way the way I view, like, the the data space right now when you're talking about these, like, analytical data

is there's there's this kind of, like, distinct,

separation between, like, data warehousing on 1 side, which is these large scale analytical queries and stuff like this, and, like, stream processing

on the other, which is more about, like, real time data as it arrives.

I think

the trend, like, really when I think about those 2 things, like, ultimately, like, what developer wants and what users want is basically some magical oracle in the sky that they can, like,

send a query to the where the result will come back in, you know, sub 50 milliseconds.

And we have that. We wouldn't need stream processing. We wouldn't need, like, all these different things. But I think

as the technology

improves and things get better and better,

data warehousing is gonna become more real time and the real time pieces are gonna, you know, move more towards, like, data warehousing because ultimately, like, people don't wanna think about separating stream from data warehousing, whatever.

And

1 of the things I'm excited about is essentially the idea that these different

building blocks

could potentially be the things that people use to kind of close that gap

and create, you know,

a big data solution that works either for real time data or for,

you know, big scale data warehousing.

But I thought people liked reinventing the Lambda architecture.

Oh, no. Yeah, they do. They do.

They just like to call it something new. Maybe it's the kappa architecture.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Biggest gap?

Oh,

I I don't know.

I don't know, actually.

I mean, obviously, like,

I think the most interesting side of this is essentially, like, you know, time series data and basically being able to represent being able to do analysis on data as time series. So that's our focus.

That that's what I think is the most interesting thing right now. But,

yeah,

I still I still think that's an unsolved problem by us or anybody else. So that's what we're working towards.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing, both contributing to and building on top of the Arrow ecosystem and the components thereof. It's definitely a very

interesting area of effort. It's great to see the work that you and your team are doing to help bring all of us forward in that space. I appreciate the time and energy you're putting into that, and I hope you enjoy the rest of your day.

Cool. Thank you.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast

dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts atdataengineeringpodcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and just tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links