Summary
Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!
- Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines?
- This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture?
- Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components?
- One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform?
- Can you describe the operational/architectural aspects of building a full data engine on top of the FDAP stack?
- What are the elements of the overall product/user experience that you had to build to create a cohesive platform?
- What are some of the other tools/technologies that can benefit from some or all of the pieces of the FDAP stack?
- What are the pieces of the Arrow ecosystem that are still immature or need further investment from the community?
- What are the most interesting, innovative, or unexpected ways that you have seen parts or all of the FDAP stack used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on/with the FDAP stack?
- When is the FDAP stack the wrong choice?
- What do you have planned for the future of the InfluxDB IOx engine and the FDAP stack?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
- FDAP Stack Blog Post
- Apache Arrow
- DataFusion
- Arrow Flight
- Apache Parquet
- InfluxDB
- Influx Data
- Rust Language
- DuckDB
- ClickHouse
- Voltron Data
- Velox
- Iceberg
- Trino
- ODBC == Open DataBase Connectivity
- GeoParquet
- ORC == Optimized Row Columnar
- Avro
- Protocol Buffers
- gRPC
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png) Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit [dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council) and use code **dataengpod20** to register today! Promo Code: dataengpod20
- Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png) Data teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. Dagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to [dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast) today to get your first 30 days free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. A new approach to building and running data platforms and data pipelines. It is an open source, cloud native orchestrator for the whole development life cycle with integrated lineage and observability, a declarative programming model, and best in class testability. Your team can get up and running in minutes thanks to DAXTER Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments. Go to data engineering podcast.com/daxter today to get started, and your first 30 days are free. Data lakes are notoriously complex.
For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com slash starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Your host is Tobias Macy, and today I'm interviewing Paul Dix to talk about his investment in the Apache Arrow ecosystem and how it led him to create the latest FAD and database design. So, Paul, can you start by introducing yourself?
[00:01:52] Unknown:
Sure. I'm Paul Dix. I'm the founder and CTO of, InfluxData. We are the makers of InfluxDB, which is an open source time series database. Prior to that, I have a lot of experience in industry. I'm obviously a computer programmer by training, and I've worked in a lot of large companies, small companies all over. So
[00:02:10] Unknown:
And for folks who haven't listened to your previous appearance on this show, we were where we were talking about the influx product suite and your experience there, where you actually hinted at the work that you've been doing, where we're bringing you back to talk about. Can you just give a refresher on how you first got started working in data?
[00:02:29] Unknown:
So as I mentioned, InfluxDB is a time series database. Now how I got interested in this topic. I mean, generally, like, when I was in school, I was interested in information retrieval, database systems, that kind of stuff. But in 2010, I was working at a Fintech startup here in New York City, and we had to build a solution for working with a lot of time series data. Later, when I started this company, initially, we were building a product for doing server monitoring and real time application metrics and that kind of thing. And to build a back end for that, I had to build a solution that was very similar to the back end I had built for the Fintech company. So I saw 2 different use cases. 1 was in financial market data and the other in, like, server monitoring and application performance monitoring data. But the back end solution for both was basically the same thing. And at that point, I realized building a database that could work with time series data at scale and make it easy for the user was a more interesting problem to solve.
So, you know, we pivoted the company to focus on that, became InfluxDB, and we've been building for that ever since. So initially, we had, you know, version 1.0. The initial announcement of InfluxDB was in the fall of 20 13. We released version 1.0 of InfluxDB in September of 2016. We released 2.0 in basically late 2019, early 20 20. And then just this last year, we released version 3.0 of the database, which is the the significant, rewrite that you were hinting at that, basically caused us to adopt all these new technologies and start investing heavily in the Apache Arrow ecosystem.
[00:04:11] Unknown:
Now bringing us through to this part of the conversation, I, made a little bit of a play on the acronym with the introduction, but the the different letters of it are f d a p. And I'm wondering if you could just start by describing the overall context of that stack, how the what the different components are, and how they combine to provide a foundational architecture for database engines.
[00:04:35] Unknown:
Yeah. So the FDAP stack, is an acronym for the different pieces. F stands for flight, which is Apache Arrow Flight or Apache Arrow Flight SQL. A is actually Apache Arrow, which is essentially the foundational project under which all these components reside. So Arrow is like the umbrella project for everything. So Apache Arrow is an in memory columnar specification. So basically, it's a format for in memory columnar data so that you can do quick analytics on it. D, which is Data Fusion, which is a SQL Processor, like, it's a query parser, planner, optimizer, and execution engine for SQL. Specifically, it also follows the Postgres dialect of SQL and parquet, which is a file format for persisting columnar data, but also structured data. So you can have nested structures.
It's essentially an open source implementation of the Google Dremel research paper that came out in the early aughts.
[00:05:41] Unknown:
I'm wondering if you can talk to the design goals and constraints that you were focused on in the reimplementation of InfluxDB and how that led you to the selection of this composition of tools to execute on that vision?
[00:05:57] Unknown:
Yeah. So for Inflectivity 3.0, as I mentioned, we we basically did a ground up rewrite at the database, which generally speaking is not something you'd ever want to do, but there are a number of problems we wanted to solve for. So first is this idea of infinite cardinality. Right? Within time series databases, generally, there's this idea of the cardinality problem where cardinality comes into in, like, dimensions that you describe your data on. Right? So these could be, like, a server name or a region or a a sensor ID, but you can also have other dimensions like what user made this request or what security token made the request. And, really, when you think about it, the dimensional data is basically just data that describes different observations that you're making.
So when people want infinite cardinality, they basically just wanna be able to say they wanna capture as much precision and information about these observations that they're making. Now traditional time series databases like InfluxDB versions 12 and others have a problem essentially when this cardinality gets super, super high. And we had a bunch of, you know, customers and users who were saying they wanted to record this and use Influx DB for it, but we didn't have a solution. It was basically like a fundamental limitation of the architecture of the database. So how do we achieve infinite cardinality? How do we achieve cheaper storage? Right? People wanted to decouple the query processing and the ingestion processing and indexing from the actual storage of the data and they wanted to be able to ship historical data off to cheaper object storage that could be backed by spinning disk while still making it so that queries against recent data are super fast. Right? So again, you're talking about a very fundamental shift in the architecture of the database to be able to enable, you know, keeping everything in object storage while processing recent data and memory and all this other stuff. So is that.
And then the other big piece is essentially, like, we wanted broader ecosystem compatibility. Right? In versions Influx DB versions 12 have their own query languages, their own data formats. Right? We wanted to be able to integrate with a much broader set of third party tools. So, specifically, we wanted to support SQL as a query language in addition to InfluxQL, our our older query language. We wanted persistence formats that could be read and used in tools outside of Influx DB. Right? So and we wanted all of this essentially to be super performance. And, basically when we looked at this, we're like, okay, there are fundamental architecture changes of the database, which means we're essentially gonna have to rewrite most of it. And this was at the beginning of 2020.
And at that time, I thought, well, 1, older versions of InfluxDB are written in Go, that's That's kind of an artifact of when we created the project back in 2013. Right? Go was very is starting to become hot then. Right? The Go 1.0 release was in March of 2012. But in 2020, the beginning of 2020, I was very interested in Rust, and I felt that Rust as a programming language would be essentially the best way to implement this kind of, like, high performance server side software. And I also thought that we could bring in other open source tools and libraries that would help us get there faster.
Specifically, like, we didn't want to create our own SQL execution engine from scratch. Right? That's a very, very big investment and there are other systems out there that can do it. And initially, we thought that we might be pulling in something that was written in either C or C plus plus which meant, like, bringing that code into a Rust project is actually fairly straightforward and you have, like, 0 cost abstractions and basically a very clean way to integrate it. But when we started looking around, we saw that there were actually some Rust projects that were super interesting, right, that would enable us to do this. So 1, persistence format. Right? We wanted a format that was more broadly addressable, right, from other tools.
And in 2020, the most obvious choice at least to us was Parquet. It was still, like, Parquet came out I think in, like, 2016, so it was beyond, like, the early, early adopter phase and was getting more usage starting to get more usage in, like, other big data processing systems, data warehouses. And we felt that if we use that as the persistence format, we'd 1, get the amount of compression we needed for our data to make it, like, you know, compact at scale. But the other is, like, make it so we could share it with other third party systems. So that was kind of an obvious choice. Then we knew, like, we needed fast analytics on the data, right, so that's when we started looking at Arrow as the, like, in memory columnar data structure.
Right? 1 of the things I mentioned is, you know, this need for supporting high cardinality data, but then the other need is essentially, like, doing analytic style queries on time series data so that you can do analysis. Versions 1 and 2 of InfluxDB, those kind of analytics queries were like slow because of the way the system was architected under the hood, and we thought if we're going to be able to do fast analytical queries on time series data, it's going to have to be in this columnar format. So we kind of adopted Arrow as the in memory format for this data, which then led to, you know, these other pieces.
And then in early 2020, we looked at, a number of different query engines we could potentially use. We looked at DuckDV, which was still very nascent at that time. We looked at ClickHouse's engine, which again was nascent compared to where it is now, and we also looked at data fusion. And at the end of the day, we decided that data fusion would be our choice because, you know, it was written in Rust. And the thing is, like, all 3 of those projects that we evaluated, we realized there was gonna be a lot of work that we would have to do to be able to support the time series use cases that we were aiming for.
And we felt that if we're gonna have to do a lot of work and end up contributing heavily to this query engine, we might as well do it in a language that we intend to use, which is Rust, Right? DuckDeeBee and ClickHouse are both implemented in C plus plus And we also felt that Data Fusion being part of the Apache Foundation and being part of the Arrow project, we're making a bet that it would essentially, like, start to gather momentum and pick up steam, and there'd be other people who would contribute to it over time. And over the last, you know, 3 and a half years that we've been heavily developing with it and contributing to it, we've certainly found that to be the case as more people have been adopting Parquet, more people have been adopting Arrow.
They've been contributing to those 2 and Data Fusion, and flight and flight SQL are also becoming kind of a standard RPC mechanism essentially for exchanging, you know, analytic datasets or, you know, millions of rows quickly in a high performance way.
[00:13:08] Unknown:
And each of those pieces of the stack are definitely well engineered. They've been gaining a lot of momentum. There's been a lot of investment in that overall ecosystem, but they are all I guess, they're not as narrowly scoped in particular Arrow as when they first started, but they are all focused on a particular portion of the problem. And in order to build them into a cohesive experience, I'm curious what was the engineering effort that's necessary to actually build a fully executable database engine and platform experience on top of those disparate parts?
[00:13:47] Unknown:
Yeah. I mean, it's certainly true that when Arrow first started, it essentially was like an in memory specification, and the the dream there was essentially that, you know, you have data scientists who are trying to do analysis in either Python or R. Right? And the thing is they almost always have to get their data from 1 place and bring it in and exchange it to another thing. So the vision there was essentially how do you do data interchange between these different data science tools and systems that is 0 copy, 0 cost serialization, deserialization, right, super, super fast.
And Wes and his team started with that, and then they saw saw, like, okay, wait a second. Now people also have these needs to, like, persist the data. So we need a persistence format. He brought in Parquet because he also helped define Parquet when it was first created, but that became an obvious add on. And then, you know, the RPC mechanism, they're like, okay. That well, now you have servers that are running things, so you need a way to exchange the data. Again, an obvious add on. And data fusion, again, like, you need if you're working with this data, like, in Python, you have, like, pandas and r, you have, like, these, you know, different things. You have, like, either data frames, libraries, or whatever. But a lot of time, people just wanna execute a SQL query and you need an execution engine that can work with this arrow format natively that's gonna be super fast. Right? Anything that's fast in Python isn't actually written in Python, it's written in c or c plus plus and then wrapped.
So that's what they realized from the data science perspective. Now from the perspective of people creating a data platform, like, an entire data platform or a database server or something like that, the thing that's tricky about it is a lot of these formats are actually they're designed for exchanging, like, a a set chunk of data. Right? Like, parquet is an immutable format. Right? It's not meant to be updated. You write a parquet file, and that's that. Arrow, again, like, you don't append to arrow buffers, on the fly. Like, you create an arrow buffer, it's well defined, and then you can hand it off.
So having a system that's basically able to ingest data live, right, like individual rights, individual rows that you're writing in, and being able to combine that with this historical dataset that's represented either as arrow buffers in memory or parquet files on disk. Right? Moving all that data around, that becomes the really, like, the trickiest part of creating, like, a larger scale data platform. It's like, how do you move that data around? How do you combine the real time data with the historical data? And how do you make that all fast, and how do you make it easy to use.
All of that work is basically non trivial amount of effort, but it's certainly made easier by the fact that you no longer have to create the lower level primitives, right, to build that data platform. You don't have to create the query engine. You don't have to create the file format. Right? Those things basically just exist and they're you know, I've heard Wes refer to it as basically the composable data stack. Right? Which is you can kind of pick and choose these pieces that you want to work with. Right? You can use the Data Fusion query engine, but not use Parquet at all and, you know, not use flight if you don't want to. It uses Arrow under the hood, so that kinda, like, comes along for the ride. But, yeah, like, all of these different pieces are kind of, like, you know, they're designed to be modular so that you can pick a different persistence format if you want that. You can pick a different execution engine. Right? Within the Arrow ecosystem, 1 of the things that, Voltron Data, the company that Wes ended up starting with some other people, that backs a lot of the Arrow stuff as well.
1 of the things they created was this project called, I don't know how to pronounce it, Velox, basically, velox, which is basically like this execution engine that was created in conjunction with some work at Facebook to do stuff. Right? So the idea is you can pick and choose these components and kind of tie them all together into a larger, like, operational system where you're essentially solving problems around data warehousing, real time analytics, and essentially just, like, working with what I would say observational data at scale. Right? Where observational data could be data from your servers, applications, sensors, logs, whatever it is.
[00:18:18] Unknown:
Are you sick and tired of sales y data conferences? You know, the ones run by large tech companies and cloud vendors? Well, so am I. And that's why I started Data Council, the best vendor neutral, no BS data conference around. I'm Pete Soderling, and I'd like to personally invite you to Austin this March 26th to 28th, where I'll play host to 100 of attendees, 100 plus top speakers, and dozens of hot start ups on the cutting edge of data science, engineering, and AI. The community that attends data council are some of the smartest founders, data scientists, lead engineers, CTOs, heads of data, investors, and community organizers who are all working together to build the future of data and AI.
And as a listener to the data engineering podcast, you can join us. Get a special discount off tickets by using the promo code depod20. That's depod20. I guarantee that you'll be inspired by the folks at the event, and I can't wait to see you there.
[00:19:18] Unknown:
Another interesting element of building your platform on top of all of these open source components is that by virtue of it being a layered stack, you can have additional integrations that can come in at each of those different layers rather than having the main interface be the only way of accessing the data that it contains. It also gives you the benefit of being able to capitalize on the overall ecosystem of investment and the network effects that you get from those different open source projects. So I'm wondering if you can comment on some of the ways that you've seen that benefit materialize in your work of building this data platform on top of these different components.
[00:20:03] Unknown:
Yeah. So this is actually, like, 1 of the things I'm most excited about for these different pieces and for, you know, the work we're doing, which is there's, I I think we actually need to add another letter to the acronym, the FDIPDAP acronym and maybe, like, jumble them up. But basically, the the other letter is I for for Apache Iceberg. So Iceberg is essentially a catalog a standard for creating a data catalog of essentially parquet files in object storage. Right? And we're we're basically building first class support for that in InfluxDB 3.0 where all of the data that's ingested into an Influx DB 3 dot o server can be exposed essentially as iceberg catalogs, which is awesome because that so that's a standard that was originally developed at Netflix and that was open sourced out into the Apache Foundation, and it's quickly being adopted by other by other companies. Right? So Snowflake just added support for Iceberg as a format. Even Databricks is adding support for it, even though they have a a competing standard called Delta Lake, and a lot within Amazon, the Amazon Web Services, for example, they're adding first class support for Iceberg so that if you have data that's exposed as an Iceberg catalog, you know, in s 3, you can then query that data using any of the Amazon, you know, query services like Athena or Redshift or all these different pieces.
So that I think is, like, a really interesting integration because it makes us so that you can access this data in bulk. Right? So if you want need to, like, train a machine learning model or whatever or query against this data for doing large scale analytical queries and be totally outside with for Influx DB 3, for example, totally outside the operational envelope of the system that's kind of, like, managing all this real time data movement, being able to query in real time, you you can basically do all these analytics tasks completely disconnected from that. And, again, like, you could use Data Fusion for that, but you could also use Athena, right, which is, you know, based on, a Java Java query engine called Trino or Presto or whatever it is now.
Or you could use DuckDV or ClickHouse or any 1 of these other systems to do your query processing and analytics against that data. So that integration, I think, is super interesting. The other 1 that I think is interesting is within the Arrow project. So they have they have flight SQL is basically like an RPC mechanism for essentially sending SQL queries to a server and getting back millions of rows really, really quickly. And they have basically a new standard that they've created, that's kind of like competing with ODBC. So ODBC is obviously the database connection standard. Was for essentially transactional databases and relational databases.
The Arrow 1, once that becomes a thing, I think it'll be a really, like, a standard way to connect to analytical data stores of any kind, whether it's data warehouses or real time data systems or whatever. And I think those like, having those things be standards and have them, you know, contributed to by many different companies, not just supported by a single vendor, I think will make it the the pace of innovation in this space for these, you know, large scale data data use cases, which are only gonna continue to, like, increase and multiply. I think it makes it so that we have we can have basically many more tools that can integrate with each other. Whereas, like, if you look at data warehousing, you know, for the last 20 years, it's largely been, like, your day you know, it's data data warehouses are basically kind of like data roach motels.
Like, your data goes in and you have to get all the data in the data warehouse, but then if you wanna do anything with it, you have to send the query to the data warehouse and, like, all this other stuff. Right? And there's just not there's not this really good integration, like the data warehouse just becomes this 1 place. So being able to access it from a bunch of different tools, without having 1 piece of software be the arbiter of the entire thing, I think is is really interesting.
[00:24:32] Unknown:
Absolutely. And to your point of, flight SQL being a new RPC mechanism to unlock a lot of potential and reduce a lot of the pains, it it just, makes me sad that I obtained all of that scar tissue around ODBC for nothing.
[00:24:50] Unknown:
I mean, I think I think ODBC is gonna be around for a very long time. I don't think it's going away. Yeah.
[00:24:56] Unknown:
Absolutely. And the counterpoint to the benefits that you get building on top of open source is that particularly when you have a business that is being powered by these components, you adopt some measure of platform risk because you're not the only person who has a vision for the future direction of these technologies. And some of that future may or may not be compatible with the vision that you have for it. And I'm curious how you think about that platform risk and the mitigating factors that you have in the engineering that you're doing to account for any potential future shift in the kind of vision and direction of those products?
[00:25:38] Unknown:
Yeah. I mean, you know, you can wrap the libraries with your own abstractions, but the problem is that comes with a high price, a high cost. And the truth is even if you wrap it with your own abstractions, if the libraries end up changing significantly and you're like, okay. We need to replace it with something else, It's not good. This is gonna be like a nontrivial task. Right? The best insurance is essentially to have enough people contributing to the core of the thing to be able to have some level of influence on the direction of the project. Right? Like, ultimately, there's gonna be platform risk, but I think, you know, take it from the other side, which is we decide to develop all this stuff ourselves, right, and keep it closed source and just, you know, whatever. Well, the the risk there is, like, I mean, that's just an absolute mountain of work to do. Right? And to and the thing is, like, as as these projects have matured, like I said, like, we've seen other people contributing to them. So now we regularly get, like, performance improvements in the query engine or new functions in the query language and all this stuff. Like, we help manage the project. Like, we have people contributing to, you know, we make significant investments into the open source pieces, but, you know, those are things that we kind of get for free, as a result. Like, essentially, like, it means that the the risk we have if we kept it all closed source is that our pace of development would be outpaced outmatched by the set of people contributing to this open thing.
Right? We may be able to to get, you know, somewhere initially, but, like, eventually, like, the open source people are gonna, like, outpace a small team of proprietary developers. Now if you have unlimited resources and you can basically just, like, you know, create, you know, a long lived team of people that you're able to fund forever, then the situation changes. But I think for for startups in the technology space, like, their best bet is to adopt platform pieces that are not that, you know, that you can contribute to that can form the basis of the things you're building. Right? Like and this is you know, you you don't create your own operating system. Right? You use Linux, and you don't create your own programming language. You use whatever language you're gonna use there. And I think all that stuff happens, you know, it happens higher and higher. You all these pieces kinda, like, build on each other. In this case, like, we're talking about the FDAP stack and all these different components.
They're essentially the toolkit that you would use to build a database, an analytical database or a data warehouse. Right? So why create those things from scratch? Right? Your ultimate goal is not really to create a data warehouse. It's to deliver, you know, value for your customers who are actually paying for the solution. And they don't really care about a data warehouse per se. They care about solving their data problem for their customers. So as much as you can, like, adopt to say, like, okay, this isn't gonna be our thing that we innovate on. This is gonna be the you know, that's not how we actually, like, add value to this market, to this thing that we're selling. This is basically just like a barrier to entry. And if you can adopt an open source thing that, like, reduces the barrier, then great.
[00:28:59] Unknown:
Absolutely. And by virtue of being involved with and participating in the open source projects that you're relying on, you also get the benefit of early warning of knowing that, okay, this is the future direction that the community would like to see. And so now I can proactively plan for those shifts in the underlying technology so that I can accommodate them in the end result that I am building on top of it.
[00:29:26] Unknown:
Yeah. Yeah. 1 and, ultimately, like, the absolute worst case scenario, right, is, like, the community is gonna make some weirdo changes. They're just completely incompatible with what we need to do. Great. Then you could we can just fork the project from whatever that last point was. It's from permissively licensed open source. We can fork the project and then we have 2 options. Do we make our fork closed source, or do we make our fork something publicly available and you just continue on from there. Right? And at that point, you haven't adopted any more risk than you would have had anyways, you know, your closed source thing. Although, I will say, like like I mentioned, we we spend a lot of time contributing to these community projects.
So there's a there's a good amount of effort that we put forward that essentially doesn't benefit us directly. Right? It's not the we're doing this community thing or managing these, like, efforts of different people contributing or whatever because it's something we need specifically for our product. But, again, the bet is that, you know, like, okay. There are a bunch of things we'll do. They're not direct benefit to us, but there are other things coming in from the community that are, so it all kind of, like, evens out. And actually, in our you know, in my experience, it doesn't even out. Like, we get far more out of it than we give than we put in. Even though we, like like I said, we try to put in as much as we possibly can.
It's just that when you have, you know, dozens of developers from around the world and different companies contributing to this thing, like, the sum is gonna be greater than what any 1 individual or 1 company produces and puts into it. And so looking at the
[00:31:15] Unknown:
component pieces of this stack and the overall architecture and system requirements for a database engine? What are the additional pieces that you had to build custom? What is the work involved in building a polished user experience on top of these different components and some of the ways that you're thinking about what what are the appropriate abstraction layers or what are the appropriate system boundaries for what these 4 pieces of the stack do and the eventual inclusion of Iceberg, and what is the responsibility of Influx
[00:31:49] Unknown:
as the database experience that needs to be built on top of it? Yeah. So, I mean, basically, like, these components are really just libraries. Right? They're just programming libraries that we use. So they're not actually a piece of running software that will do anything on its own. I mean, Data Fusion data fusion does have, like, a command line tool where you can say, like, point it at, you know, a file and execute a query against it if it's CSV or JSON or parquet. Right? But beyond that, it's not like a sir a process that'll run on a server that will respond to requests and all this other stuff. So you kinda have to build all that scaffolding around it. Right? You have to build a server process and you have to decide what your API is gonna be. Right? For writing data in, most people are not gonna wanna write, you know, arrow record batches or parquet files in because those 2 formats actually aren't super easy to create yourself. Like usually when people create those formats, they do it as a transform from some other data that's easier to work with, like CSV or JSON or whatever. Right? So you have to decide, like, how do you write data in, what's that format, how do you translate it to arrow or parquet.
You need to decide, like, for the query interface, like, SQL is the language, but then how are they gonna make the request? Right? Is it gonna be HTTP, Jira PC, whatever? And then what is the response format gonna be? Do you want to give them Arrow? Do you wanna give them Parquet? Do you wanna give them CSV, JSON, something else? Right? So all those pieces you kinda have to decide on and create. Right? Basically, the the entire, like, piece of server software. And then there's, you know, all the operational pieces, which is if you have to run this in a Kubernetes cluster, if you have to run this in the cloud or whatever.
And also for for us, for Influx DB 3, you know, we have currently, what we have is a distributed version of the database where we've it's comprised of a number of different services that run inside a Kubernetes cluster. Right? And we separated out the ingestion tier from the query tier, from compaction, from a catalog that runs. Right? So we basically had to create services for each of those and APIs for how they interact with each other and then a bunch of, like, tooling and stuff like that to actually monitor, you know, spin this up on the fly and monitor it, run it, all that separate stuff. So, I mean, there's still like if you're gonna adopt these components to build, you know, a data system, there's still a lot of work to do, but but yeah.
[00:34:34] Unknown:
For people who are interested in building some database engine or they are interested in the functionality of any of these different pieces. I'm curious what you see as some of the other types of projects that would benefit from the capabilities of any or all of those pieces of the stack and and maybe some of the other elements that could be built up and added to that ecosystem to maybe reduce the barrier to entry that you've had to pay?
[00:35:04] Unknown:
Yeah. I mean, so what I've seen, like, a bunch of different kinds of projects are starting to adopt and companies are starting to adopt these pieces of the stack. So, you know, I just saw 1, yesterday. There was basically, like, a new stream processing engine that essentially is using data fusion and thus also Arrow as the the way to do, you know, processing within the the stream processing engine. Right? So you can execute SQL queries against, like, data coming in a stream, whatever. So there's that. There are different kinds of database systems, either time series database or document database or data warehouse or whatever. Like, I've seen a number of projects in either open source or in companies that are starting that to to use those components.
There's another project right now where contributors from Apple are basically putting in a essentially a Spark, execution engine, which is based on data fusion. Essentially, this is, you know, a replacement for the open source Java Spark implementation that's supposed to be faster and stuff like that. So basically you see like 1 component within Spark is being replaced with Data Fusion as part of this. And actually the creator of Data Fusion, Andy Grove, was originally doing creating Data Fusion for that use case inside of NVIDIA.
So you see, like, all these different companies, like, creating those different pieces. I think it's still early in the for the the Rust ecosystem of tools to see what's gonna happen, like, what open source projects are going to become kind of big. Right? Right now, when you think of, like, big data processing tools, most of that environment is in Java. Right? It started with Hadoop and then continued with, Spark and, like, all the different components there and writing Kafka's written in Java and Flink's written in Java. Right? So you have different stream processing systems and all these things have to integrate together. What I anticipate is that, you know, over the next 10 years, you'll see a lot of those systems rewritten, recreated using Rust and using Data Fusion and Arrow and Parquet as the underlying primitives and ideally they wouldn't just recreate the exact same thing, you know, but instead of Java, it's in Rust. There will certainly be some of that, but ideally, what they will do is they will take, you know, a lot of lessons learned from those previous versions of those of those pieces of software.
Does that like, okay. How can we make the user experience better? Right? So it's easier to express the kind of things we wanna express. Or how do we make operations better so it's easier to, like, operate these systems at scale. So I think I think it's really early yet, though. I it's not clear to me, like, from an open source perspective, what projects are gonna be the winners here that that eventually, like, you know, supersede the the previous Java
[00:38:06] Unknown:
systems. Absolutely. And I've definitely been seeing a little bit of that as well even 3 to 5 years ago of c plus plus being the implementation target, particularly built around the c star framework for being able to take advantage of multi CPU architectures, most notably the Cilla DB project as a target to reimplement Cassandra and then Red Panda taking on the Kafka ecosystem. Yep. And another interesting aspect of this space is Aero as the focal point of that data interchange has been gaining a lot of ground. It started off as a very nascent project. There's been a lot of effort put into making that more of the first target rather than being a second consideration, and it's been working on integrating with the majority of the components of the data ecosystem.
I'm wondering what you see as some of the remaining gaps in coverage or some of the white spaces in the overall Arrow ecosystem that are either immature or completely absent and, spaces that you would like to see the overarching data community invest in building out more capabilities and capacity?
[00:39:23] Unknown:
So I think there's still probably some work to be done within Arrow as a specification itself for representing data in a more compact form. Right? For for some kinds of, like, columnar data, it's just not as efficient as as I think it could be. But that was originally, I think that was a a result of 1 of the design goals, which was essentially o of 1 lookup for any individual element within the the set. I think if that constraint is loosened, that opens up the possibility for other kinds of compression techniques and stuff like that that will make it a better format for, compressed data and memory, which I think is something that would be potentially interesting.
I think there's still there's still a question of like, okay, if we're gonna have a stream processing system in, you know, that that uses these tools, what does that look like? Because Arrow as a format actually is not is not well suited for stream processing. Right? Because it's a columnar format, so they, you know, the the conceit there is that you are you are sending in, you know, many, many rows at the same time. Whereas when you think of stream processing, you think of either micro batching or individual rows, like, 1 by 1. Right? So there's no there's no there's no good, like, I think, translation layer between okay. If you're moving if you care about doing stream processing and you wanna move to Aero or, like, batch processing or larger scale data processing, how do you make that trend you know, the transition, and what do the tools look like for that? I think that's still very difficult. Right? And it's certainly, like, something we've done in influx of v 3, which is, like, translating to, you know, line protocol, individual rows being written into the Arrow stuff.
I think the distributed query processing is something that is probably gonna, you know, get more work. It's definitely something that needs more work within the data fusion piece itself. I think later this year, I think, in a couple of months, hopefully, they're gonna vote on whether Data Fusion becomes its own top level Apache project outside of Arrow. My my best guess is that's gonna happen. And then what we'll probably see is, like, Data Fusion will then have some subprojects, 1 of which I think will be around distributed query processing, which I think will be important for for it really to become a contender and a competitor in the larger scale data warehousing space.
What else? I don't know. Like, Parquet has gotten some interesting improvements along the way. I think I don't know. There was, like, GeoParquet for representing geospatial data. I think that's gonna be super important. So yeah.
[00:42:13] Unknown:
This might be a little bit too far afield or too deep in the weeds, but there was also for a little while, a bit of contest between Parquet and ORC as the preferred columnar serialization format. I'm wondering if you have seen that the dust settle around that, and there has been a general consensus around 1 or the other, or if those are still kind of a case by case basis do what you think think is right for a different use cases?
[00:42:39] Unknown:
I I may just be biased because, you know, I'm I'm looking for parquet, but I don't I I remember that being a thing, and I remember looking at both formats, you know, from a high level, back in the day. But I don't really see ORC as a format coming up nearly as much. Right? It seems to me that Parquet has kind of won won the, you know, the mind share largely, and that's what people have kind of coalesced around. Now, of course, you know, because we're talking about data at scale, there's probably, like, mountains of data in people's, like, data lakes and data warehouses that is represented as as ORC, so that's not gonna go away. But by and large, what I see is that parquet seems to be the standard format that all the big data vendors are are coalescing around.
[00:43:30] Unknown:
I've I've been seeing a similar thing. And then to the point of streaming and record based digestive data versus the columnar approach for Parquet and Arrow, I know that Avro and Parquet have a defined kind of translation method of being able to compact multiple Avro records into a pic parquet file. And I'm curious if you're seeing anything analogous for the Arrow ecosystem of being able to maybe manage that translation of multiple Avro records batched into an Arrow buffer that can then subsequently be persisted into parquet or using that Avro to parquet translation as the intermediary to then get loaded into an arrow buffer?
[00:44:16] Unknown:
I mean, I haven't really seen that. I mean, there's because I mean, it's it's pretty easy to go from arrow to parquet or parquet to arrow, right, because they're, you know, parquet's within the arrow umbrella. So people people at the product you know, in the various projects have created a bunch of, like, translation layers to do that. I haven't seen I really haven't seen any, like, rise of, like, oh, these, like, row based formats into either arrow or parquet. It just seems to be, like, kind of 1 off. I I honestly, I don't see Avro come up that much.
So, mainly, I think what I see the most, what people care about is, like, JSON data just because it's so easy, you know, to change between different languages and different services. And, honestly, I think protobuf more than more than Avro or anything else. I think that's maybe because of, you know, the popularity of gRPC.
[00:45:15] Unknown:
And as you have been investing in this ecosystem, building on top of the different components, I'm wondering what are some of the most interesting or innovative or unexpected ways that you have seen some or all of those pieces used together?
[00:45:30] Unknown:
So, honestly, stream processing was a surprise for me because I like, I didn't when I think of, like, Aero and and Data Fusion, like, I wasn't originally thinking that people would use these things for stream processing systems. Right? I think more like it's they're around, like, batch processing and do it. You know, I execute a query against this data, whatever. So having people seeing people pull that stuff into the stream processing systems has been very surprising. Elsewhere, I'm I'm not sure. Like, I think so I've seen a few, like, observability solutions start to look seriously using Parquet as the persistence format. That's a little surprising too. Mainly because, like, when I think about observability, it's largely like, oh, you think of, like, matrix log traces. Right? You and, generally, what people have done is they've created specialized, you know, formats and back ends for each of those individual use cases.
So, I've seen, you you know, some people start to look at seriously at having Parquet represent, like, any of that kind of data, which I think to me, that's like I that's definitely, like, 1 of our visions long term is that being able to store any kind of observational data in influx and thus in Parquet. But to see more observability vendors start to look at that seriously, has been a bit of a surprise too.
[00:46:50] Unknown:
And in your experience of working in the space, rebuilding the influx database, and investing more into the Arrow ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:47:05] Unknown:
I mean, 1 of the lessons which is some somehow a lesson I always real need to relearn as a software developer is things always take longer than you expect them to take. So this project, like I said, you know, we I started seriously thinking about it about 4 years ago. Really serious development on it for the last 3 and a half. It's basically just a long a long road to to create this kind of system. So there's that. I've been pleasantly surprised by the adoption, by by actually, the level the level of contribution from outside, you know, people at at actually, you know, companies of a very significant size, has been also a bit of a surprise. Like, I think I think they're, you know, they're for for companies that reach, like, you know, crazy scale, which are, you know, companies that you know the names of. Like, I think many of them, are contributing to these projects because they kinda have to, like, create their own things because literally nobody on earth has the kind of scale problems they have except for maybe, like, 10 or 20 different companies.
So they end up having to roll their own solution. And, again, I think the the fact that these companies are contributing is something I didn't expect particularly this early on. And I think that speaks to, you know, the thing we were talking about earlier, which is, like, what kind of platform risk is there to adopting this code? And it's like, well, the alternative is you create all this closed source software that is really, like, this is not the problem you're trying to solve. This is just, like, the problem you have to solve to get to the problem you're trying to solve. So, that's that's been, like, I think a pleasant surprise seeing seeing this, you know, mature over the last few years.
[00:48:55] Unknown:
And for people who are looking to build data systems, data processing engines, what are the cases where the FDAP stack is the wrong choice?
[00:49:07] Unknown:
So I I don't think it's particularly designed for OLTP workloads. Right? So, you know, traditional relational databases and stuff like that. Like, there are places where, you know, you can it it would make sense to have it as, like, essentially, like, an interface point. But, I mean, you could certainly use, like, data fusion as your query engine in an OLTP workload. But to me, it wouldn't make sense to use, like, Arrow as a way to ingest data or Parquet. Because really when you think about OLTP workloads, you think about individual requests with individual record updates and stuff like that. So I really do think these tools are more geared towards larger scale analytical workloads against, you know, data that you can largely view as immutable. Right? This is like observational data and stuff like that. So yeah.
[00:50:00] Unknown:
And as you continue to build and iterate on the new version of Influx DB and invest in the Arrow ecosystem and the components we've been discussing, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?
[00:50:16] Unknown:
So as I mentioned, the thing I'm most excited about is essentially, like, more integration adding support for Apache Iceberg. So what that's gonna there so there's already, like, a Rust project to do Apache Iceberg, but it's not, like, fully baked yet. So we may need to contribute to that, or maybe the people who are working on it will get it fully baked before we actually get to the point where we're pulling it in. So Apache Iceberg is a big thing. I think in the medium term, the distributed processing stuff and data fusion is gonna be super interesting.
And then from InfluxDB's perspective, as I mentioned, like, we have right now our commercial distributed version of the database. But this year, we're coming out with, you know, the open source version of the monolithic single single server version of the database and getting that open source piece out there with, like, a new version 3 API that kind of represents a much richer data model than previous versions of InfluxDB that takes advantage of what you can do with Arrow and Parquet as the as the formats, that I'm actually really, really excited about because then I really think that from a technology perspective, InfluxDB will actually be able to fulfill the, like, vision that we've had all along, which is that essentially it is useful for any kind of observational data you could think of, not just, like, metrics data from your servers or networks or your apps. Right?
[00:51:47] Unknown:
Are there any other aspects of the work that you've been doing on the Influx DB engine, the work that you've been doing investing in and building on top of the Arrow ecosystem or the overall space of how the Arrow ecosystem might influence the future direction of the data processing ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?
[00:52:11] Unknown:
I don't think so. Like, I think we kinda I I mean, I guess, like, more broadly, like, the the way the way I view, like, the the data space right now when you're talking about these, like, analytical data is there's there's this kind of, like, distinct, separation between, like, data warehousing on 1 side, which is these large scale analytical queries and stuff like this, and, like, stream processing on the other, which is more about, like, real time data as it arrives. I think the trend, like, really when I think about those 2 things, like, ultimately, like, what developer wants and what users want is basically some magical oracle in the sky that they can, like, send a query to the where the result will come back in, you know, sub 50 milliseconds.
And we have that. We wouldn't need stream processing. We wouldn't need, like, all these different things. But I think as the technology improves and things get better and better, data warehousing is gonna become more real time and the real time pieces are gonna, you know, move more towards, like, data warehousing because ultimately, like, people don't wanna think about separating stream from data warehousing, whatever. And 1 of the things I'm excited about is essentially the idea that these different building blocks could potentially be the things that people use to kind of close that gap and create, you know, a big data solution that works either for real time data or for, you know, big scale data warehousing.
[00:53:40] Unknown:
But I thought people liked reinventing the Lambda architecture.
[00:53:45] Unknown:
Oh, no. Yeah, they do. They do.
[00:53:49] Unknown:
They just like to call it something new. Maybe it's the kappa architecture. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Biggest gap? Oh,
[00:54:15] Unknown:
I I don't know. I don't know, actually. I mean, obviously, like, I think the most interesting side of this is essentially, like, you know, time series data and basically being able to represent being able to do analysis on data as time series. So that's our focus. That that's what I think is the most interesting thing right now. But, yeah, I still I still think that's an unsolved problem by us or anybody else. So that's what we're working towards.
[00:54:48] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing, both contributing to and building on top of the Arrow ecosystem and the components thereof. It's definitely a very interesting area of effort. It's great to see the work that you and your team are doing to help bring all of us forward in that space. I appreciate the time and energy you're putting into that, and I hope you enjoy the rest of your day.
[00:55:12] Unknown:
Cool. Thank you.
[00:55:19] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts atdataengineeringpodcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and just tell your friends and coworkers.
Introduction and Overview of Data Engineering Podcast
Interview with Paul Dix: Introduction and Background
Paul Dix's Journey into Data and InfluxDB
The FDAP Stack: Components and Architecture
Design Goals and Constraints of InfluxDB 3.0
Engineering Effort and Integration of Open Source Components
Benefits of Building on Open Source Components
Platform Risk and Mitigation Strategies
Building a Polished User Experience
Adoption and Future of the FDAP Stack
Lessons Learned and Contributions to Open Source
Future Plans and Exciting Projects
Closing Thoughts and Final Questions