Summary
When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Your host is Tobias Macey and today I’m interviewing Hannes Mühleisen about DuckDB, an in-process embedded database engine for columnar analytics
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what DuckDB is and the story behind it?
- Where did the name come from?
- What are some of the use cases that DuckDB is designed to support?
- The interface for DuckDB is similar (at least in spirit) to SQLite. What are the deciding factors for when to use one vs. the other?
- How might they be used in concert to take advantage of their relative strengths?
- What are some of the ways that DuckDB can be used to better effect than options provided by different language ecosystems?
- Can you describe how DuckDB is implemented?
- How has the design and goals of the project changed or evolved since you began working on it?
- What are some of the optimizations that you have had to make in order to support performant access to data that exceeds available memory?
- Can you describe a typical workflow of incorporating DuckDB into an analytical project?
- What are some of the libraries/tools/systems that DuckDB might replace in the scope of a project or team?
- What are some of the overlooked/misunderstood/under-utilized features of DuckDB that you would like to highlight?
- What is the governance model and plan long-term sustainability of the project?
- What are the most interesting, innovative, or unexpected ways that you have seen DuckDB used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckDB?
- When is DuckDB the wrong choice?
- What do you have planned for the future of DuckDB?
Contact Info
- Hannes Mühleisen
- @hfmuehleisen on Twitter
- Website
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- DuckDB
- CWI
- SQLite
- OLAP == Online Analytical Processing
- Duck Typing
- ZODB
- Teradata
- HTAP == Hybrid Transactional/Analytical Processing
- Pandas
- Apache Arrow
- Julia Language
- Voltron Data
- Parquet
- Thrift
- Protobuf
- Vectorized Query Processor
- LLVM
- DuckDB Labs
- DuckDB Foundation
- MIT Open Courseware (OCW)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Hannes Muellerisen about DuckDb, an in process embedded database engine for columnar analytics. So, Hannes, can you start by introducing yourself?
[00:02:07] Hannes Mühleisen:
Thanks for having me. I'm really happy to be here. My name is Hannes Mullerisen, indeed. I am the cofounder and CEO of DuckDV Labs, a small artisanal data management systems company in Amsterdam, the Netherlands. I am also a senior researcher at the database architectures group of the CWI. CWI is the national research Center for Computer Science and Mathematics in the Netherlands. It's mostly famous for inventing Python or being the place where Python was invented. Very different things. Yeah. So I spent, like, the last 10 years of my life working on data management systems and recently added in the last 3 or 4
[00:02:43] Tobias Macey:
years on the system called DuckDB. And, yeah, really excited to tell you more about that. And do you remember how you first got involved in the area of data and what it is about that space that has drawn your attention?
[00:02:54] Hannes Mühleisen:
Yeah. It's really fascinating because, you know, it cast your mind back to the year, I don't know, 1998 or something like that. I was, you know, messing out with PHP like many kids back then, and there was something called MySQL. I I learned that you had to talk to this MySQL in order to just to do something with data, and it was really all very fussy. And I wanted to write some sort of application, and then had to write SQL queries. And, obviously, I didn't know anything about PHP. I didn't know anything about SQL. So it was really a mess. And, yeah, in fact, you, you know, you grow as a user of something like MySQL. And then when I went to study computer science, did a bit of a detour in distributed systems.
I really learned to appreciate the sort of the field of databases as a very no bullshit zone, I like to call it, because you have very clear metrics of who's right and who's wrong. I mean, obviously, as now I'm, you know, in the field for longer, I've also realized that there's more to this than than I maybe initially thought. But I still appreciate this, you know, focus on metrics, benchmarks, on measurable effects that is really, you know, prevalent in the community. I feel it's makes it a very scientific discipline of, practical computer science specifically.
After getting my PhD, I joined this database lab, the database architectures group here in Amsterdam. Yeah. And I've been here ever since.
[00:04:20] Tobias Macey:
And as you mentioned, for the past few years, you've been working on this DuckDB project, which you have now built a business around. I'm wondering if you can just give a bit of an overview about what it is that you've built there and some of the story behind how it came to be, why you thought it was a project that was worth spending time on, and sort of some of the areas of focus that you have for the DuckDV platform?
[00:04:42] Hannes Mühleisen:
Okay. So DuckDV in process SQL data management system and query engine, specifically for analytical data management. So OLAP, it's called. Not sure the term is familiar to the listeners, but data management systems come in many flavors. Not 1 size fits all. And you have this traditional divide between systems that are built for transactions and systems that are built for analytics. DB is built for analytics, which means, you know, it can do large aggregations, large joints, these kind of things. Big data, some people like to call it. So DuckDee is special because it's in process, which means it doesn't actually run as a separate database server like you know from literally everything else, but it runs inside, you know, whatever application you wanna run inside of, typically, Python, for example.
In that regard, it's very similar to SQLite, the world's most widely deployed database system. But, again, SQLite is, widely an excellent system. It's made for transactional use cases, and DuckDV is focusing very hard on these analytical use cases with, you know, high throughput query processing. And this is kind of funny because the background story there was that we literally talk to people, which is very rare for computer scientists. We like to talk to the computers, but maybe not so much to the people. We talk to statisticians, and they complained to us that they wanted to do this analysis, and it didn't work.
And they wanted to see whether we could do something about it. And at the time, we were working on a different data management system. It's also built the research group at CWI, and we were working on integrations with R, the R environment for statistical programming. There was a couple of iterations. This took years years of, you know, trying to figure out what the architecture for data management is for for, like, a a new data management systems could be that would support these use cases. And these use cases didn't like, for example, we're always like, okay. There's always a single user.
They don't really wanna spend time on setting up stuff. They don't wanna spend time on, you know, moving data around, importing, exporting. So this should all just work. And, yeah, but at the same time, of course, state of the art data processing capabilities were required to deal with the large datasets. So that kind of gave rise to this concept that, hey. We need to make something that runs in process, that doesn't require us to do separate setups for servers and things like that. And we need to make something that is actually fast. So we we're in a good position as an academic research lab in the field of high performance data management systems.
Kind of had an idea what the architecture of the system should be, But what is new, and it was still new, I don't think there's anything else out there that does this in process OLAP has this in process OLAP focus. And there's also nothing else that is custom built for this specific use case. And it's quite a it's quite a funky world to be in, to be honest. If you're trying to run-in somebody else's process, you have to deal with a lot of special things like, oh, they overwrote a signal handler. Therefore, you cannot use a signal handler anymore, things like that. But yeah. So so that's kind of how it came along. And and 1 of the cool things about CWI, the institute that I still work at, is that they kind of leave you alone. So you have very nice offices, and they treat us well. But there's this understanding that some things just take time, that this is, you know, still worthwhile to do. We were able to basically disappear for a couple of years
[00:08:08] Tobias Macey:
and start building our data management system from scratch, which I can't imagine other places that would have allowed us to do this. It's definitely great to be able to be given that freedom, and it's always interesting to see what people will come up come up with given the time and space to really deeply understand the problem space. Mhmm. 1 of the interesting aspects of the project is obviously the name. So wondering if you can talk to where that inspiration came from and why it's called DuckDB.
[00:08:33] Hannes Mühleisen:
I live in Amsterdam, and we live on a ship. And we're constantly surrounded with waterfowl. So at some point, I thought it was a great idea to have a duck as a pet. The duck was called Wilbur. You know, I got him when he was a little a little duckling and he's very cute because he was running around my kitchen all day long. Eventually, he grew up, and he learned to swim and to fly and all these things. It was very cute. At some point, he just left, which is fair enough, I guess, for a teenage duck. But yeah. So DuckTV, obviously, is named after little Wilbur.
We also think that, you know, database systems tend to have completely ridiculous names, so we are in very good tradition there. Plus, they tend to have, like, names like shaving cream, like super performance power. You know? And I think this is just completely ridiculous and also non Googleable, whereas StackDB, once you can find it again, it's not a problem. So we are quite happy, and, of course, the logo designs itself, which is which is great. Absolutely.
[00:09:29] Tobias Macey:
When I first came across it, I just assumed it had something to do with duck typing and sort of being dynamic and adaptable. But
[00:09:38] Hannes Mühleisen:
No. It's literally ducks. And, I mean, we have some post talk justification for the names because duck are versatile. Right? They can swim, fly, and walk. Also quite resilient. You know, they're hard to kill. So in that sense, it's a perfect role model for our data management system. We can do a lot of things, and it's hard to kill. This is something that we've actually spent quite a lot of time on thinking about for that DB is how to, for example, tolerate broken hardware. Because if you're 1 of these cloud guys, you know, you run the hardware that your database runs on. Great. You control the entire environment. If you're gonna be installed on everyone's laptop out there, which is what kind of activity is aiming to, of course, to basically become something like SQLite, but then for analytical use cases.
If you're on everyone's laptop, you're gonna have thousands of broken machines that you're running on. Right? So how can you actually deal with that? It's kind of a fascinating problem, and we're still working on that.
[00:10:30] Tobias Macey:
Absolutely. And I'm probably jumping ahead a little bit here, but 1 of the things that I've run into personally as far as broken hardware and database systems is that if you have just a large binary object and some portion of the hard disk gets corrupted, then the entire database becomes unusable. And because of the fact that in the big situation I'm thinking of, it was just 1 single binary blob and, know, there was no sort of separation of individual files for individual records. So I'm curious how that manifests in DuckDV because I know that with SQLite, at least, it is just a binary blob, and you have 1 database file that is the whole database. So I'm curious how that manifests in terms of being resilient to broken hardware, but still being portable and adaptable.
[00:11:11] Hannes Mühleisen:
Yeah. That's a very good question. So we did adapt the single file database because in our interactions with the people, we have learned that they love to email databases around. And I'm not making this up. Because of this user requirement, we decided to also have a single file data format, which basically means all the tables that you store in a specific Dactyb database, they're all in 1 file. The way it's resilient to corruption is that the file is separated into blocks, and each blocks has a checksum. And so we can kind of early detect that something is wrong, and you can also find out which blocks are still valid, for example, without too much trouble. This is something that SQLite also has, but they don't have a checksum. They don't they have the blocks in the file, the, like, logical blocks within the file, but we have that too. But we have a checksum on top of it, which means that we know if a block is corrupt, and we will say, no. This this hard disk is no longer to be trusted because that's kind of what you have to do. Right? It's also pretty of a departure from traditional systems design where where we have to start distrusting the hardware. We've wrote a research paper some years ago about this topic for people who are interested is how would you design data management system when you cannot trust hardware. And, again, it's not like some weird out there idea that we, you know, came up with on a Friday afternoon in our pushy clarship.
Sorry. In our NICE Research Institute. But it's something that we actually experienced. There was this 1 guy on, like, a 32 bit Brazilian laptop with Windows that just had broken RAM some years ago. And he would keep sending us these bug reports that made no sense. And it takes a while till you tell somebody, look. Your RAM is broken. Throw your computer away. You know, this is difficult to tell somebody you think it have issues. So this is when we decided, no. No. No. No. We're gonna defend again this kind of thing. You know?
[00:12:58] Tobias Macey:
But this is indeed a bit of a far out discussion already. That's dedication. Yeah. The specific database that I was dealing with was actually a Zope database, which is just a sequence of pickle files that have some vague correlation between the individual records, but very opaque or completely nonexistent documentation about how to repair corruption and such a thing.
[00:13:18] Hannes Mühleisen:
From pickles. That's not pretty. No. I I agree. I mean, for SQLite, which I've actually written a third party reader for this format that they use because it's the way I understand things is to write a parser for it. So I've written this SQLite parser, and the format is actually you can totally. It's, like, 2, 300 lines of c code to to parse this. It's not it's not crazy. 1 of the reasons why the SQLite's own code for this is so much bigger because there's a lot of error checking it for airplane purposes.
[00:13:47] Tobias Macey:
And so given this similarity in terms of the ways that SQLite and DuckDV are used of being in process and potentially being very ubiquitous. It can be easy to, you know, maybe confuse the use cases between them. And so I'm wondering what you see as some useful heuristics for deciding when to use which format and maybe some of the ways that they're able to interoperate either in terms of just application design or being able to convert or import back and forth between them?
[00:14:20] Hannes Mühleisen:
Yeah. So it comes back to this OLTP versus OLAP discussion. If you have transactional use cases, SQLite is the obvious choice. And what is a transactional use case? It is when your queries and your changes to the data are focused on few rows at a time. Right? So you're updating a single row. You're tracking some sort of order or user data or whatever. So you're updating the queries, usually update users, set email is new thing where ID is 42. Right? This is a very typical OLTP query. Yeah. We call this a point query because it points to a single row. And for this kind of stuff, SQLite is really excellent. On the other hand, have overlap style queries that look like select group name, comma, average, I don't know, price from table with a 1000000000 records, then you have more of an overlap style use case. And for that, DuckDV would be the far superior choice because of its internals.
It's more a columnar design. We can talk about that later, I guess. But it's sometimes funny because, like you said, what we've copied a lot of good ideas from SQLite in terms of packaging, in terms of how it behaves. You know, it's just a library. You do pip install, and you have it. But, okay, a lot of stuff is pip installed x these days. But in our case, it really is just a module that is linked into the process and loaded. And what is very different, and I think I think SQLite users realize this, but for OLAP, it's pretty revolutionary that the OLAP system is suddenly wherever you want it to be because, traditionally, you know, these things are kind of bound to the machines they live on. You know, you have your teradata somewhere in your company, and that's where the data lives in your data warehouse. Or you have your Snowflake instance these days somewhere, That's where your data lives. But you wouldn't ever dream of saying, hey. I have this software that I wanna, you know, install on somebody's computer, and I I kind of need to do some data wrangling, and it's actually quite serious. So let me just drop a duck DB instance there because you would just not be able to install a Teradata on every single 1 of your customer's computer. You would not be able to install a Spark or you would be able, but you would want to install a Spark on every single 1 of your, you know, phones that run your app. You know, it's not possible. But with that, you can just simply pull pretty serious data crunching capabilities inside, you know, wherever you want them to be. I think this is the kind of fundamental architectural difference that this enables is that you can pull this query processing capability wherever you want. But, yeah, to come back to your question, the heuristic is gonna be how do your queries look like, how do you interact with this data. And there's also you also had this question about how could you possibly combine the 2, and I can happily report that we thought of this. We're gonna soon release something we call the SQLite scanner, which is gonna be a functionality so DuckDuty can directly read SQLite databases as if it were local files. And the cool thing is we can do it faster than SQLite itself because we parallelize the read. So it's like not only can we read this, we can also do it faster, which is kind of funny.
Yeah. The other way around, if you wanted to read a DactDB file from SQLite,
[00:17:36] Tobias Macey:
I don't think that is super easy at the moment. Well, I mean, it's kind of in keeping with the general flow of data and analytical systems anyway, where you have your transactional application database and data moves from there into your columnar analytical warehouse and then from there to some other
[00:17:54] Hannes Mühleisen:
downstream system, but usually not in the reverse. So Yeah. This is true. And this is also, I think, why this is super interesting. This even works with SQLite while you run other on your SQLite. Right? It doesn't really interfere. We're also working on some similar functionality for Postgres, where DuckDuty can directly read Postgres' databases because we also think that, you know, too often big data is just, you know, the same fairly small data copied a 1000000 times because people have no other choice than to replicate everything from, I don't know, the Postgres where they keep their operational stuff to whatever system they feel they need to use in order to do analytical queries, which, you know, it's unfair to benchmark something like Postgres on TPCH. It's just not built for that. Right? Yet people still do it, of course. I love it specifically when academics, you know, in my field propose some new OLAP architecture, and they just, in the paper, have a benchmark against postgres, and you just think, like, come on, people. This is just mean. You know?
You don't do that. But still, I think the combo is more interesting, in my opinion, than the HTAP world. You know, you also have these HTAP systems that attempt to fulfill both use cases, to be both a crack transactional system as well as being good at OLAP. I am not entirely convinced this is, you know, the way forward either because you're gonna be forced to compromise somewhere. So I think having a bit of a closer bond between 2 systems that focus on, you know, individual aspects of this. For example, something like Postgres or SQLite paired with something like DuckTV or something else is probably a good idea. In the other direction
[00:19:34] Tobias Macey:
of being in the point where you're trying to decide, do I want to use DuckDV versus, you know, some beefy data warehouse, whether it's a, you know, cloud provider or something like a Vertica or a Teradata that you're running on premise? What are the deciding factors of when you might want to use 1 versus the other or possibly in conjunction?
[00:19:54] Hannes Mühleisen:
There's an interesting question with the conjunction. I'll come back to that. I think the traditional data warehouse is your obvious choice if you have, you know, thousands of analysts hitting the same data all the time because DirectDB is in process. Right? So we don't do client server, which means you would have to replicate your data to all your analysts or have them all work on the same cloud box with all the data in order to, you know, make use of DuckDuty properly. I think for these use cases, you would be better off using something like Teradata or Snowflake.
What I think is a bit underrated, though, is this capability of everybody being able to run their own thing. And I think as an analyst, you will not be able allowed to just wildly create tables in your Teradata instance that, you know, are just like small variations of the giant fact table that sits there. Whereas this would be perfectly allowed if you just have your own documents instance and you can do whatever you want on your local machine with it. What we see as a use case sometimes is that people use the use them in conjunction, which is quite clever.
I think where they say, okay. We're gonna make an extract of our data warehouse for 1 particular analysis question. We're gonna write that to a parquet file, for example, and then we're gonna give this to the analyst. And the analyst can or the data scientists, whatever you wanna call them, can then, on their own, on their local machine, without interfering with anyone, without, you know, producing extra costs on what on Amazon, whatever, they can then play with this data as much as they want using the standard SQL. You know, you don't have to learn strange Python package syntax to do data analysis. You can just use normal SQL queries, and we have actually quite powerful SQL support. So this is really the same kind of SQL that you would run on something like Teradata.
And then they can they can play and can explore and really can get a faster turnaround time in further analysis. And I think that's really promising sort of concept. And because like to be natively reads parquet files, you already have a data format to do so. But in general, I would say if you have a 1, 000 people that are supposed to do analysis on a dataset that is central and doesn't change, you probably don't wanna use DuckDV for that.
[00:22:08] Tobias Macey:
1 of the ways that DuckDV is strongest is obviously for this in process use case where you have a language environment. You have data that you're working with. You want to be able to take advantage of the acceleration and performance benefits of DuckDV. 1 of the ways I've seen that is in the Python ecosystem where maybe you have a pandas data frame, and then you read that into a duckdb file. You can use SQL to be able to perform some analysis on that and then be able to transform that back into a pandas data frame. And I know that there have also been some optimizations to be able to use the arrow format to make that a, you know, seamless or completely cost free process. I'm wondering if you can just talk to some of the benefits that DuckDV provides over and above some of the capabilities that are present in these different language ecosystems or some of the ways that you might want to use DuckDV to be agnostic to the language entirely?
[00:23:01] Hannes Mühleisen:
We have, indeed, as you mentioned, we have integrations with many, many languages. I've kind of lost count at this point. But there are 2 languages, I would say, that are special, and this is Python and R. And that's because I think a lot of our data analysis, data science workflows start in Python or R and end in Python or R. So this is why we spend some extra time with the integrations there. And, yeah, so 1 of the things that is super powerful with DuckDuty if you're running in process is that we can read the memory of the process we are running in. It's a very strange concept.
But, yeah, it's just a you know, within the same address space, we can just look at things. And the internal layout of things like Pandas or data frames in R is not really a secret. So you can just look at these things. Right? And that's also something we've done in Dactyv. So, indeed, as you said, you can import the data frame directly in a database file in Dactyv. But you actually don't need to because we can run SQL queries directly on pandas and r data frames by simply saying, hey, query. Instead of reading from the file, please scan, we call this a scan operator, scan this data frame. This is even 0 copy for some data types, so we don't even copy anything. And because DuckDV is a streaming engine, it means that it does not actually materialize your data twice in memory just to query it. If you're just running a simple aggregation on top of 10, 000, 000, 000 rows of a data frame, you will use constant memory for that, which is pretty cool. We also have this on the other side where you in our in Python, if you have a query result from DuckDV, let's say you run a query from whichever source, you have a result that we can efficiently then transform again back into data frames in Python and R, Julia as of this week, I believe. And this is also super powerful because, again, we are in the same process, which means in order to create a data frame, we can already set up all these in memory structures, fill it with stuff, and then just hand over and say, hey, pandas. Here's the data frame that exists in memory. Please use it. And that totally works. What it allows you to do, for example, because it's so fast, because it doesn't have to go through a socket or something like that, allows you to quickly do an iteration over queries. You know, you can do some fixed point iteration if you wanted to and just read a table from database, iterate, write it back, and just iterate. And that's not a problem at all.
With Arrow, yes, indeed, of course, we want to be good citizens in the ecosystem, and we've actually been working with Fulcrum Data on improving the integration between DuckDb and Arrow, and the same applies there. You can run SQL queries directly on Arrow structures, specifically on tables and record batches. We can also, again, export query results directly as arrow patches, basically. So, yeah, you can also, as I said, move this iteration productively on arrow, and it just goes in, interesting circle. There was 1 interesting bug that we had there a couple of weeks ago that I'd like to talk about because it's not something that you think about when you invent this kind of thing. So we have 0 copy, yes, which means we can use memory that we, you know, see on the input of a query like in the from part.
We can use, for example, from an arrow structure, we can use that and directly put it into our engine without copying it. Now that's very powerful because, as I said, no copy, better performance. And now this can go all the way through the query engine without being copied because for some queries, we just keep referencing the same memory pointer because, you know, imagine we only do a projection on something, the column data that is not being touched. And now this goes all the way through the engine. It comes back into the result. We do another 0 copy where this time, we make a new arrow structure that references duckdv memory. Okay? But we forgot to think about the fact that this memory that DuckDuty thinks it's his own is actually not his own memory, but it's something we borrowed from Arrow all the way back there. And now this can go in circles. So you end up having this circle of references to memory that is kind of arbitrarily long because you keep referencing the same memory. And now the question is, who calls free? And remember, you can only call free once because, otherwise, you crash. It's really like okay. No. Okay. First of all, how do you keep it from being freed? Okay. That's usually not so crazy. And second of all, how do you free it up? And then because in order to free this correctly, you have to circle back all this way through the multiple, you know, recursive layers of the Dactybe Aero Dactybe Aero interactions.
And, yeah, this was kind of funny because it took our people and their people some time to figure this 1 out. But now it works. But you don't imagine this kind of problem because, you know, traditionally, this problem didn't exist, like, 0 copy between the database and your your application. What? You know? But, also, something I wanna mention there is that, yes, we have these special scanners for data frames and for archive files and for arrow and for CSV files and for a bunch of things. For SQLite, I mentioned this, But you can also plug your own. DuckDive, we are kind of data communists in that sense that we treat all data sources equally. And our own table scan operator is actually just yet another 1 of these sort of zoo of table scan that we support.
And you can bring your own as well. So these are all just functions, and you can have a user defined function. There's a table producing user defined function that reads whatever data you have in whichever format it may be and makes it look like a table, and then you can run whichever SQL query you want on it. And that's actually something that we've been doing quite a lot in our spin off company to work with people to build scanners for data that they have. For example, they might have a custom application that stores a lot of data internally. They wanna add SQL support to that. Fine. We will be able to build 1 of the scan functions for them, and they have the exact same capabilities as their own scan functions that they can do. Projection push down, they can do filter push down, they can do all these things, they can do parallel reads, And this is, I think, quite a unique capability in a data management system that you have this sort of arbitrary sources of where the table data can come from. So today, for example, I was running some benchmarks and aggregates because I was working on that some time ago, and I want to finally write a blog post up on it. And then, yeah, in this case, we are just running directly from pandas, and it's
[00:29:18] Tobias Macey:
actually as fast as running from our internal storage, which I think is pretty unique. That's definitely pretty amazing. And if I was more of a masochist and actually had free time, I would be interested in actually writing 1 of these plugins to be able to use the DuckDB to scan a ZopeDB to be able to run analysis across that.
[00:29:38] Hannes Mühleisen:
Yeah. Yeah. Exactly. I mean, that would be exactly 1 of these examples to say, yeah, if you have terabytes of soap DB lying around, you could write your own scan function and
[00:29:48] Tobias Macey:
just treat them like SQL tables. Yeah. In this case, it's about a 65 gigabyte Zope database file, which is enough of a pain to deal with on its own. That sounds horrible. Yes.
[00:29:58] Hannes Mühleisen:
I'm sorry.
[00:30:01] Tobias Macey:
Thankfully, we're in the finishing stages of making sure that nobody ever has to think about that again.
[00:30:06] Hannes Mühleisen:
What is new speakers, if I might ask? So this is actually for the OpenCourseWare
[00:30:10] Tobias Macey:
project from MIT because that's where I work for my day job, and we have been using Plone as the content management system for that for a long time. And so it's 20 years worth of content data for publishing the OpenCourseWare site.
[00:30:26] Hannes Mühleisen:
Oh, yeah. That's not something you wanna lose either, isn't it? Yeah.
[00:30:31] Tobias Macey:
Wow. And we had a SAN failure where some of the drives failed and other ones got corrupted. And so, you know, the middle portion of the 65 gigabyte pickle object was corrupted, and we had to spend about 3 months and many man hours figuring out how to reclaim it.
[00:30:47] Hannes Mühleisen:
Couldn't you brute force it? Like, flip the bits until it parses again or something?
[00:30:52] Tobias Macey:
At this point, I have forgotten all of the details, but, eventually, we were able to reconstitute the database, and all was saved.
[00:31:01] Hannes Mühleisen:
Yeah. This is actually quite fascinating because this problem this is a bit of a weird problem, I admit, that you just described, you know, having all these pickles. But the big data, of course, these days lives in parquet files. Right? And parquet files, they use something called Thrift to encode the metadata inside the parquet file. And Thrift is also a very, very unforgiving encoding. So if a single bit in this Thrift encoding goes, you know, good luck getting anything out of that parquet file because Thrift will say no. Without the Thrift metadata, you have absolutely no chance of knowing anything about your parquet file. Of course, you can guess, and sometimes you can reconstruct things from but it's super fascinating because I've wrote the duckdv parquet reader myself. I was so annoyed about this. I also wrote a Thrift reader.
It is really fascinating how people prioritized this design of these formats, like Thrift the same also helps for protobuf, for example. And they really prioritized saving every little last bit, and they did think about the fact that maybe at some day, somebody needs to recover this in the HEX editor, and maybe we shouldn't save every single bit. Maybe we should throw in, like, a marker here and there. You know? Just put the end marker at something. Or they they they kind of conflate, for example, in Thrift, they conflate, what is it, the field ID and the type in the field ID is in 1 byte because, obviously, we can't have 2 bytes. We have to work 1 byte. And, obviously, correctly, yes, the field ID is sometimes short and sometimes the type ID is always short. So, hey, let's put them in 1 byte. But what this means is that you just get this crazy opaque bit stream that you have really little chances of recovering. I think there's actually a start up there. This 1 is for free, like Parquet, you know, forensics.
This is really I mean, I have some experience,
[00:32:48] Tobias Macey:
but I think you can ask arbitrary amounts of money for these kind of services. That's hilarious. Yeah. It also puts me in mind of early in my career working at a startup that was dealing with some IoT equipment that had its own custom hardware and custom packet format. And the whole hardware and packet format was designed and built by a vendor, and we were trying to build the next generation of hardware and remember what the actual packet structure was and starting to pick apart the individual bit fields of this packet to be able to understand what the messages are supposed to be so we can throw it into a database.
[00:33:24] Hannes Mühleisen:
That's juicy. Yeah.
[00:33:27] Tobias Macey:
An interesting time.
[00:33:29] Hannes Mühleisen:
Yeah. Yeah. That is the format's encodings. I do love it, though. I have to admit. This is like digging around on the bits until they figured it out. Yeah. On that note, I'm wondering if you can describe a bit about how DuckDV itself is implemented and some of the architectural details
[00:33:44] Tobias Macey:
and some of the ways that the design and goals of the project have changed or evolved from when you first started working on it? Yeah. It actually uses something called a vectorized square processor.
[00:33:53] Hannes Mühleisen:
That is 1 of the leading sort of designs right now for OLAP engines. There's 2. There's the just in time compilation camp, and then there's the vectorized query processing camp. Briefly, what the JIT people do is they take the query, they throw it into LLVM, out comes a binary, they run the binary. We do in DuckDV is we have an interpreter and not a compiler, which means that we just generate a query plan, which is then interpreted. But the query plan is not interpreted for every single row as it is in things like Postgres or SQLite. But it is interpreted for what we call vectors of data, which are bigger chunks of data. Something like, for example, right now, the default is a 1024 rows per chunk.
And what that does is it amortizes the interpretation overhead versus the execution time of every query operator. And what's that leads to is that you decrease the cycles per sort of row processed. That is the metric that we use a lot and thereby leads to good pull up performance. You have to imagine that to be as a sort of stream of these vector operators. So if you, for example, were to write a query that says select some columns from a table where, you know, some condition, you would get these vectorized operators that, for example, first is a tables can be talked about tables can can be anything. But in this case, let's assume it run it runs on our own table format. Tables can have a filter operator. There's an operator that gets 1 chunk of table at a time, runs its filter, and then moves the resulting vectors of data minus what was filtered out back upstream to the next operator, which might then be a projection, which then, you know, throws away some of the columns that are not needed anymore. This is a simplification.
In reality, things are more complex. For example, the filter will be pushed into storage that contain things we don't want, but that's in principle how it works. We have operators that operate on vectors that complicates things a bit sometimes. While this is pretty trivial for things like filters, you know, whether you're running the filter on a single row or a 1, 000 rows, really doesn't change a lot for implementation. But if you think about it, if you wanted, for example, to design a aggregate, you know, on a vectorized system, how would you actually do that? And what DuckDuty does there is that it has this concept of pipeline.
It takes a SQL query, breaks it up into pipelines, And pipelines are things that can be run-in sort of a stream of data without having to materialize any intermediate step. And, obviously, there are things like pipeline breakers. We call them things like aggregations or joints where you have to wait for 1 side to be fully materialized in the pipeline breaker before you can start streaming the other side. And so this happens all automatically where we, yeah, we basically take your query, separate into pipelines, and then the execution engine will start pushing data through it. The individual operators, yeah, they have to deal with the fact that they see data as a vector at a time, and they also have to avoid the temptation of simply running a scalar row wise interpretation under the vectors. Everything we do has to be vectorized, and that's sometimes a bit complicated from an implementation perspective.
But yeah. So this is the basic design. This is also what makes it fast. This concept of vectorized query processing was incidentally also something that was developed at CWI by our colleagues, former colleagues, still some colleagues still there. But this is what we use in DuckDee. And the reason we use it actually is because DuckDee is supposed to be lightweight. We are, like, 10 megabyte binary or something like that. And you cannot do that, in my opinion, with JIT engine because in order to get all the compiler infrastructure there, you're already like it 200 megabyte plus or something like that. So in order to keep the footprint small, we had to go for the vectorized 1, but we also think in general that is a more approachable concept because when you start to auto generate executable code, you also complicate things like debugging a lot. And that's something that, of course, it also is a factor we want to be an approachable project where it's easy to contribute to.
And that means that we also have try to keep our engine understandable, which is something we spend a lot of time on. So the goals I don't think the goals have changed a lot. You know, we started out revolutionizing data management. You know, we'll see where we get. But the basic design of the vectorized engine has also not changed, but what we constantly do is we rethink pretty significant chunks of the internals. So my cofounder and long time collaborator, Mark Asfeld, he's, for example, rewritten the execution engine 3 or 4 times at this point because we've written 1, And then we realized, oh, this is not gonna work because we you need parallelism now. And then we realized, oh, but now we actually don't wanna use pool based operation, and we're gonna move to push based operator service rewritten again. So this can be done transparently from the outside because the cool thing, of course, about having a SQL engine is that the user gives us a query. The user doesn't give us execution plan. So we can do with that query whatever we want.
And 1 way we keep ourselves sane is by having a monstrous testing suite. So we have millions of queries in our CI, and that is how we can confidently innovate on individual components inside the engine where the design constantly changes of individual operators. Right? I mean, somebody, for example, 1 of our colleagues has recently rewritten the sorting operator to be able to go out of core, you know, when you're sorting more data than fits the memory. You know, how do you deal with that? So he just ripped out the entire sorting operator, made something new, and that can get done completely transparently from the outside because at the end, it's a SQL query. The result is correct. CI passes. Everyone's happy.
This was something that we realized was actually crucial to also enable contributions, you know, because if the CI passes, we know that it can't be that bad. Right? So when we're reviewing pull requests and we see how the tests are green, we also can be quite confident that it doesn't produce anything horrible, which allows us to be a bit more open to pull requests than maybe other projects. Yeah. But I think it's also something that we're gonna keep iterating on, yeah, operators and details of the internals. And it's but it's really a relief that our outside API is mostly fixed, you know, SQL Korean table out.
That does take a lot of kind of design space out of the equation and allows you to focus on, you know, the details, which if you're writing something from scratch with a new API, you constantly have to deal with this. Also, the changes from the outside, the changes from, you know, demands from customers. SQL engines are pretty straightforward in that sense. You don't have to innovate on that level so much. We recently emerged a groundbreaking feature that you can have commas at the end of lists of expressions in SQL because people complain.
[00:40:41] Tobias Macey:
Yeah. I was gonna ask you a little bit about the specific SQL dialect and some of the ways that you've thought about how you interpret the SQL specification and any extensions that you want to provide given the fact that you are looking to be embedded in other language run times and just some of the capabilities that you want to support as far as any sort of customized functions versus pushing that into the language runtime and being able to handle complex data types and things like that?
[00:41:09] Hannes Mühleisen:
We have started with a fork of the Postgres parser for Ducktb. So we took something called libpqquery, which is somebody took Postgres parser, made a separate project. We took that thing and heavily modified it again. But the lineage, if you want, is Postgres parser. I've forgotten which version it was, but not that old. Which means that we have, like, exact compatibility with that parser, which is pretty great because lots of scripts generate Postgres compatible SQL queries. That is, like, the 1 side of which dialects Postgres. Right? Although, since we've, you know, modified the parser anyway and and refracted it and made it compile in c plus plus and use exceptions and all these things, We've also added things to it. So, for example, you know, I've mentioned this joke feature kind of allowing commas at the end of expression lists is something that we can do because, you know, we've been through this process so many times. We can't do it. The language interface with functions, indeed, that is also something that is extremely fascinating in in process database engines is because a function a user defined function, traditionally, is something that you define in some horrid PLC call dialect or something like that. Or maybe if you're lucky, the thing will let you write JavaScript or something like that.
But in our case is, for example, in the CC plus plus AVI, the UDF is just a function pointer. Right? Because we're in the same memory space, you can give us a function pointer. You use that name in your query, and it will call your function with the data. Very fascinating concept. And, basically, it will, you know, arbitrary code, whatever you want. I would say this is still something that we're working on a lot. Do I expect to see some more capabilities in the future? But we're working on mapping that to the target languages. I think for Python, we have some initial UDF support for map queries where, you know, it's like, here's a table dot map now this function to every row, and that will create 1 of these user defined functions from whatever Python function you give it. We are also working on something similar for r, where we are working on scalar functions.
UDF support for scalar functions also in the web assembly version of DuckDV and, also in the node version. By the way, the web assembly version is wild. I'm not sure if have you seen this? I think I had noticed that there was a WebAssembly
[00:43:33] Tobias Macey:
option for using DuckDV, but I didn't dig too much into it.
[00:43:36] Hannes Mühleisen:
Yeah. This is something that was super surprising to us because it's something that somebody else, in this case, Andre Cohen from Munich, came up with. Because DuckDV is just written in c plus plus and has no external dependencies, you can totally throw it into the same scripting thing and run it in the browser. And that is also 1 of these things that when I first saw it, it was really unbelievable because I thought, no way you've just taken our many how 100 of thousands of lines of c plus plus code and made it 1 of these browser things. But this he did, and we actually have a demo. I think it's consult.ctv.org, where you can kind of just play with this in the browser, and it's the entire thing, you know, running browser, including, like, Parquet support and the whole thing. And you think, why would somebody wanna do that? But then I saw a demo that completely blew my mind where, you know, the the usual lag was just gone. You know? You had, like, a dashboard, and you clicked around. And, you know, your brain expects that there's a lag because it has to go back to the server and do some callback thing and then return the result and yada yada yada.
But that was just not there. It was just like click, and before you kind of your mouse your finger came up again, it was already rerendered. It was totally mind blowing. So I think this is something that I hope you're gonna see a bit more sort of usage of this this this WebAssembly version because it's literally mind blowing.
[00:44:55] Tobias Macey:
That's definitely pretty wild. And on that note, because of the fact that it is in process, because it does unlock these additional performance capabilities and the SQL dialect on data analysis in all these different contexts, I'm wondering what you see as some of the common workflows or applications of DuckDV in analytical projects or its use in data pipelines that data engineers might build and just some of the types of libraries or tools or systems that it might replace in the context of these workflows or projects or team environments?
[00:45:31] Hannes Mühleisen:
Yeah. What's the workflow of incorporating doc DB? I mean, I've briefly said it. You can do pip install doc DB, and you have it. I think very often, it is literally 2 lines of code, and you're up and running SQL queries on your data frames or whatever you have. And this also is, I think, powered by the fact that we do read a lot of stuff natively already, and we try for every target language, basically, be able to read the most common table formats that exist in the target environment. So if you're, like, in Python or in R, you would should be you just include the package that is a prebuilt version of FactDB contains all of it, like, from SQL down to storage and to start using it, for example, to run on your data frames directly with SQL queries. Although we also have a way of you building plans directly, we call this the relational API where you can do something like dot filter, dot project, dot aggregate, and so on and so forth. Right? If you program, this is nicer than concatenating strings to make SQL queries, in my opinion.
But we also see there's people building bigger things around DuckDV that are using parts of it. Right? Because nobody forces you to, for example, use DuckDV's front end. You can also directly construct DuckDV query plan and execute them. Some people do that. Other people do only use your do only use our front end and have us take the query in SQL then transform that to a query plan and then execute that themselves. There's people that only use the Parquet reader. There's people that only use the parser. There's people that only use storage. It's kind of an interesting thing that, yeah, there's this kind of standard use case of saying here, just take the prebuilt kind of package as it comes and use it basically as packaged.
But there's also a lot of use cases where we see that for bigger deployments where people just mix and match, you know, to say, okay, I'm actually gonna use Arrow to do this, and I'm gonna use StackDB to do that. And then I'm gonna pipe this all to some other TensorFlow, you know, something like that. It's pretty fascinating. So I think the easiest way to start is just with these Python packages and the R package or the R package or any other of the packages. Put it in the project, start using the existing APIs. If you think that you need some sort of deeper sort of way into digging into or you only need parts of the thing and you want to hook into 1 of the APIs directly, then make use of some of the data structures directly.
That is very possible. And if it's not possible, we wanna hear about it because it's 1 of our stated design goals. And it's also something that we do work with people with. Right? For example, there is this company called stoic that is building a data analysis back on top of DactDB, also using the Wasm version, for example. We have been working a lot with them on, yeah, basically extending things that weren't there yet or, you know, they realized they wanted to have a hook here, so we had to hook there. Something we do sometimes. And what other tools that that we might replace?
So I'm teaching some courses on big data every now and then, actually tomorrow again. And 1 thing I've always kind of found very strange is how hard it was to use these tools like Spark.
[00:48:41] Tobias Macey:
Have you ever tried to install Spark? I have not had to do that myself, but I've definitely heard accounts of some of the challenges that it poses and actually had an interview with somebody who their whole business was around making Spark run-in Kubernetes.
[00:48:58] Hannes Mühleisen:
Hooray. Yet another layer. But I don't know. I feel that we are using a lot of infrastructure. It's as if you drive a dump truck around as your daily driver. You know, that does make no sense. While Spark has its place, I don't doubt that. I also think it's complete overkill in, like, 99 of the percent of the use cases it's used for. We had this fun experiment where we ran some query in your taxi dataset, you know, the standard whatever 1, 000, 000 taxi rides, yada yada yada. And we run it on a single note on Amazon, and then we ran the same queries with Spark on the same note type, but then we measured how many nodes do we have to add in order to beat the performance of DuckDV with that. Any guesses?
[00:49:42] Tobias Macey:
So 1 node of DuckDV, I'm gonna say 32 nodes of Spark.
[00:49:47] Hannes Mühleisen:
Very did you look at my slides? I did not. That's a very good guess. I think it was 32, actually. And
[00:49:55] Tobias Macey:
That is a good guess.
[00:49:57] Hannes Mühleisen:
Yeah. I mean, I went up in powers of 2. You could have you probably guessed that 1, but this was completely crazy. And so also, if we have a planet to protect, I mean, you cannot really justify using 32 boxes when you don't need to. Right? I think it's my hope that we will stop using these infrastructures that are built for something else, for another dimension of data problems on really small scale stuff. Another example that is really annoying is you have these, you know, company x install Spark cluster. Right? Has a 100 machines. They have a 100 employees. Now they have to start doing some sort of load sharing scheme on their Spark cluster, which basically means everybody gets 1 box on average, why not just give every employee a virtual machine that runs DuckTV and save yourself a whole lot of hassle?
I think I mean, it's my hope that DuckDB would replace a lot of these needlessly distributed infrastructures. I know this is not the most, let's say, hardware vendor friendly thing to say. It's also not the most cloud provider friendly thing to say, but I don't know. This would be 1 thing that I would hope it would replace. On the lower end, I think I also would like to see use cases that all app use cases. People use SQLite, in fact, also for all app use cases. I think it will eventually get to the point where that is something that happens in DuckDV because it's just better suited for it. Fun fact, the creator of SQLite has been sending people away for a while now when somebody pestering him too much about on a little performance. Apparently, he says, go just go use StackDB.
That's just kind of That's great. Thank you. So that's, I think, the yeah. Yeah. Where I see this going from both sides. Like, we take the the tasks that don't belong on Spark or other distributed systems, and we will also take the tasks that don't belong on transactional systems on single node. And they'll just supercharge basically the capabilities of single node, you know, because I think people don't realize is how much you can actually do on a modern laptop. You know? Because when do people start using Spark? It's when Panda stops working for them usually. Right? Yeah. But there is a huge amount of headroom between where Panda stops working or where r stops working, but you actually need to use, you know, something else. I would say that's at least 2, 3 order of magnitude on top of that, probably more.
And I think that's also where we wanna be. Absolutely.
[00:52:43] Tobias Macey:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder. In terms of the project itself, I know that it's open source and also that you recently formed DuckDBLabs as a corporate entity to help support its ongoing development and provide some additional capabilities around the project. So I'm wondering if you can talk to the governance model of the open source project, some of the long term sustainability goals, and some of the ways that you think about the boundaries between the open source project and the commercial entity that is helping to support it. That is an excellent question. It's something we spend quite a lot of time on as well.
[00:53:54] Hannes Mühleisen:
So if there's DuckDuty project, the open source project, and there's DuckDuty Lab, the company, as you said, We have similar logos, but different colors. Let me start with the project. Project itself is governed by the DactoB Foundation, which is in in the Netherlands, which is a nonprofit foundation here that holds the, you know, the IP that makes for an MIT licensed project, but it holds the intellectual property in the project. It also has all the rights in it, and it is merely there to basically perpetuate DuckDV. The foundation collects donations from commercial users of DuckDV, and we've we're very happy that we've already, managed to attract a couple of, you know, sponsors.
But this money that we get from our sponsors, we basically help fund the development of the open source, DuckDV, in sense of that, you know, there needs to be maintenance. Bugs need to be fixed. And there is also a global roadmap that we that we discuss with our sponsors. Basically, if you have become a donor to the Dactyv Foundation, you know, we discuss our roadmap for for the open source Dactyv with you, and your input is heard. And the money that is given is also used to fund this road map to execute this road map. So that's really purposefully disconnected from the company.
We have no way of interfering with this even if we wanted to. It's in the statutes of the foundation that the DuckDV project needs to be stay out open source, MIT license. You know? So we cannot just change that. And that was done, yeah, to protect the project from the company, from us, which is weird, but I feel it's necessary. So Doctor. B Labs is a commercial company that that provides, let's say, in-depth, you know, professional support for DuckDuty itself. So if you're betting the farm on DuckDuty, you might want to have a way of getting stuff fixed quickly. So that's kind of how that works. Right? And we have a couple of strategic partners for DuckDV Labs. Not very many, but people that are really invested in using DuckDV for their, you know, purposes that basically contract with us sometimes for custom development around DuckDV. Sometimes they want us to prioritize things.
Sometimes they want us to add new features. So far, this has been extremely successful in sort of marrying the goals of the open source project together with the things, you know, our customers want. So we haven't had to make evil trade offs yet. So I'm quite happy about that. These are separate entities, and they have separate goals. And I think it's important, yeah, to separate those because the company can go anywhere. You know, the company might do anything, but we would like the project to be independent of the fate of the company.
[00:56:51] Tobias Macey:
In terms of the applications of DuckDB, you've already shared some pretty interesting and wild use cases for it. I'm wondering what are some of the other interesting or innovative or unexpected ways that you've seen it applied?
[00:57:05] Hannes Mühleisen:
We've seen a lot of funky things. So there was somebody that built a big data cluster using iPhones, which was okay. Yeah. So, basically, just glued together a bunch of iPhones that run DuckDB and use that to do large scale data processing. Why? I'm not entirely sure, but I thought it was very funny. There was 1 project that I was really impressed by where, actually, somebody works with us. He managed to port DuckDV to run on the Intel XGX trusted compute thing. So you could actually run part SQL queries inside this hardware enclave thing. As I've mentioned, the Wasm module was also something where I'm just blown away what people have done with DuckDee.
On the other edge of the scale, you know, we're talking with people that are building giant, you know, cloud infrastructure using our, you know, puny little project as their building blocks. And that's also sometimes very interesting to see. Yeah. We should probably be better at collecting these stories. But what really makes me happy is remember I said that DuckDuty was kind of born out of these discussions with data scientists not being able to do what they wanted to do. And what really makes me happy when I see somebody on Twitter saying, wow. This was not possible before, but thanks to, it's not possible. And this is usually some totally random thing like, oh, here's this giant archive file or here is this, you know, some CSV thing or just some capability they were lacking before that they now have.
That really makes me happy to see that. I wanna say Lambdas. So people have, you know, Lambda functions, right, in in cloud Lambdas. And it's another 1 of these things. If you realize that the SQL engine is no longer bound anywhere because you can just pull up wherever you want, you totally can pull up duct DB Lambdas, and people do. So you can say, hey. I have a Lambda that's reacting on some parquet file uploaded to folder that then runs a query on this parquet file and writes another parquet file to this other location. You know, you can actually build a entire query tree out of that. And so we've seen people that have been building Lambda based query processing engines, which I think was also pretty cool.
It's just really impressive. GitHub shows you sometimes which projects depend on your project. Sometimes I look at that and also just really curious. And I also sometimes look at the people that start Dactyv just to see which companies are using Dactyv, and, you know, you wanna know. And it's also really crazy because I think at this point, we have seen everyone. It's really crazy. It's also a bit because it's so unobtrusive. You know? You can just start with a single person even in in a company and then slowly, slowly grow.
You don't have to go golfing with the CEO yet. Absolutely.
[00:59:52] Tobias Macey:
In your own experience of working on the DuckDB project and working with the community and helping to run this business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[01:00:05] Hannes Mühleisen:
I'm an academic, as you know, from my background. And, obviously, we are used to know everything. And I thought I had a pretty good grasp of SQL, let's say, before we started DuckDV. It turns out I was wrong. I think that it's extremely humbling to see what you can actually do with SQL. That's 1 thing. That was challenging sometimes. For example, Mark spent enormous amount of time in subquery folding, you know, where, you know, SQL subqueries is 1 of these most challenging things in the world. I have spent enormous amount of time on window functions, which is something that most people have probably not heard of.
If in your SQL query, you wanna have, for example, an aggregate value, but you don't wanna actually do an aggregation, but you wanna have a scalar function in terms of the aggregate. Okay. That sounds weird. It is weird. It's called a window function, and it usually recognizable by this over clause. It's over and then something in a bracket. Okay. That sounds pretty benign, doesn't it? But it turns out evaluating window functions is this endless well of complexity that, you know, basically, it's not only a pipeline breaker, which means it needs to fully materialize the input and then do weird things to it, but it's also the only SQL operator that has a notion of locality, which means that you can look up and down in the table, and that breaks all your previous assumptions about, you know, how you can actually rethink the execution of a SQL operator because the window operator is just different. I have to make special considerations for it. Making the thing fast is also really a nightmare because, again, they have so many ways of executing those window functions that, you know, the traditional SQL way of saying there's 3 ways of doing it, we'll make all those fast, and we're done. Simply doesn't work. So that was also something that I was really surprised by is the complexity of seemingly benign things in SQL that also also the technical complexity of making these things run and making them run fast, of course, which is our added problem. Right? We can't just pull a Postgres and implement it in whatever way, but we have to actually think about it how we can execute this quickly. And we have to make it work in a vectorized engine, which, as I mentioned before, sometimes throws additional, like, architectural complexities at you. So that was really, I would say, humbling, how much we didn't know and also to see, yeah, where the real problems are. You know? And I think that also is something that sets us apart because we're an open source project. We get a lot of feedback, and we are super thankful because we learn so much about how people actually use databases that you will not find in any textbook, you know, that you will not, you know the Oracle people would have never heard of this because the people that do that don't have 10, 000, 000 sitting around to pay them.
So that was also super interesting and also super humbling to see how people use databases and then how things go wrong, you know, because they're holding it wrong, but obviously, not holding it wrong. Your assumption of how they should hold it is wrong. So you have to basically make the handle, like, pink so they kind of see it and held it correctly. But this is really difficult if your interface is a SQL API. Right? I mean, how do you give people hints in SQL queries? Difficult. So this was really interesting to see. And then every day, you know, as I said, every day we see interesting things people do with it. And what I generally think was the most unexpected thing is that people cared about this thing in the first place. That's something I still I still can scarcely believe that, you know, we made this thing.
And, yes, as I mentioned, we did talk to some people and we thought, okay, if they care about this, then probably other people care about this. But I am quite impressed about the adoption and this caused a big responsibility. You know, whenever we make a release, we have to actually be sure. But that people cared really surprised me a lot. So I would like to say thank you.
[01:04:04] Tobias Macey:
And for people who are interested in the capabilities of DuckDV and they want to be able to have this easy on ramp, quick access to performance SQL queries, what are the cases where it's the wrong choice? And maybe they are better suited with running a 32 node Spark cluster or putting it into a Snowflake or a Teradata or using some other approach to managing these data analysis and manipulation?
[01:04:31] Hannes Mühleisen:
Yeah. I think we've briefly had this. If you have many, many concurrent users on a dataset that is, you know, centrally managed, you probably want to use traditional data warehouse. When would you go for more the Spark sort of kind of way? Yeah. You know? This is, of course, the big question is, like, how big should the data be? If you have petabytes of data sitting around, I would say, probably, you don't wanna use dark DB on that yet. But if you have the 32 node Spark cluster I mean, actually, you should never use a 32 node Spark cluster. You should use a 1, 000 node Spark cluster or don't. You know? I think there's no good reason to have these mini mini clusters around because you're getting, like, all the pain of running clusters.
And, basically, as I've explained to you in this example that we did, all the computers is basically eaten up by the coordination that is designed to deal with a 1, 000 node cluster. Right?
[01:05:29] Tobias Macey:
So why anybody would not run a 1, 000 node clusters beyond me? Does it make sense to you? I mean, why would you Yeah. Why would you have the small cluster? That makes perfect sense. And, yeah, there's definitely a lot of just kind of cargo culting of, oh, we use Spark because everybody uses Spark, and that's the thing that you use Spark for.
[01:05:46] Hannes Mühleisen:
Yeah. And then, you know, they're in Python, and then they make a computation with Spark, and then they say, but now I wanna use matplotlib. And it's like, yeah. No. Because you have to now forget about it. Because in order to pull this from Spark to your Python shell, Spark, in its infinite wisdom, serializes this data using the Hive protocol, which I've written a paper about. And the Hive protocol uses Thrift to encode column data, which is not a good idea probably, and which means that, actually, to pull something from your 1, 000 nodes bar cluster that is not that crazy big, let's say, 2 gigabytes or something like that. You know? You pull that into your Python. Maybe you won't. Maybe you even wanna use learning on that. Who knows? You know? These days. In order to do that, you know, you're gonna spend the vast majority of time that this query takes end to end because yeah. Okay. The 1, 000 nodes were done quickly running your extract or whatever it is on some parquet files. But then pulling the result into your Python shell is gonna take, you know, hours.
But you could also say, hey, if I run this query in tech DB, it's already in the correct memory. So I can just do whoop, and it's there. I think there was a tweet some time ago. It's like where somebody said, okay. Yeah. Time to produce query result in Postgres is 0.1 milliseconds, time to construct SQLAlchemy results at 2 seconds. You know, I think that's more true than people want to believe Absolutely. How much performance we leave on the table.
[01:07:18] Tobias Macey:
I think we've touched on this a little bit, but in terms of the near to medium term future for DuckDV, what are some of the projects that you're focused on or new capabilities that you're looking to add or areas that you're excited to dig into? And given that it's open source, any areas of contribution that you're looking for help with? We do love contributions. We really do. And I think we've also tried to make a welcoming
[01:07:41] Hannes Mühleisen:
community on our Discord and our GitHub. I can maybe go through the strategic directions for the project itself so that people get an idea of what would be contributions we would value. Of course, we value all the contributions. There's issues that we have a lot of, like, open tickets on GitHub that some of them attack with good first issue. People want us to get started. That's a great place to start. But in terms of the more general directions that we wanna go into, I think the 1 big 1 right now is the the out of core capability for all operators. I think this is important.
We call this the galaxy quest principle, never give up, never surrender, because we wanna be able to complete queries if at all possible. Right? And that sometimes means you have to go out of memory in the sense that I mentioned this before. You know, you have these pipelines, and sometimes you have pipeline break operators that need to materialize some input, like a sort, needs to see all the input before it can sort. Okay? And these operators, they might need a lot of memory. If you build a join hash table on a table that has, you know, a terabyte, then this might take a terabyte of join hash table. That's not great. Probably don't have a terabyte of memory. So we're constantly focusing a lot on teaching every single 1 of our operators, and we have, like, 20 of them, how to gracefully start using disk space instead of memory if possible, if we run out of memory. And this is only so we can make sure we can always complete queries. So this is a big focus, and, yeah, we're working on that. We've been looking at a bunch of operators already. Much to be done. It's also really fun to think about that, how to do this in a dynamic way.
That's 1 thing. Another thing is that remember we talked about our storage format, the hashes of the blocks in order to hashes of blocks sounds familiar, doesn't it? The hass of blocks that protect against the bit flips on disk or disk corruptions. This is also something that the storage format is something we want to finalize this year if possible because it's kind of what keeping us from releasing a 1.0 because we want to not make a backwards incompatible change to the on disk storage format. Again, after releasing the 1.0, it's our goal. We'll see whether we manage. And right now, there's just too much going on in order to guarantee that. So we wanna stabilize the storage format so we don't break backwards compatibility with existing database files anymore. Mind you, Postgres doesn't guarantee that. Right? This is something that normal database systems don't do. Right? In Postgres, I think they guarantee compatibility within the same major version, but that's it. What we try to do is something more akin to SQLite where they say anything that has been created with any SQLite version bigger than 3 dot o is still readable and writable by current SQLite versions.
I think that's also a bit far, but we would at least wanna have the backwards compatibility there going. So that's a big topic. We're also actively working on compressed storage. So dactyv will automatically analyze the tables on disk during checkpointing to see which compression method would be most applicable per each row group. So row groups are like subsets of tables that we store individually. We run for every column, and we run this analyze phase during checkpointing to see which compression method is most applicable and then apply that method. We have built a framework. We've built a bunch of compression methods, but we have to do more there. So we wanna finalize that. We wanna add more compression methods. Also, maybe a good way to getting started, implement the compression method. Because also they're pluggable. These are also user pluggable. It's pretty cool. If you know better than us how to compress your data, sure. You know? Write your own compressor. Why not? We are generally also working on performance engineering, of course.
While DuckDee is already pretty fast, there's always some edges where we need to improve. And for anyone interested in low level c plus plus hacking to make things go fast is also something that we enjoy. And if you enjoy that kind of thing, maybe we should work together then. And these are, I think, some main directions. And then we also have the whole testing area. We wanna do more in terms of destructive testing as well in the sense of, you know, defending against misbehaving disks, misbehaving memory, and then integrating all that into the testing framework so that every time, you know, we commit, we will run all our tests also and some simulated memory. That might also be fun for people to look into. So these kind of things. But I would say that's the main directions that we wanna go into.
Wanna go, like, continue for the main project. And, obviously, there are some more sort of domain specific things like people. Some people are interested to work on geographical data type support for DuckDb. We haven't done anything yet. Some people are working on making more graph, like use cases work on DuckDb. We'll see what comes out of there. But I'm really excited to see, like, a lot of, like, things springing up left and right. And we also have worked really hard to make an extension API. So DuckDuty has this plug in support where you can change a lot of things about, you know, which types, which functions, and so on are available in the systems simply by loading plug ins. Also, it's very similar to SQLite, which is where we copied it from. And that I think also would allow for lots of these things to live on their own. You know? For example, imagine, like, DuckGiz or something like this similar to Postgres, which is just a plug in, you know, that doesn't actually have to change anything in the base system. I think that would be really interesting.
[01:13:16] Tobias Macey:
Are there any other aspects of the DuckDb project or the applications of OLAP capabilities in process in language ecosystems and the work that you're doing at duckdb Labs that we didn't discuss yet that you would like to cover before we close out the show? It's been a privilege to work with everyone
[01:13:34] Hannes Mühleisen:
on DuckDB because, yeah, it's a big project. Let's say, a big undertaking. And, yeah, I'm really, really thankful for everyone that has contributed and is contributing
[01:13:43] Tobias Macey:
because it would not be possible without them. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as my final question, I'd be interested to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. You know, it's very interesting because
[01:14:06] Hannes Mühleisen:
I think the design of Tech TV was kind of motivated by 1 of these gaps. Right? So, okay, there was no in process OLAP system yet, so we made 1. And there are many of these things. For example, we can throw out some startup ideas. Right? For example, I think strings in data management systems, like text, is still 1 of these huge unsolved things where, yes, they work in principle, but it's just painful. It's bad performance. It's, you know, inefficient storage, the whole you know, we have done this experiment where we took a benchmark like TPCH, right, and converted everything to strings and we run it on different database systems.
It's, you know, orders of magnitude later, you get your query result. That's not great. Somebody should work on that. I think, generally, the debuggability of data pipelines is pretty bad. Running query engines is something that hasn't mattered traditionally. And it also inductively, it's not something that I'm particularly proud of. But seeing what's going on while it's going on, I think that's also something that so often we just, you know, we hit a button and then we have to kind of wait and pray that it will finish eventually. But we have no concept how long it will take. You know, is there progress? Who knows? So I think there's a lot there. But I think similar to the gap that we found, there are a lot of these blatantly obvious, let's say, you know, nightmares tooling nightmares that that are just waiting for somebody to tackle. I would say strings.
[01:15:41] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your adventures with DuckDV. It's definitely a very interesting project and, obviously, a very important and complex undertaking. So I appreciate the time that you and your colleagues have taken to bring this to the world and continue to help improve it and add new capabilities and functionality. So appreciate you taking the time today, and I hope you enjoy the rest of your day. Thanks so much for having me. I'm really thankful. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Hannes Muellerisen: Introduction and Background
Overview of DuckDB and Its Unique Features
Naming and Inspiration Behind DuckDB
Technical Challenges and Resilience in DuckDB
Use Cases and Integration with Other Systems
Implementation and Architectural Details of DuckDB
Common Workflows and Applications of DuckDB
Governance and Sustainability of DuckDB
Interesting Use Cases and Applications of DuckDB
Lessons Learned and Challenges Faced
Future Directions and Contributions
Closing Remarks and Contact Information