Move Your Database To The Data And Speed Up Your Analytics With DuckDB

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Hannes Muellerisen about DuckDb, an in process embedded database engine for columnar analytics. So, Hannes, can you start by introducing yourself?

Thanks for having me. I'm really happy to be here. My name is Hannes Mullerisen, indeed.

I am the cofounder and CEO of DuckDV Labs, a small artisanal data management systems company in Amsterdam, the Netherlands.

I am also a senior researcher at the database architectures group of the CWI.

CWI is the national research Center for Computer Science and Mathematics in the Netherlands. It's mostly famous for inventing Python or being the place where Python was invented.

Very different things. Yeah. So I spent, like, the last 10 years of my life working on data management systems

and recently added in the last 3 or 4

years on the system called DuckDB. And, yeah, really excited to tell you more about that. And do you remember how you first got involved in the area of data and what it is about that space that has drawn your attention?

Yeah. It's really fascinating because, you know, it cast your mind back to the year, I don't know,

1998 or something like that.

I was, you know, messing out with PHP

like many kids back then, and there was something called MySQL. I I learned that you had to talk to this MySQL in order to

just to do something with data, and it was really all very fussy. And

I wanted to write some sort of application, and then had to write SQL queries. And, obviously, I didn't know anything about PHP. I didn't know anything about SQL. So it was really a mess.

And, yeah, in fact, you, you know, you grow as a user of something

like MySQL. And then when I went to study computer science, did a bit of a detour in distributed

systems.

I really learned to appreciate the sort of the field of databases

as a very no bullshit zone, I like to call

it, because

you have very clear metrics of who's right and who's wrong. I mean, obviously, as now I'm, you know, in the field for longer, I've also realized that there's more to this than than I maybe initially thought. But I still appreciate this, you know, focus on metrics,

benchmarks,

on measurable effects

that is really, you know, prevalent in the community.

I feel it's makes it a very scientific

discipline of, practical computer science specifically.

After getting my PhD, I joined this database lab, the database architectures group here in Amsterdam.

Yeah. And I've been here ever since.

And as you mentioned, for the past few years, you've been working on this DuckDB project, which you have now built a business around. I'm wondering if you can just give a bit of an overview about what it is that you've built there and some of the story behind

how it came to be, why you thought it was a project that was worth spending time on, and sort of some of the areas of focus that you have for the DuckDV

platform?

Okay. So DuckDV in process SQL data management system and query engine,

specifically for analytical data management. So OLAP, it's called. Not sure the term is familiar to the listeners, but

data management systems come in many flavors. Not 1 size fits all.

And you have this traditional divide between systems that are built for transactions and systems that are built for analytics. DB is built for analytics,

which means, you know, it can do large aggregations, large joints, these kind of things.

Big data, some people like to call it. So DuckDee is special because it's

in process, which means it doesn't actually run as a separate database server like you know from literally everything else,

but it runs inside, you know, whatever application you wanna run inside of, typically,

Python, for example.

In that regard, it's very similar to SQLite, the world's most widely deployed database system. But, again, SQLite is, widely an excellent system. It's made for transactional use cases,

and DuckDV is focusing very hard on these analytical use cases with, you know, high throughput query processing.

And this is kind of funny because the background story there

was that

we

literally talk to people, which is very rare for computer scientists.

We like to talk to the computers, but maybe not so much to the people.

We talk to statisticians,

and they complained to us that they wanted to do this analysis, and it didn't work.

And they wanted to see whether we could do something about it. And at the time, we were working on a different data management system. It's also built the research group at CWI, and we were working on integrations with R, the R environment for statistical programming.

There was a couple of iterations. This took years years of, you know, trying to figure out what the architecture

for data management is for for, like, a a new data management systems could be that would support

these use cases. And these use cases

didn't like, for example, we're always like, okay. There's always a single user.

They don't really wanna spend time on setting up stuff. They don't wanna spend time on, you know, moving data around, importing, exporting. So this should all just work.

And, yeah, but at the same time, of course, state of the art data processing capabilities were required to deal with the large datasets. So that kind of gave rise to this concept that, hey. We need to make something that runs in process, that doesn't require us to do separate setups for servers and things like that. And we need to make something that is actually fast.

So we we're in a good position as an academic research

lab in the field of high performance data management systems.

Kind of had an idea

what the architecture of the system should be, But what is new, and it was still new, I don't think there's anything else

out there that does this in process OLAP has this in process OLAP focus.

And there's also nothing else that is custom built for this specific use case. And it's quite a it's quite a funky world to be in, to be honest. If you're trying to run-in somebody else's process, you have to deal with a lot of special things like,

oh, they overwrote a signal handler. Therefore, you cannot use a signal handler anymore, things like that. But yeah.

So so that's kind of how it came along. And and 1 of the cool things about CWI,

the institute

that I still work at, is that they kind of leave you alone. So you have very nice offices, and they treat us well. But there's this understanding that some things just take time, that this is, you know,

still worthwhile to do. We were able to basically disappear for a couple of years

and start building our data management system from scratch, which I can't imagine other places that would have allowed us to do this. It's definitely great to be able to be given that freedom, and it's always interesting to see what people will come up come up with given the time and space to really deeply understand the problem space. Mhmm. 1 of the interesting aspects of the project is obviously the name. So wondering if you can talk to where that inspiration came from and why it's called DuckDB.

I live in Amsterdam, and we live on a ship. And we're constantly surrounded with waterfowl. So at some point, I thought it was a great idea to have a duck as a pet. The duck was called Wilbur. You know, I got him when he was a little a little duckling and he's very cute because he was running around my kitchen all day long.

Eventually, he grew up, and he learned to swim and to fly and all these things. It was very cute. At some point, he just left, which is fair enough, I guess, for a teenage duck.

But yeah. So DuckTV, obviously,

is named after little Wilbur.

We also think that, you know, database systems tend to have completely ridiculous names, so we are in very good tradition there. Plus, they tend to have, like, names like shaving cream, like super performance power. You know? And I think this is just completely ridiculous

and also non Googleable, whereas StackDB,

once you can find it again, it's not a problem. So we are quite happy, and, of course, the logo designs itself, which is which is great. Absolutely.

When I first came across it, I just assumed it had something to do with duck typing and sort of being dynamic

and adaptable. But

No. It's literally ducks. And, I mean, we have some post talk justification for the names because duck are versatile. Right? They can swim, fly, and walk.

Also quite resilient. You know, they're hard to kill. So in that sense,

it's a perfect role model for our data management system. We can do a lot of things, and it's hard to kill.

This is something that we've actually spent quite a lot of time on thinking about for that DB is how to,

for example, tolerate broken hardware. Because if you're 1 of these cloud guys, you know, you run the hardware that your database runs on. Great. You control the entire environment.

If you're gonna be installed on everyone's laptop out there, which is what kind of activity is aiming to, of course, to basically become

something like SQLite, but then for analytical use cases.

If you're on everyone's laptop,

you're gonna have thousands of broken machines that you're running on. Right? So how can you actually deal with that? It's kind of a fascinating problem, and we're still working on that.

Absolutely.

And I'm probably jumping ahead a little bit here, but 1 of the things that I've run into personally as far as broken hardware and database systems is that if you have just a large binary object and some portion of the hard disk gets corrupted, then the entire database becomes unusable. And because of the fact that in the big situation I'm thinking of, it was just 1 single binary blob and, know, there was no sort of separation of individual files for individual records. So I'm curious how that manifests

in DuckDV because I know that with SQLite, at least, it is just a binary blob, and you have 1 database file that is the whole database. So I'm curious how that manifests in terms of being resilient to broken hardware, but still being portable and adaptable.

Yeah. That's a very good question. So we did adapt the single file database because in our interactions with the people, we have learned that they love to email databases around. And I'm not making this up. Because of this user requirement,

we decided to also have a single file data format, which basically means all the tables that you store in a specific Dactyb database, they're all in 1 file. The way it's resilient to corruption is that the file is separated into blocks, and each blocks has a checksum.

And so we can kind of early detect that something is wrong, and

you can also find out which blocks are still valid, for example, without too much trouble. This is something that SQLite also has, but they don't have a checksum. They don't they have the blocks in the file,

the, like, logical blocks within the file, but we have that too. But we have a checksum on top of it, which means that we know

if a block is corrupt, and we will say, no. This

this hard disk is no longer to be trusted because that's kind of what you have to do. Right? It's also pretty of a departure from traditional systems design where where we have to start distrusting the hardware. We've wrote a research paper some years ago about this topic for people who are interested

is how would you design data management system when you cannot trust hardware. And, again, it's not like some weird out there idea that we, you know, came up with on a Friday afternoon in our pushy clarship.

Sorry. In our NICE Research Institute. But it's something that we actually experienced. There was this 1 guy

on, like, a 32 bit

Brazilian laptop with Windows

that just had broken RAM some years ago. And he would keep sending us these bug reports that made no sense.

And it takes a while till you tell somebody,

look.

Your RAM is broken.

Throw your computer away. You know, this is difficult to tell somebody you think it have issues. So this is when we decided, no. No. No. No. We're gonna defend again this kind of thing. You know?

But this is indeed a bit of a far out discussion already. That's dedication. Yeah. The specific database that I was dealing with was actually a Zope database, which is just a sequence of pickle files that have some vague correlation between the individual records, but very opaque or completely nonexistent documentation about how to repair corruption and such a thing.

From pickles. That's not pretty. No. I I agree. I mean, for SQLite, which I've actually written a third party reader for this format that they use because it's the way I understand things

is to write a parser for it. So I've written this SQLite parser, and the format is actually you can totally.

It's, like, 2, 300 lines of c code to to parse this. It's not it's not crazy. 1 of the reasons why the SQLite's own code for this is so much bigger because there's a lot of error checking it for airplane purposes.

And so given this similarity in terms of the ways that

SQLite and DuckDV

are used of being in process and potentially being very ubiquitous. It can be easy to,

you know, maybe confuse the use cases between them. And so I'm wondering

what you see as some useful heuristics for deciding when to use which format and maybe some of the ways that they're able to interoperate either in terms of just application design or being able to convert or import back and forth between them?

Yeah. So it comes back to this OLTP versus OLAP discussion. If you have transactional use cases, SQLite is the obvious choice. And what is a

transactional use case?

It is when your

queries and your changes to the data are focused on few rows at a time. Right? So you're updating

a single row. You're tracking some sort of order or user data or whatever. So you're updating

the queries, usually

update users,

set email is new thing where ID is 42. Right? This is a very typical OLTP query. Yeah. We call this a point query because it points to a single row. And for this kind of stuff, SQLite is really excellent.

On the other hand, have overlap style queries

that look like

select

group

name, comma, average,

I don't know, price from

table with a 1000000000 records,

then you have more of an overlap style use case. And for that, DuckDV would be the far superior choice because of its internals.

It's more a columnar design. We can talk about that later,

I guess. But it's sometimes funny

because,

like you said, what we've copied a lot of good ideas from SQLite in terms of packaging,

in terms of how it behaves. You know, it's just a library. You do pip install,

and you have it.

But, okay, a lot of stuff is pip installed

x these days. But in our case, it really is just a module that is linked into the process and loaded. And

what is very different, and I think I think SQLite users realize this, but for OLAP, it's pretty revolutionary that the OLAP system is suddenly wherever you want it to be because, traditionally,

you know, these things are kind of bound to the machines they live on. You know, you have your teradata

somewhere in your company,

and that's where the data lives in your data warehouse. Or you have your Snowflake instance these days somewhere, That's where your data lives. But you wouldn't ever dream of saying, hey. I have this

software that I wanna, you know, install on somebody's computer, and I I kind of need to do some data wrangling, and it's actually quite serious. So let me just drop a duck DB instance there because you would just not be able to install a Teradata

on every single 1 of your customer's computer. You would not be able to install a Spark or you would be able, but you would want to install a Spark

on every single 1 of your, you know, phones that run your app. You know, it's not possible. But with that, you can just simply pull pretty serious data crunching capabilities inside,

you know, wherever you want them to be. I think this is the kind of fundamental architectural

difference that this enables is that you can pull this query processing capability wherever you want. But, yeah, to come back to your question, the heuristic is gonna be

how do your queries look like,

how do you interact with this data. And there's also you also had this question about how could you possibly combine the 2, and I can happily report that we thought of this. We're gonna

soon release something we call the SQLite scanner,

which is gonna be a functionality so DuckDuty can directly read SQLite databases as if it were local files. And the cool thing is we can do it faster than SQLite itself because we parallelize the read. So it's like

not only can we read this, we can also do it faster, which is kind of funny.

Yeah. The other way around, if you wanted to read

a DactDB file from SQLite,

I don't think that is super easy at the moment. Well, I mean, it's kind of in keeping with the general flow of data and analytical systems anyway, where you have your transactional

application database and data moves from there into your columnar analytical warehouse and then from there to some other

downstream system, but usually not in the reverse. So Yeah. This is true. And this is also, I think, why this is super interesting. This even works with SQLite while you run other on your SQLite. Right? It doesn't really interfere. We're also working on some similar functionality for Postgres,

where DuckDuty can directly read Postgres'

databases because we also think

that, you know, too often big data is just, you know, the same fairly small data copied a 1000000 times because people have no other choice than to replicate

everything from, I don't know, the Postgres where they keep their operational stuff to

whatever system they feel they need to use in order to do analytical queries, which, you know, it's unfair to benchmark something like Postgres on TPCH. It's just not built for that. Right? Yet people still do it, of course. I love it specifically when academics, you know, in my field

propose some new OLAP architecture, and they just, in the paper, have a benchmark against postgres, and you just think, like, come on, people. This is just mean. You know?

You don't do that.

But still, I think the combo

is more interesting, in my opinion, than the HTAP world. You know, you also have these HTAP systems that attempt

to fulfill both use cases, to be both a

crack transactional system as well as being good at OLAP.

I am not entirely convinced this is, you know, the way forward either because

you're gonna be forced to compromise somewhere.

So I think having

a bit of a closer bond between 2 systems that focus on, you know, individual aspects of this.

For example, something like Postgres or SQLite

paired with something like DuckTV

or something else is probably a good idea. In the other direction

of

being in the point where you're trying to decide, do I want to use DuckDV

versus,

you know, some beefy data warehouse, whether it's a, you know, cloud provider or something like a Vertica or a Teradata that you're running on premise?

What are the deciding factors of when you might want to use 1 versus the other or possibly in conjunction?

There's an interesting question with the conjunction. I'll come back to that. I think

the traditional data warehouse

is your obvious choice if you have, you know, thousands of analysts hitting the same data

all the time because

DirectDB is in process. Right? So we don't do client server,

which means you would have to replicate your data to all your analysts or have them all work on the same cloud box with all the data

in order to, you know, make use of DuckDuty properly. I think for these use cases, you would be better off using something like Teradata

or Snowflake.

What I think is a bit underrated, though, is this capability of

everybody being able to run their own thing. And I think as an analyst, you will not be able allowed to just

wildly create tables in your Teradata instance that, you know, are just like small variations of the giant fact table that sits there. Whereas this would be perfectly allowed if you just have your own documents instance and you can do whatever you want on your local machine with it. What we see as a use case sometimes

is that people use

the use them in conjunction, which is quite clever.

I think where they say, okay. We're gonna make an extract

of our data warehouse for 1 particular analysis question.

We're gonna write that to a parquet file, for example,

and then we're gonna give this to the analyst.

And the analyst can or the data scientists, whatever you wanna call them, can then, on their own, on their local machine, without interfering with anyone, without, you know, producing extra costs on what on Amazon, whatever, they can then play with this data as much as they want using

the standard SQL. You know, you don't have to learn

strange Python package syntax to do data analysis. You can just use

normal SQL queries, and we have actually quite powerful

SQL support. So this is

really the same kind of SQL that you would run on something like Teradata.

And then they can they can play and can explore and really can get a faster

turnaround time in further analysis. And I think that's really promising

sort of concept. And because like to be natively reads parquet files,

you already have a data format to do so. But in general, I would say

if you have a 1, 000 people that are supposed to do analysis on a dataset

that is central and doesn't change, you probably don't wanna use DuckDV for that.

1 of the ways that DuckDV

is strongest is obviously for this in process use case where you have a language environment. You have data that you're working with. You want to be able to take advantage of the acceleration and performance benefits of DuckDV.

1 of the ways I've seen that is in the Python ecosystem where maybe you have a pandas data frame, and then you read that into a duckdb file. You can use SQL to be able to perform some analysis on that and then

be able to transform that back into a pandas data frame. And I know that there have also been some optimizations to be able to use the arrow format to make that a, you know, seamless or completely

cost free process. I'm wondering if you can just talk to some of the benefits that DuckDV provides over and above some of the capabilities that are present in these different language ecosystems or some of the ways that you might want to use DuckDV to be agnostic to the language entirely?

We have, indeed, as you mentioned, we have integrations with many, many languages. I've kind of lost count at this point. But there are 2 languages, I would say, that are special, and this is Python and R. And that's because I think a lot

of our data analysis,

data science workflows

start in Python or R and end in Python or R. So this is why we spend some extra time

with the integrations there. And, yeah, so 1 of the things that is super powerful with DuckDuty if you're running in process

is that we can

read the memory of the process we are running in. It's a very strange concept.

But, yeah, it's just a you know, within the same address space, we can just look at things. And the internal layout of things like Pandas or data frames in R is not really a secret. So you can just look at these things. Right? And that's also something we've done in Dactyv. So, indeed, as you said, you can import the data frame directly in a database file in Dactyv.

But you actually don't need to because we can run SQL queries directly

on pandas and r data frames by simply saying, hey,

query. Instead of reading from the file,

please scan, we call this a scan operator,

scan this data frame. This is even 0 copy for some data types, so we don't even copy anything. And because DuckDV is a streaming engine,

it means that it does not actually materialize your data twice in memory just to query it. If you're just running a simple aggregation on top of 10, 000, 000, 000 rows of a data frame, you will use constant memory for that, which is pretty cool. We also have this on the other side where you in our in Python, if you have a query result from DuckDV, let's say you run a query

from whichever source, you have a result

that we can efficiently then transform again back into data frames in Python

and R, Julia as of this week, I believe. And this is also super powerful because,

again, we are in the same process, which means in order to create a data frame, we can already set up all these in memory structures,

fill it with stuff, and then just hand over and say, hey, pandas. Here's the data frame that

exists in memory. Please use it. And that totally works. What it allows you to do, for example,

because it's so fast, because it doesn't have to go through a socket or something like that, allows you to quickly do an iteration over queries. You know, you can do some fixed point iteration if you wanted to and just read a table from database, iterate, write it back, and just iterate. And that's not a problem at all.

With Arrow, yes, indeed, of course, we want to be good citizens in the ecosystem, and we've actually been working with Fulcrum Data on

improving the integration between DuckDb and Arrow, and the same applies there. You can run SQL queries directly on Arrow

structures,

specifically on tables and record batches. We can also, again, export query results

directly as arrow

patches, basically.

So, yeah, you can also, as I said, move this iteration productively on arrow, and it just goes in, interesting circle. There was 1 interesting bug that we had there a couple of weeks ago that I'd like to talk about because it's not something that you think about when you invent this kind of thing. So we have 0 copy, yes, which means we can use

memory

that we, you know, see on the input of a query like in the from part.

We can use, for example, from an arrow structure,

we can use that and directly put it into our engine without copying it. Now that's very powerful because, as I said, no copy,

better performance.

And now this can go all the way through the query engine without being copied because

for some queries, we just keep referencing the same memory pointer because,

you know, imagine

we only do a projection on something, the column data that is not being touched.

And now this goes all the way through the engine. It comes back into the result. We do another 0 copy where this time, we make a new arrow structure that references duckdv memory. Okay?

But we forgot to think about the fact that this memory that DuckDuty thinks it's his own is actually not his own memory, but it's something we borrowed from Arrow all the way back there. And now this can go in circles. So you end up having this

circle of references

to memory

that is kind of arbitrarily

long because you keep referencing the same memory. And now the question is, who calls free? And remember, you can only call free once because, otherwise, you

crash. It's really like okay. No. Okay. First of all, how do you keep it from being freed? Okay. That's usually not so crazy. And second of all, how do you free it up? And then because in order to free this correctly, you have to circle back all this way through the multiple, you know, recursive layers of the Dactybe Aero Dactybe Aero

interactions.

And, yeah, this was kind of funny because it took our people and their people some time to figure this 1 out. But now it works.

But you don't imagine this kind of problem because, you know, traditionally, this problem didn't exist, like, 0 copy between the database and your your application.

What? You know? But, also, something I wanna mention there is that, yes, we have these special scanners for data frames and for

archive files and for arrow and for CSV files and for a bunch of things. For SQLite, I mentioned this, But you can also plug your own. DuckDive, we are kind of data communists in that sense that we treat all data sources equally.

And our own table scan operator is actually just

yet another 1 of these sort of zoo of table scan that we support.

And you can bring your own as well. So these are all just functions, and you can have a user defined function.

There's a table producing user defined function

that reads whatever data you have in whichever format it may be and makes it look like a table, and then you can run whichever SQL query you want on it. And that's actually something that we've been doing quite a lot in our spin off company to work with people to build scanners for data that they have. For example, they might have a custom

application that stores a lot of data internally.

They wanna add SQL support to that. Fine. We will be able to build 1 of the scan functions for them, and they have the exact same capabilities as their own scan functions that they can do. Projection push down, they can do filter push down, they can do all these things, they can do parallel reads,

And this is, I think, quite a unique capability

in a data management system that you have this sort of arbitrary

sources of where the table data can come from. So today, for example, I was running some benchmarks and aggregates because I was working on that some time ago, and I want to finally write a blog post up on it. And then, yeah, in this case, we are just running directly from pandas, and it's

actually as fast as running from our internal storage, which I think is pretty unique. That's definitely pretty amazing. And if I was more of a masochist and actually had free time, I would be interested in actually writing 1 of these plugins to be able to use the DuckDB to scan a ZopeDB

to be able to run analysis across that.

Yeah. Yeah. Exactly. I mean, that would be exactly 1 of these examples

to say, yeah, if you have terabytes of soap DB lying around, you could write your own scan function and

just treat them like SQL tables. Yeah. In this case, it's about a 65 gigabyte Zope database file, which is enough of a pain to deal with on its own. That sounds horrible. Yes.

I'm sorry.

Thankfully, we're in the finishing stages of making sure that nobody ever has to think about that again.

What is new speakers, if I might ask? So this is actually for the OpenCourseWare

project from MIT because that's where I work for my day job, and we have been using Plone as the content management system for that for a long time. And so it's 20 years worth of content data for publishing the OpenCourseWare site.

Oh, yeah. That's not something you wanna lose either, isn't it? Yeah.

Wow. And we had a SAN failure where some of the drives failed and other ones got corrupted. And so, you know, the middle portion of the 65 gigabyte pickle object was corrupted, and we had to spend about 3 months and many man hours figuring out how to reclaim it.

Couldn't you brute force it? Like, flip the bits until it parses again or something?

At this point, I have forgotten all of the details, but, eventually, we were able to reconstitute

the database, and all was saved.

Yeah. This is actually quite fascinating because this problem this is a bit of a weird problem, I admit, that you just described, you know, having all these pickles.

But the big data, of course, these days lives in parquet files. Right? And parquet files, they use something called Thrift to encode the metadata inside the parquet file. And Thrift

is also a very, very unforgiving encoding. So if a single bit in this Thrift encoding goes, you know, good luck getting anything out of that parquet file because Thrift will say no. Without the Thrift metadata,

you have absolutely no chance of knowing anything about your parquet file. Of course, you can

guess, and sometimes you can reconstruct things from but it's super fascinating because I've wrote the duckdv parquet reader myself. I was so annoyed about this. I also wrote a Thrift reader.

It is really fascinating

how people prioritized

this design of these formats, like Thrift the same also helps for protobuf, for example. And they really prioritized

saving every little last bit,

and they did think about the fact that maybe at some day, somebody needs to recover this in the HEX editor,

and maybe we shouldn't save every single bit. Maybe we should throw in, like, a marker here and there. You know? Just put the end marker at something. Or

they they they kind of conflate, for example, in Thrift,

they conflate, what is it, the field ID and the type in the field ID is in 1 byte because, obviously, we can't have 2 bytes. We have to work 1 byte. And, obviously,

correctly, yes, the field ID is sometimes short and sometimes the type ID is always short. So, hey, let's put them in 1 byte. But what this means is that you just get this crazy opaque bit stream

that you have really little chances of recovering. I think there's actually a start up there. This 1 is for free, like Parquet, you know, forensics.

This is really

I mean, I have some experience,

but I think you can ask arbitrary amounts of money for these kind of services. That's hilarious. Yeah. It also puts me in mind of early in my career working at a startup that was dealing with some IoT equipment that had its own custom hardware and custom packet format. And the whole hardware and packet format was designed and built by a vendor, and we were trying to build the next generation of hardware and remember what the actual packet structure was and starting to pick apart the individual bit fields of this packet to be able to understand what the messages are supposed to be so we can throw it into a database.

That's juicy. Yeah.

An interesting time.

Yeah. Yeah. That is the format's encodings. I do love it, though. I have to admit. This is like digging around on the bits until they figured it out. Yeah. On that note, I'm wondering if you can describe a bit about how DuckDV itself is implemented and some of the architectural details

and some of the ways that the design and goals of the project have changed or evolved from when you first started working on it? Yeah. It actually uses something called a vectorized square processor.

That is 1 of the

leading sort of designs right now for OLAP engines. There's 2. There's the just in time compilation camp, and then there's the vectorized query processing camp.

Briefly, what the JIT people do is they take the query,

they throw it into LLVM, out comes a binary, they run the binary.

We do in DuckDV

is we have an interpreter and not a compiler, which means that we just generate a query plan, which is then interpreted.

But

the query plan is not interpreted for every single row as it is in things like Postgres or SQLite.

But it is interpreted for what we call vectors of data, which are bigger chunks of data. Something like, for example, right now, the default is a 1024

rows per chunk.

And what that does is it amortizes

the interpretation overhead versus the execution

time of every query operator.

And what's

that leads to is that you decrease the

cycles per sort of row processed. That is the metric that we use a lot and thereby leads to good pull up performance.

You have to imagine that to be as a sort of stream

of these vector operators. So if you, for example,

were to write a query that says

select some columns from a table

where, you know, some condition,

you would get these vectorized operators that, for example, first is a tables can be talked about tables can can be anything. But in this case, let's assume it run it runs on our own table format. Tables can have a filter operator. There's an operator that gets

1 chunk of table at a time,

runs its filter,

and

then moves the resulting vectors of data minus what was filtered out back upstream to the next operator,

which might then be a projection,

which then, you know, throws away some of the columns that are not needed anymore. This is a simplification.

In reality, things are more complex. For example, the filter will be pushed into storage that contain things we don't want, but that's in principle how it works. We have operators that operate on vectors

that complicates things a bit sometimes. While this is pretty trivial for things like filters, you know, whether you're running the filter on a single row or a 1, 000 rows, really doesn't change a lot for implementation.

But if you think about it, if you wanted, for example, to design

a aggregate,

you know, on a vectorized system, how would you actually do that?

And what DuckDuty does there is that it has

this concept of pipeline.

It takes a SQL query, breaks it up into pipelines, And pipelines are things that can be run-in sort of a stream of data without having to materialize any intermediate step. And, obviously,

there are things like pipeline breakers. We call them things like aggregations or joints where you have to wait for 1 side to be fully materialized in the pipeline breaker before you can start streaming the other side. And so this happens all automatically where we, yeah, we basically take your query,

separate into pipelines,

and then the execution engine will start pushing data through it. The individual operators,

yeah, they have to deal with the fact that they see data as a vector at a time, and they also have to avoid the temptation

of simply

running a scalar row wise interpretation under the vectors. Everything we do has to be vectorized, and that's sometimes a bit complicated

from an implementation perspective.

But yeah. So this is the basic design. This is also what makes it fast. This concept of vectorized query processing

was incidentally also something that was developed at CWI

by our colleagues, former colleagues, still some colleagues still there. But this is what we use in DuckDee.

And the reason we use it actually

is because DuckDee is supposed to be lightweight. We are, like, 10 megabyte binary or something like that. And you cannot do that, in my opinion, with JIT engine because in order to get all the compiler infrastructure there, you're already like it 200 megabyte plus or something like that. So in order to keep

the footprint small, we had to go for the vectorized 1, but we also think in general that is a more

approachable concept because

when you start to auto generate

executable code, you also complicate things like debugging a lot. And that's something that,

of course, it also is a factor we want to be an approachable project where

it's easy to contribute to.

And that means that we also have try to keep our engine understandable, which is something we spend a lot of time on. So the goals I don't think the goals have changed a lot. You know, we started out

revolutionizing data management.

You know, we'll see where we get.

But the basic design of the vectorized engine

has also not changed, but what we constantly do is we rethink

pretty significant chunks of the internals.

So my cofounder and long time collaborator, Mark Asfeld, he's, for example, rewritten the execution engine

3 or 4 times at this point because we've written 1, And then we realized, oh, this is not gonna work

because we you need parallelism now. And then we realized, oh, but now we actually don't wanna use pool based operation, and we're gonna move to push based operator service rewritten

again. So this can be

done transparently from the outside because the cool thing, of course, about having a SQL engine is that the user gives us a query. The user doesn't give us execution plan. So we can do with that query whatever we want.

And 1 way we keep ourselves sane

is by having a monstrous testing suite. So we have millions of queries in our CI,

and that is how we can confidently

innovate

on individual components

inside the engine where the design constantly changes of individual operators. Right? I mean,

somebody, for example, 1 of our colleagues has recently rewritten the sorting operator

to be able to go out of core, you know, when you're sorting more data than fits the memory. You know, how do you deal with that? So he just ripped out the entire sorting operator, made something new, and that can get done completely transparently from the outside because at the end, it's a SQL query. The result is correct.

CI passes. Everyone's happy.

This was something that we realized

was actually crucial to also enable contributions, you know, because if the CI passes, we know that it can't be that bad. Right?

So when we're reviewing pull requests and we see how the tests are green, we also can be quite confident that it doesn't produce anything horrible,

which allows us to be a bit more open to pull requests than maybe other projects.

Yeah. But I think it's also something that we're gonna

keep iterating on, yeah, operators and

details of the internals.

And it's but it's really a relief

that our outside API

is mostly fixed, you know, SQL Korean table out.

That does take a lot of kind of design space out of the equation

and allows you to focus on,

you know, the details, which if you're writing something from scratch with a new API,

you constantly

have to deal with this. Also, the changes from the outside, the changes from, you know, demands from customers. SQL engines are pretty straightforward in that sense. You don't have to innovate on that level so much. We recently emerged a groundbreaking feature that you can

have commas at the end of lists of expressions in SQL because people complain.

Yeah. I was gonna ask you a little bit about the specific SQL dialect and some of the ways that you've thought about

how you interpret the SQL specification

and any extensions that you want to provide given the fact that you are looking to be embedded in other language run times and just some of the

capabilities that you want to support as far as any sort of customized functions versus pushing that into the language runtime and being able to handle

complex data types and things like that?

We have started with a fork

of the Postgres parser for Ducktb.

So we took something called libpqquery,

which is somebody took Postgres parser, made a separate project. We took that thing

and heavily modified it again.

But the lineage, if you want, is Postgres parser. I've forgotten which version it was, but not that old. Which means that we have, like,

exact compatibility with that parser, which is pretty great because lots of scripts generate Postgres compatible

SQL queries.

That is, like, the 1 side of which dialects Postgres. Right?

Although,

since we've, you know, modified the parser anyway and and refracted it and made it compile in c plus plus and use exceptions and all these things, We've also added things to it. So, for example, you know, I've mentioned this joke feature kind of allowing commas at the end of expression lists

is something that we can do

because, you know, we've been through this process so many times. We can't do it. The language interface

with functions, indeed, that is also something that is extremely fascinating

in in process database engines

is because a function a user defined function,

traditionally, is something that you define in

some horrid PLC call dialect or something like that. Or

maybe if you're lucky, the thing will let you write JavaScript or something like that.

But in our case is, for example, in the CC plus plus AVI,

the UDF is just a function pointer. Right? Because we're in the same memory space, you can give us a function pointer.

You use that name in your query,

and it will call your function with the data. Very fascinating concept. And, basically,

it will, you know, arbitrary code, whatever you want. I would say this is still something that we're working on a lot. Do I expect to see some more capabilities in the future? But we're working on mapping that to the target languages. I think for Python, we have some initial UDF support for map queries where, you know, it's like, here's a table dot map now this function to every row, and that will create 1 of these user defined functions from whatever Python function you give it. We are also working on something similar for r,

where

we are working on scalar functions.

UDF support for scalar functions also in the web assembly version

of DuckDV

and, also in the node version.

By the way, the web assembly version is wild. I'm not sure if have you seen this? I think I had noticed that there was a WebAssembly

option for using DuckDV, but I didn't dig too much into it.

Yeah. This is something that was super surprising to us because it's something that

somebody else, in this case, Andre Cohen from Munich, came up with. Because DuckDV is just written in c plus plus and has no external dependencies,

you can totally throw it into the same scripting thing

and run it in the browser.

And

that is also 1 of these things that when I first saw it, it was really unbelievable because

I thought, no way you've just taken our many how 100 of thousands of lines of c plus plus code and made it 1 of these browser things. But this he did, and we actually have a demo. I think it's consult.ctv.org,

where you can kind of just play with this in the browser, and it's the entire thing, you know, running browser, including, like, Parquet support and the whole thing. And you think, why would somebody wanna do that? But then I saw a demo that completely blew my mind where, you know, the

the usual lag was just gone. You know? You had, like, a dashboard, and you clicked around. And, you know, your brain expects that there's a lag because it has to go back to the server and do some callback thing and then return the result and yada yada yada.

But that was just not there. It was just like click, and before you kind of your mouse your finger came up again, it was already rerendered. It was totally mind blowing. So I think this is something that I hope you're gonna see a bit

more sort of usage of this this this WebAssembly version because it's literally mind blowing.

That's definitely pretty wild.

And on that note, because of the fact that it is in process, because it does unlock these additional performance capabilities and the SQL dialect on data analysis in all these different contexts,

I'm wondering what you see as some of the common workflows or applications of DuckDV

in analytical projects or its use in data pipelines

that data engineers might build and just some of the types

of libraries or tools or systems that it might replace in the context of these workflows

or projects or team environments?

Yeah. What's the workflow of incorporating doc DB? I mean, I've briefly said it. You can do pip install doc DB, and you have

it. I think very often, it is literally 2 lines of code, and you're up and running SQL queries on your data frames or whatever you have.

And this also is, I think, powered by the fact that we do read a lot of stuff natively already, and we try for every target language,

basically, be able to read the most common

table formats that exist in the target environment.

So if you're, like, in Python or in R,

you would should be you just include the package

that is a prebuilt version of FactDB contains all of it, like, from SQL down to storage

and to start using it, for example, to run on your data frames directly

with SQL queries. Although we also have a way of you building plans directly, we call this the relational API where you can

do something like dot filter, dot project, dot aggregate, and so on and so forth. Right? If you program, this is nicer than concatenating strings to make SQL queries, in my opinion.

But we also see there's people building bigger things around DuckDV

that are using parts of it. Right? Because nobody forces you to, for example,

use DuckDV's front end. You can also directly construct

DuckDV query plan and execute them. Some people do that. Other people

do only use your do only use our front end and have us take the query

in SQL then transform that to a query plan

and then execute that themselves.

There's people that only use the Parquet reader. There's people that only use the parser. There's people that only use storage.

It's kind of an interesting thing

that, yeah, there's this kind of standard use case of saying here, just take the prebuilt kind of package as it comes

and use it basically as packaged.

But there's also a lot of use cases where we see that

for bigger deployments

where people just mix and match, you know, to say, okay, I'm actually gonna use Arrow to do this, and I'm gonna use StackDB to do that. And then I'm gonna pipe this all to

some other TensorFlow,

you know, something like that. It's pretty fascinating.

So I think the easiest way to start

is just with these Python packages and the R package

or the R package or any other of the packages.

Put it in the project,

start using the existing APIs.

If you think that you need some sort of deeper

sort of way into digging into or you only need parts of the thing and you want to hook into 1 of the APIs

directly, then make use of some of the data structures directly.

That is very possible. And if it's not possible, we wanna hear about it because it's 1 of our stated design goals.

And it's also something that we do work with people with. Right? For example, there is this company called stoic

that is building

a data analysis back on top of DactDB,

also using the Wasm version, for example.

We have been working a lot with them on, yeah, basically extending things that weren't there yet or, you know, they realized they wanted to have a hook here, so we had to hook there. Something we do sometimes. And what other tools

that that we might replace?

So

I'm teaching some courses on big data every now and then, actually tomorrow again.

And

1 thing I've always

kind of found very strange

is how hard it was to use these tools like Spark.

Have you ever tried to install Spark? I have not had to do that myself, but I've definitely heard accounts of some of the challenges that it poses

and actually had an interview with somebody who their whole business was around making Spark run-in Kubernetes.

Hooray. Yet another layer.

But I don't know. I feel that

we are using a lot of infrastructure.

It's as if you drive a dump truck around as your daily driver. You know, that does make no sense. While Spark has its place, I don't doubt that. I also think it's complete overkill in, like, 99 of the percent of the use cases it's used for. We had this fun experiment where we

ran some query in your taxi dataset, you know, the standard

whatever 1, 000, 000 taxi rides, yada yada yada.

And we run it on a single note on Amazon,

and then we ran the same queries with Spark

on the same note type, but then we measured how many nodes do we have to add in order to beat the performance of DuckDV with that. Any guesses?

So 1 node of DuckDV, I'm gonna say

32 nodes of Spark.

Very did you look at my slides? I did not.

That's a very good guess. I think it was 32, actually.

And

That is a good guess.

Yeah. I mean, I went up in powers of 2. You could have you probably guessed that 1, but this was completely crazy.

And so

also, if we have a planet to protect, I mean, you cannot really justify

using 32 boxes when you don't need to. Right? I think it's my hope that we will

stop using these

infrastructures that are built for something else, for another dimension

of data problems

on really small scale stuff. Another example that is really annoying is you have these, you know, company x install Spark cluster. Right? Has a 100 machines.

They have a 100 employees. Now they have to start

doing some sort of load sharing scheme on their Spark cluster,

which

basically means everybody gets 1 box on average,

why not just give every employee a virtual machine that runs DuckTV and save yourself a whole lot of hassle?

I think I mean, it's my hope

that DuckDB would replace

a lot of these

needlessly distributed infrastructures.

I know this is not the most, let's say, hardware vendor friendly thing to say. It's also not the most cloud provider friendly thing to say,

but I don't know. This would be 1 thing that I would hope it would replace.

On the lower end, I think I also would like to see

use cases that

all app use cases. People use SQLite, in fact, also for all app use cases.

I think it will eventually get to the point where

that is something

that happens in DuckDV because it's just better suited for it. Fun fact,

the creator of SQLite

has been sending people away for a while now when somebody pestering him too much about on a little performance. Apparently,

he says, go just go use StackDB.

That's just kind of

That's great. Thank you.

So that's, I think, the

yeah. Yeah. Where I see this

going from both sides. Like, we take the the tasks that don't belong

on Spark or

other distributed

systems, and we will also take the tasks that don't belong on transactional systems on single node. And they'll just supercharge

basically the

capabilities of single node, you know, because I think people don't realize

is how much you can actually do on a modern laptop. You know? Because when do people start using Spark? It's when Panda stops working for them usually. Right? Yeah.

But there is a huge amount of headroom between

where Panda stops working or where r stops working,

but you actually need to use, you know, something else. I would say that's at least

2, 3 order of magnitude

on top of that, probably more.

And I think that's also where we wanna be. Absolutely.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool.

Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder.

In terms of the

project itself, I know that it's open source and also that you

recently formed DuckDBLabs

as a corporate entity to help support its ongoing development

and provide some additional capabilities around the project. So I'm wondering if you can talk to

the governance model of the open source project, some of the long term sustainability

goals, and some of the ways that you think about the boundaries between the open source project and the commercial entity that is helping to support it. That is an excellent question. It's something we spend quite a lot of time on as well.

So if there's DuckDuty

project, the open source project, and there's DuckDuty Lab, the company, as you said, We have similar logos, but different colors.

Let me start with the project. Project itself is governed by the DactoB Foundation,

which is in in the Netherlands, which is a nonprofit

foundation here

that holds the, you know, the IP

that makes for an MIT licensed project, but it holds the intellectual property in the project. It also has all the rights in it,

and it is merely there to basically perpetuate DuckDV.

The foundation

collects donations from commercial users of DuckDV,

and we've we're very happy that we've already,

managed to attract a couple of, you know, sponsors.

But this money that we get from our sponsors, we basically

help fund the

development of the open source, DuckDV, in sense of that, you know, there needs to be maintenance.

Bugs need to be fixed. And there is also

a global roadmap

that we that we discuss

with our sponsors. Basically, if you have become a donor to the

Dactyv Foundation, you know, we discuss our roadmap for for the open source Dactyv

with you, and your input is heard.

And

the money that is given

is also used to fund

this road map to execute this road map. So that's really

purposefully

disconnected from the company.

We have no way of interfering with this even if we wanted to. It's in the statutes of the foundation that the DuckDV project needs to be stay out open source, MIT license.

You know? So we cannot just change that. And that was

done, yeah, to protect the project

from the

company, from us, which is weird, but I feel it's necessary.

So Doctor. B Labs

is a commercial company that

that provides,

let's say,

in-depth,

you know, professional

support for DuckDuty itself. So if you're betting the farm on DuckDuty, you might want

to have a way of getting

stuff fixed quickly. So that's kind of how that works. Right? And we have a couple of strategic partners for DuckDV Labs. Not very many,

but people that are really invested in using DuckDV

for their, you know, purposes

that basically contract with us

sometimes for custom development

around DuckDV. Sometimes they want us to prioritize things.

Sometimes they want us to add new features.

So far, this has been

extremely successful in

sort of marrying the goals of the open source project

together with the things, you know, our customers want. So we haven't

had to

make evil trade offs yet. So I'm quite happy about that. These are separate entities, and they have separate goals. And I think it's important,

yeah, to separate those because the company can go

anywhere. You know, the company might

do anything,

but we would like the project to be independent of the fate of the company.

In terms of the applications of DuckDB, you've already shared some pretty interesting and wild use cases for it. I'm wondering what are some of the other

interesting or innovative or unexpected ways that you've seen it applied?

We've seen a lot of funky things. So there was somebody that

built a big data cluster using iPhones,

which was

okay.

Yeah. So, basically, just glued together a bunch of iPhones that run DuckDB

and use that to do large scale data processing.

Why? I'm not entirely sure, but I thought it was very funny.

There was 1 project that I was really impressed

by where, actually, somebody works with us. He managed to port DuckDV to run on the Intel XGX

trusted compute thing.

So you could actually run part SQL queries

inside this hardware enclave thing. As I've mentioned, the Wasm module was also something

where I'm just blown away what people have done with DuckDee.

On the other edge of the scale,

you know, we're talking

with people that are building

giant,

you know, cloud infrastructure

using our, you know, puny

little project as their building blocks.

And that's also sometimes very interesting to see.

Yeah. We should probably be better at collecting these stories. But what really makes me happy is remember I said that DuckDuty was kind of born out of these discussions with data scientists

not being able to do what they wanted to do. And what really makes me happy when I see somebody on Twitter saying, wow. This was not possible before,

but thanks to, it's not possible. And this is usually some totally random thing like, oh, here's this giant archive file or here is this,

you know, some CSV thing or

just some capability they were lacking before that they now have.

That really makes me happy to see that. I wanna say Lambdas.

So people have, you know, Lambda functions, right, in in cloud Lambdas.

And it's another 1 of these things. If you realize that the SQL engine is no longer bound

anywhere because you can just pull up wherever you want, you totally can pull up duct DB Lambdas, and people do. So you can say,

hey. I have a Lambda

that's reacting on some

parquet file uploaded to folder

that then runs a query on this parquet file and writes another parquet file to this other location.

You know, you can actually build a entire query tree out of that. And so we've seen people that

have been building

Lambda based

query processing engines, which I think was also pretty cool.

It's just really impressive.

GitHub shows you sometimes which projects depend on your project. Sometimes I look at that and also just really curious. And I also sometimes look at the people that start Dactyv

just to see

which companies are using Dactyv, and, you know, you wanna know. And it's also really crazy because I think at this point,

we have seen everyone. It's really crazy.

It's also a bit because it's so unobtrusive. You know? You can just start with a single person even in in a company and then

slowly, slowly grow.

You don't have to go golfing with the CEO yet. Absolutely.

In your own experience

of working on the DuckDB project and working with the community and helping to run this business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I'm an academic, as you know, from my background.

And, obviously, we are used to know everything.

And I thought I had a pretty good grasp of SQL,

let's say, before we started DuckDV.

It turns out I was wrong. I think that it's extremely humbling to see what you can actually do with SQL. That's 1 thing. That was challenging sometimes. For example,

Mark spent enormous amount of time in subquery folding,

you know, where, you know, SQL subqueries is 1 of these most challenging things in the world.

I have spent enormous amount of time on window functions,

which is something that

most people have probably not heard of.

If in your SQL query, you wanna have, for example, an aggregate value,

but you don't wanna actually do an aggregation, but you wanna have a scalar function in terms of the aggregate. Okay. That sounds weird. It is weird. It's called a window function, and it usually recognizable by this over clause. It's over and then something in a bracket.

Okay. That sounds pretty benign, doesn't it? But it turns out evaluating window functions

is this endless well of complexity

that,

you know, basically, it's not only a pipeline breaker, which means it needs to fully materialize the input and then do weird things to it, but it's also

the only SQL operator that has a notion of locality,

which means that you can look up and down in the table,

and that breaks all your previous assumptions about, you know, how you can actually

rethink the execution of a SQL operator

because the window operator is just different. I have to make special considerations for it. Making the thing fast

is also

really a nightmare because, again, they have so many ways of executing those window functions that,

you know, the traditional SQL way of saying there's 3 ways of doing it, we'll make all those fast, and we're done. Simply doesn't work. So that was also something

that I was really surprised by is the complexity

of seemingly benign things in SQL

that

also also the technical complexity of making these

things run and making them run fast, of course, which is our added problem. Right? We can't just

pull a Postgres and implement it in whatever way, but we have to actually

think about it how we can execute this quickly. And we have to make it work in a vectorized engine, which, as I mentioned before,

sometimes throws additional, like, architectural

complexities at you. So that was really, I would say, humbling, how much we didn't know

and also to see,

yeah, where the real problems are. You know? And I think that also is something that sets us apart because we're an open source project. We get a lot of feedback, and we are super thankful because

we learn so much about how people actually use databases

that you will not find in any textbook, you know, that you will not, you know the Oracle people would have never heard of this because the people that do that don't have 10, 000, 000 sitting around to pay them.

So that was also super interesting and also super humbling to see

how people use databases and then

how things go wrong,

you know, because they're holding it wrong, but obviously, not holding it wrong.

Your assumption of how they should hold it is wrong. So you have to basically make the handle, like, pink so they kind of see it and held it correctly. But this is really difficult if your interface is a SQL API. Right? I mean, how do you give people hints in SQL queries? Difficult.

So this was really interesting to see. And then every day, you know, as I said, every day we see

interesting things people do with it. And what I generally think was the most unexpected thing is that people cared about this thing in the first place. That's something I still

I still can scarcely believe that, you know, we made this thing.

And,

yes, as I mentioned, we did talk to some people and we thought, okay,

if they care about this, then probably other people care about this. But I am quite impressed about the adoption and this caused a big responsibility.

You know, whenever we make a release,

we have to actually be sure.

But that people cared really surprised

me a lot. So I would like to say thank you.

And for people who are interested in the capabilities of DuckDV and they want to be able to have this easy on ramp, quick access to performance SQL queries, what are the cases where it's the wrong choice? And maybe they are better suited with

running a 32 node Spark cluster or putting it into

a Snowflake or a Teradata

or using some other approach to managing these data analysis and manipulation?

Yeah. I think we've briefly had this. If you have many, many concurrent users on a dataset that is, you know, centrally managed, you probably want to use traditional

data warehouse.

When would you go for more the

Spark

sort of kind of way?

Yeah. You know? This is, of course, the big question is, like, how big should the data be? If you have petabytes of data sitting around, I would say,

probably, you don't wanna use dark DB on that yet.

But if you have the 32 node Spark cluster

I mean, actually, you should never use a 32 node Spark cluster.

You should use a 1, 000 node Spark cluster or don't. You know? I think there's no good reason

to have these mini mini clusters around because you're getting, like,

all the pain of

running clusters.

And, basically, as I've explained to you in this example that we did, all the computers is basically eaten up by the coordination that is designed

to deal with a 1, 000 node cluster. Right?

So why anybody would not run a 1, 000 node clusters beyond me? Does it make sense to you? I mean, why would you Yeah. Why would you have the small cluster? That makes perfect sense. And, yeah, there's definitely a lot of just kind of cargo culting of, oh, we use Spark because everybody uses Spark, and that's the thing that you use Spark for.

Yeah.

And then, you know, they're in Python, and then they make a computation with Spark, and then they say, but now I wanna use matplotlib.

And it's like, yeah. No.

Because you have to now forget

about it. Because

in order to pull this from Spark to your Python shell,

Spark, in its infinite wisdom, serializes this data using the Hive protocol, which

I've written a paper about. And the Hive protocol uses Thrift

to encode column data,

which is not a good idea probably, and which means that, actually, to pull something from your 1, 000 nodes bar cluster

that is not that crazy big, let's say, 2 gigabytes or something like that. You know? You pull that into your Python. Maybe you won't. Maybe you even wanna use learning on that. Who knows? You know? These days. In order to do that, you know, you're gonna spend

the vast majority of time that this query takes end to end because yeah. Okay. The 1, 000 nodes were done quickly running your extract or whatever it is on some parquet files. But then pulling the result into your Python shell is gonna take, you know,

hours.

But you could also say, hey, if I run this query in tech DB, it's already in the correct memory.

So I can just do whoop, and it's there. I think there was a tweet some time ago. It's like where somebody said, okay. Yeah. Time

to produce

query result in Postgres is 0.1 milliseconds,

time to construct SQLAlchemy

results at 2 seconds.

You know, I think that's more true than people want to believe Absolutely. How much performance we leave on the table.

I think we've touched on this a little bit, but in terms of the near to medium term future for DuckDV, what are some of the projects that you're focused on or new capabilities that you're looking to add or areas that you're excited to dig into? And given that it's open source, any areas of contribution that you're looking for help with? We do love contributions. We really do. And I think we've also tried to make a welcoming

community on our Discord and our GitHub.

I can maybe go through the strategic directions for the project itself so that people get an idea of what would be contributions we would value. Of course, we value all the contributions. There's issues that we have a lot of, like, open tickets on GitHub that some of them attack with good first issue. People want us to get started. That's

a great place to start. But in terms of the more general

directions that we wanna go into,

I think the 1 big 1 right now is the the out of core capability for all operators.

I think this is important.

We call this the galaxy quest principle,

never give up, never surrender, because

we wanna be able to complete queries

if at all possible. Right? And that sometimes means you have to go out of memory in the sense that

I mentioned this before. You know, you have these pipelines, and sometimes you have pipeline break operators that need to materialize some input, like a sort, needs to see all the input before it can sort. Okay?

And these operators,

they might need a lot of memory. If you build a join hash table on a table that has, you know, a terabyte, then this might take a terabyte of join hash table. That's not great. Probably don't have a terabyte of memory.

So we're constantly focusing a lot on

teaching every single 1 of our operators, and we have, like, 20 of them, how to gracefully

start using disk space instead of memory if possible,

if we run out of memory. And this is only so we can

make sure we can always complete queries. So this is a big focus, and, yeah, we're working on that. We've been looking at a bunch of operators already. Much to be done. It's also really fun to

think about that, how to do this in a dynamic way.

That's 1 thing. Another thing is that remember we talked about our storage format, the hashes of the blocks in order to

hashes of blocks sounds familiar, doesn't it?

The hass of blocks that protect against the bit flips on disk

or disk corruptions.

This is also something that the storage format is something we want to finalize this year if possible because it's kind of what keeping us from releasing a 1.0

because

we want to not make a backwards incompatible change

to the on disk storage format. Again,

after releasing the 1.0, it's our goal. We'll see whether we manage.

And right now, there's just too much going on in order to guarantee that. So we wanna

stabilize the storage format so we don't

break backwards compatibility

with existing database files anymore. Mind you, Postgres doesn't guarantee that. Right? This is something that normal database systems don't do. Right? In Postgres, I think they guarantee compatibility within the same major version, but that's it. What we try to do is something more akin to SQLite

where they say anything that has been created with any SQLite version bigger than 3 dot o is still readable and writable by current SQLite versions.

I think that's also a bit far, but we would at least wanna have the backwards compatibility

there going. So that's a big topic. We're also actively working on compressed storage.

So dactyv will automatically

analyze the

tables on disk during checkpointing

to see which compression method would be most applicable

per each

row group. So row groups are like subsets of tables that we store individually.

We run for every column, and we run this analyze phase during checkpointing to see which compression method is most applicable and then apply that method. We have built a framework. We've built a bunch of compression methods, but we have to do more there. So we wanna finalize that. We wanna add more compression methods. Also, maybe a good way to getting started, implement the compression method.

Because also they're pluggable. These are also user pluggable. It's pretty cool. If you know better than us how to compress your data,

sure. You know? Write your own compressor. Why not? We are generally also working on performance engineering, of course.

While DuckDee is already pretty fast, there's always

some edges where we need to improve. And for anyone interested in low level c plus plus hacking to make things go fast is also something that we enjoy.

And if you enjoy that kind of thing, maybe we should work together then. And these are, I think, some main directions.

And then we also have the whole testing area.

We wanna do more in terms of destructive testing as well in the sense of, you know, defending against misbehaving disks, misbehaving memory,

and then integrating all that into the testing framework so that

every time, you know, we commit, we will

run all our tests also and some simulated memory. That might also be fun for people to look into. So these kind of things. But I would say that's the main directions that we wanna go into.

Wanna go, like, continue for the main project.

And, obviously,

there are some more sort of domain specific things like people. Some people are interested to work on geographical

data type support for DuckDb.

We haven't done anything yet. Some people are working on making more graph,

like use cases work on DuckDb.

We'll see what comes out of there. But I'm really excited to see, like, a lot of, like, things springing up left and right.

And we also have worked really hard to make an extension API. So DuckDuty has this plug in support where you can change a lot of things about, you know, which types, which functions, and so on are available in the systems

simply by loading plug ins. Also, it's very similar to SQLite, which is where we copied it from. And that I think also would allow

for lots of these things to live on their own. You know? For example, imagine, like, DuckGiz or something like this similar to Postgres,

which is just a plug in, you know, that doesn't actually

have to change anything in the base system. I think that would be really interesting.

Are there any other aspects of the DuckDb project or the applications

of OLAP capabilities

in process in language ecosystems and the work that you're doing at duckdb Labs that we didn't discuss yet that you would like to cover before we close out the show? It's been a privilege to work with everyone

on DuckDB because, yeah, it's a big project. Let's say, a big undertaking.

And, yeah, I'm really, really thankful for everyone that has contributed and is contributing

because it would not be possible without them. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And

as my final question, I'd be interested to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. You know, it's very interesting because

I think the design of Tech TV was kind of motivated by 1 of these gaps. Right?

So, okay, there was no in process OLAP system yet, so we made 1.

And there are many of these things.

For example,

we can throw out some startup ideas. Right? For example, I think strings in data management systems,

like text,

is still 1 of these huge

unsolved things

where, yes, they work in principle,

but it's just painful.

It's bad performance.

It's, you know,

inefficient storage, the whole you know, we have done this experiment where we took a benchmark like TPCH, right,

and converted everything to strings and we run it on different database systems.

It's, you know, orders of magnitude later, you get your query result. That's not great. Somebody should work on that. I think, generally, the debuggability

of data pipelines is pretty bad.

Running query engines is something that

hasn't mattered traditionally.

And it also inductively, it's not something that I'm particularly proud of. But seeing what's going on while it's going on, I think that's also something

that so often we just, you know, we hit a button and then we have to kind of wait and pray that it will finish eventually. But we have no concept how long it will take. You know, is there progress? Who knows?

So I think there's a lot there. But I think similar to the gap that we found,

there are a lot of these

blatantly obvious,

let's say, you know,

nightmares

tooling nightmares that that are just waiting for somebody to tackle. I would say strings.

Alright. Well, thank you very much for taking the time today to join me and share your adventures with DuckDV. It's definitely a very interesting project

and, obviously, a very

important

and complex undertaking. So I appreciate the time that you and your colleagues have taken to bring this to the world and continue to help improve it and add new capabilities and functionality. So appreciate you taking the time today, and I hope you enjoy the rest of your day. Thanks so much for having me. I'm really thankful.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the

the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links