The Art of Database Selection and Evolution

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

It's 2024.

Why are we still doing data migrations by hand?

Teams spend months, sometimes years, manually converting queries and validating data, burning resources and crushing morale.

DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit data engineering podcast.com/datafold

today to learn how DataFold can automate your migration and ensure source to target parity. Your host is Tobias Macy, and today I'm interviewing Sam Kleinman about database trade offs across operating environments and axes of scale. So, Sam, can you start by introducing yourself? Yeah. Sure. I

am a

software engineer with

a eclectic background, I guess.

I, you know, like, I I live in Brooklyn, and I've I I worked for a I spent a long time

working on

MongoDB in a couple of different

roles and then have, like, spent some

time working

on other

other engines,

Neon and and and also a startup called ClearAD. And I think

I think databases are

or database engines are are sort of

the I don't know, have always sort of been the, like, the biggest

lever in the in sort of that you can that you can jump on to, like, impact the way that software engineers

are able to

be productive and build cool things. And so it's,

it's always been a really exciting it's weird to say, oh, databases are exciting, but but they they they have always been an exciting

place to to, like, do work and see and, like, have an impact on

our industry. So that's that's sort

of how I got here,

belatedly.

And do you remember how you first got started working in data?

Oh, yeah. No. I, like, I I have a

a liberal arts

undergraduate degree in social sciences, basically, and and interdisciplinary

stuff that I, and I

had a dumb job,

like, in college, like, filing

paper for the, fundraising

department

in, like, file cabinets,

which I can't imagine anyone ever used. But, you know, it's a sort

of

economic endeavor, and I

I would listen to podcasts, which is the thing. It's ironic here we are on a podcast, but I'm, like, not a I don't listen to a lot of podcasts

since then, but I listened to something and,

where they they interviewed some people who are working on

me as

who's sort of in the the last

data

tooling

boom. And I, you know, listened to people talk about a lot of a lot of these databases that were, like, it was the sort of peak of NoSQL

sort of or the

heading towards the peak of NoSQL's moment. And I and I was like, oh, this data thing is, like, real big and

gonna

you know, like, it it sort of seems you know, and I had been interested in open source. And and I I had always been a nerd, but somehow

as a as a younger young person,

I was very resistant into

in in, like, taking taking

CS classes or whatever. And so I I never really studied it, but I definitely

got I got interested in in learning about the technology and thought this that there's, like, a lot of

cool things to be done,

around

infrastructure. And so I I think, like, I can't you know, it's sort of weird now if you say infrastructure, it means you're a sysadmin, and I'm not exactly a sysadmin.

But I think infrastructure in this, like, abstract sense is really cool because of what it enables and the sort of power

the ways that infrastructure lets you focus on the important things of your problem. And databases mean that you're not

spending lots and lots of time

sort of

figuring out how to

absolutely, like, indelibly save data or

make sure that you, like, recall like, there are these problems that databases just, like, solve for you. So as a programmer, you don't have to sort of wonder if you've, like, written the search algorithm correctly. Right? You just

select whatever whatever you want.

And and I think that that that that there's, like, many levels of that. And I think we're actually, like I mean, we've been having

relational databases

for

however, you know, 4 years. And we are, like, not yet to the point

where the, like,

the abstraction between

application code and

and state persistence and and retrieval

code or the database engine. Like, that boundary is not exactly where it should be, and it's, like, interesting to watch it sort of settle over time, like, over the, like, long time.

But

yeah. So all of those things, like, have have sort of you know? And then I you know? And then you get a job and you keep working at it. And I and I've

and then your next job looks like your previous job, and suddenly you're, you know,

suddenly suddenly you have a thing,

somehow. I think that's a, like, an accident that many of us kind of stumble into.

And on the point of database engines,

they are

the system of record for whatever application or service or system you're building

where

database

is a very broad term where it can be anything from I happen to have written a file to disk and then I am reading it back to

some of these deconstructed database engines where you have the lakehouse architecture that's emerging where you have a query engine, you have some object storage, you're off live somewhere else,

and everything in between across different modalities of data types and data formats,

vectors have become increasingly popular as a data type because of the AI growth.

Yeah. Well and and there's, like, there's, like

indexing is sort of a classic. Like, I I kind of actually think of, like, vet there's a lot of vector search problems, which are, like, subclasses of indexing problems or sometimes they can be. There's also,

like, database is an abstraction.

I think I think that's the, like, the thing for for doing this, like, save access and retrieval

piece.

You said something that I thought was great, just like

deconstructed database. I think the there's this

sort of orthogonally to all of this, what is a database, right? Like,

there's also these questions about how we build applications

and

that shape how you need to save data. So you have

you know, there are very architectural

decisions which influence the shape of the data that you have. There are deployment considerations which shape how you how you deploy things.

There's, like, patterns or, like, workload

properties that constrain how you manage the data that you have.

I I think the truly revolutionary thing that's, like, happened

in

the last 15 years in computing, I think, is that that, like, s 3

has become

reliable

in a way that's kind

of alarming,

cheap in a way that's also alarming,

and

and ubiquitous.

And I think that and the cheapness makes it ubiquitous, but but that

that means that we

it used to be the databases were like, how do you attach enough disk to a machine that has enough RAM to do the thing that you need it to do?

And and the, like, the size of the disk and the you know, like, that was the sort of thing that mattered. And now, like, you just put it on s 3. You can, like and, like, network

having storage that is connected over Ethernet rather than, like,

over, like, some sort of, you know, PCI bus or whatever, SCSI bus, is is like that that shift changes how all the rest of these systems work. And I think that and it lets you it lets us as database makers, engine makers,

separate

storage from compute and have them vary separately. And that's real like, that's really powerful.

And it gives rise to, like, some of these, like, serverless things

that are that are cool.

And I think, again, we haven't quite seen all of it. Like, there's gonna be, like

s 3, as good as it is, is gonna get better along

a number of dimensions.

And and I think we will,

we'll start to see even even cooler things there.

And I think on the s three point, there have been a few developments over the past

5 or so years that have also increased its utility,

most notably the read after write consistency that didn't exist for a while where you would make a write, and then depending on where you are reading from, it may or may not exist yet. Now they've got guaranteed read after rate consistency.

And most recently, they've added a

optional update

pattern where you can say only update this file if it is actually different than the one that I'm trying to upload, which can add a lot of efficiencies as well as far as IO for conditional updates or where you're not actually changing any of the data.

Yeah. Yeah. Yeah. No. It's I even think the, like,

non

the, like, in the inconsistent read after write version of s 3 is, like, is,

like, so

good

in a lot of ways that that you

or that I like I don't even think about all these other things as I'm like, yes. This is great, of course. But but I think there's they have also been adding,

like,

faster s 3

tiers, which are more expensive, but, like, gives you a level of consistency. And then you can do all these, like, crazy tiering. But but regardless of all of that, it means that you can, like,

allocate your compute

very elastically

and and not have to worry about

stateful,

like, not have to worry about managing disks. Right? And and it's actually like when people are like, why do you need a new database now? Aren't the things that we have good enough? And the and, like, the things that like, the these, like, old

industrial

battle tested pieces of software

are are really

good.

And, like, it takes a long time to make a database kind of robust in that way just because, like, you have to bang on it from all sides. But the

but the but they're all they all assume local storage in this way that,

like,

you can sort of rip out the bottom and put something new in, but it doesn't it, like the the transition

from,

like, colocated

storage and compute to separated storage and compute is, like, a really huge change in in, like,

in the software. And and sort of get getting us to that point, I think, is is really,

is gonna be is, like, is an interesting,

like, technological

shift,

that that, like, that all that people writing software today are just just beginning to sort of figure out how to do. And there are a lot of legacy tools which are sort of, like, sort of shoehorning themselves into being

serverless

in a way that is, like,

absolutely

not gonna, like,

fly,

from a,

I mean, frankly, from an accounting perspective, let alone, like, technology.

But but I think it's it's this is another thing that's super interesting to watch.

And one of the points that you made early on is the impact

on the system design that the database has,

and

I think that that's definitely a very real consideration

for anybody who is trying to design or evolve

any sort of service, whether it's a

consumer facing transactional application

or a

analytical workload or machine learning system.

And I'm wondering if you can give some of the

main points of consideration

that engineers and teams need to be thinking about as they're starting to determine what is their persistence mechanism for the system that they're building

and how that shapes the capabilities

of that overall system.

Yeah.

The

it's sort of I I think that I I often kind of, like, think about backing into this in a different in a different way.

Or, like, the the answer to the the way so we'll get I think we'll get there, but it won't feel like it for a little bit. So I think, like, the thing that you end up wanting to do

is to think about

the queries that you're running and think about things like,

how does the data get updated?

Is this aggregating data from a lot of different places, or is this

data that we're writing and reading?

Does

data get modified

after after it's created or not? There are some kinds of data where that's

true or not.

There's also a

a expense question. Like, how much money do you have to spend on

on engine on the, like,

on the database?

And then and then there's this, like

this sort of hints at, like, the the analytics question, but, like, are you returning

records in some kind of what like, business object

record of

that is, like, persisted in a database, persisted in the database, and modified in in the application? Or are you

returning aggregates or computed values that sort of summarize

larger parts of the dataset? And and the thing

or sort of so, like, once you answer those questions, once you have answers to to, like,

what kinds of queries you're doing, which really kinds of depends on

like, there's no wrong answer to any of these questions. There's just

you you sort of,

like,

you have to do the thing where you write all of your business objects on index cards and lay them out in front of you and see how they interact with with,

each other and and where the you know, what what's what needs to where things need to go. But you, that's a very old school way of thinking about this, I guess. But but you you once you sort of have that, like, what are the components, what are the objects in the system, and how do they interact? You can sort of and then you have to figure out how you need to save them and and

or where where the objects exist in existing systems if they're, like, accessed over over APIs. And then

so you

you have so you have to, like

once you have where all of the data is

and have all of

the

the the, like, components and the objects,

you

you start to sort of

think about how you're gonna use it in the in in whatever thing that you're building.

And there are lots of different tools which do,

you know, will do sort of, like,

one thing really well and a and a a few things

less, like, fine, but not great. And the and the challenge is, like, always, like, picking the sort of the best tool

for what you're doing and what you're planning to do based on

how and where you're going. And I think that, like, the thing that that I

the thing the sort of, like, sort of key things are, like,

right throughput

is sort

of hard to

like, that's like there are some I guess, like, maybe maybe actually the way to think about this is, like, what do you need to think about first?

And or, like, what are the things which you can't compromise on? And so there is this, like like, write throughput is really like, if you pick a system that has slow writes,

it's really hard to make it fast.

If you

have a system which doesn't really support efficient

updates,

then you can't like, and your workload is very update heavy, then

you really like,

you're in trouble.

And

and and and then you sort of, like, go there are, like

you know? So, like in some ways, like, insert throughput, update throughput, and then

various kinds of, like, can you do recall at the level that you need? You can make there's, like there are ways of making every

query fast. If you're if you're, like, if you're determined, you can make a database give you a query faster usually,

But but there are it's hard to make it write faster if if you're sort of, like, teetering on the edge of its, like, tolerances.

But the

but I think one of the sort of big,

big one of these, like, big sort of top level sort of properties of engines is, like, is the data stored in a in a, like, row oriented

format or a column oriented format? And I heard and I, you know, as a

early career

pre pre working on database engines, like, heard these people say columnar databases. And it took me a, like, really long time to sort of make sense of, like, how and why this was important. But, basically,

the idea is that if you so row row oriented is very sort

of reasonable, and you can think about it. You have records or rows.

Everyone sort of thinks about this now. It's like an Excel

spreadsheet, but you know what a row is and all of the all of the columns,

all of the fields,

positions in the row are things that you, like, are related to each other. And then columns, right, are sort of all of the all of the different values for each of your records

are

stored

sequentially. And so you will till you have, like, every you know, if you have an ID

and

a, you know, like, a a name and a balance and a last updated time, you would store the names and then all of the

balances and all of the last updated times. And when you do that, you can there are, like, ways to compress

it more quickly, and there are ways

of computing averages

very quickly because you you can just sort of, like, shove,

shove the the data through through the

process or really, like, there's, like, very little copying that goes on. It's it's it's straightforward

and and quite fast. And this is how analytics engines,

like, make things fast is they just write the data out in a way that makes the query that they're gonna do

quicker.

And so knowing how these things

these things work is

like, really

lets you answer the questions of, like, what kind of engine do we need? Because

we're gonna be computing averages, and so we need a columnar format. Or we're computing

rows, and so we need or we're, like, recalling

one row at a time or a collection of rows or a a filtering of a section of rows. If you say have,

I mean, this is sort of obvious, but if you're just, like, returning records

again and again, you

like, columnar formats may be very quick for answering some kinds of

queries, but they're quite slow at returning rows. That's obvious. Right? But,

sounds dumb to say it. But but

that, like, either either you store things as rows or or you store them as

as columns are is sort of one of these, like, big choices in in databases. And it impacts things like write throughput and update performance if the data isn't if you're updating a whole row and they're in columns, you have to jump around a lot to get the update to work.

So that's that is, like, one of these, like, fundamental decisions that you make.

That dichotomy

of the row versus

column oriented

record storage

is

the fundamental divide that has introduced

the practice of

extract transform load or extract load transform whichever sequence you want to do it in because of the fact that if you have row oriented data, it's easy to insert an update,

but it's slower to do aggregates. And if you're trying to do aggregates across your transactional database, it's going to negatively impact the application.

So you move all the data to a different system to do all the analytics,

which increases a lot of cost and complexity. You're duplicating the data,

so you have to duplicate the storage cost.

And

so that's definitely one of the fundamental aspects of how do you think about the system where

are you primarily doing analytics, or are you primarily

doing

single record

events or updates, or how how what what do you need to be able to retrieve?

Yeah.

I and I but and I think that's like a I I kind of think of this like the ETL probe problem is

I sort of I I kind of feel like the ETL is kind of this, like, problem of our own creation.

The the there is this to to sort of

that that thing that you

it it I mean, this is this is like,

this is this is a sort of interesting thing. But we have there is this sort of to like, there is this, like, historical myth that you can, like, have one

database, and you'll just, like it will be huge, and all of your

code will access the same database. And and, you know, and this is, like,

a sort of this is the business model of, like, general purpose database vendors is that you're just gonna, like, you're just gonna make one giant machine or a couple of giant machines, and, you know, it's gonna run DB 2 or Oracle or one of these, like like,

behemoths,

and

and it's gonna solve all of your problems with data. And

and this is, like,

basically always been a myth

for the for literally this, like, this thing that you just said where

if you run analytics queries against your transactional workload,

then

the workloads conflict with each other, and they make each other slow. Right? And the mechanism for this is, like, quite simple. You have you have more data than RAM. The data isn't in you know, the data working set

for your instance at at

any given time isn't in memory.

But

if you're really smart

or clever, you can keep the working set for a given

workload

in memory, and you'd sort of optimize the workload in the application so that

for the most part, you're not modify

you're not going to disk

as much as you can. And transactional systems use, like, indexes for this. Right? And indexes are kind of like real time ETLs

of of your of your system,

in a lot of ways. There are these there are things like, materialized

views, which

are, again, the same kind of thing.

And they, like, put the data in a different form

and keep it up to date sort

of sequentially,

or, like like, strictly, like, in a consistent sort of way. But but but

those those also sort of solve this, like, how like, ways that we can kind of keep the data sort of positioned in a way that makes the queries that you're gonna run stay in memory, not have to sort of bounce

bounce to random IO on on disks a lot. And then and then you're sort

of good, but if you have more than what if you have too much of this going on, then the workloads thrash a limited set of memory. And then and then you're into this, like, okay. We have to ETL and copy it over. You can also like, there are

you can also imagine having

having

diff like, using

read replicas

to satisfy some of that some of those workloads.

You can

imagine

having,

you know, like, having ways of persisting data

that

don't involve updates very much. Right? There are these, like, event driven systems,

and so you will, like, materialize

out

the row data,

you know, on some interval. And maybe that's, like, enough. Like, you don't like, having

this, like, having the objects be sort of

strictly

read from memory and, like, material like, in memory

as, like, a row and record is, like, maybe not always the thing that that we

we need. And so there is like, you can write applications

that do both, or you can have some of the like, you can use

materializations

or or other sort of

tools within a system

along with replication and and all of this to get kind of,

to get the the

the sort of query performance that you need for a given workload. Right? Because everyone some things like, you're always doing both.

Right? There's not a no. Look. This is the application where we save numbers, but we never average them. Right? Like, no. I'm like, okay. What are you doing? Like, you know, that's a that's not a that's not a thing. Stop it. Right?

There are,

you know and and and you say this other there is this other thing of this, like, assumption that

that, like, ETL ing is

is, like, all like, has this this, like, flaw of

of, like, your data's stale and you're saving another copy. But if you're doing like, if you're on some kind of event driven,

like, message passing,

like, persistent like, using persistence via like, with messages and records and events,

then and not, like, objects that you're you're updating in place, then you which, like, that's a thing we do. Like, that's sort of, like,

pretty good way of doing things in in a lot of cases.

Then

the thing that you're ETL ing is archives, which you need to do anyway. And so so, like

and and it's not even it's not even an interesting ETL at that point. Right? You're just, like like,

streaming data out to a file. Right? Like, it's and it'll so so so

all of this all of this to say, like,

there are a lot of tools. There are a lot of these, like, patterns that we use.

And I think that we tend to sort of

default to using databases

with in these, like, very sort of specific ways with, like, all of these, like, very you know, like, I'm I'm, like, kind of, like, picking apart all of your assumptions and all of your, like,

the all of these things that you're saying. But you're also sort of saying exactly, like, the thing

that is is sort of like, the the the common

way that we think and talk about

about these, like, solving these data persistence problems.

Like, of course, we use ETL or ELT or

TL I don't know. Whatever. It doesn't like, the the the this is, you know, like,

processing your data and writing it, you know, like, that's it's called a query. Like, you know, it's somewhat like like like like, ETL should, like, be just like, I I've I've always sort of you know, like, every so often, I, like, stumble across an ETL flow. I'm like, this is you're just running a query in a CronJob.

Like, I I get it. I like it. I like CronJobs

a lot, but

but I think that I think that we we often, you know, and we often sort of fall into,

like, leading. And this is

this is

this is a,

like, common

engineering.

Engineers do this a lot, but you sort of start with the tools and work backwards to the business problem. And I think they'll, like, actually,

you

know, they're it take you know, like, you need the sort of grumpy,

like, architect or principal engineer type to sort of say, like, okay.

Like, let's back up and, like, go go in the other direction and sort of say, like like, ask why a few more times. Like, why are why are we doing this, and what what is the thing that we're trying to accomplish? Not, you know, like, do we need to archive? Do we need to get this this one query really fast or this other query?

Or we need write throughput or or whatever the whatever the issue is. So,

yeah, I

I think it's I think it's I think it's these are, like, really important questions, and it and it impacts, like, how we how we

what tool like like, if you you you get to the same like, you still build a thing that works at the end of the day.

But the

the sort of

assumptions that you need a thing of a certain shape is is

you know, leads leads us into some weird contortions,

I think, actually.

That's definitely

true.

And

I think another interesting angle to it is

starting from that business case of I need to solve this problem.

So given the requirements constraints of the problem space, I will then choose this, this, and this solution

is that

businesses

and the context in which they're operating are constantly evolving.

So what is the right solution for a business at a certain point in time

will not necessarily

be the right solution for all time, And so you get into those situations

of iterative development, iterative architecture evolution.

And in a lot of cases, the

database engine that you selected because you thought that it was going to solve all the problems that you had,

and maybe it did for a time,

then becomes a liability

because

the assumptions

of the business are no longer true.

So you need to be able to evolve to accommodate

the new realities of what you're trying to solve for.

And

I'm wondering, what are some of the useful

ways for teams to think about,

1, future proofing, recognizing that that is never the reality? You can never actually future proof, but

planning for

evolution of the architecture

and some of the ways that teams, when they do hit that point of realization

that the engine that they have selected

is no longer viable and is now a liability,

how can they

then make that transition to a new system

as painless as possible?

Yeah. I mean, I think you sort of you, like, you you got it in the, like, you can't future proof anything,

point. So, now I have to think of something cleverer

to say

yeah. So I think that

I think that

what you

I mean, in some ways, like,

don't

try is sort of the like, the answer is, like, you know, I sort of I obliquely referenced this the, like, the, like, x y problem of sort of jumping to a solution

too soon. And I think that, like, the the sort of the thing that you can do

to sort of be the one who says time out time out, folks. Let's let's, like, let's ask a few more questions is to all to think that, like, okay. I'm an engineer.

I

should I should always ask one more question and not or, like, I should always sort of state my assumptions.

And and often,

like, that is the that is the tool that you use to future proof what you're doing.

I think the so that's, like, that's that's a nod in the you know, that's, like, too too sort of hand wavy because people will, like,

you know, like,

engineers will almost always

think that they're doing that and not quite do it. So, like, asking

asking more questions and, like, starting,

like, I assume starting sort of, like, every ticket with every every Jira issue, every document that you write, every meeting with, like, these

are the things that I'm assuming. And then and then you sort of do

do design work and architecture thinking, like, based on that requirements gathering based on on those assumptions. The the sort of

not

the the sort of

the the second so that's like that's a whole that's like a whole kind of we could spend there's like a whole podcast

series that we could have on that question. There's the the sort of on the other side of of it is this, like, realization that engineers

have this

tendency and desire to over engineer things and to know that sort of this, like, planning for future scaling

is

is is always this, like,

over engineering, premature optimization

kind of

path

and and to sort of,

in some ways, just, like, spend less time

thinking about it because

you're always going to have to do it again in the future,

and you should wait for it to be a problem before you start optimizing it

or or fixing that kind of thing. And I

and that's that's, like, okay. That's, like, really easy to say.

But I think, like,

having

an architecture that makes sense for the team that you have

that, like, keeps, like, the assumptions, like, well well declared,

uses

standard

tools that that do scale in some way. So,

right, like,

if you

there there is this, like, example. Right? If you have if you're, like, writing all of your data manipulation

using

pandas or polars, you have basically, you have your queries written in Python.

And if you then need to take your 4 gigabyte dataset or your

32 gigabyte data set or what you know, or

or, you know, small and terabytes

of data that you that have been running okay

with pandas, and then suddenly, like like, multiply it by a 100,

you have to redo it. Like like, there is just like that's not you know, there's no way around that. You have to redo it. Fine.

That's fine.

If it were SQL,

you may have to redo it too, but there's a chance that, like, the SQL will, like, translate

better across

scales

because SQL's declarative, and it sort of says, this is the thing I want.

You,

mister database,

missus database,

go forth and give it to me, and I don't care what you do on the inside. And you can sort of load the data into whatever tool you need to and see what happens in terms of performance. And so so that that, like, that sort of using

using

tools in that way, like, maybe is, like, the right answer, which isn't to say that, like, you shouldn't use

pandas,

because, like, a lot of times you're never gonna do that. And

and and there are you know, and sometimes the redoing it is

less

less intense in one way or another.

Or you or

or you have to, like, do some kind of or you have to there's a group by very early in your execution, and you just there's no way to to have

that query

execute in parallel anyway. And so

your scaling options are are limited. Right? So the so it's, like, very kind of use case specific about, like, which which

tools you use to kind of

to to handle handle that, which is why, like, all the things we said before about

about knowing what kinds of queries you're gonna be doing and and how, like, how that relates to where the data is is absolutely crucial.

Going

and then

you,

you know, you sort of hit this scaling wall or you watch the graph kind of get close, and then you and then you sort of say, okay. Now we have to, like we

we deferred doing this earlier,

but now

we have the SQLite database, which is

it's 2 terabytes.

And,

you know, we need we

really need to, like, have it not be,

like, on the application server because it makes deploys take too long.

Fine. Right. You know, like, there there I like, a year ago, I had

had coffee with a former coworker who

who told you know, described the architecture of a component of the of their of the system they were working on now. I was like, oh, yeah. It's just like it's like an EC 2 instance with with a really big SQLite database. And, like,

it's fine.

The you know? And and I was like

and I was like, but really? And and, you know, if you think about it, like, sometimes,

you know, having a having the app and the database in the same place, like, you there was always gonna be

one one thing anyway, and there are backups no matter what it is. And so you you end you end up with these situations where, like, yeah. Actually, like, people probably should use SQLite for way more than they're currently doing with it. And and then having

and then having have there are ways to, like, architect that in a really sort of sane way.

And then, you know, and then

there are

sort of you can kind of take that in in lots of different directions and

having,

you know, having ways of testing it locally. Like like the the oh, this is just a SQLite database or whatever. It's actually, like, gets you, like, really far, you know, in a distressingly

really far way. But if the, like, thing if but but, you know, like and then

maybe that's fine. And, like, you can

later, you know, move it to Postgres

or MySQL or whatever you want.

And and that, like,

that gets you that gets you a lot. And but but and some of this is, like, what tools are you comfortable with? And and some tie and there are definitely database

tools, right, like,

that do

scale up pretty well along

any number of dimensions.

Though

though it is really it is also true. So, so, like, the the databases at scale, like, there is this there are these, like, class of

of databases which have sharding built into the database,

Cockroach,

MongoDB. I mean, like, obviously, like, I know

things about the MongoDB

world pretty pretty well. But there are there are, Couch

Couchbase

does this. There are there are, like,

Cassandra's and other sort of

native, like, sharding inside of the database

tools, and they

and they're really like, that's, like,

next level stuff. It's also

absolutely true that

in practice, if you're gonna use sharding, eventually, you sort of have to, like,

know that you're potentially gonna be at that scale and, like, spend an hour making sure that you haven't sort of tied yourself

to some kind of railroad track eventually. Right? Like, there is this, like, you kind of need to, like, plan how your data is is, like, laid out in order to to make that work. So

it it is the the answer, you know, the answer is sort of like don't overengineer

anything.

Upgrade the scale when you need to.

Use be use the abstractions

to your advantage,

and also sort of gut check. Like,

it's not possible for this dataset to get larger than 2 terabytes.

Okay.

Like or this is a dataset where we're, like, modifying things all of the time, but it's small. That leads to a different set of a different set of challenges.

This is an append only data dataset.

Great. You know, there are different

different

different like, you

if you start at a certain place, you, like, get to

get to an end more more or less quickly.

And and I think

scaling

is always

or scaling is a thing that application

developers

who know things about who, like, once you know things about how databases work, application developers are really the people in the best

position to sort of

in coordination with the product managers that one hopes they have

to sort of make these make these kinds of decisions,

when building when building apps. But it's but you but it's really sort of a thing that, like, works its way all the way up and down

all the way up and down the stack.

Another interesting

outgrowth

of the complexity

of selection of a database engine

can have

is that the engineers who are building on top of that engine

become

skeptical of

the longevity

of that system, and so they will

refrain from actually taking full advantage of the capabilities of the database

and instead move more of that functionality and logic into the application layer.

So the application becomes more complex and heavier weight and harder to maintain,

and the database

becomes effectively just dumb storage.

And so I think that that is definitely

a a risk that exists as well. And I'm wondering how you have seen teams try to

manage that balance or some of the ways that you think about

how and when to actually invest in the capabilities of the database engine to

improve or enhance some functionality

or manage some of the actual application logic versus just being this dumb storage system?

Yeah. I mean, I think

it's I mean, the things like, I think this is, like, a class of, like, over know, like, oh, we're gonna put an abstraction layer in front of x thing because we might change it later. You're never gonna do it. Right? Like, sorry. Like like, that's not a that's not a thing.

Like, no one so there there are a couple of, like, fallacies, I think, in in in the way that people think about databases and data selection. The first is that someone that there's ever a day when anyone in the world ever wakes up and is like, you know what? Today, I need to decide if we're gonna use MongoDB or Postgres, or we're gonna use MySQL or Postgres. Right? That decision has always already been made

by someone else. Right? You know, unless you are, like, literally, like, starting out, like, that's we just don't make that decision

at a at a low level. And the second thing is we don't change it. You know, there are occasionally,

there are people who will do there are systems where you do need to migrate

from one to another

for

some reason,

But

but that's incredibly

rare, and you should and sort of at the point when you need to do that,

you have users and money.

And so so it doesn't so so you can afford to do it. Right? If you don't have users and money, it doesn't matter

because

your data whatever it is, it all fits in RAM and you like like like, you're not making money anyway, so who's gonna pay you to, like, change databases? Like, that's not that's not what we're doing. Unless it's for fun, in which case, you know, go wild.

But

but we don't

we don't have that these kinds of, like, migrations happen very often. So these abstractions like, you should have good tooling around your database that, like, lets you do it. But that that lets you sort of, like,

handle like,

have some abstraction between the database and and

and your code. Right? You, like, you shouldn't be, like, I don't know, like,

writing your own database driver inside of your inside you know, in in your model's package or whatever. But but but but you also

you don't need to sort of build your own abstraction on top of

on top of your database and do this this thing where you put the the all of your business logic in your app that could easily be pushed into the database. It would be faster and cheaper if you did it that way. Again, like, pushing things into the database is like writing SQL. And, you know, if you take a SQL

statement and you're like, okay. Now we have to take this SQLite and make it run on Postgres 1. That that way should be pretty easy. And 2,

like,

it's it's you know, if what you have to do is go through and find a bunch of functions that don't exist or or and write them a different way, like, that's that's a much less

onerous kind of thing. The

I mean, even the real sort of the thing that I think like, you you sort of have this you the question you posed was, like, do you, like, take advantage of the database?

I think the, like, the risk that people like, say, in some ways,

like, it's not just like, taking advantage of the database is, like, a relatively safe thing because people don't change.

The thing that, like, I think is risky and people are less afraid of doing

for the wrong reasons are are the are systems that have

user defined functions,

which means that you've sort of pushed some logic

custom logic into the database, which has to sort of run and it's difficult to version it correctly. Right? That's kind of terrifying

and hard to migrate. There's also,

and and various things like triggers and stored procedures and all that, like, have the same same property. There's also if you're using an extension,

you get,

you know, extend so for systems like Postgres where you just, like, hot load code into into the database, it becomes very version dependent. And using that functionality

is, like,

actually sort

of that. There is some like, I I I get

that gives me the willies.

But,

but I think, like,

I think that I think that all mostly, like,

teams

I think mostly teams should should just not be as worried about it and and think of think of the these, like, adding extra indirection between the database and the application is really, like,

is really a is really sort of a premature optimization.

It's also true

having said all of that, it's also true that sometimes you do actually wanna write application code. And this was, like, one of the things that, like,

the early NoSQL

or, like, early mid NoSQL

era really kind of, like, leaned into is that

you just you know, sometimes

the clearest way to express something was in fact,

like, writing a 4 loop. And

you you you can't do it, like, all of the time, and it's not in a universal rule. But but sometimes

that's what you want. Like, you just want a for loop. And having some application code,

like, there are times when

when you you when it's fine.

But but, I mean, this is like, we're we're sort of the all of my answers to this question are, like, you know, classic engineer. It depends. And and also, like,

have a have a really good understanding

of

the problems that you're trying to solve and

the size of data at any given point in a pipeline.

And

and try and also sort of, like, try and avoid

spending your time solving

problems of your own creation. Right? Like, there are a lot of you know, like, I sort of I think that,

like, teams aren't gonna do this for lots of reasons, but, like, in a lot of cases, ETLing

is

a sort of problem that we've created because we haven't sort of extended engines in the right ways, where we haven't provided,

like, interoperability

in one way or another. And and, like, this is a lot of what Better d's working on is this, like, how do we

provide the same interface, but also

not

acknowledge the the, like, reality of

database this monolith of give all your money to IBM, get d b 2, put all your data in there, go to sleep happy

is

is just not true. And there you know? Or any any

it's very rare that you find an organization that has only 1 database

or only 1 database type. And and I think being

being pragmatic with that is is really,

really sort of crucial,

a really crucial way of of, like,

avoiding

falling into some of these traps.

I think that's probably a pretty good place to start to close out the conversation.

So is there anything that you think is useful as

a closing argument or anything that you wanna say to wrap up that we didn't already touch on?

I think we got through we certainly got through most of it. And there are definitely, like, a lot of, like,

side quests in this conversation,

which I think

bear a little bit more attention

at some point.

I I mean, I think they'll, like, the part the parting

parting shot, parting wisdom is really, like, ask more questions

and acknowledge your ask more questions, acknowledge your assumptions,

avoid premature optimization.

You like,

be aware

and be aware sort of pragmatically

of, like,

what

what's going on, what's possible,

where

you're we're

almost always

data

we have app there's we spend a lot of time talking about applications, but there's, like, on top of there's, like, core application developers and then people working with data generated by applications on top of it. And which are themselves writing

writing application

applications of a different sort. But that all of these

all of these projects really

there's there's, like, this, like,

pragmatic,

like, this this is we don't have control of it. Most of our almost every line of code like, people very rarely just, like, open a text editor and say, today, I'm gonna write a data driven application. Right? That's another this is, like, another one of these, like, things that doesn't happen.

Not you know, don't wake up today and decide which database to use. Don't wake up today with a with a blank application

and no data. Right? There's we're building on top of things. And so being letting making it possible and and

and acknowledging that we're we're gonna be pragmatic. We have to be pragmatic. This is this is,

you know, like, we're not

we're building applications

and tools and software to solve,

hopefully, as much as possible other people's problems. And and,

you know, and I think, like, make it so that

we can make decisions about

businesses, about the world, about our organizations,

about whatever the thing is

based on

data and not vibes.

And if we screw up with

you know, like, if a query is too slow,

people are just gonna make decisions based on vibes. Right? People will always fall back to the, like, vibes

vibes driven,

decision making. And,

you know, people

do okay with their vibes, but we have all these tools.

We have all of this data,

like,

being being

sort of

able to and flexible with with building

building out infrastructure and tools around it is the thing that lets us

make

better decisions, more clear decisions

about

about whatever it is that we're doing. And I think that's, like, a really

that's the thing that it's, like, important to to keep in mind and

and is I know it had and has

been true for a really long time at at,

you know, the scales of 10 years ago are not the scales of today and aren't gonna be the scales of of,

you know,

10 or 20 years from now. And I think that's all of that's okay.

And and our jobs are to sort of

make sure that we don't get

get we don't look back in 10 years and say, oh, we just we just sort of, like,

did that all on 5. It was you know? And I think that's I think that's sort of the the important part. Yeah. And I think to making sure that you actually understand what the

understand what the database that you're selecting is capable of and is optimized for before you decide to invest in it.

Yeah.

Absolutely.

Like, know your tools. Like like, I you know, I you know, as a as a programmer, I, like, have a collection of absolutely absurd

facts in my brain about, like, what various run times do. Like, I you know, like, Python programmers sort of, like, have an intuitive sense of where the GIL is. You know? Like, the

all there are all sorts of all sorts of cases where just, like, knowing your tools

is really good

and really, like, gives you a lot of of power and databases are

kind of the same. Like, it's a it's a runtime. It's a it's a

it I mean, they're they all use the same mostly use the same programming language, but it's all, like, it's all it's all one of those. And so knowing

knowing how all of that works makes you a a better consumer.

And I think that's

that's useful.

Absolutely.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

This is gonna sound like it's out of left field, but I think

easy

to use

testing tools are really the, like

is the thing that just, like, always feels like it's missing. You you sort of write all of your application code. You can unit test a lot of stuff if you're, like, functional enough. But then you you start talking to a database and you're like, well, okay. Here we go. I hope I I hope I got that right.

And and there's, like, there's nothing. There's like like as great as everything is, there's just, like, absolutely

nothing

that exists where you're like, oh, yeah. I wanna run a I wanna run something that feels like a unit test with

you know, I wanna make some fake data, and it just, like, doesn't exist, and there's no way to do it. And

and I

and so, you know, that's where all the bugs are all the time. Like, you know, you spend if you if you have an app that someone's using

and

all every every outage, every user complain is always something

that isn't

covered by testing. And we have these, like, great tools in other areas

of our code

for for doing testing, and

and, like,

there's some paradigm or syntax or something that just, like, doesn't exist today.

And

and I think,

like, that

that that's it.

Alright. Well, thank you very much for taking the time today to join me and share your thoughts on this overall space of

selection and optimization

around the database engine and how you're applying it to the problems you're trying to solve. It is a perennial

problem and definitely subject to the overarching

it depends answer. So I appreciate you taking the time to explore that with me, and I hope you enjoy the rest of your day. Yeah. You too.

Thank

you

so

much.

Thank you for listening, and don't forget to check out our other shows.

Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast