Summary
In this episode of the Data Engineering Podcast Sam Kleinman talks about the pivotal role of databases in software engineering. Sam shares his journey into the world of data and discusses the complexities of database selection, highlighting the trade-offs between different database architectures and how these choices affect system design, query performance, and the need for ETL processes. He emphasizes the importance of understanding specific requirements to choose the right database engine and warns against over-engineering solutions that can lead to increased complexity. Sam also touches on the tendency of engineers to move logic to the application layer due to skepticism about database longevity and advises teams to leverage database capabilities instead. Finally, he identifies a significant gap in data management tooling: the lack of easy-to-use testing tools for database interactions, highlighting the need for better testing paradigms to ensure reliability and reduce bugs in data-driven applications.
Announcements
Contact Info
Parting Question
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
In this episode of the Data Engineering Podcast Sam Kleinman talks about the pivotal role of databases in software engineering. Sam shares his journey into the world of data and discusses the complexities of database selection, highlighting the trade-offs between different database architectures and how these choices affect system design, query performance, and the need for ETL processes. He emphasizes the importance of understanding specific requirements to choose the right database engine and warns against over-engineering solutions that can lead to increased complexity. Sam also touches on the tendency of engineers to move logic to the application layer due to skepticism about database longevity and advises teams to leverage database capabilities instead. Finally, he identifies a significant gap in data management tooling: the lack of easy-to-use testing tools for database interactions, highlighting the need for better testing paradigms to ensure reliability and reduce bugs in data-driven applications.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- It’s 2024, why are we still doing data migrations by hand? Teams spend months—sometimes years—manually converting queries and validating data, burning resources and crushing morale. Datafold's AI-powered Migration Agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today to learn how Datafold can automate your migration and ensure source to target parity.
- Your host is Tobias Macey and today I'm interviewing Sam Kleinman about database tradeoffs across operating environments and axes of scale
- Introduction
- How did you get involved in the area of data management?
- The database engine you use has a substantial impact on how you architect your overall system. When starting a greenfield project, what do you see as the most important factor to consider when selecting a database?
- points of friction introduced by database capabilities
- embedded databases (e.g. SQLite, DuckDB, LanceDB), when to use and when do they become a bottleneck
- single-node database engines (e.g. Postgres, MySQL), when are they legitimately a problem
- distributed databases (e.g. CockroachDB, PlanetScale, MongoDB)
- polyglot storage vs. general-purpose/multimodal databases
- federated queries, benefits and limitations
- ease of integration vs. variability of performance and access control
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
[00:00:16] Sam Kleinman:
It's 2024. Why are we still doing data migrations by hand? Teams spend months, sometimes years, manually converting queries and validating data, burning resources and crushing morale. DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit data engineering podcast.com/datafold today to learn how DataFold can automate your migration and ensure source to target parity. Your host is Tobias Macy, and today I'm interviewing Sam Kleinman about database trade offs across operating environments and axes of scale. So, Sam, can you start by introducing yourself? Yeah. Sure. I
[00:01:09] Tobias Macey:
am a software engineer with a eclectic background, I guess. I, you know, like, I I live in Brooklyn, and I've I I worked for a I spent a long time working on MongoDB in a couple of different roles and then have, like, spent some time working on other other engines, Neon and and and also a startup called ClearAD. And I think I think databases are or database engines are are sort of the I don't know, have always sort of been the, like, the biggest lever in the in sort of that you can that you can jump on to, like, impact the way that software engineers are able to be productive and build cool things. And so it's, it's always been a really exciting it's weird to say, oh, databases are exciting, but but they they they have always been an exciting place to to, like, do work and see and, like, have an impact on our industry. So that's that's sort of how I got here, belatedly.
[00:02:24] Sam Kleinman:
And do you remember how you first got started working in data?
[00:02:28] Tobias Macey:
Oh, yeah. No. I, like, I I have a a liberal arts undergraduate degree in social sciences, basically, and and interdisciplinary stuff that I, and I had a dumb job, like, in college, like, filing paper for the, fundraising department in, like, file cabinets, which I can't imagine anyone ever used. But, you know, it's a sort of economic endeavor, and I I would listen to podcasts, which is the thing. It's ironic here we are on a podcast, but I'm, like, not a I don't listen to a lot of podcasts since then, but I listened to something and, where they they interviewed some people who are working on me as who's sort of in the the last data tooling boom. And I, you know, listened to people talk about a lot of a lot of these databases that were, like, it was the sort of peak of NoSQL sort of or the heading towards the peak of NoSQL's moment. And I and I was like, oh, this data thing is, like, real big and gonna you know, like, it it sort of seems you know, and I had been interested in open source. And and I I had always been a nerd, but somehow as a as a younger young person, I was very resistant into in in, like, taking taking CS classes or whatever. And so I I never really studied it, but I definitely got I got interested in in learning about the technology and thought this that there's, like, a lot of cool things to be done, around infrastructure. And so I I think, like, I can't you know, it's sort of weird now if you say infrastructure, it means you're a sysadmin, and I'm not exactly a sysadmin.
But I think infrastructure in this, like, abstract sense is really cool because of what it enables and the sort of power the ways that infrastructure lets you focus on the important things of your problem. And databases mean that you're not spending lots and lots of time sort of figuring out how to absolutely, like, indelibly save data or make sure that you, like, recall like, there are these problems that databases just, like, solve for you. So as a programmer, you don't have to sort of wonder if you've, like, written the search algorithm correctly. Right? You just select whatever whatever you want.
And and I think that that that that there's, like, many levels of that. And I think we're actually, like I mean, we've been having relational databases for however, you know, 4 years. And we are, like, not yet to the point where the, like, the abstraction between application code and and state persistence and and retrieval code or the database engine. Like, that boundary is not exactly where it should be, and it's, like, interesting to watch it sort of settle over time, like, over the, like, long time. But yeah. So all of those things, like, have have sort of you know? And then I you know? And then you get a job and you keep working at it. And I and I've and then your next job looks like your previous job, and suddenly you're, you know, suddenly suddenly you have a thing, somehow. I think that's a, like, an accident that many of us kind of stumble into.
[00:06:07] Sam Kleinman:
And on the point of database engines, they are the system of record for whatever application or service or system you're building where database is a very broad term where it can be anything from I happen to have written a file to disk and then I am reading it back to some of these deconstructed database engines where you have the lakehouse architecture that's emerging where you have a query engine, you have some object storage, you're off live somewhere else, and everything in between across different modalities of data types and data formats, vectors have become increasingly popular as a data type because of the AI growth.
[00:06:53] Tobias Macey:
Yeah. Well and and there's, like, there's, like indexing is sort of a classic. Like, I I kind of actually think of, like, vet there's a lot of vector search problems, which are, like, subclasses of indexing problems or sometimes they can be. There's also, like, database is an abstraction. I think I think that's the, like, the thing for for doing this, like, save access and retrieval piece. You said something that I thought was great, just like deconstructed database. I think the there's this sort of orthogonally to all of this, what is a database, right? Like, there's also these questions about how we build applications and that shape how you need to save data. So you have you know, there are very architectural decisions which influence the shape of the data that you have. There are deployment considerations which shape how you how you deploy things.
There's, like, patterns or, like, workload properties that constrain how you manage the data that you have. I I think the truly revolutionary thing that's, like, happened in the last 15 years in computing, I think, is that that, like, s 3 has become reliable in a way that's kind of alarming, cheap in a way that's also alarming, and and ubiquitous. And I think that and the cheapness makes it ubiquitous, but but that that means that we it used to be the databases were like, how do you attach enough disk to a machine that has enough RAM to do the thing that you need it to do? And and the, like, the size of the disk and the you know, like, that was the sort of thing that mattered. And now, like, you just put it on s 3. You can, like and, like, network having storage that is connected over Ethernet rather than, like, over, like, some sort of, you know, PCI bus or whatever, SCSI bus, is is like that that shift changes how all the rest of these systems work. And I think that and it lets you it lets us as database makers, engine makers, separate storage from compute and have them vary separately. And that's real like, that's really powerful.
And it gives rise to, like, some of these, like, serverless things that are that are cool. And I think, again, we haven't quite seen all of it. Like, there's gonna be, like s 3, as good as it is, is gonna get better along a number of dimensions. And and I think we will, we'll start to see even even cooler things there.
[00:09:43] Sam Kleinman:
And I think on the s three point, there have been a few developments over the past 5 or so years that have also increased its utility, most notably the read after write consistency that didn't exist for a while where you would make a write, and then depending on where you are reading from, it may or may not exist yet. Now they've got guaranteed read after rate consistency. And most recently, they've added a optional update pattern where you can say only update this file if it is actually different than the one that I'm trying to upload, which can add a lot of efficiencies as well as far as IO for conditional updates or where you're not actually changing any of the data.
[00:10:26] Tobias Macey:
Yeah. Yeah. Yeah. No. It's I even think the, like, non the, like, in the inconsistent read after write version of s 3 is, like, is, like, so good in a lot of ways that that you or that I like I don't even think about all these other things as I'm like, yes. This is great, of course. But but I think there's they have also been adding, like, faster s 3 tiers, which are more expensive, but, like, gives you a level of consistency. And then you can do all these, like, crazy tiering. But but regardless of all of that, it means that you can, like, allocate your compute very elastically and and not have to worry about stateful, like, not have to worry about managing disks. Right? And and it's actually like when people are like, why do you need a new database now? Aren't the things that we have good enough? And the and, like, the things that like, the these, like, old industrial battle tested pieces of software are are really good.
And, like, it takes a long time to make a database kind of robust in that way just because, like, you have to bang on it from all sides. But the but the but they're all they all assume local storage in this way that, like, you can sort of rip out the bottom and put something new in, but it doesn't it, like the the transition from, like, colocated storage and compute to separated storage and compute is, like, a really huge change in in, like, in the software. And and sort of get getting us to that point, I think, is is really, is gonna be is, like, is an interesting, like, technological shift, that that, like, that all that people writing software today are just just beginning to sort of figure out how to do. And there are a lot of legacy tools which are sort of, like, sort of shoehorning themselves into being serverless in a way that is, like, absolutely not gonna, like, fly, from a, I mean, frankly, from an accounting perspective, let alone, like, technology.
But but I think it's it's this is another thing that's super interesting to watch.
[00:12:51] Sam Kleinman:
And one of the points that you made early on is the impact on the system design that the database has, and I think that that's definitely a very real consideration for anybody who is trying to design or evolve any sort of service, whether it's a consumer facing transactional application or a analytical workload or machine learning system. And I'm wondering if you can give some of the main points of consideration that engineers and teams need to be thinking about as they're starting to determine what is their persistence mechanism for the system that they're building and how that shapes the capabilities of that overall system.
[00:13:40] Tobias Macey:
Yeah. The it's sort of I I think that I I often kind of, like, think about backing into this in a different in a different way. Or, like, the the answer to the the way so we'll get I think we'll get there, but it won't feel like it for a little bit. So I think, like, the thing that you end up wanting to do is to think about the queries that you're running and think about things like, how does the data get updated? Is this aggregating data from a lot of different places, or is this data that we're writing and reading? Does data get modified after after it's created or not? There are some kinds of data where that's true or not.
There's also a a expense question. Like, how much money do you have to spend on on engine on the, like, on the database? And then and then there's this, like this sort of hints at, like, the the analytics question, but, like, are you returning records in some kind of what like, business object record of that is, like, persisted in a database, persisted in the database, and modified in in the application? Or are you returning aggregates or computed values that sort of summarize larger parts of the dataset? And and the thing or sort of so, like, once you answer those questions, once you have answers to to, like, what kinds of queries you're doing, which really kinds of depends on like, there's no wrong answer to any of these questions. There's just you you sort of, like, you have to do the thing where you write all of your business objects on index cards and lay them out in front of you and see how they interact with with, each other and and where the you know, what what's what needs to where things need to go. But you, that's a very old school way of thinking about this, I guess. But but you you once you sort of have that, like, what are the components, what are the objects in the system, and how do they interact? You can sort of and then you have to figure out how you need to save them and and or where where the objects exist in existing systems if they're, like, accessed over over APIs. And then so you you have so you have to, like once you have where all of the data is and have all of the the the, like, components and the objects, you you start to sort of think about how you're gonna use it in the in in whatever thing that you're building.
And there are lots of different tools which do, you know, will do sort of, like, one thing really well and a and a a few things less, like, fine, but not great. And the and the challenge is, like, always, like, picking the sort of the best tool for what you're doing and what you're planning to do based on how and where you're going. And I think that, like, the thing that that I the thing the sort of, like, sort of key things are, like, right throughput is sort of hard to like, that's like there are some I guess, like, maybe maybe actually the way to think about this is, like, what do you need to think about first? And or, like, what are the things which you can't compromise on? And so there is this, like like, write throughput is really like, if you pick a system that has slow writes, it's really hard to make it fast.
If you have a system which doesn't really support efficient updates, then you can't like, and your workload is very update heavy, then you really like, you're in trouble. And and and and then you sort of, like, go there are, like you know? So, like in some ways, like, insert throughput, update throughput, and then various kinds of, like, can you do recall at the level that you need? You can make there's, like there are ways of making every query fast. If you're if you're, like, if you're determined, you can make a database give you a query faster usually, But but there are it's hard to make it write faster if if you're sort of, like, teetering on the edge of its, like, tolerances.
But the but I think one of the sort of big, big one of these, like, big sort of top level sort of properties of engines is, like, is the data stored in a in a, like, row oriented format or a column oriented format? And I heard and I, you know, as a early career pre pre working on database engines, like, heard these people say columnar databases. And it took me a, like, really long time to sort of make sense of, like, how and why this was important. But, basically, the idea is that if you so row row oriented is very sort of reasonable, and you can think about it. You have records or rows.
Everyone sort of thinks about this now. It's like an Excel spreadsheet, but you know what a row is and all of the all of the columns, all of the fields, positions in the row are things that you, like, are related to each other. And then columns, right, are sort of all of the all of the different values for each of your records are stored sequentially. And so you will till you have, like, every you know, if you have an ID and and a, you know, like, a a name and a balance and a last updated time, you would store the names and then all of the balances and all of the last updated times. And when you do that, you can there are, like, ways to compress it more quickly, and there are ways of computing averages very quickly because you you can just sort of, like, shove, shove the the data through through the process or really, like, there's, like, very little copying that goes on. It's it's it's straightforward and and quite fast. And this is how analytics engines, like, make things fast is they just write the data out in a way that makes the query that they're gonna do quicker.
And so knowing how these things these things work is like, really lets you answer the questions of, like, what kind of engine do we need? Because we're gonna be computing averages, and so we need a columnar format. Or we're computing rows, and so we need or we're, like, recalling one row at a time or a collection of rows or a a filtering of a section of rows. If you say have, I mean, this is sort of obvious, but if you're just, like, returning records again and again, you like, columnar formats may be very quick for answering some kinds of queries, but they're quite slow at returning rows. That's obvious. Right? But, sounds dumb to say it. But but that, like, either either you store things as rows or or you store them as as columns are is sort of one of these, like, big choices in in databases. And it impacts things like write throughput and update performance if the data isn't if you're updating a whole row and they're in columns, you have to jump around a lot to get the update to work.
So that's that is, like, one of these, like, fundamental decisions that you make.
[00:21:37] Sam Kleinman:
That dichotomy of the row versus column oriented record storage is the fundamental divide that has introduced the practice of extract transform load or extract load transform whichever sequence you want to do it in because of the fact that if you have row oriented data, it's easy to insert an update, but it's slower to do aggregates. And if you're trying to do aggregates across your transactional database, it's going to negatively impact the application. So you move all the data to a different system to do all the analytics, which increases a lot of cost and complexity. You're duplicating the data, so you have to duplicate the storage cost.
And so that's definitely one of the fundamental aspects of how do you think about the system where are you primarily doing analytics, or are you primarily doing single record events or updates, or how how what what do you need to be able to retrieve?
[00:22:39] Tobias Macey:
Yeah. I and I but and I think that's like a I I kind of think of this like the ETL probe problem is I sort of I I kind of feel like the ETL is kind of this, like, problem of our own creation. The the there is this to to sort of that that thing that you it it I mean, this is this is like, this is this is a sort of interesting thing. But we have there is this sort of to like, there is this, like, historical myth that you can, like, have one database, and you'll just, like it will be huge, and all of your code will access the same database. And and, you know, and this is, like, a sort of this is the business model of, like, general purpose database vendors is that you're just gonna, like, you're just gonna make one giant machine or a couple of giant machines, and, you know, it's gonna run DB 2 or Oracle or one of these, like like, behemoths, and and it's gonna solve all of your problems with data. And and this is, like, basically always been a myth for the for literally this, like, this thing that you just said where if you run analytics queries against your transactional workload, then the workloads conflict with each other, and they make each other slow. Right? And the mechanism for this is, like, quite simple. You have you have more data than RAM. The data isn't in you know, the data working set for your instance at at any given time isn't in memory.
But if you're really smart or clever, you can keep the working set for a given workload in memory, and you'd sort of optimize the workload in the application so that for the most part, you're not modify you're not going to disk as much as you can. And transactional systems use, like, indexes for this. Right? And indexes are kind of like real time ETLs of of your of your system, in a lot of ways. There are these there are things like, materialized views, which are, again, the same kind of thing. And they, like, put the data in a different form and keep it up to date sort of sequentially, or, like like, strictly, like, in a consistent sort of way. But but but those those also sort of solve this, like, how like, ways that we can kind of keep the data sort of positioned in a way that makes the queries that you're gonna run stay in memory, not have to sort of bounce bounce to random IO on on disks a lot. And then and then you're sort of good, but if you have more than what if you have too much of this going on, then the workloads thrash a limited set of memory. And then and then you're into this, like, okay. We have to ETL and copy it over. You can also like, there are you can also imagine having having diff like, using read replicas to satisfy some of that some of those workloads.
You can imagine having, you know, like, having ways of persisting data that don't involve updates very much. Right? There are these, like, event driven systems, and so you will, like, materialize out the row data, you know, on some interval. And maybe that's, like, enough. Like, you don't like, having this, like, having the objects be sort of strictly read from memory and, like, material like, in memory as, like, a row and record is, like, maybe not always the thing that that we we need. And so there is like, you can write applications that do both, or you can have some of the like, you can use materializations or or other sort of tools within a system along with replication and and all of this to get kind of, to get the the the sort of query performance that you need for a given workload. Right? Because everyone some things like, you're always doing both.
Right? There's not a no. Look. This is the application where we save numbers, but we never average them. Right? Like, no. I'm like, okay. What are you doing? Like, you know, that's a that's not a that's not a thing. Stop it. Right? There are, you know and and and you say this other there is this other thing of this, like, assumption that that, like, ETL ing is is, like, all like, has this this, like, flaw of of, like, your data's stale and you're saving another copy. But if you're doing like, if you're on some kind of event driven, like, message passing, like, persistent like, using persistence via like, with messages and records and events, then and not, like, objects that you're you're updating in place, then you which, like, that's a thing we do. Like, that's sort of, like, pretty good way of doing things in in a lot of cases.
Then the thing that you're ETL ing is archives, which you need to do anyway. And so so, like and and it's not even it's not even an interesting ETL at that point. Right? You're just, like like, streaming data out to a file. Right? Like, it's and it'll so so so all of this all of this to say, like, there are a lot of tools. There are a lot of these, like, patterns that we use. And I think that we tend to sort of default to using databases with in these, like, very sort of specific ways with, like, all of these, like, very you know, like, I'm I'm, like, kind of, like, picking apart all of your assumptions and all of your, like, the all of these things that you're saying. But you're also sort of saying exactly, like, the thing that is is sort of like, the the the common way that we think and talk about about these, like, solving these data persistence problems.
Like, of course, we use ETL or ELT or TL I don't know. Whatever. It doesn't like, the the the this is, you know, like, processing your data and writing it, you know, like, that's it's called a query. Like, you know, it's somewhat like like like like, ETL should, like, be just like, I I've I've always sort of you know, like, every so often, I, like, stumble across an ETL flow. I'm like, this is you're just running a query in a CronJob. Like, I I get it. I like it. I like CronJobs a lot, but but I think that I think that we we often, you know, and we often sort of fall into, like, leading. And this is this is this is a, like, common engineering.
Engineers do this a lot, but you sort of start with the tools and work backwards to the business problem. And I think they'll, like, actually, you know, they're it take you know, like, you need the sort of grumpy, like, architect or principal engineer type to sort of say, like, okay. Like, let's back up and, like, go go in the other direction and sort of say, like like, ask why a few more times. Like, why are why are we doing this, and what what is the thing that we're trying to accomplish? Not, you know, like, do we need to archive? Do we need to get this this one query really fast or this other query? Or we need write throughput or or whatever the whatever the issue is. So, yeah, I I think it's I think it's I think it's these are, like, really important questions, and it and it impacts, like, how we how we what tool like like, if you you you get to the same like, you still build a thing that works at the end of the day.
But the the sort of assumptions that you need a thing of a certain shape is is you know, leads leads us into some weird contortions, I think, actually.
[00:31:00] Sam Kleinman:
That's definitely true. And I think another interesting angle to it is starting from that business case of I need to solve this problem. So given the requirements constraints of the problem space, I will then choose this, this, and this solution is that businesses and the context in which they're operating are constantly evolving. So what is the right solution for a business at a certain point in time will not necessarily be the right solution for all time, And so you get into those situations of iterative development, iterative architecture evolution. And in a lot of cases, the database engine that you selected because you thought that it was going to solve all the problems that you had, and maybe it did for a time, then becomes a liability because the assumptions of the business are no longer true.
So you need to be able to evolve to accommodate the new realities of what you're trying to solve for. And I'm wondering, what are some of the useful ways for teams to think about, 1, future proofing, recognizing that that is never the reality? You can never actually future proof, but planning for evolution of the architecture and some of the ways that teams, when they do hit that point of realization that the engine that they have selected is no longer viable and is now a liability, how can they then make that transition to a new system as painless as possible?
[00:32:41] Tobias Macey:
Yeah. I mean, I think you sort of you, like, you you got it in the, like, you can't future proof anything, point. So, now I have to think of something cleverer to say yeah. So I think that I think that what you I mean, in some ways, like, don't try is sort of the like, the answer is, like, you know, I sort of I obliquely referenced this the, like, the, like, x y problem of sort of jumping to a solution too soon. And I think that, like, the the sort of the thing that you can do to sort of be the one who says time out time out, folks. Let's let's, like, let's ask a few more questions is to all to think that, like, okay. I'm an engineer. I should I should always ask one more question and not or, like, I should always sort of state my assumptions.
And and often, like, that is the that is the tool that you use to future proof what you're doing. I think the so that's, like, that's that's a nod in the you know, that's, like, too too sort of hand wavy because people will, like, you know, like, engineers will almost always think that they're doing that and not quite do it. So, like, asking asking more questions and, like, starting, like, I assume starting sort of, like, every ticket with every every Jira issue, every document that you write, every meeting with, like, these are the things that I'm assuming. And then and then you sort of do do design work and architecture thinking, like, based on that requirements gathering based on on those assumptions. The the sort of not the the sort of the the second so that's like that's a whole that's like a whole kind of we could spend there's like a whole podcast series that we could have on that question. There's the the sort of on the other side of of it is this, like, realization that engineers have this tendency and desire to over engineer things and to know that sort of this, like, planning for future scaling is is is always this, like, over engineering, premature optimization kind of path and and to sort of, in some ways, just, like, spend less time thinking about it because you're always going to have to do it again in the future, and you should wait for it to be a problem before you start optimizing it or or fixing that kind of thing. And I and that's that's, like, okay. That's, like, really easy to say.
But I think, like, having having an architecture that makes sense for the team that you have that, like, keeps, like, the assumptions, like, well well declared, uses standard tools that that do scale in some way. So, right, like, if you there there is this, like, example. Right? If you have if you're, like, writing all of your data manipulation using pandas or polars, you have basically, you have your queries written in Python. And if you then need to take your 4 gigabyte dataset or your 32 gigabyte data set or what you know, or or, you know, small and terabytes of data that you that have been running okay with pandas, and then suddenly, like like, multiply it by a 100, you have to redo it. Like like, there is just like that's not you know, there's no way around that. You have to redo it. Fine.
That's fine. If it were SQL, you may have to redo it too, but there's a chance that, like, the SQL will, like, translate better across scales because SQL's declarative, and it sort of says, this is the thing I want. You, mister database, missus database, go forth and give it to me, and I don't care what you do on the inside. And you can sort of load the data into whatever tool you need to and see what happens in terms of performance. And so so that that, like, that sort of using using tools in that way, like, maybe is, like, the right answer, which isn't to say that, like, you shouldn't use pandas, because, like, a lot of times you're never gonna do that. And and and there are you know, and sometimes the redoing it is less less intense in one way or another.
Or you or or you have to, like, do some kind of or you have to there's a group by very early in your execution, and you just there's no way to to have that query execute in parallel anyway. And so your scaling options are are limited. Right? So the so it's, like, very kind of use case specific about, like, which which tools you use to kind of to to handle handle that, which is why, like, all the things we said before about about knowing what kinds of queries you're gonna be doing and and how, like, how that relates to where the data is is absolutely crucial. Going and then you, you know, you sort of hit this scaling wall or you watch the graph kind of get close, and then you and then you sort of say, okay. Now we have to, like we we deferred doing this earlier, but now we have the SQLite database, which is it's 2 terabytes.
And, you know, we need we really need to, like, have it not be, like, on the application server because it makes deploys take too long. Fine. Right. You know, like, there there I like, a year ago, I had had coffee with a former coworker who who told you know, described the architecture of a component of the of their of the system they were working on now. I was like, oh, yeah. It's just like it's like an EC 2 instance with with a really big SQLite database. And, like, it's fine. The you know? And and I was like and I was like, but really? And and, you know, if you think about it, like, sometimes, you know, having a having the app and the database in the same place, like, you there was always gonna be one one thing anyway, and there are backups no matter what it is. And so you you end you end up with these situations where, like, yeah. Actually, like, people probably should use SQLite for way more than they're currently doing with it. And and then having and then having have there are ways to, like, architect that in a really sort of sane way.
And then, you know, and then there are sort of you can kind of take that in in lots of different directions and having, you know, having ways of testing it locally. Like like the the oh, this is just a SQLite database or whatever. It's actually, like, gets you, like, really far, you know, in a distressingly really far way. But if the, like, thing if but but, you know, like and then maybe that's fine. And, like, you can later, you know, move it to Postgres or MySQL or whatever you want. And and that, like, that gets you that gets you a lot. And but but and some of this is, like, what tools are you comfortable with? And and some tie and there are definitely database tools, right, like, that do scale up pretty well along any number of dimensions.
Though though it is really it is also true. So, so, like, the the databases at scale, like, there is this there are these, like, class of of databases which have sharding built into the database, Cockroach, MongoDB. I mean, like, obviously, like, I know things about the MongoDB world pretty pretty well. But there are there are, Couch Couchbase does this. There are there are, like, Cassandra's and other sort of native, like, sharding inside of the database tools, and they and they're really like, that's, like, next level stuff. It's also absolutely true that in practice, if you're gonna use sharding, eventually, you sort of have to, like, know that you're potentially gonna be at that scale and, like, spend an hour making sure that you haven't sort of tied yourself to some kind of railroad track eventually. Right? Like, there is this, like, you kind of need to, like, plan how your data is is, like, laid out in order to to make that work. So it it is the the answer, you know, the answer is sort of like don't overengineer anything.
Upgrade the scale when you need to. Use be use the abstractions to your advantage, and also sort of gut check. Like, it's not possible for this dataset to get larger than 2 terabytes. Okay. Like or this is a dataset where we're, like, modifying things all of the time, but it's small. That leads to a different set of a different set of challenges. This is an append only data dataset. Great. You know, there are different different different like, you if you start at a certain place, you, like, get to get to an end more more or less quickly. And and I think scaling is always or scaling is a thing that application developers who know things about who, like, once you know things about how databases work, application developers are really the people in the best position to sort of in coordination with the product managers that one hopes they have to sort of make these make these kinds of decisions, when building when building apps. But it's but you but it's really sort of a thing that, like, works its way all the way up and down all the way up and down the stack.
[00:43:25] Sam Kleinman:
Another interesting outgrowth of the complexity of selection of a database engine can have is that the engineers who are building on top of that engine become skeptical of the longevity of that system, and so they will refrain from actually taking full advantage of the capabilities of the database and instead move more of that functionality and logic into the application layer. So the application becomes more complex and heavier weight and harder to maintain, and the database becomes effectively just dumb storage. And so I think that that is definitely a a risk that exists as well. And I'm wondering how you have seen teams try to manage that balance or some of the ways that you think about how and when to actually invest in the capabilities of the database engine to improve or enhance some functionality or manage some of the actual application logic versus just being this dumb storage system?
[00:44:28] Tobias Macey:
Yeah. I mean, I think it's I mean, the things like, I think this is, like, a class of, like, over know, like, oh, we're gonna put an abstraction layer in front of x thing because we might change it later. You're never gonna do it. Right? Like, sorry. Like like, that's not a that's not a thing. Like, no one so there there are a couple of, like, fallacies, I think, in in in the way that people think about databases and data selection. The first is that someone that there's ever a day when anyone in the world ever wakes up and is like, you know what? Today, I need to decide if we're gonna use MongoDB or Postgres, or we're gonna use MySQL or Postgres. Right? That decision has always already been made by someone else. Right? You know, unless you are, like, literally, like, starting out, like, that's we just don't make that decision at a at a low level. And the second thing is we don't change it. You know, there are occasionally, there are people who will do there are systems where you do need to migrate from one to another for some reason, But but that's incredibly rare, and you should and sort of at the point when you need to do that, you have users and money.
And so so it doesn't so so you can afford to do it. Right? If you don't have users and money, it doesn't matter because your data whatever it is, it all fits in RAM and you like like like, you're not making money anyway, so who's gonna pay you to, like, change databases? Like, that's not that's not what we're doing. Unless it's for fun, in which case, you know, go wild. But but we don't we don't have that these kinds of, like, migrations happen very often. So these abstractions like, you should have good tooling around your database that, like, lets you do it. But that that lets you sort of, like, handle like, have some abstraction between the database and and and your code. Right? You, like, you shouldn't be, like, I don't know, like, writing your own database driver inside of your inside you know, in in your model's package or whatever. But but but but you also you don't need to sort of build your own abstraction on top of on top of your database and do this this thing where you put the the all of your business logic in your app that could easily be pushed into the database. It would be faster and cheaper if you did it that way. Again, like, pushing things into the database is like writing SQL. And, you know, if you take a SQL statement and you're like, okay. Now we have to take this SQLite and make it run on Postgres 1. That that way should be pretty easy. And 2, like, it's it's you know, if what you have to do is go through and find a bunch of functions that don't exist or or and write them a different way, like, that's that's a much less onerous kind of thing. The I mean, even the real sort of the thing that I think like, you you sort of have this you the question you posed was, like, do you, like, take advantage of the database?
I think the, like, the risk that people like, say, in some ways, like, it's not just like, taking advantage of the database is, like, a relatively safe thing because people don't change. The thing that, like, I think is risky and people are less afraid of doing for the wrong reasons are are the are systems that have user defined functions, which means that you've sort of pushed some logic custom logic into the database, which has to sort of run and it's difficult to version it correctly. Right? That's kind of terrifying and hard to migrate. There's also, and and various things like triggers and stored procedures and all that, like, have the same same property. There's also if you're using an extension, you get, you know, extend so for systems like Postgres where you just, like, hot load code into into the database, it becomes very version dependent. And using that functionality is, like, actually sort of that. There is some like, I I I get that gives me the willies.
But, but I think, like, I think that I think that all mostly, like, teams I think mostly teams should should just not be as worried about it and and think of think of the these, like, adding extra indirection between the database and the application is really, like, is really a is really sort of a premature optimization. It's also true having said all of that, it's also true that sometimes you do actually wanna write application code. And this was, like, one of the things that, like, the early NoSQL or, like, early mid NoSQL era really kind of, like, leaned into is that you just you know, sometimes the clearest way to express something was in fact, like, writing a 4 loop. And you you you can't do it, like, all of the time, and it's not in a universal rule. But but sometimes that's what you want. Like, you just want a for loop. And having some application code, like, there are times when when you you when it's fine.
But but, I mean, this is like, we're we're sort of the all of my answers to this question are, like, you know, classic engineer. It depends. And and also, like, have a have a really good understanding of the problems that you're trying to solve and the size of data at any given point in a pipeline. And and try and also sort of, like, try and avoid spending your time solving problems of your own creation. Right? Like, there are a lot of you know, like, I sort of I think that, like, teams aren't gonna do this for lots of reasons, but, like, in a lot of cases, ETLing is a sort of problem that we've created because we haven't sort of extended engines in the right ways, where we haven't provided, like, interoperability in one way or another. And and, like, this is a lot of what Better d's working on is this, like, how do we provide the same interface, but also not acknowledge the the, like, reality of database this monolith of give all your money to IBM, get d b 2, put all your data in there, go to sleep happy is is just not true. And there you know? Or any any it's very rare that you find an organization that has only 1 database or only 1 database type. And and I think being being pragmatic with that is is really, really sort of crucial, a really crucial way of of, like, avoiding falling into some of these traps.
[00:51:28] Sam Kleinman:
I think that's probably a pretty good place to start to close out the conversation. So is there anything that you think is useful as a closing argument or anything that you wanna say to wrap up that we didn't already touch on?
[00:51:42] Tobias Macey:
I think we got through we certainly got through most of it. And there are definitely, like, a lot of, like, side quests in this conversation, which I think bear a little bit more attention at some point. I I mean, I think they'll, like, the part the parting parting shot, parting wisdom is really, like, ask more questions and acknowledge your ask more questions, acknowledge your assumptions, avoid premature optimization. You like, be aware and be aware sort of pragmatically of, like, what what's going on, what's possible, where you're we're almost always data we have app there's we spend a lot of time talking about applications, but there's, like, on top of there's, like, core application developers and then people working with data generated by applications on top of it. And which are themselves writing writing application applications of a different sort. But that all of these all of these projects really there's there's, like, this, like, pragmatic, like, this this is we don't have control of it. Most of our almost every line of code like, people very rarely just, like, open a text editor and say, today, I'm gonna write a data driven application. Right? That's another this is, like, another one of these, like, things that doesn't happen.
Not you know, don't wake up today and decide which database to use. Don't wake up today with a with a blank application and no data. Right? There's we're building on top of things. And so being letting making it possible and and and acknowledging that we're we're gonna be pragmatic. We have to be pragmatic. This is this is, you know, like, we're not we're building applications and tools and software to solve, hopefully, as much as possible other people's problems. And and, you know, and I think, like, make it so that we can make decisions about businesses, about the world, about our organizations, about whatever the thing is based on data and not vibes.
And if we screw up with you know, like, if a query is too slow, people are just gonna make decisions based on vibes. Right? People will always fall back to the, like, vibes vibes driven, decision making. And, you know, people do okay with their vibes, but we have all these tools. We have all of this data, like, being being sort of able to and flexible with with building building out infrastructure and tools around it is the thing that lets us make better decisions, more clear decisions about about whatever it is that we're doing. And I think that's, like, a really that's the thing that it's, like, important to to keep in mind and and is I know it had and has been true for a really long time at at, you know, the scales of 10 years ago are not the scales of today and aren't gonna be the scales of of, you know, 10 or 20 years from now. And I think that's all of that's okay.
And and our jobs are to sort of make sure that we don't get get we don't look back in 10 years and say, oh, we just we just sort of, like, did that all on 5. It was you know? And I think that's I think that's sort of the the important part. Yeah. And I think to making sure that you actually understand what the
[00:55:41] Sam Kleinman:
understand what the database that you're selecting is capable of and is optimized for before you decide to invest in it.
[00:55:49] Tobias Macey:
Yeah. Absolutely. Like, know your tools. Like like, I you know, I you know, as a as a programmer, I, like, have a collection of absolutely absurd facts in my brain about, like, what various run times do. Like, I you know, like, Python programmers sort of, like, have an intuitive sense of where the GIL is. You know? Like, the all there are all sorts of all sorts of cases where just, like, knowing your tools is really good and really, like, gives you a lot of of power and databases are kind of the same. Like, it's a it's a runtime. It's a it's a it I mean, they're they all use the same mostly use the same programming language, but it's all, like, it's all it's all one of those. And so knowing knowing how all of that works makes you a a better consumer.
And I think that's that's useful.
[00:56:49] Sam Kleinman:
Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:57:08] Tobias Macey:
This is gonna sound like it's out of left field, but I think easy to use testing tools are really the, like is the thing that just, like, always feels like it's missing. You you sort of write all of your application code. You can unit test a lot of stuff if you're, like, functional enough. But then you you start talking to a database and you're like, well, okay. Here we go. I hope I I hope I got that right. And and there's, like, there's nothing. There's like like as great as everything is, there's just, like, absolutely nothing that exists where you're like, oh, yeah. I wanna run a I wanna run something that feels like a unit test with you know, I wanna make some fake data, and it just, like, doesn't exist, and there's no way to do it. And and I and so, you know, that's where all the bugs are all the time. Like, you know, you spend if you if you have an app that someone's using and all every every outage, every user complain is always something that isn't covered by testing. And we have these, like, great tools in other areas of our code for for doing testing, and and, like, there's some paradigm or syntax or something that just, like, doesn't exist today.
And and I think, like, that that that's it.
[00:58:39] Sam Kleinman:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts on this overall space of selection and optimization around the database engine and how you're applying it to the problems you're trying to solve. It is a perennial problem and definitely subject to the overarching it depends answer. So I appreciate you taking the time to explore that with me, and I hope you enjoy the rest of your day. Yeah. You too.
[00:59:09] Tobias Macey:
Thank you so much.
[00:59:16] Sam Kleinman:
Thank you for listening, and don't forget to check out our other shows. Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
[00:00:16] Sam Kleinman:
It's 2024. Why are we still doing data migrations by hand? Teams spend months, sometimes years, manually converting queries and validating data, burning resources and crushing morale. DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit data engineering podcast.com/datafold today to learn how DataFold can automate your migration and ensure source to target parity. Your host is Tobias Macy, and today I'm interviewing Sam Kleinman about database trade offs across operating environments and axes of scale. So, Sam, can you start by introducing yourself? Yeah. Sure. I
[00:01:09] Tobias Macey:
am a software engineer with a eclectic background, I guess. I, you know, like, I I live in Brooklyn, and I've I I worked for a I spent a long time working on MongoDB in a couple of different roles and then have, like, spent some time working on other other engines, Neon and and and also a startup called ClearAD. And I think I think databases are or database engines are are sort of the I don't know, have always sort of been the, like, the biggest lever in the in sort of that you can that you can jump on to, like, impact the way that software engineers are able to be productive and build cool things. And so it's, it's always been a really exciting it's weird to say, oh, databases are exciting, but but they they they have always been an exciting place to to, like, do work and see and, like, have an impact on our industry. So that's that's sort of how I got here, belatedly.
[00:02:24] Sam Kleinman:
And do you remember how you first got started working in data?
[00:02:28] Tobias Macey:
Oh, yeah. No. I, like, I I have a a liberal arts undergraduate degree in social sciences, basically, and and interdisciplinary stuff that I, and I had a dumb job, like, in college, like, filing paper for the, fundraising department in, like, file cabinets, which I can't imagine anyone ever used. But, you know, it's a sort of economic endeavor, and I I would listen to podcasts, which is the thing. It's ironic here we are on a podcast, but I'm, like, not a I don't listen to a lot of podcasts since then, but I listened to something and, where they they interviewed some people who are working on me as who's sort of in the the last data tooling boom. And I, you know, listened to people talk about a lot of a lot of these databases that were, like, it was the sort of peak of NoSQL sort of or the heading towards the peak of NoSQL's moment. And I and I was like, oh, this data thing is, like, real big and gonna you know, like, it it sort of seems you know, and I had been interested in open source. And and I I had always been a nerd, but somehow as a as a younger young person, I was very resistant into in in, like, taking taking CS classes or whatever. And so I I never really studied it, but I definitely got I got interested in in learning about the technology and thought this that there's, like, a lot of cool things to be done, around infrastructure. And so I I think, like, I can't you know, it's sort of weird now if you say infrastructure, it means you're a sysadmin, and I'm not exactly a sysadmin.
But I think infrastructure in this, like, abstract sense is really cool because of what it enables and the sort of power the ways that infrastructure lets you focus on the important things of your problem. And databases mean that you're not spending lots and lots of time sort of figuring out how to absolutely, like, indelibly save data or make sure that you, like, recall like, there are these problems that databases just, like, solve for you. So as a programmer, you don't have to sort of wonder if you've, like, written the search algorithm correctly. Right? You just select whatever whatever you want.
And and I think that that that that there's, like, many levels of that. And I think we're actually, like I mean, we've been having relational databases for however, you know, 4 years. And we are, like, not yet to the point where the, like, the abstraction between application code and and state persistence and and retrieval code or the database engine. Like, that boundary is not exactly where it should be, and it's, like, interesting to watch it sort of settle over time, like, over the, like, long time. But yeah. So all of those things, like, have have sort of you know? And then I you know? And then you get a job and you keep working at it. And I and I've and then your next job looks like your previous job, and suddenly you're, you know, suddenly suddenly you have a thing, somehow. I think that's a, like, an accident that many of us kind of stumble into.
[00:06:07] Sam Kleinman:
And on the point of database engines, they are the system of record for whatever application or service or system you're building where database is a very broad term where it can be anything from I happen to have written a file to disk and then I am reading it back to some of these deconstructed database engines where you have the lakehouse architecture that's emerging where you have a query engine, you have some object storage, you're off live somewhere else, and everything in between across different modalities of data types and data formats, vectors have become increasingly popular as a data type because of the AI growth.
[00:06:53] Tobias Macey:
Yeah. Well and and there's, like, there's, like indexing is sort of a classic. Like, I I kind of actually think of, like, vet there's a lot of vector search problems, which are, like, subclasses of indexing problems or sometimes they can be. There's also, like, database is an abstraction. I think I think that's the, like, the thing for for doing this, like, save access and retrieval piece. You said something that I thought was great, just like deconstructed database. I think the there's this sort of orthogonally to all of this, what is a database, right? Like, there's also these questions about how we build applications and that shape how you need to save data. So you have you know, there are very architectural decisions which influence the shape of the data that you have. There are deployment considerations which shape how you how you deploy things.
There's, like, patterns or, like, workload properties that constrain how you manage the data that you have. I I think the truly revolutionary thing that's, like, happened in the last 15 years in computing, I think, is that that, like, s 3 has become reliable in a way that's kind of alarming, cheap in a way that's also alarming, and and ubiquitous. And I think that and the cheapness makes it ubiquitous, but but that that means that we it used to be the databases were like, how do you attach enough disk to a machine that has enough RAM to do the thing that you need it to do? And and the, like, the size of the disk and the you know, like, that was the sort of thing that mattered. And now, like, you just put it on s 3. You can, like and, like, network having storage that is connected over Ethernet rather than, like, over, like, some sort of, you know, PCI bus or whatever, SCSI bus, is is like that that shift changes how all the rest of these systems work. And I think that and it lets you it lets us as database makers, engine makers, separate storage from compute and have them vary separately. And that's real like, that's really powerful.
And it gives rise to, like, some of these, like, serverless things that are that are cool. And I think, again, we haven't quite seen all of it. Like, there's gonna be, like s 3, as good as it is, is gonna get better along a number of dimensions. And and I think we will, we'll start to see even even cooler things there.
[00:09:43] Sam Kleinman:
And I think on the s three point, there have been a few developments over the past 5 or so years that have also increased its utility, most notably the read after write consistency that didn't exist for a while where you would make a write, and then depending on where you are reading from, it may or may not exist yet. Now they've got guaranteed read after rate consistency. And most recently, they've added a optional update pattern where you can say only update this file if it is actually different than the one that I'm trying to upload, which can add a lot of efficiencies as well as far as IO for conditional updates or where you're not actually changing any of the data.
[00:10:26] Tobias Macey:
Yeah. Yeah. Yeah. No. It's I even think the, like, non the, like, in the inconsistent read after write version of s 3 is, like, is, like, so good in a lot of ways that that you or that I like I don't even think about all these other things as I'm like, yes. This is great, of course. But but I think there's they have also been adding, like, faster s 3 tiers, which are more expensive, but, like, gives you a level of consistency. And then you can do all these, like, crazy tiering. But but regardless of all of that, it means that you can, like, allocate your compute very elastically and and not have to worry about stateful, like, not have to worry about managing disks. Right? And and it's actually like when people are like, why do you need a new database now? Aren't the things that we have good enough? And the and, like, the things that like, the these, like, old industrial battle tested pieces of software are are really good.
And, like, it takes a long time to make a database kind of robust in that way just because, like, you have to bang on it from all sides. But the but the but they're all they all assume local storage in this way that, like, you can sort of rip out the bottom and put something new in, but it doesn't it, like the the transition from, like, colocated storage and compute to separated storage and compute is, like, a really huge change in in, like, in the software. And and sort of get getting us to that point, I think, is is really, is gonna be is, like, is an interesting, like, technological shift, that that, like, that all that people writing software today are just just beginning to sort of figure out how to do. And there are a lot of legacy tools which are sort of, like, sort of shoehorning themselves into being serverless in a way that is, like, absolutely not gonna, like, fly, from a, I mean, frankly, from an accounting perspective, let alone, like, technology.
But but I think it's it's this is another thing that's super interesting to watch.
[00:12:51] Sam Kleinman:
And one of the points that you made early on is the impact on the system design that the database has, and I think that that's definitely a very real consideration for anybody who is trying to design or evolve any sort of service, whether it's a consumer facing transactional application or a analytical workload or machine learning system. And I'm wondering if you can give some of the main points of consideration that engineers and teams need to be thinking about as they're starting to determine what is their persistence mechanism for the system that they're building and how that shapes the capabilities of that overall system.
[00:13:40] Tobias Macey:
Yeah. The it's sort of I I think that I I often kind of, like, think about backing into this in a different in a different way. Or, like, the the answer to the the way so we'll get I think we'll get there, but it won't feel like it for a little bit. So I think, like, the thing that you end up wanting to do is to think about the queries that you're running and think about things like, how does the data get updated? Is this aggregating data from a lot of different places, or is this data that we're writing and reading? Does data get modified after after it's created or not? There are some kinds of data where that's true or not.
There's also a a expense question. Like, how much money do you have to spend on on engine on the, like, on the database? And then and then there's this, like this sort of hints at, like, the the analytics question, but, like, are you returning records in some kind of what like, business object record of that is, like, persisted in a database, persisted in the database, and modified in in the application? Or are you returning aggregates or computed values that sort of summarize larger parts of the dataset? And and the thing or sort of so, like, once you answer those questions, once you have answers to to, like, what kinds of queries you're doing, which really kinds of depends on like, there's no wrong answer to any of these questions. There's just you you sort of, like, you have to do the thing where you write all of your business objects on index cards and lay them out in front of you and see how they interact with with, each other and and where the you know, what what's what needs to where things need to go. But you, that's a very old school way of thinking about this, I guess. But but you you once you sort of have that, like, what are the components, what are the objects in the system, and how do they interact? You can sort of and then you have to figure out how you need to save them and and or where where the objects exist in existing systems if they're, like, accessed over over APIs. And then so you you have so you have to, like once you have where all of the data is and have all of the the the, like, components and the objects, you you start to sort of think about how you're gonna use it in the in in whatever thing that you're building.
And there are lots of different tools which do, you know, will do sort of, like, one thing really well and a and a a few things less, like, fine, but not great. And the and the challenge is, like, always, like, picking the sort of the best tool for what you're doing and what you're planning to do based on how and where you're going. And I think that, like, the thing that that I the thing the sort of, like, sort of key things are, like, right throughput is sort of hard to like, that's like there are some I guess, like, maybe maybe actually the way to think about this is, like, what do you need to think about first? And or, like, what are the things which you can't compromise on? And so there is this, like like, write throughput is really like, if you pick a system that has slow writes, it's really hard to make it fast.
If you have a system which doesn't really support efficient updates, then you can't like, and your workload is very update heavy, then you really like, you're in trouble. And and and and then you sort of, like, go there are, like you know? So, like in some ways, like, insert throughput, update throughput, and then various kinds of, like, can you do recall at the level that you need? You can make there's, like there are ways of making every query fast. If you're if you're, like, if you're determined, you can make a database give you a query faster usually, But but there are it's hard to make it write faster if if you're sort of, like, teetering on the edge of its, like, tolerances.
But the but I think one of the sort of big, big one of these, like, big sort of top level sort of properties of engines is, like, is the data stored in a in a, like, row oriented format or a column oriented format? And I heard and I, you know, as a early career pre pre working on database engines, like, heard these people say columnar databases. And it took me a, like, really long time to sort of make sense of, like, how and why this was important. But, basically, the idea is that if you so row row oriented is very sort of reasonable, and you can think about it. You have records or rows.
Everyone sort of thinks about this now. It's like an Excel spreadsheet, but you know what a row is and all of the all of the columns, all of the fields, positions in the row are things that you, like, are related to each other. And then columns, right, are sort of all of the all of the different values for each of your records are stored sequentially. And so you will till you have, like, every you know, if you have an ID and and a, you know, like, a a name and a balance and a last updated time, you would store the names and then all of the balances and all of the last updated times. And when you do that, you can there are, like, ways to compress it more quickly, and there are ways of computing averages very quickly because you you can just sort of, like, shove, shove the the data through through the process or really, like, there's, like, very little copying that goes on. It's it's it's straightforward and and quite fast. And this is how analytics engines, like, make things fast is they just write the data out in a way that makes the query that they're gonna do quicker.
And so knowing how these things these things work is like, really lets you answer the questions of, like, what kind of engine do we need? Because we're gonna be computing averages, and so we need a columnar format. Or we're computing rows, and so we need or we're, like, recalling one row at a time or a collection of rows or a a filtering of a section of rows. If you say have, I mean, this is sort of obvious, but if you're just, like, returning records again and again, you like, columnar formats may be very quick for answering some kinds of queries, but they're quite slow at returning rows. That's obvious. Right? But, sounds dumb to say it. But but that, like, either either you store things as rows or or you store them as as columns are is sort of one of these, like, big choices in in databases. And it impacts things like write throughput and update performance if the data isn't if you're updating a whole row and they're in columns, you have to jump around a lot to get the update to work.
So that's that is, like, one of these, like, fundamental decisions that you make.
[00:21:37] Sam Kleinman:
That dichotomy of the row versus column oriented record storage is the fundamental divide that has introduced the practice of extract transform load or extract load transform whichever sequence you want to do it in because of the fact that if you have row oriented data, it's easy to insert an update, but it's slower to do aggregates. And if you're trying to do aggregates across your transactional database, it's going to negatively impact the application. So you move all the data to a different system to do all the analytics, which increases a lot of cost and complexity. You're duplicating the data, so you have to duplicate the storage cost.
And so that's definitely one of the fundamental aspects of how do you think about the system where are you primarily doing analytics, or are you primarily doing single record events or updates, or how how what what do you need to be able to retrieve?
[00:22:39] Tobias Macey:
Yeah. I and I but and I think that's like a I I kind of think of this like the ETL probe problem is I sort of I I kind of feel like the ETL is kind of this, like, problem of our own creation. The the there is this to to sort of that that thing that you it it I mean, this is this is like, this is this is a sort of interesting thing. But we have there is this sort of to like, there is this, like, historical myth that you can, like, have one database, and you'll just, like it will be huge, and all of your code will access the same database. And and, you know, and this is, like, a sort of this is the business model of, like, general purpose database vendors is that you're just gonna, like, you're just gonna make one giant machine or a couple of giant machines, and, you know, it's gonna run DB 2 or Oracle or one of these, like like, behemoths, and and it's gonna solve all of your problems with data. And and this is, like, basically always been a myth for the for literally this, like, this thing that you just said where if you run analytics queries against your transactional workload, then the workloads conflict with each other, and they make each other slow. Right? And the mechanism for this is, like, quite simple. You have you have more data than RAM. The data isn't in you know, the data working set for your instance at at any given time isn't in memory.
But if you're really smart or clever, you can keep the working set for a given workload in memory, and you'd sort of optimize the workload in the application so that for the most part, you're not modify you're not going to disk as much as you can. And transactional systems use, like, indexes for this. Right? And indexes are kind of like real time ETLs of of your of your system, in a lot of ways. There are these there are things like, materialized views, which are, again, the same kind of thing. And they, like, put the data in a different form and keep it up to date sort of sequentially, or, like like, strictly, like, in a consistent sort of way. But but but those those also sort of solve this, like, how like, ways that we can kind of keep the data sort of positioned in a way that makes the queries that you're gonna run stay in memory, not have to sort of bounce bounce to random IO on on disks a lot. And then and then you're sort of good, but if you have more than what if you have too much of this going on, then the workloads thrash a limited set of memory. And then and then you're into this, like, okay. We have to ETL and copy it over. You can also like, there are you can also imagine having having diff like, using read replicas to satisfy some of that some of those workloads.
You can imagine having, you know, like, having ways of persisting data that don't involve updates very much. Right? There are these, like, event driven systems, and so you will, like, materialize out the row data, you know, on some interval. And maybe that's, like, enough. Like, you don't like, having this, like, having the objects be sort of strictly read from memory and, like, material like, in memory as, like, a row and record is, like, maybe not always the thing that that we we need. And so there is like, you can write applications that do both, or you can have some of the like, you can use materializations or or other sort of tools within a system along with replication and and all of this to get kind of, to get the the the sort of query performance that you need for a given workload. Right? Because everyone some things like, you're always doing both.
Right? There's not a no. Look. This is the application where we save numbers, but we never average them. Right? Like, no. I'm like, okay. What are you doing? Like, you know, that's a that's not a that's not a thing. Stop it. Right? There are, you know and and and you say this other there is this other thing of this, like, assumption that that, like, ETL ing is is, like, all like, has this this, like, flaw of of, like, your data's stale and you're saving another copy. But if you're doing like, if you're on some kind of event driven, like, message passing, like, persistent like, using persistence via like, with messages and records and events, then and not, like, objects that you're you're updating in place, then you which, like, that's a thing we do. Like, that's sort of, like, pretty good way of doing things in in a lot of cases.
Then the thing that you're ETL ing is archives, which you need to do anyway. And so so, like and and it's not even it's not even an interesting ETL at that point. Right? You're just, like like, streaming data out to a file. Right? Like, it's and it'll so so so all of this all of this to say, like, there are a lot of tools. There are a lot of these, like, patterns that we use. And I think that we tend to sort of default to using databases with in these, like, very sort of specific ways with, like, all of these, like, very you know, like, I'm I'm, like, kind of, like, picking apart all of your assumptions and all of your, like, the all of these things that you're saying. But you're also sort of saying exactly, like, the thing that is is sort of like, the the the common way that we think and talk about about these, like, solving these data persistence problems.
Like, of course, we use ETL or ELT or TL I don't know. Whatever. It doesn't like, the the the this is, you know, like, processing your data and writing it, you know, like, that's it's called a query. Like, you know, it's somewhat like like like like, ETL should, like, be just like, I I've I've always sort of you know, like, every so often, I, like, stumble across an ETL flow. I'm like, this is you're just running a query in a CronJob. Like, I I get it. I like it. I like CronJobs a lot, but but I think that I think that we we often, you know, and we often sort of fall into, like, leading. And this is this is this is a, like, common engineering.
Engineers do this a lot, but you sort of start with the tools and work backwards to the business problem. And I think they'll, like, actually, you know, they're it take you know, like, you need the sort of grumpy, like, architect or principal engineer type to sort of say, like, okay. Like, let's back up and, like, go go in the other direction and sort of say, like like, ask why a few more times. Like, why are why are we doing this, and what what is the thing that we're trying to accomplish? Not, you know, like, do we need to archive? Do we need to get this this one query really fast or this other query? Or we need write throughput or or whatever the whatever the issue is. So, yeah, I I think it's I think it's I think it's these are, like, really important questions, and it and it impacts, like, how we how we what tool like like, if you you you get to the same like, you still build a thing that works at the end of the day.
But the the sort of assumptions that you need a thing of a certain shape is is you know, leads leads us into some weird contortions, I think, actually.
[00:31:00] Sam Kleinman:
That's definitely true. And I think another interesting angle to it is starting from that business case of I need to solve this problem. So given the requirements constraints of the problem space, I will then choose this, this, and this solution is that businesses and the context in which they're operating are constantly evolving. So what is the right solution for a business at a certain point in time will not necessarily be the right solution for all time, And so you get into those situations of iterative development, iterative architecture evolution. And in a lot of cases, the database engine that you selected because you thought that it was going to solve all the problems that you had, and maybe it did for a time, then becomes a liability because the assumptions of the business are no longer true.
So you need to be able to evolve to accommodate the new realities of what you're trying to solve for. And I'm wondering, what are some of the useful ways for teams to think about, 1, future proofing, recognizing that that is never the reality? You can never actually future proof, but planning for evolution of the architecture and some of the ways that teams, when they do hit that point of realization that the engine that they have selected is no longer viable and is now a liability, how can they then make that transition to a new system as painless as possible?
[00:32:41] Tobias Macey:
Yeah. I mean, I think you sort of you, like, you you got it in the, like, you can't future proof anything, point. So, now I have to think of something cleverer to say yeah. So I think that I think that what you I mean, in some ways, like, don't try is sort of the like, the answer is, like, you know, I sort of I obliquely referenced this the, like, the, like, x y problem of sort of jumping to a solution too soon. And I think that, like, the the sort of the thing that you can do to sort of be the one who says time out time out, folks. Let's let's, like, let's ask a few more questions is to all to think that, like, okay. I'm an engineer. I should I should always ask one more question and not or, like, I should always sort of state my assumptions.
And and often, like, that is the that is the tool that you use to future proof what you're doing. I think the so that's, like, that's that's a nod in the you know, that's, like, too too sort of hand wavy because people will, like, you know, like, engineers will almost always think that they're doing that and not quite do it. So, like, asking asking more questions and, like, starting, like, I assume starting sort of, like, every ticket with every every Jira issue, every document that you write, every meeting with, like, these are the things that I'm assuming. And then and then you sort of do do design work and architecture thinking, like, based on that requirements gathering based on on those assumptions. The the sort of not the the sort of the the second so that's like that's a whole that's like a whole kind of we could spend there's like a whole podcast series that we could have on that question. There's the the sort of on the other side of of it is this, like, realization that engineers have this tendency and desire to over engineer things and to know that sort of this, like, planning for future scaling is is is always this, like, over engineering, premature optimization kind of path and and to sort of, in some ways, just, like, spend less time thinking about it because you're always going to have to do it again in the future, and you should wait for it to be a problem before you start optimizing it or or fixing that kind of thing. And I and that's that's, like, okay. That's, like, really easy to say.
But I think, like, having having an architecture that makes sense for the team that you have that, like, keeps, like, the assumptions, like, well well declared, uses standard tools that that do scale in some way. So, right, like, if you there there is this, like, example. Right? If you have if you're, like, writing all of your data manipulation using pandas or polars, you have basically, you have your queries written in Python. And if you then need to take your 4 gigabyte dataset or your 32 gigabyte data set or what you know, or or, you know, small and terabytes of data that you that have been running okay with pandas, and then suddenly, like like, multiply it by a 100, you have to redo it. Like like, there is just like that's not you know, there's no way around that. You have to redo it. Fine.
That's fine. If it were SQL, you may have to redo it too, but there's a chance that, like, the SQL will, like, translate better across scales because SQL's declarative, and it sort of says, this is the thing I want. You, mister database, missus database, go forth and give it to me, and I don't care what you do on the inside. And you can sort of load the data into whatever tool you need to and see what happens in terms of performance. And so so that that, like, that sort of using using tools in that way, like, maybe is, like, the right answer, which isn't to say that, like, you shouldn't use pandas, because, like, a lot of times you're never gonna do that. And and and there are you know, and sometimes the redoing it is less less intense in one way or another.
Or you or or you have to, like, do some kind of or you have to there's a group by very early in your execution, and you just there's no way to to have that query execute in parallel anyway. And so your scaling options are are limited. Right? So the so it's, like, very kind of use case specific about, like, which which tools you use to kind of to to handle handle that, which is why, like, all the things we said before about about knowing what kinds of queries you're gonna be doing and and how, like, how that relates to where the data is is absolutely crucial. Going and then you, you know, you sort of hit this scaling wall or you watch the graph kind of get close, and then you and then you sort of say, okay. Now we have to, like we we deferred doing this earlier, but now we have the SQLite database, which is it's 2 terabytes.
And, you know, we need we really need to, like, have it not be, like, on the application server because it makes deploys take too long. Fine. Right. You know, like, there there I like, a year ago, I had had coffee with a former coworker who who told you know, described the architecture of a component of the of their of the system they were working on now. I was like, oh, yeah. It's just like it's like an EC 2 instance with with a really big SQLite database. And, like, it's fine. The you know? And and I was like and I was like, but really? And and, you know, if you think about it, like, sometimes, you know, having a having the app and the database in the same place, like, you there was always gonna be one one thing anyway, and there are backups no matter what it is. And so you you end you end up with these situations where, like, yeah. Actually, like, people probably should use SQLite for way more than they're currently doing with it. And and then having and then having have there are ways to, like, architect that in a really sort of sane way.
And then, you know, and then there are sort of you can kind of take that in in lots of different directions and having, you know, having ways of testing it locally. Like like the the oh, this is just a SQLite database or whatever. It's actually, like, gets you, like, really far, you know, in a distressingly really far way. But if the, like, thing if but but, you know, like and then maybe that's fine. And, like, you can later, you know, move it to Postgres or MySQL or whatever you want. And and that, like, that gets you that gets you a lot. And but but and some of this is, like, what tools are you comfortable with? And and some tie and there are definitely database tools, right, like, that do scale up pretty well along any number of dimensions.
Though though it is really it is also true. So, so, like, the the databases at scale, like, there is this there are these, like, class of of databases which have sharding built into the database, Cockroach, MongoDB. I mean, like, obviously, like, I know things about the MongoDB world pretty pretty well. But there are there are, Couch Couchbase does this. There are there are, like, Cassandra's and other sort of native, like, sharding inside of the database tools, and they and they're really like, that's, like, next level stuff. It's also absolutely true that in practice, if you're gonna use sharding, eventually, you sort of have to, like, know that you're potentially gonna be at that scale and, like, spend an hour making sure that you haven't sort of tied yourself to some kind of railroad track eventually. Right? Like, there is this, like, you kind of need to, like, plan how your data is is, like, laid out in order to to make that work. So it it is the the answer, you know, the answer is sort of like don't overengineer anything.
Upgrade the scale when you need to. Use be use the abstractions to your advantage, and also sort of gut check. Like, it's not possible for this dataset to get larger than 2 terabytes. Okay. Like or this is a dataset where we're, like, modifying things all of the time, but it's small. That leads to a different set of a different set of challenges. This is an append only data dataset. Great. You know, there are different different different like, you if you start at a certain place, you, like, get to get to an end more more or less quickly. And and I think scaling is always or scaling is a thing that application developers who know things about who, like, once you know things about how databases work, application developers are really the people in the best position to sort of in coordination with the product managers that one hopes they have to sort of make these make these kinds of decisions, when building when building apps. But it's but you but it's really sort of a thing that, like, works its way all the way up and down all the way up and down the stack.
[00:43:25] Sam Kleinman:
Another interesting outgrowth of the complexity of selection of a database engine can have is that the engineers who are building on top of that engine become skeptical of the longevity of that system, and so they will refrain from actually taking full advantage of the capabilities of the database and instead move more of that functionality and logic into the application layer. So the application becomes more complex and heavier weight and harder to maintain, and the database becomes effectively just dumb storage. And so I think that that is definitely a a risk that exists as well. And I'm wondering how you have seen teams try to manage that balance or some of the ways that you think about how and when to actually invest in the capabilities of the database engine to improve or enhance some functionality or manage some of the actual application logic versus just being this dumb storage system?
[00:44:28] Tobias Macey:
Yeah. I mean, I think it's I mean, the things like, I think this is, like, a class of, like, over know, like, oh, we're gonna put an abstraction layer in front of x thing because we might change it later. You're never gonna do it. Right? Like, sorry. Like like, that's not a that's not a thing. Like, no one so there there are a couple of, like, fallacies, I think, in in in the way that people think about databases and data selection. The first is that someone that there's ever a day when anyone in the world ever wakes up and is like, you know what? Today, I need to decide if we're gonna use MongoDB or Postgres, or we're gonna use MySQL or Postgres. Right? That decision has always already been made by someone else. Right? You know, unless you are, like, literally, like, starting out, like, that's we just don't make that decision at a at a low level. And the second thing is we don't change it. You know, there are occasionally, there are people who will do there are systems where you do need to migrate from one to another for some reason, But but that's incredibly rare, and you should and sort of at the point when you need to do that, you have users and money.
And so so it doesn't so so you can afford to do it. Right? If you don't have users and money, it doesn't matter because your data whatever it is, it all fits in RAM and you like like like, you're not making money anyway, so who's gonna pay you to, like, change databases? Like, that's not that's not what we're doing. Unless it's for fun, in which case, you know, go wild. But but we don't we don't have that these kinds of, like, migrations happen very often. So these abstractions like, you should have good tooling around your database that, like, lets you do it. But that that lets you sort of, like, handle like, have some abstraction between the database and and and your code. Right? You, like, you shouldn't be, like, I don't know, like, writing your own database driver inside of your inside you know, in in your model's package or whatever. But but but but you also you don't need to sort of build your own abstraction on top of on top of your database and do this this thing where you put the the all of your business logic in your app that could easily be pushed into the database. It would be faster and cheaper if you did it that way. Again, like, pushing things into the database is like writing SQL. And, you know, if you take a SQL statement and you're like, okay. Now we have to take this SQLite and make it run on Postgres 1. That that way should be pretty easy. And 2, like, it's it's you know, if what you have to do is go through and find a bunch of functions that don't exist or or and write them a different way, like, that's that's a much less onerous kind of thing. The I mean, even the real sort of the thing that I think like, you you sort of have this you the question you posed was, like, do you, like, take advantage of the database?
I think the, like, the risk that people like, say, in some ways, like, it's not just like, taking advantage of the database is, like, a relatively safe thing because people don't change. The thing that, like, I think is risky and people are less afraid of doing for the wrong reasons are are the are systems that have user defined functions, which means that you've sort of pushed some logic custom logic into the database, which has to sort of run and it's difficult to version it correctly. Right? That's kind of terrifying and hard to migrate. There's also, and and various things like triggers and stored procedures and all that, like, have the same same property. There's also if you're using an extension, you get, you know, extend so for systems like Postgres where you just, like, hot load code into into the database, it becomes very version dependent. And using that functionality is, like, actually sort of that. There is some like, I I I get that gives me the willies.
But, but I think, like, I think that I think that all mostly, like, teams I think mostly teams should should just not be as worried about it and and think of think of the these, like, adding extra indirection between the database and the application is really, like, is really a is really sort of a premature optimization. It's also true having said all of that, it's also true that sometimes you do actually wanna write application code. And this was, like, one of the things that, like, the early NoSQL or, like, early mid NoSQL era really kind of, like, leaned into is that you just you know, sometimes the clearest way to express something was in fact, like, writing a 4 loop. And you you you can't do it, like, all of the time, and it's not in a universal rule. But but sometimes that's what you want. Like, you just want a for loop. And having some application code, like, there are times when when you you when it's fine.
But but, I mean, this is like, we're we're sort of the all of my answers to this question are, like, you know, classic engineer. It depends. And and also, like, have a have a really good understanding of the problems that you're trying to solve and the size of data at any given point in a pipeline. And and try and also sort of, like, try and avoid spending your time solving problems of your own creation. Right? Like, there are a lot of you know, like, I sort of I think that, like, teams aren't gonna do this for lots of reasons, but, like, in a lot of cases, ETLing is a sort of problem that we've created because we haven't sort of extended engines in the right ways, where we haven't provided, like, interoperability in one way or another. And and, like, this is a lot of what Better d's working on is this, like, how do we provide the same interface, but also not acknowledge the the, like, reality of database this monolith of give all your money to IBM, get d b 2, put all your data in there, go to sleep happy is is just not true. And there you know? Or any any it's very rare that you find an organization that has only 1 database or only 1 database type. And and I think being being pragmatic with that is is really, really sort of crucial, a really crucial way of of, like, avoiding falling into some of these traps.
[00:51:28] Sam Kleinman:
I think that's probably a pretty good place to start to close out the conversation. So is there anything that you think is useful as a closing argument or anything that you wanna say to wrap up that we didn't already touch on?
[00:51:42] Tobias Macey:
I think we got through we certainly got through most of it. And there are definitely, like, a lot of, like, side quests in this conversation, which I think bear a little bit more attention at some point. I I mean, I think they'll, like, the part the parting parting shot, parting wisdom is really, like, ask more questions and acknowledge your ask more questions, acknowledge your assumptions, avoid premature optimization. You like, be aware and be aware sort of pragmatically of, like, what what's going on, what's possible, where you're we're almost always data we have app there's we spend a lot of time talking about applications, but there's, like, on top of there's, like, core application developers and then people working with data generated by applications on top of it. And which are themselves writing writing application applications of a different sort. But that all of these all of these projects really there's there's, like, this, like, pragmatic, like, this this is we don't have control of it. Most of our almost every line of code like, people very rarely just, like, open a text editor and say, today, I'm gonna write a data driven application. Right? That's another this is, like, another one of these, like, things that doesn't happen.
Not you know, don't wake up today and decide which database to use. Don't wake up today with a with a blank application and no data. Right? There's we're building on top of things. And so being letting making it possible and and and acknowledging that we're we're gonna be pragmatic. We have to be pragmatic. This is this is, you know, like, we're not we're building applications and tools and software to solve, hopefully, as much as possible other people's problems. And and, you know, and I think, like, make it so that we can make decisions about businesses, about the world, about our organizations, about whatever the thing is based on data and not vibes.
And if we screw up with you know, like, if a query is too slow, people are just gonna make decisions based on vibes. Right? People will always fall back to the, like, vibes vibes driven, decision making. And, you know, people do okay with their vibes, but we have all these tools. We have all of this data, like, being being sort of able to and flexible with with building building out infrastructure and tools around it is the thing that lets us make better decisions, more clear decisions about about whatever it is that we're doing. And I think that's, like, a really that's the thing that it's, like, important to to keep in mind and and is I know it had and has been true for a really long time at at, you know, the scales of 10 years ago are not the scales of today and aren't gonna be the scales of of, you know, 10 or 20 years from now. And I think that's all of that's okay.
And and our jobs are to sort of make sure that we don't get get we don't look back in 10 years and say, oh, we just we just sort of, like, did that all on 5. It was you know? And I think that's I think that's sort of the the important part. Yeah. And I think to making sure that you actually understand what the
[00:55:41] Sam Kleinman:
understand what the database that you're selecting is capable of and is optimized for before you decide to invest in it.
[00:55:49] Tobias Macey:
Yeah. Absolutely. Like, know your tools. Like like, I you know, I you know, as a as a programmer, I, like, have a collection of absolutely absurd facts in my brain about, like, what various run times do. Like, I you know, like, Python programmers sort of, like, have an intuitive sense of where the GIL is. You know? Like, the all there are all sorts of all sorts of cases where just, like, knowing your tools is really good and really, like, gives you a lot of of power and databases are kind of the same. Like, it's a it's a runtime. It's a it's a it I mean, they're they all use the same mostly use the same programming language, but it's all, like, it's all it's all one of those. And so knowing knowing how all of that works makes you a a better consumer.
And I think that's that's useful.
[00:56:49] Sam Kleinman:
Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:57:08] Tobias Macey:
This is gonna sound like it's out of left field, but I think easy to use testing tools are really the, like is the thing that just, like, always feels like it's missing. You you sort of write all of your application code. You can unit test a lot of stuff if you're, like, functional enough. But then you you start talking to a database and you're like, well, okay. Here we go. I hope I I hope I got that right. And and there's, like, there's nothing. There's like like as great as everything is, there's just, like, absolutely nothing that exists where you're like, oh, yeah. I wanna run a I wanna run something that feels like a unit test with you know, I wanna make some fake data, and it just, like, doesn't exist, and there's no way to do it. And and I and so, you know, that's where all the bugs are all the time. Like, you know, you spend if you if you have an app that someone's using and all every every outage, every user complain is always something that isn't covered by testing. And we have these, like, great tools in other areas of our code for for doing testing, and and, like, there's some paradigm or syntax or something that just, like, doesn't exist today.
And and I think, like, that that that's it.
[00:58:39] Sam Kleinman:
Alright. Well, thank you very much for taking the time today to join me and share your thoughts on this overall space of selection and optimization around the database engine and how you're applying it to the problems you're trying to solve. It is a perennial problem and definitely subject to the overarching it depends answer. So I appreciate you taking the time to explore that with me, and I hope you enjoy the rest of your day. Yeah. You too.
[00:59:09] Tobias Macey:
Thank you so much.
[00:59:16] Sam Kleinman:
Thank you for listening, and don't forget to check out our other shows. Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
The Role of Database Engines in System Architecture
Considerations for Choosing a Database
Future-Proofing and Evolving Database Architecture
Balancing Application Logic and Database Capabilities
Pragmatic Approaches to Database Selection and Use