Summary
In this episode of the Data Engineering Podcast Jeremy Edberg, CEO of DBOS, about durable execution and its impact on designing and implementing business logic for data systems. Jeremy explains how DBOS's serverless platform and orchestrator provide local resilience and reduce operational overhead, ensuring exactly-once execution in distributed systems through the use of the Transact library. He discusses the importance of version management in long-running workflows and how DBOS simplifies system design by reducing infrastructure needs like queues and CI pipelines, making it beneficial for data pipelines, AI workloads, and agentic AI.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Jeremy Edberg, CEO of DBOS, about durable execution and its impact on designing and implementing business logic for data systems. Jeremy explains how DBOS's serverless platform and orchestrator provide local resilience and reduce operational overhead, ensuring exactly-once execution in distributed systems through the use of the Transact library. He discusses the importance of version management in long-running workflows and how DBOS simplifies system design by reducing infrastructure needs like queues and CI pipelines, making it beneficial for data pipelines, AI workloads, and agentic AI.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Jeremy Edberg about durable execution and how it influences the design and implementation of business logic
- Introduction
- How did you get involved in the area of data management?
- Can you describe what DBOS is and the story behind it?
- What is durable execution?
- What are some of the notable ways that inclusion of durable execution in an application architecture changes the ways that the rest of the application is implemented? (e.g. error handling, logic flow, etc.)
- Many data pipelines involve complex, multi-step workflows. How does DBOS simplify the creation and management of resilient data pipelines?
- How does durable execution impact the operational complexity of data management systems?
- One of the complexities in durable execution is managing code/data changes to workflows while existing executions are still processing. What are some of the useful patterns for addressing that challenge and how does DBOS help?
- Can you describe how DBOS is architected?
- How have the design and goals of the system changed since you first started working on it?
- What are the characteristics of Postgres that make it suitable for the persistence mechanism of DBOS?
- What are the guiding principles that you rely on to determine the boundaries between the open source and commercial elements of DBOS?
- What are the most interesting, innovative, or unexpected ways that you have seen DBOS used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on DBOS?
- When is DBOS the wrong choice?
- What do you have planned for the future of DBOS?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- DBOS
- Exactly Once Semantics
- Temporal
- Sempahore
- Postgres
- DBOS Transact
- Idempotency Keys
- Agentic AI
- State Machine
- YugabyteDB
- CockroachDB
- Supabase
- Neon
- Airflow
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macy, and today, I'm interviewing Jeremy Edberg about durable execution and how it influences the design and implementation of business logic for data systems. So, Jeremy, can you start by introducing yourself?
[00:01:00] Jeremy Edberg:
Hi. My name is Jeremy Edberg. I'm currently the CEO of Deboss. Previously, I've worked at places like Reddit, Netflix, PayPal, Apple, eBay, Amazon. And do you remember how you first got started working in data? Well, I've kind of been working in data the whole time. So I was the first employee at Reddit, and, obviously, that involved a lot of data. So by necessity, had to manage a cluster of Postgres databases, which I later found out was the largest cluster of Postgres databases on AWS for a while. Yet Netflix, it was kind of the same. I was there in charge of reliability, and I moved to the data team because data is a huge part of Netflix, and making the system reliable revolved around making the data systems reliable and fast and easy to use.
[00:01:45] Tobias Macey:
So I I've sort of been involved in data my through my whole career and even before that. And as you mentioned, you're currently running your company. You said DBOS? I wasn't sure if it was DBOS or DBOS.
[00:01:56] Jeremy Edberg:
Yeah. So funny okay. So it started as DBOS because it was originally going to be an operating system based on a database. So using a database as the the bottom layer instead of a file system. The the founders moved away from that model as they realized it was very hard, would take a long time, and wasn't commercially viable to start. And instead used their learnings from their research into that model to build the serverless platform and orchestrator that we offer today. So our systems are based on that concept.
[00:02:31] Tobias Macey:
And so digging a bit more into what you're building and also, a little bit about this concept of durable execution, I'm wondering if you can just describe some of the overview about what you're trying to solve and some of the system that you're building around it to address that problem?
[00:02:47] Jeremy Edberg:
Yeah. So durable execution is a concept that's very old, but the term is fairly new. You've probably dealt with it with reliable pipelines or, error correction, things like that. But, really, all it means is executing in a way where the software does what it's supposed to do, and in our case, exactly once. Right? So exactly once has been a very tough problem to solve since we've moved to distributed systems, and durable computing is intended to solve that problem across distributed systems.
[00:03:18] Tobias Macey:
And some of the other contexts, at least in recent memory, that I have heard this concept of durable execution, whether using that specific terminology or not, is largely around the temporal project, which was an outgrowth of the orchestration and, log running execution systems from Uber. And I'm wondering if you can talk to some of the ways that what you're building is either related to or disparate from the concepts that are built into temporal and the capabilities that it focuses
[00:03:49] Jeremy Edberg:
on? Yeah. So temporal is definitely the, the 800 pound gorilla in the durable execution space. And we are building a system that acts similarly, but is built completely differently. So we solve the same problems that they do, with once and only once execution, long running workflows, so on and so forth. But we do it in a very different way. We store everything locally. Our system is designed to be much more locally resilient, so you don't need an, an executor that's remote where your reliability depends on another system. So that's how we significantly differ from temporal. And
[00:04:26] Tobias Macey:
from an architecture and application design and business logic perspective, obviously, having durable execution as the core substrate changes the way that you would think about the problem versus if you were just building it all from scratch, if you're doing just arbitrary Python scripts or trying to build a distributed pipeline system where you have a sequencing of tasks, whether that's from cron jobs or CI pipelines or airflow, etcetera. I'm just curious how that inclusion of durable execution, the exactly one semantics changes the ways that engineers think about what problems they can solve and how they address solving those problems.
[00:05:04] Jeremy Edberg:
Yeah. So so what's really interesting about it is, if you're building your distributed system in the quote, unquote correct way, you are already building it to be almost durable. And the way we operate is we just add some decorators, and so that lets you decorate your existing code. But starting from scratch, what it enables is you don't need a lot of the error checking whoops code that you would normally need. For example, you don't need all the transaction rollback. What do I do if somebody made an order but the payment failed? Now I have to put the inventory back. You don't have to worry about that because it's already a series of work steps and the workflow will get automatically rolled back. So that inventory will just go back. So you don't even have to worry about that. So the way it significantly changes is it makes programming much easier to reason about. And in some cases, we've seen it reduces code by over 90% because you don't need all of those, edge cases, branches, error corrections because it's all just built right into well, if I can guarantee that this will run either exactly one time or zero times, I don't have to worry about a lot of that other stuff. From a data pipeline perspective, as you mentioned, if you're doing it correctly, you're already bringing in some of those semantics of using things like semaphores
[00:06:16] Tobias Macey:
or doing check before right, etcetera. I'm wondering how the approach for data pipelines in particular changes
[00:06:24] Jeremy Edberg:
across multi step workflows, across different execution contexts when it's relying on something like DBOS for that exactly one semantics and the failure recovery, the ability to resume execution, things like that? Yeah. By using DBOS and the concepts of durable compute, you can eliminate a lot of the extra things that you need to do on your data pipeline. You can eliminate the check at the end whether the data got through. You can eliminate a lot of the queues. We support queues, but you can eliminate a lot of the external queues because you don't need to worry so much about it anymore. And so that's the big thing with the Symantec is a lot of our customers are actually doing this. Make sure that this chunk of data gets from point a to point b, and it happens exactly one time, and it did happen. And so by using durable compute, you can eliminate a lot of the extra infrastructure around your data pipeline that is there to make sure that the data pipeline actually worked. And things like you that you mentioned, the semifours, that stuff, it's all it's built in because it's part of the durability of the library. You mentioned
[00:07:31] Tobias Macey:
too in the context of temporal, how Deepass is more self contained. One of the major challenges of building data systems or complex application architectures, particularly bringing in things like service oriented architecture or microservices, is the operational overhead of managing them and managing coordination across them. What are some of the ways that DBOS addresses some of that operational overhead and operational complexity of these distributed architectures, particularly around synchronization and state management?
[00:08:03] Jeremy Edberg:
So you said it at the end best. Right? It's the statefulness, and it's the we're strong believers in put everything in Postgres. And so what we have is a metadata database, and that helps you eliminate a lot of that operational complexity. You can technically build a fully functional operational reliable system with a single file using the semantics of our library. And so you don't have to worry about all those other systems, managing those other systems, making sure they're scaling. All you have to do is scale your executors and your Postgres database. And both of these are well solved problems. Right? People have been scaling databases for fifty years and Postgres for forty. And same with execution. Right? People have been scaling execution for at least last ten, fifteen plus years. And so those are the only two things you really have to worry about. There's very little database overhead, and so the system works really well because you're already scaling your database because it's where your data is. And so that's why this eliminates a lot of that operational overhead because you don't have to worry about those extra systems.
[00:09:08] Tobias Macey:
When we're talking about durable execution, workflow management, a lot of times, data pipelines or systems that people are typically interacting with are executing on the scale of seconds to hours, maybe days at most. But, obviously, there are workflows particularly in large and complex organizational contexts where the workflow might take weeks, months, years to complete. And, obviously, that brings in a lot of challenges around version management, change management, where you have a workflow, you kick it off, you'd you know that it's not going to complete for up to a year, but you also need to be able to evolve your system within that time because you can't just freeze everything, and you don't want to necessarily stamp out multiple copies where each copy corresponds to a change that you're trying to release. And I'm curious how that factors into the way that you think about what workflows to incorporate in DBOS or a durable execution framework, how to manage that change of version, how to manage the change of state representation, and some of the ways that you're thinking about that in the ways that you're designing and implementing the DBOS substrate.
[00:10:20] Jeremy Edberg:
Mhmm. Yeah. So versioning is key, and versioning is a first class citizen, within the transact library, because of the exact reason that you just pointed out is you don't want your workflow to change versions in the middle or maybe you do, but you wanna have control over that. You don't want it to to change when you're not expecting it to. And so that is a very fundamental part of the transact library and the services that we offer is managing versions. So by default, every workflow will finish on the version of software that it started on. That is the default behavior, because we assume that's mostly the expected behavior for most people.
You can manually change it, though. If you're in the middle of a workflow and you've realized you had a bug, you upload some new code, you can say this workflow now finishes on this new piece of code. That is certainly available to you. And so it's versioning is key. It's first class, and everything in the system is around those versions to make sure that the state data is tied to the version but can be changed if you want to.
[00:11:21] Tobias Macey:
So now digging into DeBoss, you mentioned the transact library. We can dig a little bit more into the specifics of that change management as we dig a bit deeper. But I'm wondering if you can talk through what is the overall design and architecture of DeBoss? What is the role of the transact library in the broader context of the corporate offering and just some of the ways that the scope and focus have changed from when you first started building out these capabilities?
[00:11:51] Jeremy Edberg:
Right. So the transact library is the free open source MIT library that gives you durability. It is the one that handles creating that metadata database, managing all the metadata in the database. So, for for the nerds who care, we essentially keep a record of every piece of data that's changed and which transaction ID in Postgres made that change so that we can roll back to any particular transaction ID and say all of this happened after, so undo it. And so what this lets you do is if you are storing your data in the same database, the whole thing can be wrapped in a database transaction.
So it either completes or it doesn't, and then it's all stored together. If you're using a third party, you can get idempotency keys, which is another thing that the deboss library provides management of and is stored with your data so that you can assuming that third party offers you some functionality to roll stuff back, you can still do that. And so that is what the core library offers. It offers these primitives. It offers crons, queues, a few other things that you might want if you're building a strong software that's based on data manipulation.
And those are all in the library, and that can all be run locally. That can all be tested locally. What we offer commercially is the ability to run that as a production workload reliably, securely, observably. So you get a bunch of observability tools. You get a single pane of glass management system to see all the workflows, to manage the workflows. You get a system that moves the workflows from, one executor to another if they fail. The is managing to guarantee that the workflow is finished across multiple executors. So transact is will be on a single executor. It'll make sure that everything happens on that single executor.
Our commercial offering helps you manage it across executors as well as giving you observability and a lot of data around your workflows.
[00:13:53] Tobias Macey:
And so digging more into the transact library and the ways that you're thinking about its application and the ecosystem that it sits within, to my understanding, it's a Python library. So, obviously, that means that you're very, that you're betting heavily on the Python ecosystem and people needing the durable execution context within the Python ecosystem. And I'm wondering what are some of the ways that you're thinking about the types of use cases and the types of applications that are the target here and any potential for expanding to other language ecosystems as you establish your foothold within Python?
[00:14:36] Jeremy Edberg:
Yeah. So just to clarify, we offer both a Python and TypeScript library. So, you can do it's the same semantics in both, and you can use either one. And because TypeScript will support, you know, Node. Js, it'll support Next. Js, all of the all of your favorite JS libraries. So that is, one important thing. But what we're targeting really is people or I should say, we're targeting every workload, but the ones that have shown up the most are data pipeline workloads. I need data to get from point a to point b, AI workloads.
So I need to do a training run, and I need to guarantee that these 10,000 documents have all been trained on. Agentic AI, I need a human in the middle, and so that could be a long running workflow. So those use cases all work for us. Although we do even have, random people doing back end application development. We have one person who manages, hunting areas in the American South. So literally life and death situation because if you have not reserved your hunting area, you could get shot by somebody who has, and so that is running on our system. So it's a it's a wide range of use cases, but we started with TypeScript because a lot of people were building full stack applications using TypeScript in their back end. We then moved on to Python because of the AI use cases, the data pipeline use cases, the fact that NumPy exists, SciPy, all of those massive data manipulation libraries, and a lot you know, those create long running workflows. Right? It could take hours, days to to analyze your data. And so you wanna have checkpointing there so you can restart it and so on and so forth. The next couple of languages, we don't have a road map for any particular language yet, but the most popular they requested are, Go and Java.
So probably one of those next, most likely Go because of the complications of Java. But we're going with right now those two to for those use cases and then whatever people want next.
[00:16:38] Tobias Macey:
For cases where people are leveraging both the Python and the TypeScript SDKs, what are some of the ways that they approach the cross language boundaries for workflows that need to span across different process environments where maybe they have something in the browser or in node runtime that also needs to interact with workflows that have execution context within that Python ecosystem?
[00:17:04] Jeremy Edberg:
That is a very interesting question. So in reality, no one is doing that because of all the complications you just mentioned. Anybody who's using both languages, if they are, it's two applications that talk to each other via standard API. Mainly because you don't want to be sharing those databases because of the the different semantics of where and how data is used. Right? So it's like a standard microservices thing. You generally aren't gonna you don't wanna use the same database for multiple microservices because then you don't really have microservices.
Two languages basically are two different microservices. So that would be a bad practice, and luckily, no one that we know of is actually doing it. Another interesting parallel in computer science and just the the ways that people think about designing systems,
[00:17:47] Tobias Macey:
there's the durable execution, which helps you with failure recovery of the process died, and now I need to start back up from where I left off. Another parallel for managing transitions and, sequencing within workflows is the idea of a state machine. And I'm wondering what are some of the areas of overlap with state machines and the ways that those are incorporated into application architectures and the guarantees that are offered by durable execution and some of the ways that you're actually using state machines within DeBoss to manage some of those transition requirements.
[00:18:21] Jeremy Edberg:
Yeah. I mean, you could think of every workflow as a big state machine. Right? There's a a series there's a state and it changes, and we track how it's changed over time. So you can if that is actually one really good way to reason about a workflow. It's a series of steps in a state machine, and we can always determine the current state of the state machine by querying the metadata database, which interestingly is completely open and available to the user. So the user is certainly able to build any query they want against that database to expand the functionality or get whatever observability they need in particular. There's nothing secret there.
[00:18:58] Tobias Macey:
And you mentioned that Postgres is the persistence layer. So, obviously, you're betting a lot on its reliability, its transaction isolation, and its scalability. And I'm wondering what are the aspects of Postgres as an engine that led you to selecting it for such a critical piece of the architecture and maybe any other engines or storage layers that you considered on the path to that ultimate decision?
[00:19:26] Jeremy Edberg:
Yeah. So Postgres was selected in part because our cofounder created Postgres. But we did look at others I should say they did because before I joined. They looked at other storage engines. I don't wanna say their names, but they did look at other storage engines, and none of them have the bulletproof reliability that forty years of Postgres gives. And, Postgres was settled on mostly because it's just it's bulletproof. It's known to be bulletproof. It's been around a long time. There's a lot of research that has gone into making it stronger and better and a lot of people attacking it to make sure that it works. So it was just a good solution. There's good solutions for reliability there, for shipping data back and forth, lots of experience. So it just made a lot of sense. And most applications can be run on one or two Postgres databases.
There's very few applications that require more than that. And for applications that do require
[00:20:20] Tobias Macey:
more horizontal scalability, there are systems such as CockroachDB and Yugabyte that have at least interface compatibility with Postgres, and I'm wondering what your thoughts are on the viability of using one of those options or something similar in place of the actual core Postgres and some of the ways that that compromises the actual guarantees that you're focused on with the vanilla Postgres as the core of your architecture.
[00:20:47] Jeremy Edberg:
Yeah. So so you can use any Postgres compatible database, but they have to actually be Postgres compatible. So there's there's a range of Postgres compatible. Right? There's there's some that claim to be, but really, they only support, you know, 70% of the actual functionality. That's not gonna work. But the ones that do support it fully and we do have people using other databases. We have people on Supabase. We have a partnership with Supabase. We have NeonDB. We've had someone try it with Cockroach, but we have not. They disappeared. So we don't certify database, but we have tried it on other databases.
And we can if we get enough requests, try it on a database to see if they truly are Postgres compatible. But as long as they truly are, it'll work just fine. You could buy Cockroach. We've talked to all of these folks, and at least somebody has tried it on them. And so for people who are
[00:21:38] Tobias Macey:
building a system, they need guaranteed exactly what semantics. They need reliability, resumability. I'm wondering if you can talk through the process of actually building a system with DBOS and just what that looks like from a program design perspective, some of the ways that they should be thinking about, what are the touch points, it to the transact library and the execution context for the actual workload components, what are some of the boundaries where you want to keep your application logic maybe in a separate process that uses transact in a subprocess or just some of the ways that that factors into the overall system design and implementation workflow?
[00:22:18] Jeremy Edberg:
So we have designed the system so that you can if you want to do it all in a single file. You can have a single process that does the whole thing. Now, obviously, if it's very big, you might want to separate your applications into multiple app applications, multiple microservices basically for separation concerns or perhaps your databases are separate for scaling. Generally, you would build it the same you would use the same design considerations that you would use if you're building microservices. Doesn't make sense to put a boundary here. But you can certainly start with a single file and then take particular workflows and move them to somewhere else if that makes sense. When you're building with durable execution, you generally wanna think in workflows. So, what things are gonna happen and what are the series of steps that are gonna be involved. So workflow fundamentally is just defined as a series of steps. And each step is executed and then we store this we store the inputs and the outputs of each step so we can rerun them as necessary. And so that is the fundamental basis of where where you start your design is what are the workflows, what are the steps. Then where is the data coming from? The answer is usually a database, and it doesn't even have to be a Postgres database. It could be any database. We've definitely had people that are attaching to MongoDB or whatever it is, their favorite key value store, DocumentDB. It doesn't really matter. It's still just a regular application. You just get extra protection if you're using Postgres as your data stores. And so that's gonna be your boundary. It's gonna be where's your data, and then this is the application that's going to move the data from one place to another. And if you wanna scale that independently, then you would separate it into two applications.
One that runs some workflows, one's run another. If you have, like, cron jobs in there, you may want to make a separate application that's just the cron jobs, for example. But it doesn't particularly matter because the executor is going to be resident. It's gonna execute. It's going to re execute as things fail. The main thing that you have to be responsible for is restarting that executor if it fails or moving those workloads. And that's that's where our commercial product comes in and does that stuff for you. Or if you use our cloud, of course, all of that is taken care of for you as well. Circling back now to that version management
[00:24:31] Tobias Macey:
piece, what what are the complexities in the software ecosystem is that everything is at a constant state of flux. Even if you want to, for instance, run something, some piece of software from twenty or thirty years ago, you might not be able to because you might not be able to access all the dependencies unless you very intentionally created a sort of hermetic environment for all of it. And so in the context of DBOS for applications that are pulling in other dependencies, so if you're using Python and you're relying on NumPy for the execution and then you upgrade the version of NumPy or change some of the business logic, but you want the workflow that kicked off six months ago to continue executing, I'm wondering how you actually manage the versioning of the code and the ability to maintain the version of code that was started for a workflow to ensure that it continues to completion while also being able to deploy newer versions.
[00:25:25] Jeremy Edberg:
Okay. So I'll start with our cloud. Our cloud does that hermetic ceiling. Right? So we actually install the libraries for you based on your requirements of text or, you know, the way you you presented to us the libraries with the version numbers. And so that your version on our cloud is that entire package of your software plus all of its dependencies, frozen at the point where you upload them or what you've chosen as the the freezing point. If you are running it yourself, then you do need to manage that for yourself. You have to manage your dependencies the same way that you would in any other distributed system without durability. Right? You're still gonna have to manage. That's a problem no matter what. The things getting upgraded underneath, stuff like that. So you still have to manage that yourself if you are managing yourself.
But if you're using our commercial products, then we help you manage that. I guess it's the best way to say it. But otherwise, yes. So you're gonna have to manage it the same way that you manage any other software upgrade in any other distributed system. You mentioned the capabilities
[00:26:28] Tobias Macey:
in your commercial offering. We've mentioned the transact library being open source. I'm curious how you approach the heuristics of what are the capabilities that belong in the open source, and what is the boundary for what is part of the commercial offering and some of the ways that you think about the sustainability and ecosystem investment around the open source component and how that can coincide with the commercial, focus of running a sustainable business.
[00:26:57] Jeremy Edberg:
So the boundary is transact will help you build durable software. It'll help you it'll always be anything regarding, workflow steps, etcetera. Deboss commercially will help you operate that software. So that is that is the boundary. We help you operate it well and reliably, and we give away for free the ability to build software that can be reliably operated. So you can do it on your own if you want, or you can pay us some money, and we will help you do it or do it for you depending on whether you use our cloud or not. And so that's the boundary is is operations versus build. Build for free will help you operate it well. And in your experience
[00:27:37] Tobias Macey:
of building DeBoss and the transact library, you mentioned some of the use cases. I'm just curious. What are some of the most interesting or innovative or unexpected ways that you've seen that technology stack used?
[00:27:48] Jeremy Edberg:
Yeah. So I told you about the the back end application. And honestly, that's not super innovative, but it's just really interesting to me that there's a life and death situation there. The most interesting way ones have been these data pipeline use cases. It's the AI training ones, and we have a quote on our website from a customer who was attempting to move from data from one CRM to another. And they were trying to build this for months, and they couldn't do it until they found our durable execution library, and then they were able to build it in two days. And so that's probably the most interesting use case, the one where the once and only once Symantecs are the most important thing to you. So I want to make sure that this data actually got there, but it I don't want duplicates. And then the other interesting one is the agentic AI. The think about a customer service agent where you can ask an AI for a refund. And if it's a small amount of money, the AI just gives you a refund. But if it's a large amount of money, a human needs to approve that. And that human in the loop part is interesting because it could take seconds, it could take hours, it could take days for that human to approve that. And so that workflow needs to hang open and running, but the way that we've designed it is there is no hanging open. Right? It's the that step is completed, and we're waiting for the next step. And whenever that comes in, that is when we will resume execution of this workflow. And so, these these LLM use cases, we end up saving the customer a ton of money because they don't have to sit there waiting for these answers to happen. In the workflow
[00:29:20] Tobias Macey:
stage completion piece, another thing that we didn't really talk through yet is some of the triggering of what is the actual signal for the next step to execute and some of the ways that transact handles that trigger management to say, okay. The step completed. Now I need to do this thing. Or in the case of what you were just saying where there's the human in the loop of I completed this step, and I sent some sort of notification. Now I'm just going to wait, and now I need some other signal back in to be able to trigger the next step of it and just some of that signal management and sequencing.
[00:29:50] Jeremy Edberg:
Yeah. So, typically, the the signal is the next API call. Right? It's the response. How the response comes in depends, of course. It could be a webhook. It could just be a response on a request. And, yes, there is a listener listening for that, but we can do it very efficiently in our cloud because we can be listening for thousands of responses at once. But none of those are running, so there's no real CPU time being used. But, yeah, it's typically an incoming API call that triggers the next step. It's the same as if you didn't have durable computing. It's just that you don't have to have a listener of your application running and waiting for that response. There's just a generic listener waiting for incoming API requests, and then that triggers a workflow based on what it's a response to. In the context of these
[00:30:36] Tobias Macey:
LLM use cases, data pipelines, for people who are adopting DBOS and the transact library, I'm wondering how you're seeing that influence the rest of the architectural decisions that they make. Like, are there components of their data stack that they're able to retire because of the capabilities that DDoS offers, or are there just cases where they were using
[00:30:58] Jeremy Edberg:
maybe an airflow or some other sort of orchestrator, but they move pieces of that to DBOS for managing the workflows that require that reliability or just some of the ways that it influences the overall architecture decisions that they either have already made or would otherwise make. Yeah. We've definitely seen people who have been able to drop their airflow or at least turn it down, their Kafka queues. That's a big one. They don't need their Kafka queue anymore. In some cases, they still use the Kafka queue as a backup, and then after a while, they shut it down realizing, oh, we don't. It's not doing anything anymore. So there's definitely that case. It's, I'd say the biggest one is is queue queues getting shut down. We don't need all these extra queues. We don't need these extra processes that are monitoring these queues, that kind of thing. That is probably the biggest the biggest change that people make. Oh, and their CI pipelines. They can reduce the steps in their CI pipelines because they can depend on DBOS to take care of it for them. Another complexity
[00:31:55] Tobias Macey:
of executions that have time as a component is testing, which I know that there are libraries in different language ecosystems that will simulate passage of time. But from a testing perspective, I'm just curious how you've also addressed that in DBOS and transact to allow people to verify that the workflows that they are building are actually
[00:32:18] Jeremy Edberg:
correct in so at least correct to the point of the effort that they're willing to put into validating it. So we have an interesting testing story because so one, you can run everything locally. So you can speed up time, right, to test your workflows. And then the other really interesting thing is we have a time travel debugger. Because of the way that we store every input and output, you can step forward and backward through actual inputs and outputs. So this you can use this for testing locally, debugging something that didn't work right. You can go back and replay the inputs and outputs, see where it failed. And production, if you get a bug in production in a microservices system, usually your options are hope that you had sufficient logging to catch it, turn on sufficient logging to try to catch it in the future, or try to figure out how to repeat it. But with the time travel debugger, you don't need any of that. You can literally replay the actual production bug that happened. So that's a big one. That's sort of a nice side effect of the way that we store all of the state is it allows you to replay that state. So that's a big one. But, yeah, as far as testing goes, it's still gonna be pretty similar to the way it was before where you either run a version that has a much shorter time span in between,
[00:33:25] Tobias Macey:
or you speed up time to see what happens, that kind of thing. And in your own work of building DBOS, building the transact library, working with the community, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:33:39] Jeremy Edberg:
Testing these things is still very hard, even if you have the right libraries. And in part, that's because it's very hard to it's still very hard to reason about distributed systems over time. And it's hard to build tests to test things that happen over time. Right? How do you test for, race conditions? It's not easy. Right? You have to cause them. So I'd say that's probably the biggest one. Otherwise, the community has been great. It's mostly just been about features. I would like to observe this or cron jobs was a feature that was added because somebody said, I want to schedule a workflow at a particular time, and we're like, oh, wait a second. That's literally a cron job. Let's just build that right in because we can totally do that. So, yeah, some of the features have definitely come out of community suggestions.
And then just building our own demo apps have caused us to add new functionality to it. For people who are
[00:34:33] Tobias Macey:
interested in the reliability guarantees, the sequencing workflow management, what are the cases where a DBOS is the wrong choice?
[00:34:42] Jeremy Edberg:
So the wrong choice would be something that is extremely compute heavy. Video transcoding is a perfect example of that would definitely not be a DBOS use case. If you have a workflow where your CPU is pegged at a % all the time, it's probably not the right use case. Our use cases are the ones where I need a bunch of data, and then I need to manipulate them and do something to it. And so there's pauses in the workflow. So that would probably be the best answer to what doesn't work. And as you continue to build and iterate on the capabilities
[00:35:14] Tobias Macey:
of the open source and commercial offerings, what are some of the things you have planned for the near to medium term or any particular projects or capabilities you're excited to explore?
[00:35:23] Jeremy Edberg:
Yeah. So in the nearest term, we have a product coming out just in a couple of weeks. The, and April, that will is what will allow you to actually run your transact workloads any outside of our cloud. So that is, that will be our our it'll be called Conductor, and it will be the product that manages your executors and making sure that they're running and the workloads are moving and so on and so forth. All the stuff that we provide today in our cloud, you'll be able to do in your infrastructure. That's the most immediate. The after that, there are some interesting possibilities. So, obviously, more languages, more language support. There's a lot of really interesting security implications because of the way that we keep track of the state and the inputs and the outputs.
There was one experiment where we were able to to detect a, fake intrusion much faster than standard intrusion detection tools because, we could see it in the database. Right? We could see the erroneous inputs and outputs. So there's interesting security implications. There's certainly a lot of AI implications. There's the training and the agentic AI and things like that. So we definitely build out that capability. Those are probably the two biggest ones right now going forward. Beyond that, we are a reliability company, so we will probably start adding in all the reliability products. Right? Chaos testing or remediation
[00:36:46] Tobias Macey:
or things of that nature. Are there any other aspects of the work that you're doing on DBOS, the transact library capabilities, or the overall space of durable compute that we didn't discuss yet that you would like to cover before we close out the show? I guess the main thing is that the term itself, durable compute, is fairly niche.
[00:37:03] Jeremy Edberg:
And so there isn't anything that we didn't discuss, but this may be a new term to some of your listeners. And so I would encourage you to go out and learn more about it and realize that it's very much a new name for some old concepts.
[00:37:17] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:37:33] Jeremy Edberg:
The biggest gap in data management. I think the biggest gap is still around cleanliness. As everyone who deals with data knows, data cleanliness is gonna be the biggest problem you have to solve. From the most basic, this person called it f name and this person calls it first name to much more complicated problems. And I think there's definitely a missing set of tooling around that, but I also think that, the latest LLMs or even SLMs can help solve a lot of that. Right? Funny enough, I just was helping someone with this problem yesterday where they had to compare two lists of names, and I was able to use an LLM to ask it, are these two names the same?
And it was more accurate than even, like, a fuzzer, for example. And so I think that is where the tooling lacks, but will improve quickly with the new the new tools that are available to us. The biggest issue is the, of course, the cost of the compute of asking it, are these two names the same?
[00:38:36] Tobias Macey:
Absolutely. Yep. Let's swat this fly with a sledgehammer.
[00:38:41] Jeremy Edberg:
Yeah. Exactly. Exactly.
[00:38:43] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on DeBoss and the investments that you're making into durable compute and making that available to people to build these reliable workflows. So I appreciate the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Thanks for having me. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macy, and today, I'm interviewing Jeremy Edberg about durable execution and how it influences the design and implementation of business logic for data systems. So, Jeremy, can you start by introducing yourself?
[00:01:00] Jeremy Edberg:
Hi. My name is Jeremy Edberg. I'm currently the CEO of Deboss. Previously, I've worked at places like Reddit, Netflix, PayPal, Apple, eBay, Amazon. And do you remember how you first got started working in data? Well, I've kind of been working in data the whole time. So I was the first employee at Reddit, and, obviously, that involved a lot of data. So by necessity, had to manage a cluster of Postgres databases, which I later found out was the largest cluster of Postgres databases on AWS for a while. Yet Netflix, it was kind of the same. I was there in charge of reliability, and I moved to the data team because data is a huge part of Netflix, and making the system reliable revolved around making the data systems reliable and fast and easy to use.
[00:01:45] Tobias Macey:
So I I've sort of been involved in data my through my whole career and even before that. And as you mentioned, you're currently running your company. You said DBOS? I wasn't sure if it was DBOS or DBOS.
[00:01:56] Jeremy Edberg:
Yeah. So funny okay. So it started as DBOS because it was originally going to be an operating system based on a database. So using a database as the the bottom layer instead of a file system. The the founders moved away from that model as they realized it was very hard, would take a long time, and wasn't commercially viable to start. And instead used their learnings from their research into that model to build the serverless platform and orchestrator that we offer today. So our systems are based on that concept.
[00:02:31] Tobias Macey:
And so digging a bit more into what you're building and also, a little bit about this concept of durable execution, I'm wondering if you can just describe some of the overview about what you're trying to solve and some of the system that you're building around it to address that problem?
[00:02:47] Jeremy Edberg:
Yeah. So durable execution is a concept that's very old, but the term is fairly new. You've probably dealt with it with reliable pipelines or, error correction, things like that. But, really, all it means is executing in a way where the software does what it's supposed to do, and in our case, exactly once. Right? So exactly once has been a very tough problem to solve since we've moved to distributed systems, and durable computing is intended to solve that problem across distributed systems.
[00:03:18] Tobias Macey:
And some of the other contexts, at least in recent memory, that I have heard this concept of durable execution, whether using that specific terminology or not, is largely around the temporal project, which was an outgrowth of the orchestration and, log running execution systems from Uber. And I'm wondering if you can talk to some of the ways that what you're building is either related to or disparate from the concepts that are built into temporal and the capabilities that it focuses
[00:03:49] Jeremy Edberg:
on? Yeah. So temporal is definitely the, the 800 pound gorilla in the durable execution space. And we are building a system that acts similarly, but is built completely differently. So we solve the same problems that they do, with once and only once execution, long running workflows, so on and so forth. But we do it in a very different way. We store everything locally. Our system is designed to be much more locally resilient, so you don't need an, an executor that's remote where your reliability depends on another system. So that's how we significantly differ from temporal. And
[00:04:26] Tobias Macey:
from an architecture and application design and business logic perspective, obviously, having durable execution as the core substrate changes the way that you would think about the problem versus if you were just building it all from scratch, if you're doing just arbitrary Python scripts or trying to build a distributed pipeline system where you have a sequencing of tasks, whether that's from cron jobs or CI pipelines or airflow, etcetera. I'm just curious how that inclusion of durable execution, the exactly one semantics changes the ways that engineers think about what problems they can solve and how they address solving those problems.
[00:05:04] Jeremy Edberg:
Yeah. So so what's really interesting about it is, if you're building your distributed system in the quote, unquote correct way, you are already building it to be almost durable. And the way we operate is we just add some decorators, and so that lets you decorate your existing code. But starting from scratch, what it enables is you don't need a lot of the error checking whoops code that you would normally need. For example, you don't need all the transaction rollback. What do I do if somebody made an order but the payment failed? Now I have to put the inventory back. You don't have to worry about that because it's already a series of work steps and the workflow will get automatically rolled back. So that inventory will just go back. So you don't even have to worry about that. So the way it significantly changes is it makes programming much easier to reason about. And in some cases, we've seen it reduces code by over 90% because you don't need all of those, edge cases, branches, error corrections because it's all just built right into well, if I can guarantee that this will run either exactly one time or zero times, I don't have to worry about a lot of that other stuff. From a data pipeline perspective, as you mentioned, if you're doing it correctly, you're already bringing in some of those semantics of using things like semaphores
[00:06:16] Tobias Macey:
or doing check before right, etcetera. I'm wondering how the approach for data pipelines in particular changes
[00:06:24] Jeremy Edberg:
across multi step workflows, across different execution contexts when it's relying on something like DBOS for that exactly one semantics and the failure recovery, the ability to resume execution, things like that? Yeah. By using DBOS and the concepts of durable compute, you can eliminate a lot of the extra things that you need to do on your data pipeline. You can eliminate the check at the end whether the data got through. You can eliminate a lot of the queues. We support queues, but you can eliminate a lot of the external queues because you don't need to worry so much about it anymore. And so that's the big thing with the Symantec is a lot of our customers are actually doing this. Make sure that this chunk of data gets from point a to point b, and it happens exactly one time, and it did happen. And so by using durable compute, you can eliminate a lot of the extra infrastructure around your data pipeline that is there to make sure that the data pipeline actually worked. And things like you that you mentioned, the semifours, that stuff, it's all it's built in because it's part of the durability of the library. You mentioned
[00:07:31] Tobias Macey:
too in the context of temporal, how Deepass is more self contained. One of the major challenges of building data systems or complex application architectures, particularly bringing in things like service oriented architecture or microservices, is the operational overhead of managing them and managing coordination across them. What are some of the ways that DBOS addresses some of that operational overhead and operational complexity of these distributed architectures, particularly around synchronization and state management?
[00:08:03] Jeremy Edberg:
So you said it at the end best. Right? It's the statefulness, and it's the we're strong believers in put everything in Postgres. And so what we have is a metadata database, and that helps you eliminate a lot of that operational complexity. You can technically build a fully functional operational reliable system with a single file using the semantics of our library. And so you don't have to worry about all those other systems, managing those other systems, making sure they're scaling. All you have to do is scale your executors and your Postgres database. And both of these are well solved problems. Right? People have been scaling databases for fifty years and Postgres for forty. And same with execution. Right? People have been scaling execution for at least last ten, fifteen plus years. And so those are the only two things you really have to worry about. There's very little database overhead, and so the system works really well because you're already scaling your database because it's where your data is. And so that's why this eliminates a lot of that operational overhead because you don't have to worry about those extra systems.
[00:09:08] Tobias Macey:
When we're talking about durable execution, workflow management, a lot of times, data pipelines or systems that people are typically interacting with are executing on the scale of seconds to hours, maybe days at most. But, obviously, there are workflows particularly in large and complex organizational contexts where the workflow might take weeks, months, years to complete. And, obviously, that brings in a lot of challenges around version management, change management, where you have a workflow, you kick it off, you'd you know that it's not going to complete for up to a year, but you also need to be able to evolve your system within that time because you can't just freeze everything, and you don't want to necessarily stamp out multiple copies where each copy corresponds to a change that you're trying to release. And I'm curious how that factors into the way that you think about what workflows to incorporate in DBOS or a durable execution framework, how to manage that change of version, how to manage the change of state representation, and some of the ways that you're thinking about that in the ways that you're designing and implementing the DBOS substrate.
[00:10:20] Jeremy Edberg:
Mhmm. Yeah. So versioning is key, and versioning is a first class citizen, within the transact library, because of the exact reason that you just pointed out is you don't want your workflow to change versions in the middle or maybe you do, but you wanna have control over that. You don't want it to to change when you're not expecting it to. And so that is a very fundamental part of the transact library and the services that we offer is managing versions. So by default, every workflow will finish on the version of software that it started on. That is the default behavior, because we assume that's mostly the expected behavior for most people.
You can manually change it, though. If you're in the middle of a workflow and you've realized you had a bug, you upload some new code, you can say this workflow now finishes on this new piece of code. That is certainly available to you. And so it's versioning is key. It's first class, and everything in the system is around those versions to make sure that the state data is tied to the version but can be changed if you want to.
[00:11:21] Tobias Macey:
So now digging into DeBoss, you mentioned the transact library. We can dig a little bit more into the specifics of that change management as we dig a bit deeper. But I'm wondering if you can talk through what is the overall design and architecture of DeBoss? What is the role of the transact library in the broader context of the corporate offering and just some of the ways that the scope and focus have changed from when you first started building out these capabilities?
[00:11:51] Jeremy Edberg:
Right. So the transact library is the free open source MIT library that gives you durability. It is the one that handles creating that metadata database, managing all the metadata in the database. So, for for the nerds who care, we essentially keep a record of every piece of data that's changed and which transaction ID in Postgres made that change so that we can roll back to any particular transaction ID and say all of this happened after, so undo it. And so what this lets you do is if you are storing your data in the same database, the whole thing can be wrapped in a database transaction.
So it either completes or it doesn't, and then it's all stored together. If you're using a third party, you can get idempotency keys, which is another thing that the deboss library provides management of and is stored with your data so that you can assuming that third party offers you some functionality to roll stuff back, you can still do that. And so that is what the core library offers. It offers these primitives. It offers crons, queues, a few other things that you might want if you're building a strong software that's based on data manipulation.
And those are all in the library, and that can all be run locally. That can all be tested locally. What we offer commercially is the ability to run that as a production workload reliably, securely, observably. So you get a bunch of observability tools. You get a single pane of glass management system to see all the workflows, to manage the workflows. You get a system that moves the workflows from, one executor to another if they fail. The is managing to guarantee that the workflow is finished across multiple executors. So transact is will be on a single executor. It'll make sure that everything happens on that single executor.
Our commercial offering helps you manage it across executors as well as giving you observability and a lot of data around your workflows.
[00:13:53] Tobias Macey:
And so digging more into the transact library and the ways that you're thinking about its application and the ecosystem that it sits within, to my understanding, it's a Python library. So, obviously, that means that you're very, that you're betting heavily on the Python ecosystem and people needing the durable execution context within the Python ecosystem. And I'm wondering what are some of the ways that you're thinking about the types of use cases and the types of applications that are the target here and any potential for expanding to other language ecosystems as you establish your foothold within Python?
[00:14:36] Jeremy Edberg:
Yeah. So just to clarify, we offer both a Python and TypeScript library. So, you can do it's the same semantics in both, and you can use either one. And because TypeScript will support, you know, Node. Js, it'll support Next. Js, all of the all of your favorite JS libraries. So that is, one important thing. But what we're targeting really is people or I should say, we're targeting every workload, but the ones that have shown up the most are data pipeline workloads. I need data to get from point a to point b, AI workloads.
So I need to do a training run, and I need to guarantee that these 10,000 documents have all been trained on. Agentic AI, I need a human in the middle, and so that could be a long running workflow. So those use cases all work for us. Although we do even have, random people doing back end application development. We have one person who manages, hunting areas in the American South. So literally life and death situation because if you have not reserved your hunting area, you could get shot by somebody who has, and so that is running on our system. So it's a it's a wide range of use cases, but we started with TypeScript because a lot of people were building full stack applications using TypeScript in their back end. We then moved on to Python because of the AI use cases, the data pipeline use cases, the fact that NumPy exists, SciPy, all of those massive data manipulation libraries, and a lot you know, those create long running workflows. Right? It could take hours, days to to analyze your data. And so you wanna have checkpointing there so you can restart it and so on and so forth. The next couple of languages, we don't have a road map for any particular language yet, but the most popular they requested are, Go and Java.
So probably one of those next, most likely Go because of the complications of Java. But we're going with right now those two to for those use cases and then whatever people want next.
[00:16:38] Tobias Macey:
For cases where people are leveraging both the Python and the TypeScript SDKs, what are some of the ways that they approach the cross language boundaries for workflows that need to span across different process environments where maybe they have something in the browser or in node runtime that also needs to interact with workflows that have execution context within that Python ecosystem?
[00:17:04] Jeremy Edberg:
That is a very interesting question. So in reality, no one is doing that because of all the complications you just mentioned. Anybody who's using both languages, if they are, it's two applications that talk to each other via standard API. Mainly because you don't want to be sharing those databases because of the the different semantics of where and how data is used. Right? So it's like a standard microservices thing. You generally aren't gonna you don't wanna use the same database for multiple microservices because then you don't really have microservices.
Two languages basically are two different microservices. So that would be a bad practice, and luckily, no one that we know of is actually doing it. Another interesting parallel in computer science and just the the ways that people think about designing systems,
[00:17:47] Tobias Macey:
there's the durable execution, which helps you with failure recovery of the process died, and now I need to start back up from where I left off. Another parallel for managing transitions and, sequencing within workflows is the idea of a state machine. And I'm wondering what are some of the areas of overlap with state machines and the ways that those are incorporated into application architectures and the guarantees that are offered by durable execution and some of the ways that you're actually using state machines within DeBoss to manage some of those transition requirements.
[00:18:21] Jeremy Edberg:
Yeah. I mean, you could think of every workflow as a big state machine. Right? There's a a series there's a state and it changes, and we track how it's changed over time. So you can if that is actually one really good way to reason about a workflow. It's a series of steps in a state machine, and we can always determine the current state of the state machine by querying the metadata database, which interestingly is completely open and available to the user. So the user is certainly able to build any query they want against that database to expand the functionality or get whatever observability they need in particular. There's nothing secret there.
[00:18:58] Tobias Macey:
And you mentioned that Postgres is the persistence layer. So, obviously, you're betting a lot on its reliability, its transaction isolation, and its scalability. And I'm wondering what are the aspects of Postgres as an engine that led you to selecting it for such a critical piece of the architecture and maybe any other engines or storage layers that you considered on the path to that ultimate decision?
[00:19:26] Jeremy Edberg:
Yeah. So Postgres was selected in part because our cofounder created Postgres. But we did look at others I should say they did because before I joined. They looked at other storage engines. I don't wanna say their names, but they did look at other storage engines, and none of them have the bulletproof reliability that forty years of Postgres gives. And, Postgres was settled on mostly because it's just it's bulletproof. It's known to be bulletproof. It's been around a long time. There's a lot of research that has gone into making it stronger and better and a lot of people attacking it to make sure that it works. So it was just a good solution. There's good solutions for reliability there, for shipping data back and forth, lots of experience. So it just made a lot of sense. And most applications can be run on one or two Postgres databases.
There's very few applications that require more than that. And for applications that do require
[00:20:20] Tobias Macey:
more horizontal scalability, there are systems such as CockroachDB and Yugabyte that have at least interface compatibility with Postgres, and I'm wondering what your thoughts are on the viability of using one of those options or something similar in place of the actual core Postgres and some of the ways that that compromises the actual guarantees that you're focused on with the vanilla Postgres as the core of your architecture.
[00:20:47] Jeremy Edberg:
Yeah. So so you can use any Postgres compatible database, but they have to actually be Postgres compatible. So there's there's a range of Postgres compatible. Right? There's there's some that claim to be, but really, they only support, you know, 70% of the actual functionality. That's not gonna work. But the ones that do support it fully and we do have people using other databases. We have people on Supabase. We have a partnership with Supabase. We have NeonDB. We've had someone try it with Cockroach, but we have not. They disappeared. So we don't certify database, but we have tried it on other databases.
And we can if we get enough requests, try it on a database to see if they truly are Postgres compatible. But as long as they truly are, it'll work just fine. You could buy Cockroach. We've talked to all of these folks, and at least somebody has tried it on them. And so for people who are
[00:21:38] Tobias Macey:
building a system, they need guaranteed exactly what semantics. They need reliability, resumability. I'm wondering if you can talk through the process of actually building a system with DBOS and just what that looks like from a program design perspective, some of the ways that they should be thinking about, what are the touch points, it to the transact library and the execution context for the actual workload components, what are some of the boundaries where you want to keep your application logic maybe in a separate process that uses transact in a subprocess or just some of the ways that that factors into the overall system design and implementation workflow?
[00:22:18] Jeremy Edberg:
So we have designed the system so that you can if you want to do it all in a single file. You can have a single process that does the whole thing. Now, obviously, if it's very big, you might want to separate your applications into multiple app applications, multiple microservices basically for separation concerns or perhaps your databases are separate for scaling. Generally, you would build it the same you would use the same design considerations that you would use if you're building microservices. Doesn't make sense to put a boundary here. But you can certainly start with a single file and then take particular workflows and move them to somewhere else if that makes sense. When you're building with durable execution, you generally wanna think in workflows. So, what things are gonna happen and what are the series of steps that are gonna be involved. So workflow fundamentally is just defined as a series of steps. And each step is executed and then we store this we store the inputs and the outputs of each step so we can rerun them as necessary. And so that is the fundamental basis of where where you start your design is what are the workflows, what are the steps. Then where is the data coming from? The answer is usually a database, and it doesn't even have to be a Postgres database. It could be any database. We've definitely had people that are attaching to MongoDB or whatever it is, their favorite key value store, DocumentDB. It doesn't really matter. It's still just a regular application. You just get extra protection if you're using Postgres as your data stores. And so that's gonna be your boundary. It's gonna be where's your data, and then this is the application that's going to move the data from one place to another. And if you wanna scale that independently, then you would separate it into two applications.
One that runs some workflows, one's run another. If you have, like, cron jobs in there, you may want to make a separate application that's just the cron jobs, for example. But it doesn't particularly matter because the executor is going to be resident. It's gonna execute. It's going to re execute as things fail. The main thing that you have to be responsible for is restarting that executor if it fails or moving those workloads. And that's that's where our commercial product comes in and does that stuff for you. Or if you use our cloud, of course, all of that is taken care of for you as well. Circling back now to that version management
[00:24:31] Tobias Macey:
piece, what what are the complexities in the software ecosystem is that everything is at a constant state of flux. Even if you want to, for instance, run something, some piece of software from twenty or thirty years ago, you might not be able to because you might not be able to access all the dependencies unless you very intentionally created a sort of hermetic environment for all of it. And so in the context of DBOS for applications that are pulling in other dependencies, so if you're using Python and you're relying on NumPy for the execution and then you upgrade the version of NumPy or change some of the business logic, but you want the workflow that kicked off six months ago to continue executing, I'm wondering how you actually manage the versioning of the code and the ability to maintain the version of code that was started for a workflow to ensure that it continues to completion while also being able to deploy newer versions.
[00:25:25] Jeremy Edberg:
Okay. So I'll start with our cloud. Our cloud does that hermetic ceiling. Right? So we actually install the libraries for you based on your requirements of text or, you know, the way you you presented to us the libraries with the version numbers. And so that your version on our cloud is that entire package of your software plus all of its dependencies, frozen at the point where you upload them or what you've chosen as the the freezing point. If you are running it yourself, then you do need to manage that for yourself. You have to manage your dependencies the same way that you would in any other distributed system without durability. Right? You're still gonna have to manage. That's a problem no matter what. The things getting upgraded underneath, stuff like that. So you still have to manage that yourself if you are managing yourself.
But if you're using our commercial products, then we help you manage that. I guess it's the best way to say it. But otherwise, yes. So you're gonna have to manage it the same way that you manage any other software upgrade in any other distributed system. You mentioned the capabilities
[00:26:28] Tobias Macey:
in your commercial offering. We've mentioned the transact library being open source. I'm curious how you approach the heuristics of what are the capabilities that belong in the open source, and what is the boundary for what is part of the commercial offering and some of the ways that you think about the sustainability and ecosystem investment around the open source component and how that can coincide with the commercial, focus of running a sustainable business.
[00:26:57] Jeremy Edberg:
So the boundary is transact will help you build durable software. It'll help you it'll always be anything regarding, workflow steps, etcetera. Deboss commercially will help you operate that software. So that is that is the boundary. We help you operate it well and reliably, and we give away for free the ability to build software that can be reliably operated. So you can do it on your own if you want, or you can pay us some money, and we will help you do it or do it for you depending on whether you use our cloud or not. And so that's the boundary is is operations versus build. Build for free will help you operate it well. And in your experience
[00:27:37] Tobias Macey:
of building DeBoss and the transact library, you mentioned some of the use cases. I'm just curious. What are some of the most interesting or innovative or unexpected ways that you've seen that technology stack used?
[00:27:48] Jeremy Edberg:
Yeah. So I told you about the the back end application. And honestly, that's not super innovative, but it's just really interesting to me that there's a life and death situation there. The most interesting way ones have been these data pipeline use cases. It's the AI training ones, and we have a quote on our website from a customer who was attempting to move from data from one CRM to another. And they were trying to build this for months, and they couldn't do it until they found our durable execution library, and then they were able to build it in two days. And so that's probably the most interesting use case, the one where the once and only once Symantecs are the most important thing to you. So I want to make sure that this data actually got there, but it I don't want duplicates. And then the other interesting one is the agentic AI. The think about a customer service agent where you can ask an AI for a refund. And if it's a small amount of money, the AI just gives you a refund. But if it's a large amount of money, a human needs to approve that. And that human in the loop part is interesting because it could take seconds, it could take hours, it could take days for that human to approve that. And so that workflow needs to hang open and running, but the way that we've designed it is there is no hanging open. Right? It's the that step is completed, and we're waiting for the next step. And whenever that comes in, that is when we will resume execution of this workflow. And so, these these LLM use cases, we end up saving the customer a ton of money because they don't have to sit there waiting for these answers to happen. In the workflow
[00:29:20] Tobias Macey:
stage completion piece, another thing that we didn't really talk through yet is some of the triggering of what is the actual signal for the next step to execute and some of the ways that transact handles that trigger management to say, okay. The step completed. Now I need to do this thing. Or in the case of what you were just saying where there's the human in the loop of I completed this step, and I sent some sort of notification. Now I'm just going to wait, and now I need some other signal back in to be able to trigger the next step of it and just some of that signal management and sequencing.
[00:29:50] Jeremy Edberg:
Yeah. So, typically, the the signal is the next API call. Right? It's the response. How the response comes in depends, of course. It could be a webhook. It could just be a response on a request. And, yes, there is a listener listening for that, but we can do it very efficiently in our cloud because we can be listening for thousands of responses at once. But none of those are running, so there's no real CPU time being used. But, yeah, it's typically an incoming API call that triggers the next step. It's the same as if you didn't have durable computing. It's just that you don't have to have a listener of your application running and waiting for that response. There's just a generic listener waiting for incoming API requests, and then that triggers a workflow based on what it's a response to. In the context of these
[00:30:36] Tobias Macey:
LLM use cases, data pipelines, for people who are adopting DBOS and the transact library, I'm wondering how you're seeing that influence the rest of the architectural decisions that they make. Like, are there components of their data stack that they're able to retire because of the capabilities that DDoS offers, or are there just cases where they were using
[00:30:58] Jeremy Edberg:
maybe an airflow or some other sort of orchestrator, but they move pieces of that to DBOS for managing the workflows that require that reliability or just some of the ways that it influences the overall architecture decisions that they either have already made or would otherwise make. Yeah. We've definitely seen people who have been able to drop their airflow or at least turn it down, their Kafka queues. That's a big one. They don't need their Kafka queue anymore. In some cases, they still use the Kafka queue as a backup, and then after a while, they shut it down realizing, oh, we don't. It's not doing anything anymore. So there's definitely that case. It's, I'd say the biggest one is is queue queues getting shut down. We don't need all these extra queues. We don't need these extra processes that are monitoring these queues, that kind of thing. That is probably the biggest the biggest change that people make. Oh, and their CI pipelines. They can reduce the steps in their CI pipelines because they can depend on DBOS to take care of it for them. Another complexity
[00:31:55] Tobias Macey:
of executions that have time as a component is testing, which I know that there are libraries in different language ecosystems that will simulate passage of time. But from a testing perspective, I'm just curious how you've also addressed that in DBOS and transact to allow people to verify that the workflows that they are building are actually
[00:32:18] Jeremy Edberg:
correct in so at least correct to the point of the effort that they're willing to put into validating it. So we have an interesting testing story because so one, you can run everything locally. So you can speed up time, right, to test your workflows. And then the other really interesting thing is we have a time travel debugger. Because of the way that we store every input and output, you can step forward and backward through actual inputs and outputs. So this you can use this for testing locally, debugging something that didn't work right. You can go back and replay the inputs and outputs, see where it failed. And production, if you get a bug in production in a microservices system, usually your options are hope that you had sufficient logging to catch it, turn on sufficient logging to try to catch it in the future, or try to figure out how to repeat it. But with the time travel debugger, you don't need any of that. You can literally replay the actual production bug that happened. So that's a big one. That's sort of a nice side effect of the way that we store all of the state is it allows you to replay that state. So that's a big one. But, yeah, as far as testing goes, it's still gonna be pretty similar to the way it was before where you either run a version that has a much shorter time span in between,
[00:33:25] Tobias Macey:
or you speed up time to see what happens, that kind of thing. And in your own work of building DBOS, building the transact library, working with the community, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:33:39] Jeremy Edberg:
Testing these things is still very hard, even if you have the right libraries. And in part, that's because it's very hard to it's still very hard to reason about distributed systems over time. And it's hard to build tests to test things that happen over time. Right? How do you test for, race conditions? It's not easy. Right? You have to cause them. So I'd say that's probably the biggest one. Otherwise, the community has been great. It's mostly just been about features. I would like to observe this or cron jobs was a feature that was added because somebody said, I want to schedule a workflow at a particular time, and we're like, oh, wait a second. That's literally a cron job. Let's just build that right in because we can totally do that. So, yeah, some of the features have definitely come out of community suggestions.
And then just building our own demo apps have caused us to add new functionality to it. For people who are
[00:34:33] Tobias Macey:
interested in the reliability guarantees, the sequencing workflow management, what are the cases where a DBOS is the wrong choice?
[00:34:42] Jeremy Edberg:
So the wrong choice would be something that is extremely compute heavy. Video transcoding is a perfect example of that would definitely not be a DBOS use case. If you have a workflow where your CPU is pegged at a % all the time, it's probably not the right use case. Our use cases are the ones where I need a bunch of data, and then I need to manipulate them and do something to it. And so there's pauses in the workflow. So that would probably be the best answer to what doesn't work. And as you continue to build and iterate on the capabilities
[00:35:14] Tobias Macey:
of the open source and commercial offerings, what are some of the things you have planned for the near to medium term or any particular projects or capabilities you're excited to explore?
[00:35:23] Jeremy Edberg:
Yeah. So in the nearest term, we have a product coming out just in a couple of weeks. The, and April, that will is what will allow you to actually run your transact workloads any outside of our cloud. So that is, that will be our our it'll be called Conductor, and it will be the product that manages your executors and making sure that they're running and the workloads are moving and so on and so forth. All the stuff that we provide today in our cloud, you'll be able to do in your infrastructure. That's the most immediate. The after that, there are some interesting possibilities. So, obviously, more languages, more language support. There's a lot of really interesting security implications because of the way that we keep track of the state and the inputs and the outputs.
There was one experiment where we were able to to detect a, fake intrusion much faster than standard intrusion detection tools because, we could see it in the database. Right? We could see the erroneous inputs and outputs. So there's interesting security implications. There's certainly a lot of AI implications. There's the training and the agentic AI and things like that. So we definitely build out that capability. Those are probably the two biggest ones right now going forward. Beyond that, we are a reliability company, so we will probably start adding in all the reliability products. Right? Chaos testing or remediation
[00:36:46] Tobias Macey:
or things of that nature. Are there any other aspects of the work that you're doing on DBOS, the transact library capabilities, or the overall space of durable compute that we didn't discuss yet that you would like to cover before we close out the show? I guess the main thing is that the term itself, durable compute, is fairly niche.
[00:37:03] Jeremy Edberg:
And so there isn't anything that we didn't discuss, but this may be a new term to some of your listeners. And so I would encourage you to go out and learn more about it and realize that it's very much a new name for some old concepts.
[00:37:17] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:37:33] Jeremy Edberg:
The biggest gap in data management. I think the biggest gap is still around cleanliness. As everyone who deals with data knows, data cleanliness is gonna be the biggest problem you have to solve. From the most basic, this person called it f name and this person calls it first name to much more complicated problems. And I think there's definitely a missing set of tooling around that, but I also think that, the latest LLMs or even SLMs can help solve a lot of that. Right? Funny enough, I just was helping someone with this problem yesterday where they had to compare two lists of names, and I was able to use an LLM to ask it, are these two names the same?
And it was more accurate than even, like, a fuzzer, for example. And so I think that is where the tooling lacks, but will improve quickly with the new the new tools that are available to us. The biggest issue is the, of course, the cost of the compute of asking it, are these two names the same?
[00:38:36] Tobias Macey:
Absolutely. Yep. Let's swat this fly with a sledgehammer.
[00:38:41] Jeremy Edberg:
Yeah. Exactly. Exactly.
[00:38:43] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on DeBoss and the investments that you're making into durable compute and making that available to people to build these reliable workflows. So I appreciate the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Thanks for having me. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Durable Execution
Understanding Durable Execution
Architecture and Application Design
DBOS and Data Pipelines
Version Management in Long Workflows
Cross-Language Workflows
Postgres as a Core Component
Managing Code Versioning
Influence on Data Architecture
Challenges and Lessons Learned