Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/

lunode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm welcoming back Nick Schrock to talk about software defined assets and his work at Elementl on Dagster to improve the developer experience for data orchestration. So, Nick, can you introduce yourself for folks who haven't heard of you before?

Yeah. Thanks for having me, Tobias, and thanks for that intro. As you said, my name is Nick Schrock. I'm the CEO and founder

of Elementor, which is the company behind Dagster.

Before Elementl, my career was mostly spent at Facebook from 2009, 2017.

And there, I founded this team called product infrastructure, which was to make our application developers more efficient and productive. And kind of the most notable piece of work that came out of that

was that I was personally involved in was GraphQL, and I'm 1 of the cocreators of that. And

then moved on from Facebook 2017

and got into this entire domain

and started working on data orchestration.

And in terms of the

overall DAXTER project,

for folks who wanna understand more about what it is and some of the history, I'll point them to some of the other interviews that we've done.

But the last time we spoke was last November, and I'm wondering if you can just give a quick update on what's new in DAXTER and

its reaction to the evolution of the data space since the last time we spoke.

Yeah. So since November,

we've been making a ton of progress. So

in December, the month after that interview, we launched our cloud product to early access,

and that's been going better than I anticipated, actually. And we have tons of users across

all sorts of companies all the way from startups to Fortune 500 companies.

And as we'll talk about later in the episode, we're actually, you know, gonna be launching that to general availability to the open public

next month on August 9th. So that's super exciting.

In terms of the open source project,

we had talked about it on that podcast, but we've really been doubling down on this software defined assets direction.

And

in November, it was a super early prototype. Not super early. It was a fairly early prototype being used by a few early design partners. And now, you know, adoption and development has accelerated massively.

And we've launched it with a stable API.

And then we're actually

also launching Daxter 1 0 next month at Daxter Day along with our Cloud GA launch. So it's been a huge 6 months for us.

And so in terms of

the

overall software defined asset concept and some of the ways that it manifests, I'm wondering if you can talk to a bit of

what the motivation is and what the meaningful impact is for data teams who are working with software defined assets as their primary abstraction

rather than the core abstraction of this task based DAG that folks have become familiar with since the early days of data orchestration and especially exemplified by the Airflow community?

Yeah. So in terms of the motivation,

what we saw,

and this was last summer when we really started

to dig into this, is that

a task based orchestration system was increasingly out of step

with the way that data teams wanted to approach their data platforms overall

and also the way that all the constituent

tools

in modern data, I'll avoid the term modern data stack, which were very kind of

asset oriented or model oriented. So if you go to Erabyte, they'll talk about sources. Right? And they, like, model a distinct set of sources in their system or 5 Transimrally.

If you go to dbt,

they'll be modeling the system in terms of dbt models and not tasks.

And what was interesting, and this kinda came out in this, you know,

discussion around bundling and unbundling platforms, which happened in the spring. But, effectively,

tons of the information

that used to be in the task based orchestration layer were pushed down into these constituent tools.

Whereas, for example, prior to say DBT,

each table would have had a corresponding task in airflow,

say. But now you're just invoking dbt cloud or dbt

run. That entire DAG is represented by a single task in the orchestration layer. And, therefore, the orchestration layer is not nearly as useful

in terms of providing a single place where you can have visibility into your entire data platform.

So we saw that happening

where task based orchestration

no longer map to the constituent tools as much.

And then second of all, just in terms of even absent those external vendors,

the platforms are becoming more and more complicated, and people wanted a more declarative approach

where they could manage more complexity in their head.

And the kind of more declarative approach of of software defined assets really

resonated with us and our early users. And that was kind of the motivation that happened there.

And in terms of that early experience of releasing this capability, digging into that particular approach, and some of the ways that that maps to different teams' mental models. I'm wondering what were

some of the notable outcomes as far as the development approach that teams had to building their

data systems and data products and how that factored into their overall

practices of how they thought about what the output of their work actually was.

Yeah. That's a great question.

So in terms of outcomes here, I think, first of all, the practitioners who are working in Python

and orchestrating all these systems and building data assets in Python, they just became more productive

using the system. They have to keep less stuff in their head.

They no longer have to write 1 centralized

DAG artifact or flow artifact manually.

So if you come from airflow, there's typically this DAG construction file, which ends up being a sort of

centralized,

unknown dumping ground of code

that gets more complex

as your data platform gets more complex. And with software defined assets, you no longer have that centralized artifact. The complexity is distributed

across the entire system in a good way, meaning that that centralized DAG is constructed for you.

So it just makes it so that coding is faster, collaboration is easier,

and it's an overall

win. I think more profoundly,

you know, what we found, and this is also based on some early feedback too, is that this allowed users who want to, say,

develop

their entire data platform

across a bunch of different independent GitHub repos

by independent teams.

They can each individually code in their own world. Once deployed,

they can actually interconnect all the assets that are built in those different constituent

GitHub repos. So it's really been a way to

unify

the different practitioners who might be working quite independently, but are still building up to 1 cohesive data platform.

And then that's kinda connected to the last bit,

which is now they finally have a place to go to where they can understand how their data products interrelate. So they no longer have to go into different constituent tools and kind of, like, have it all in their heads. There's just, like, 1 spot

where everything is defined. It's been

pretty interesting to see how that's developed. Like, 1 anecdote I really like is that we had an early cloud customer who started to adopt software defined assets, and they kinda started with our out of the box integrations. So they use the ingest tools, they use some custom Python scripts, and they integrate that in the software to find assets, and then they also imported their DBT graph.

And they saw this interconnected web of assets. But then

they also use a feature store called Feast,

and it kind of was in its own world because they hadn't imported that to software defined assets, and we didn't have an out of the box integration for that.

So they actually built their own integration to bring Feast

into software defined assets land. And I thought that was interesting. 1, it was just like the value of it compelled them to write that integration.

And second of all, it actually interconnected the data engineering and ML sides of of the house into 1 cohesive

lineage graph.

And I think that's a lot of the promise here is that you no longer have to desilo your organization.

Right? You can kind of conform everyone to this common standard and have your entire org,

regardless of persona, kind of interconnected in this single fabric of assets.

And to that point of

building their own integration

to be able to represent what was happening in Feast as a software defined asset, can you talk a bit more about what is actually involved in building that type of an integration and what it means to actually represent some of these external systems through this asset oriented

abstraction?

I mean, in the end, if you look at the Python code, that's a software defined asset. It's just a function.

So it's not this huge lift to do an integration.

It's just merely annotating

your existing because all these systems generally have Python integrations. Right? So in the end, you're writing some Python code, which invokes

the tool that you wanna invoke.

The magic is sort of how we structure

the software and structure the metadata.

So it just means that you've written integration.

It ends up being a Python function, which evokes something else, but then you can plug into our system to interconnect it to the other assets, and that's kind of the critical piece here.

And as you went through that

early phase of proving out the idea,

testing it out with your early adopters,

and then iterating on that with some of the subsequent releases. I know that you have

stabilized the API, and I'm wondering what were some of the

shifts in the interfaces

and the abstractions and paradigms you wanted to expose for people to be able to work in this mode of the software defined asset,

And what were some of the modifications that were necessary

internally to DAXTER and the framework to be able to support that as a first class capability?

So I'll start with the first 1, which is what are the changes we had to make to core DAXTER

in order to do this?

And the answer is that the core

was relatively stable.

Right?

So software defined assets is just a layer

on top of

our existing core.

Right? So at its core, a software defined asset refers to an op,

which is our core unit of computation, kind of the the foundation of our task based orchestration layer.

If you write an asset, it builds the centralized DAG artifact, the job,

for you,

and it also manages interactions at a fine grade level with our asset catalog,

which also predated software defined assets. So it's actually this relatively

thin layer of software

on top of the existing capabilities

in our core system. And that's important as well

because our task based orchestrator is not going away, that core orchestration engine. Some tasks,

not to double use the word, some activities in data platforms are still highly amenable

to kind of more imperative programming models, and those are critical to support, and we will always support that.

And in fact, the software defined asset layer needs that layer in order to execute well. You can kind of, like, think of it

it's not a perfect analogy,

but in that way, you know, Spark, for example,

computations are encoded in SQL. And maybe in the, like, long term Spark, like, 80% of the computations will be in spark SQL,

but there's still gonna be this critical 20%

that are encoded in data frames. It's more imperative, but you need to do it in order to accomplish a lot of the activities you wanna do. We kind of view software defined assets in our task based system in a similar sort of way, where we anticipate over time more and more of the activity in the system

will be encoded in terms of the software defined assets API, but there's still gonna be absolutely critical activities that you need to use this lower level for. So that all being said, software defined assets is a layer on top of our op layer, and that will forever and always be true. So most of the work

was actually more about

kind of conceptual work and getting that layer

absolutely right, and then building tooling on top of it, which was by far the where the lion's share of the effort was.

So that's 1 thing in terms of your question about changes to the interfaces and internals of the system.

Your other question was, how do we respond to feedback and how the API

shift?

And I think that focused on maybe 3 different things.

And 1 was a lot of the users

who understood

this very intuitively

and wanted it were our modern data stack users that were heavy DBT users and wanted much more advanced orchestration capabilities on top of DBT.

So tons of features there. In particular, I thought 1 was interesting, which speaks to what I was talking about before

was, you know, we had this ability initially

where you can, like, write in the same GitHub repo,

be able to load dbt projects

and orchestrate it right there. But what tons of teams wanted actually

was the ability to have their analytics engineering team to actually have a more independent GitHub repo in their own development cycle, and then have the orchestrator actually consume the dbt manifest file, for example. And that required some changes on our end, but it spoke to that

people are really thinking about how to the interfaces between teams in these systems

and then how to make it so that, you know, yes, they can deploy and operate independently, but once deployed to prod, the orchestrator kind of stitches the entire thing together.

So we got a lot of feedback

around

how our DBT integration worked and then how that kind of trend and then I also viewed it as feedback about how people wanna organize their teams, and that was super interesting.

On a more tactical level,

you know, we were originally gonna deemphasize our config system a little bit in the software defined assets layer, which is effectively the ability to parameterize computations without changing

code. But super early on, people were like, wait. I want this. I miss this capability in the old system. So, like, we actually had to re frontload that quite a bit. And then lastly,

you know, we knew it was gonna be a demand,

but, you know, immediately, people wanted to focus on interoperability.

So they have their existing task graphs.

Some stuff are appropriate to model as task graphs. How do you wanna interleave those layers? So I think those were kind of the early themes and feedback which we responded to. But the core conceptual underpinnings of the project have remained quite constant from the beginning.

As you have gone through this process of working through the software defined assets capability,

updating

the interfaces and APIs and some of the tooling and just general

messaging around that functionality.

In parallel, you've also been going through a journey of the early beta testing and early adoption of your cloud product, and I'm curious how that journey has paralleled the capabilities of software defined assets and some of the ways that those 2 areas of effort have played off of each other in terms of your own internal engineering work? Yeah. That's a good question.

I mean,

I think the easiest answer

and why I'm so excited about the cloud product happening

is that

the ability to have a hosted service

allows you to deploy

new capabilities

and bug fixes

and new system infrastructure

to your users in a completely permissionless way,

and that allows you to iterate with your users far more quickly. You know, in a pure open source model, let's say you fix a bug or have a new feature in your UI,

you push it up to PyPI, and then you're kinda like waiting for people to upgrade and like that's their and then you're subject to their own

internal

infrastructure release processes and whatnot.

And with cloud,

we can push whenever we want,

get new capabilities into users hands extremely quickly and have a much tighter feedback loop with them.

And that's been able to accelerate our product development

dramatically, and I'm super excited to be able to do that at scale.

So you've been going through this process of adding the software defined assets capability, cementing that API.

Just prior to that, you went through the work of updating the core abstractions for DAXTER from these nomenclatures

of solids and pipelines and

Tobias, you're gonna bring those up?

And that whole work and then, you know, streamlining that into this jobs ops graphs paradigm.

So you've gone through a number of iterations. There's been a whole host of work. You've seen a lot of adoption.

And so that leads us to what you mentioned earlier. On August 9th, you're having this DAXTER Day event where you're going to be committing to a 1

stable version of DAXTER. You're going to be releasing cloud product as GA.

I'm wondering what are the major lessons that you and your team have learned in the process of going through

the initial development, the initial adoption cycles, figuring out what are the concepts that actually make sense in this space,

exploring

how you can

keep up with the rapid pace of evolution in the data ecosystem writ large

while staying true to the core ideas and abstractions that you're putting into DAXTER and just some of the commitments and what it is about where you are in your journey right now that makes you feel confident in committing to that stable version identifier.

1 of the big takeaways here is don't be too cute

with your name in.

The name Solid was my fault. It was actually originally gonna be the name of the project itself, and I kinda like the name. And then I anyway, we don't need to belabor that 1. So don't be too cute. Name things in the most obvious way possible.

You know, it was a critical

thing for us to switch to a more

natural nomenclature

and fix some very core API problems with that original pipeline solid. There are other things called modes and presets and all sorts of stuff. And we wanted to simplify and consolidate that

into a far more elegant ADBI. And that was a huge step forward for the system.

And then assets were always gonna be a layer on top of that.

So really the foundation

of the 1 0 release and the big breaking change we needed to make

was switching from to this op job graph model. And that is the stable core foundation that we're building everything on top of.

The feedback on those APIs has been extremely positive.

All of our major users have migrated everything.

We're super confident in those APIs, that core Sable Foundation.

And it was really that laid the foundation for this 1 over lease.

And that doesn't mean we're gonna be stopping

development by any sort of means, but any further changes are gonna be additive.

And we will, like, commit to supporting these APIs for the long term.

I guess, can you remind me the rather subcommos of the question? There was a bunch in there. The main driving point of the question was why you're confident with committing to this 1.0 version specifier, but also just some of the overall

sort of lessons that you've learned as far as how to keep up with the rapid pace of change in the data ecosystem

and what are the core

ideas and lessons that you've been able to lean on that have stayed true regardless of the external turbulence that is a constant in this space.

The word that you used, I think, is telling,

which is turbulence.

So I think there's 2 things going on here. There's, like, what are the fundamental things we are trying to accomplish in our jobs?

And regardless of all the different noise in the ecosystem,

that doesn't change that frequently.

And the end of the day, we're writing code to produce assets and whatnot.

There is tons of activity in the tooling ecosystem

to optimize different parts of that workflow

and make certain parts easier and improve ergonomics and whatnot.

So, you know, with the 1.0 signifier is mostly around our core abstractions.

You know? And then there's the entire integration layer

and all the different libraries that both we and our community maintain.

And that's kind of where you, you know, are dealing with the turbulence, so to speak.

And

as to new tools some come and go, the core notions in our framework, like, will remain the same.

Along with the GA release of cloud,

what are some of the new capabilities that you're planning on releasing

as part of that?

So there's a ton of new stuff that we're gonna be rolling out.

But,

you know, 1 thing I'm incredibly excited about

in RGA release

that's coming out. So first of all, for the DAXR 1 0, there's no huge changes coming out with DAXR 1 0. Right? The whole point is that there's a stable set of features that we are marking as mature,

and you can be confident that things aren't gonna change in that stable core going forward. The Cloud GA release

is a bit different.

And first of all, what is the cloud product? And it's a managed orchestration service

so that

you can write your code and we host your control plane.

So you can deploy Dagster effortlessly

in an enterprise grade way, and it comes with features like, you know, role based asset control, authentication,

and, you know, you can just focus on your business logic.

A new capability that's coming out that I'm incredibly excited about, which I think solves the sort of,

like, unsolved problem

in data engineering and data management

is what we call branch deployments.

So,

you know, there's this current problem in, I think, all the orchestrators,

which is you can spend a bunch of effort

structuring your code so that it's locally testable.

And so you have a fast development life cycle.

That's only in certain of the frameworks. Like, Airflow doesn't really support local development at all.

And then what teams are left to do in those cases

well, let me step back. So let's imagine you do do a bunch of work and you're, like, mocking out your data warehouses and doing all this work to try and make it so you have a nice local development flow on your laptop. There's a problem there, which is 1,

there's a ton of stuff that still isn't tested if you aren't interacting with the services you're actually gonna interact with. And then if you try to interact with those services from your laptop, a lot of time your, you know, security policies prevent you from interacting with that. And then what people are left to do is still to do a bunch of testing in a staging environment.

But those centralized staging environments are also a pain in the butt to manage.

And often, if you have multiple developers deploying code to that staging environment, they're overriding each other's changes, and it just gets complicated

and whatnot.

So you either have these, like, clumsy to muse, clumsy to manage staging environments, or what people all too often do is just end up testing in production.

They just push their code, and then they manually trigger

a job or a DAG and just see if it seems to work. And it's this incredibly low,

low productivity life cycle. What branch appointments allow you to do

is that when using cloud, is that for every PR that you make,

it actually creates a

lightweight staging environment

that's specific for that PR.

So you write your code, you branch, you push,

then our infrastructure takes over,

deploys your container,

and then actually spins up a lightweight environment where you can either

automatically launch a job every on every push or manually do it in our UI. And it provides a sort of,

like, very lightweight

staging environment that serves almost as like an IDE because you can, like, click runs

and see if they complete or not. It's a complete game changer

in terms of the development life cycle,

and it just fits so naturally into development workflow.

And it's just an incredibly powerful capability.

So I'm incredibly excited about that.

It's time to make sense of today's data tooling ecosystem.

Go to data engineering podcast.com/rudder

to get a guide that will help you build a practical data stack for every phase of your company's journey to data maturity.

The guide includes architectures and tactical advice to help you progress through 4 stages,

starter, growth, machine learning, and real time. Go to dataengineeringpodcast.com/rudder

today to drop the modern data stack and use

And to your point of

the staging experience

being able to make

data workflows and data orchestration

something that is tenable for having a

fast feedback loop for the development workflow.

It's definitely great having that capability built into the orchestrator, but then there's also the question about what about all the pieces that that orchestrator is connecting to? How do you manage those branches and

preventing accidentally

polluting

the actual production data or the production systems and just

the overall space of developer tooling, CICD,

just making sure that you have those fast feedback cycles and are able to validate your logic and your end to end workflow in the data space is

still very painful. And I'm wondering what you see as

some of the other sharp edges that folks are going to have to deal with now that they do have this capability of being able to easily branch their orchestration layer, just how to think about what the broader impact is on the rest of their systems with that capability? Robert Leonard It's a great question, Tobias,

and, you know, we are able to use the orchestrator

itself to manage all these different test environments. So let me give you a simple example. Snowflake actually has good support for this. They have a feature where you can clone a schema, which effectively makes a lightweight copy

of your data

in a copy on right way, but it does it safely. So if you do any sort of mutation,

it does not affect the source production data. This lends itself very nicely to this branch deployment model because effectively, you make a clone schema

for every branched environment.

You can't actually pollute the production data, but you have access to it in order to test your flows.

And then if you actually do any writes in that data warehouse,

it still is safe. Right?

What's awesome about branch deployments

is that you can set it up so that you can automatically run a job whenever you push up new code. So the way it works is that you actually have a job in Daxter

that

clones the Snowflake schema.

And every time you push up

a new PR and or a new change within that PR, it reclones the schema every time, So you kinda start from fresh.

So by marrying the ability to have underlying infrastructure kinda support these things

and have the orchestrator be able to orchestrate the management of those test environments. It kind of all comes together in 1 nice package.

Now the different vendors and tools

support those type of things

to different degrees. Right? So Snowflake has kind of the nice version. If you have, like, a full unmanaged data lake, you might actually have to kick off a job that copies a bunch of data.

Right? Or

write what we call custom IO manager

that would know how to read from kind of the source tables and then write out to copies

elsewhere.

So depending on the tool support and whatnot,

it depends on how much work you have to do. But the entire idea

is that by having this capability built into the orchestration layer,

the orchestration layer itself can manage all these test environments.

And that's really where this magic comes together.

For the data lake

use case, also, there are things like Lake FS that allow you to have a sort of branch and merge style workflow for your data in s 3. I know Iceberg and Project Nessie are looking to be able to do that in the space of things like Hive Tables or tables in your, your, you know, Amazon

Athena Lake or things like that.

And now that there is more movement in some of these different vendors and in what you're providing with the branch deployments, and it is a conversation that is being had.

I'm curious what you see as some of the

potential next steps

that we, as a community, can and should take to help

keep that ball moving or keep that flywheel spinning and continue

on improvements in being able to actually manage these complex and heavyweight

workflows

while being able to safely iterate quickly on

updating logic, modifying schemas,

figuring out how these data flows actually impact the business, and just the the end to end workflow of being able to

manage these

high value,

high risk, you know, heavyweight tasks?

This is 1 of the reasons why I'm so excited about an orchestrator finally supporting this type of thing

is that I think it will provide a great incentive structure

so that people are forced to make progress more quickly.

You know? Because what what will happen will be something like this where people orchestrate in 2 different technologies, and 1 of them, like Snowflake, has a super easy or LakeFS

has, like, a very straightforward way of doing this. And then they're trying to orchestrate it with a tool that doesn't. And then they'll immediately be like, wait.

It'll like, they have this workflow ready to go, and it's just 1 tool that's preventing it. And then those users will exert pressure on their vendors and tools

in order to support that better.

So, yeah, I'm just super excited about having a place where there's an end to end workflow

that kind of aligns with this trend that you're seeing

so that,

you know, it makes it clear what tools have

fallen short here,

so to speak. And I think it will provide a nice incentive structure for everyone to get their ducks in a row. Because right now, if you're trying to orchestrate all these different capabilities,

right, like FFS and, you know, Snowflake's clone scheme are copying whatever and doing all their different bespoke branch and merge workflows.

It's so much work to construct that end that no 1 does it. And therefore, people don't even know what they're missing yet. So to me,

the way to really make progress

on these type of issues is not to, like, gather a conference together and, like, convince every single vendor and, like, nag them to do this and this and this. It's just, like, with actual tooling, with actual users and actual incentives,

just move everything much more quickly.

I agree that having that capability

of

managing the full kind of test environment

in the core of your data platform is definitely going to be a massive leap forward in terms of people's experience and even thinking about that as being a possibility because as you said, otherwise, it's this very heavyweight exercise that is full of toil of just saying, okay. How do I automate each of these pieces individually

and be able to manage all the infrastructure and manage all of the and manage all of the logic and hopefully

not shoot myself in both feet by accident in the process.

Yeah. Totally.

And

and going back to this release of the 1 version of Dagster, the GA release of the cloud product with these new capabilities.

As you said, you've got this DAXTER Day event that you're planning to make all of these announcements. I'm wondering what are some of the other things that you have planned as part of that event and what you're hoping

the participants will take away from it. The event is, you know, short but sweet. Yeah. Or I think we're gonna,

clock in, you know, 45 minutes ish or so. And so what we really wanna be is like a very information dense, you know, high single to noise ratio

where you can come

understand the entire value proposition of the platform

end to end all the way from the framework to the cloud product, and then have a forum where you can connect to other community members and ask questions.

And in general, be kind of the first to know what's going on with all this stuff. And, you know, it'll be available afterwards, of course, for to view. But we're really excited to say, you know, it really is this

coming of age moment for the platform

where,

you know, we're powering

with cloud and open source,

like, some of the most sophisticated data and ML platforms in the world. People have bet their entire mission critical workflows and their entire data platforms on this technology.

And this is more kind of a a coming of age party that communicates that underlying reality

accurately

and then opens up this capability to the entire world. So, you know, it's a big day for

us. 1 of the interesting

aspects of marketing something as stable with that 1.0 version is that it opens the doors for a lot of folks to even consider using it in the first place because there are

organizations

that have restrictions that say, we will never actually use something until it hits version 1 0,

or there are people who

aren't comfortable

being the guinea pigs or testing things out with the possibility of breaking changes because

they don't want to put in the time to actually stay up to date with things as they evolve.

I'm curious what you anticipate

as

the overall impact on the

size and composition

of the community around

now that you are hitting the stable milestone

and some of the things that you as an organization

and managers of the community are doing to prepare for

these new categories of users that are likely to

start kicking the tires on Dagster?

Yeah. Like you said, there's a whole set of users out there who

it's extremely important for them

to have the company and the project signal to them that they're not gonna be broken and it's a stable foundation to be built on.

And the reality is we've been operating like a 1.0 project

for

a long time now, both in terms of the reliability

that we provide to our users as well as

effectively the job.

The core task based layer, for example, has been stable for a long, long time and we don't we don't break anyone there. So on an operational

basis,

day to day,

not that much is gonna change for us,

actually. We've been, like, operating like this for a while. You know, we have done renewed focus

on our documentation.

You know, we have a full time technical writing staff now. They're doing fantastic

work

to improve the docs. So we're preparing to scale

to scale our product education and community management

alongside this.

Now that you do have this

stable point, this marker of

a next stage in your life cycle,

what are some of the

new and upcoming capabilities that you're starting to look to

with this solid platform to be able to build from?

We are gonna continue to invest, you know, in additional integrations

and additional tooling on top of the software defined assets, you know, stuff in particular.

And I think the, you know, you can think of the system as layered.

Right? So

when you have a core stable foundation,

then your activities

generally end up being focused at the tooling built on top of that.

And that's really where a bunch of our focus is gonna be as well as on integrations.

I think there is just a ton of work

in terms of both integrating both at the software defined assets level, but also working with partners

to make those tools work with branch deployments, for example.

And I don't think we can underestimate

how much both work that is and how much value there is in there. So I think you'll see us focus a ton

on

integrations work and value added to win over that stable core.

In terms of other directions,

you know,

on the commercial side, you'll see us also working on supporting organizations as they scale across our platform,

whether that means more advanced

role based assets,

RBAC capabilities,

the ability to, you know, track lineage across your entire org

as you cross deployments

in the Dexter platform.

But,

you know, we are super focused on getting 1.0

and the GA product out the door, and, you know, we'll be publishing more of a roadmap after that. 1 of our earlier episodes, we had a good conversation about your opinions and framework for deciding

what capabilities

belong in the open source framework capabilities of Dagster and which features

belong in the commercial cloud offering.

And I'm wondering how this stable marker of the platform and of the cloud system and its release into general availability

is going to

factor into your ongoing commitment

to or modification of that philosophy and just some of the ways that you're continuing to manage that tension of which capabilities

need to be paid for the sustainability of the organization

and which capabilities

need to be part of the open source because they are

necessary for the continued viability

of DAXTER as its own project.

First of all, let's step back and talk about how it even approached this issue here.

The goal of Dagster has become

as broadly as adopted standard as possible

for structuring your data platforms and data assets.

And then, DAXTER Cloud is to build, you know, an awesome

managed surface that is structured on top of that standard.

And in terms of what goes into that standard, the open source versus what goes into the proprietary platform,

I kinda divide the world into 3

layers of complexity here.

Application complexity,

operational, and enterprise.

So application complexity is about how developers work with the framework to structure their code in order to have, like, better API ergonomics,

testability,

all that stuff. And that's effectively

the relationship

between

the engineer, developer, practitioner,

and the code that runs in their process.

Right? Like, literally, they are pip installing something,

and they are running Python code that we have authored that calls into their code.

So on a very

practical level, that needs to be open source because you have code in the same stack trace, and there's no process boundary

and all that stuff.

And, also, we want

that framework to be composable and embeddable and all sorts of interesting contexts

because we can't predict how everything's going to be used. And that's like the fun and the joy of open source.

So that's 1 layer, the application complexity and the open source framework that will forever and always be open source.

Then you get into the land of like tooling built on top of that stuff.

Right? And I think that's where this gets more subtle and interesting.

So we want teams to be able

to

self host open source Dagster

and run production workloads on it. And we want them to be able to run that on different

infrastructure

like ECS or Kubernetes or just an EC2 node or whatnot.

We want people to be able to do that. We want them to able to self host that, and we will always support that.

But we also have Daixo Cloud, which is a managed service which hosts the computations on their behalf. And the way that I kind of divide

the divide the world here in terms of what should be proprietary

and what should not be

is actually a lot of it is in terms of execution efficiency of the organization.

Meaning, it is much

easier for us to develop advanced operational capabilities

on a centralized

hosted

platform.

We can deploy it whenever we want. We can fix bugs immediately

and so on and so forth. Let me give you a very specific example.

Like a database migration,

it is,

like, so much easier for every single stakeholder involved

if we run 1 centralized database migration

instead of forcing

our, you know, thousands of open source users to run their own database migration against their own infrastructure.

Doesn't mean we're never gonna implement anything that requires a database migration. It just is an example of if we can centrally manage something,

we can move faster and we don't have infinite engineering resources. And then we can fix bugs on the user's behalf and whatnot.

So I often think of this in terms of

if there are operational capabilities where we can centrally host it and there are massive economies of scale for 1 organization to do it instead of distributing that work across thousands of organizations.

We should really bias at minimum towards initial development being in the proprietary

domain because we can move faster and provide better quality of service. I think the value exchange here, and this is something we're rolling out with GA, is that

there should be a fair usage based pricing model

where small players can participate as well as the largest enterprises.

So just to give another example of, like, where economies of scale don't apply,

you know, we have a Kubernetes executor. It's not like we're gonna withhold a bug fix from that and only apply that to

the proprietary stack.

First of all, it's just the wrong thing to do and goes against our values. But in the rubric that I just laid out, there's no economies of scale to not fixing that more broadly. Right? It just is better for everyone. And then the last layer of complexity

is enterprise complexity.

So

security, auditing, etcetera,

these are capabilities which are complicated to implement

that we wanna be able to push changes instantaneously

and fix any issues there. And that

enterprises who want those capabilities

typically want a paid relationship anyway.

Like, they want to write a check because then there's someone on the hook when something goes wrong.

So, again, that was kind of a long winded answer, but 3 layers,

application complexity, that's really the relationship with the individual practitioner.

Those capabilities will always be open source.

Managing operational complexity,

we wanna support production workloads,

but, you know, for really complicated capabilities

that take a lot of work to host

and fix and all that. We wanna bias towards centrally hosting those. And then lastly, enterprise

complexity, which is strictly in the province of the proprietary domain.

Digging into that middle layer of the operational complexity where there is this

decision to be made about whether it is proprietary or open source.

You mentioned that

at least the initial development for some of these capabilities belongs in the commercial offering, and I'm wondering what your

decision structure or mechanisms are around being able to

say, okay. We've built this internally,

but this is actually something that is useful in the open source. And so being able to do that initial development, offering it as a paid capability for your cloud customers, and then eventually migrating that into the open source and just how that decision structure and how that actual

management of the code and being able to make it something that can be migrated

plays out in your mind.

Yeah. We don't have a specific process in place yet. I guess what I'll say is that a lot of it just kind of happens

because

the proprietary service is really kind of a shell

where

work that's done on it naturally flows down into our core infrastructure.

So

that kind of just happens for free.

And in terms of, like, an ongoing process where we

develop things that are focused on the proprietary platform In

your

In your experience of rolling out the software defined assets capability

and onboarding some of your early customers to the cloud platform? What are some of the most interesting or innovative or unexpected ways that you've seen those capabilities used?

Couple examples come to mind.

A bunch of our cloud

users

are actually

energy companies that you probably have heard of, and

1 of them

has

really integrated

Dagster

into their

geologists

workflow

as they, you know, are doing analysis

to the point where it's actually completely integrated with their own internal custom apps.

So they have internal apps where they click a button and it actually kicks off

Dagster Cloud runs.

That metadata is then fed back into

their own custom proprietary

app. Kitellum is kind of like their custom operational tools and our tooling has almost like fused into 1 centralized system,

which is super cool to see. You know, it's just like it demonstrates the value of, like, using open source standards,

like GraphQL

to model your web app because then someone can literally

effectively build a lot of those capabilities in different contexts and integrate it in interesting ways.

So that's been super interesting to see. Unexpected,

this is 1 of my favorites. 1 of our

customers is KIPP, the knowledge is power program, which is a charter school program.

And

they actually or a couple of their geographies, they have a fairly fragmented organization,

are actually using DAXTER to, in effect,

build

report cards and progress reports for their students,

which sounds wild,

but it makes sense if you think about that. They actually use

a bunch of different SaaS tools

to track

student activity, to record their work,

to, you know, do attendance,

all sorts of stuff. And then they actually built a data platform in order to assemble those tools

and

integrate the data and then produce a cohesive report that can be viewed by parents and students.

So,

you know, I love it because I can explain this to, like, my mother. It's like, okay. Imagine you have this report card.

That is an affected data asset in the system, which is composed of other data assets, and it's been integrated and computed upon. And in the end, you know, there are data pipelines which produce these data assets for parents and teachers, and it's a very, like,

graspable example that people can understand.

And something I did not anticipate that

DAXR would be used as an orchestration platform for student report cards, but here we are. It's all data. It's all information that has to go somewhere.

That's right.

In your experience of bringing

DAXTER

and elemental

up to this point of

having a

stable,

solid foundation to build from and marking it with this 1 release and the general availability of the cloud product

and all of the new capabilities and features that you have planned subsequent to that. What what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?

It does speak to the 1.0 release,

and it's something I understood intellectually before,

but now really felt in my bones. It's just how important it is

to communicate

clear expectations

around

stability

and future changes and make sure to do enough upfront design

in order to do that effectively.

I think early in the project we were a little fast and loose about kind of we want to be able to move quickly and whatnot

but it's very different.

You know, you can change your eyes quickly, and that's fine

because

you're not breaking people's code, but you just have to have a very different standard,

you know, in terms of communicating

that this is a stable foundation to build on, and that you have to do more upfront deliberate design,

especially in these yeah. So I think that's been a lesson

and

have had to learn that. And, you know, we've learned from that and have changed our processes. And now we're singling this new phase.

And the other thing, and I spoke to it earlier,

is again, I knew it

intellectually,

but it just in the last 6 months, it has just become super, super real

is that, you know, we're excited about cloud. Like, obviously, you know, we're excited to build a sustainable

business model and make forward progress on that front so we become a sustainable organization.

But cloud has also just increased

our pace of product velocity

because we have a tighter relationship with those customers.

We can see the effects of our changes immediately.

We can push them changes. They can provide feedback.

In a day, we can fix a bug or add a feature, and it's just like the super exciting new phase of development.

So those are kind of the 2 things that come to mind. So we've spent the

past hour or so extolling all of the virtues of DAXTER and the new capabilities and benefits that you're providing, but what are the cases where

and or a DAXTER cloud are actually the wrong choice and somebody is better suited with

a different orchestration system or just writing a bunch of bash scripts or just using the built in orchestrator

for their point solution tool?

Yeah. So to the last thing

is that

if you are only using that 1 tool and as a built in orchestrator,

you know, just use that 1 tool. I think that's, you know, fairly clear. And you only need like cron based scheduling, for example.

But

broadly,

I do think that orchestration is a base capability

that you should build in for your platform from day 1. And obviously, it's, like, self interested in me to say that, but I do believe it's true.

It's kind of like

telling

someone

if they're, like, writing

Python or this isn't the best example,

but,

you know, let's say, like, someone's a Python program. It's like, well, only use classes

if your program is gonna get big. You know? It's like, no. You should, like, structure your programming

from day 1 to kind of you know,

even small things become big later. So you might as well kinda make good engineering decisions all the way from front to back.

In terms of

within the orchestration domain,

when Dagster and Dagster Cloud is the wrong

choice,

I think you're actually seeing a pretty interesting bifurcation

in the orchestration domain

where, you know, when people are evaluating

orchestration from ground up these days, they're kind of generally choosing between 3 options, prefect,

astronomer,

and Dagster.

And I think those 3 solutions are actually

kind of going their separate ways, so to speak, with their most recent directions.

So, for example, Prefect has their new Orion

project, which is gonna be their 2.0 project. And I actually think Orion is quite cool.

And it actually makes it a system much more like temporal,

which is a microservices orchestration engine.

So Temporal and Orion

are a prefect, I should say. They're actually more general tools than DAXTER,

and they're more operationally flexible, but they're much more imperative.

So if you need to kind of build a state machine

instead of a DAG,

you should definitely

choose those projects.

And they're much less tailored towards the data platform use case. So they have no notion of data assets.

You can kind of write almost like a Turing complete state machine using them. But as a result, you can't pre visualize what the computations are gonna be. There's very strict trade offs there. So if you want something more general

and more imperative,

and something with less operational constraints, and therefore more operationally complicated,

you know, temporal

prefix is kind of the way to go. Airflow is kind of your, you know, task based orchestration system,

and

that doesn't really think about the full developer workflow.

And if you don't think that orchestration should have a fast developer feedback loop

and that's not really what its purpose is, that you should have scripts and tools and then the orchestrator's only job is to order and schedule them in production and that's its only domain,

then Airflow is a better solution.

But for DAXTER, it's like, if you are building a data platform

and need a control plane for it

and you believe that you're bringing this you should bring kind of software engineering best practices to the data domain and you wanna orchestrate it to be the seat

of your full end to end life cycle, then we're kind of the solution for you.

We've touched on this a bit already as far as what you've got planned for the near to medium term future of Daxter and elemental. I'm just not sure if there is anything else that you wanted to bring up before we start to close out the show. Dan Oh, there's so many plans and future directions to go. You know, we can save that for the next show. We are, you know, as a company

internally and as well for our external communication purposes, we are laser focused on

DAGSTER

Day

and the 1 release and the, you know, launching cloud to the world. So

those are our plans.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or find out more about DAX today, I'll have you add your preferred contact information to the show notes.

And as the final question, what is

your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?

And, Tobias, you always ask this of tool authors and vendors,

and I feel like we're being

sucked into a trap because, you know, it's gotta be what we're working on. Well, obviously, it's not because you're already solving that problem.

Yeah. You know, to me, it's a couple things.

1, I do think the lack of a cohesive end to end developer workflow

when the context of data is still a huge problem, an unsolved problem.

And I think some of the stuff that we're working on is gonna be a linchpin for that, but it's gonna be a ecosystem wide

initiative to get that solved.

The other

thing that comes to mind,

and this goes back to this kind of unbundling bundling conversation,

And not to rehash that, but this general notion that we are asking too much of data teams that they have to assemble, like, 12 different tools

in order to get basic end to end capabilities. And I saw a tweet on this since, like, if the data ecosystem was in charge of the car industry,

there'd be a tire vendor,

a steering wheel vendor, a seat vendor, a body vendor, and every single consumer would have to assemble the car by themselves.

And that resonated with me. That's why in our platform, we are building in a bunch of capabilities.

I call them like a base layer because we're not trying to replace all the products, but,

you know, we have integrated linears, observability,

orchestration

where in a basic catalog. So that out of the box, you kind of have,

like, these basic capabilities built in so that if you only need a simple version of those tools, you don't have to integrate a full another vendor. So

to me, it's about the end to end developer workflow

and also this notion of,

like,

not having to integrate

a dozen vendors just to get basic capabilities

into your platform.

Well, thank you very much for taking the time today to join me and share some of the recent updates in DAXTER and the work that you're doing at Elementl and on DAXTER Cloud. It's definitely great to see this commitment to stable APIs

and releasing your cloud product as generally available. So definitely excited to see

that happen and what comes next. I appreciate all the time and energy that you and your team are putting into building such a quality product. So I thank you again for that, and hope you enjoy the rest of your day. Thanks so much, Tobias. Thanks for having me.

Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you have learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links