Laying The Foundation Of Your Data Platform For The Era Of Big Complexity With Dagster

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads?

Hightouch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch.

Your host is Tobias Macy. And today, I'm welcoming back Nick Schrock to talk about the evolution of DAXTER and its path forward. So, Nick, can you start by introducing yourself? Yeah. Thanks, Tobias. My name is Nick Schrock. I'm the CEO and founder

of Elementor, which is the company behind Dexter, and it's an honor to be here. And do you remember how you first got involved in the data ecosystem for folks who didn't haven't listened to your previous episode? So I worked at Facebook from 2,009, 2017,

and I didn't really work in data management then. The thing I'm known for from that period of time

was I was 1 of the cocreators of GraphQL, which has gone on to be,

broadly adopted open source technology.

And when I left Facebook, I was figuring out what to do next.

And I started talking to companies both inside and outside the valley

about what their biggest technical and engineering liabilities were.

And the notion of data infrastructure and ML infrastructure just kept on coming up universally,

both in terms of just raw

I know, I like to describe it as the

biggest mismatch or gap between the criticality

and complexity of a problem domain and the tools to support that domain that I'd ever encountered.

It was also a critical domain insofar as,

you know, the job of data management is to curate data assets,

which are the basis of all decision making in the enterprise these days, whether it be a human

making strategic decisions

or machine learning model

making automated decisions.

So it's a really important problem,

You know, there are practitioners

in pain, which is very motivating to me.

And I started digging in, you know, about, you know, 4 years ago now. So that's a bit of the story behind sort of what motivated you to build Dagster. I'm wondering if you can talk through some of the core

concepts and design elements that you baked into some of those

early versions of it that you were aiming at starting to target some of those pain points that people were experiencing in the conversations that you had?

So very quickly,

orchestration

stuck out

as both a pain point and a massive opportunity.

Because mostly orchestration solutions that people had used,

they were fairly narrowly conceived

as purely operational tools. Meaning, like, this is how you order and schedule things in production.

And I thought that was a very missed opportunity.

You

know, what I found was these orchestration systems were in their developer life cycle. You know, the whole job here is to build graphs of computations that consume and produce data assets.

And, you know, solutions like Airflow and the container based solutions really didn't have smooth local development experiences.

And then I also found that the DAGs or graphs they encoded were very metadata

poor,

and they were not data aware, typically. And I thought that was a huge missed opportunity

because

the DAG can be a source of truth

for

a large number of things. You can encode what data is produced. That data's lineage.

You can make it much more self describing the computations themselves.

You can structure this thing so that the computations can be rendered in tooling prior to computation, which is actually very important for making it

a a system of record for your data platform.

And so that was kind of the

entry point and why I started Daigster.

Given that sort of foundational concept and the idea of building much richer metadata into the computation graph that you're building up with the Dagster,

And we've talked a lot about some of the core elements and the architecture of the project in the previous episodes. We'll add a link in the show notes for people who wanna revisit that. But that was 2 years ago now that you were on the podcast last, and I'm sure a lot has changed in the tool and in the ecosystem. So I'm wondering if you can just talk through some of the ways that the project has changed, how the community has evolved, and some of the ways that the

ecosystem around the project has shifted your goals and priorities.

So it's a much more mature and robust system now, and it's been slightly,

you know, conceived a little differently

insofar as the original vision

was to have Daxter be more of a pure software layer

over other orchestration systems.

And it became

clear relatively quickly, actually, after that podcast by the way, I can't believe it was 2 years ago. It became relatively quickly that we needed more vertical integration to achieve our objectives and satisfy our users.

Meaning that effectively, we built, like, a full orchestrator

underneath the hood. Because,

you know, we got feedback that

people were like, we love the programming model, but we wanna only adopt 1 thing, not 2 things. And

that made sense.

So in terms of high level technical properties, that is the biggest shift, and it really changed the scope

of the project, you know, and observing us very well.

You know, the community has changed and grown dramatically since then.

You know, we have, like, no usage

and maybe, like, a 150 people on our Slack when you were kind enough to allow us on air.

You know, fast forward, now we have thousands of people on our Slack

and really awesome design partners that power their data platforms and ML platforms, you know, Goodeggs, Prezi, Loom, GoPuff,

you know, UNICEF, which is a nice 1, Drizly, Scale, Mapbox, whole bunch of awesome users.

The Philippines government used us for the data platform for their COVID response, which was surreal. You know, it's been deployed into Fintech and telecom companies in Southeast Asia.

So, you know, we made a ton of progress on that front. Like, you know, we've grown the team a bunch. We have, you know,

20 people on the team now and,

you know, making a ton of progress. And then, you know, I think we're gonna talk about later, but we're also, you know, on the cusp of doing a open announcement of our cloud products. So there's been massive progress on a number of fronts in the last 2 years.

As you have been exploring the data ecosystem, particularly

given your sort of fresh eyes on the problem. I'm wondering what you have grown to see as some of the sort of core foundational

challenges and compute the existing ecosystem

and some of the things that have been sort of transient or short lived or potentially mutable in the problem space as sort of people try to deal with the complexities that arise due to the nature of data? So a lot of foundational things have not changed.

You know, there are fundamental underlying properties of data processing applications.

I use that term generically for ML training pipelines

or ETL jobs or

data pipelines because they're fundamentally doing the same exact activity.

They're fundamental properties of those computations which have not changed,

meaning that they're typically

they are across organization but interconnected.

So there's multiple tools in play.

That has not changed.

It's multi persona,

and they span teams.

That has not changed.

And then these systems also have this unique property

in that the people crafting their computations

do not have control over their inputs.

And this is very distinct from web applications

where the programmer actually has much more control over the user.

So,

you know, you put a form in front of them. If they, like, mistype something or misformat something, you throw up a red box and say, please change this. So you can constrain your inputs, and, therefore, you have much more control.

That is not true in data processing applications where you just get data from an external source, and it is what it is. You can't control it. Often, you can't,

like, have the upstream person change it. So you just have to be far more defensive in your computations.

And it's much more akin to a manufacturing process

where, you know, instead of an assembly line, you have an assembly DAG, and the raw materials flow through the system. In the end, you want a meaningful output.

You know, as in manufacturing, you want QA at every step.

And then,

you know, that property

makes it so that you have a much more

intricate relationship

between your data assets and the code that produces your data assets than you do in other forms of programming. So you can think about, like, a particular data asset

might represent

the data that was produced on a certain date.

It was processed by code that was blast updated on a certain date and then recomputed at a later date. So there's like 3 notions of time just in that sentence.

And that is really hard to deal with. The complexity is almost, like, hidden and subtle because a lot of times you look at the code, actually, and just, like,

there's, like, 2 lines of Pandas code, and you're splitting a number. Why is this hard?

And it's hard because of those underlying properties

that I referred to,

and that is not going away.

I think what is, you know, transient or incidental complexity that's going away

is that it's becoming received and accepted wisdom

that the way out of a lot of these problems is to apply software engineering techniques to the data domain.

And I think you're really seeing that. Like, a good example of this is

the meteoric rise of DBT,

right, which is really, like, taking the lessons of software

engineering

and then getting analysts to adopt that.

You know, they retitle themselves analytics engineer, and they're dramatically more productive and leveraged.

And I think that lesson needs to be reapplied

across multiple domains

in data,

You know? And I think I'm thrilled by that development.

There are a number of different directions to go from here, and I think that the first 1 I wanna talk about is this idea of

applying software engineering principles to the domain of the data engineer and some of the challenges that that brings in because of the fact that data,

unlike software engineering,

is

inherently cross functional where it requires

participation from every member of the organization

because of the fact that there are producers of the data, consumers of the data, you know, manipulators of the data, and they all need to be able to align and interact along the entire process.

And so there are concepts such as no code or low code platforms or, you know, the

the the sometimes loved, sometimes hated UI oriented pipeline building tools and then,

you know, this movement towards,

you know, x as code where code essentially means a YAML file or a JSON definition

or the other

trend that's pulling people into another direction of containerizing everything so that you can have multiple languages coexisting

in the different stages of computation.

And given the fact that Daxter is unapologetically

a software tool and, you know, written and

implemented and consumed in Python. I'm wondering what you see as some of the

sort of potential challenges

and opportunities and just the overall

sort of acceptance of this very unapologetically

software oriented approach

to managing

data computation

and some of the

opportunities for being able to bring in some of these other concepts of, you know, polyglot

data applications

or app data applications as a more sort of low code, no code approach.

There's a lot to unpack there,

but I guess we can start with the premise

that yeah. We like to say that Daxter is built by engineers for engineers. You know, our core adopters always self identify as engineers, and that's very deliberate.

So, you know, frequently, the people who

really latch on to Dexter

and really get a ton of value on it describe themselves as data platform engineers,

which means they themselves are engineers, but they are serving tons of non engineering stakeholders.

So, like, you know, I just brought up dbt. Right? Like, you know, we have platform engineers who work lockstep

with analytics engineering teams. Those analytics engineering teams are still coding in DBT,

but we have a nice integration.

And then those DBT users can use us

to understand how their computations interrelate

to all of their stakeholders

and then also have kind of data observability

so they know how their assets interrelate.

And, you know, even though I'm a grumpy

engineer,

I think there is a place for low code and no code environments

properly conceived

and placed within a software engineering process.

So, you know, I actually have kind of a picture in my head,

which we don't have to get into, and I even have a domain for it, but I'm not gonna tell you what that is because I've had domain names stolen from you before.

For a no code,

low code tool

that plays nicely

in a composable way with the rest of the data platform instead of having a completely siloed system that is completely foreign to our engineering process.

So I think there's a spot for that.

But, really, what I see happening almost

you know, there's a term du jour out there, the modern data stack. Right? And there's, like, debate about what it means, and is it a set of technologies or, like, an emotional state?

And, you know, I really think it's a mindset of a way of approaching the problem.

And, effectively, data infrastructure is being rebuilt from the ground up for the modern cloud era and the software engineeringification of data.

So starting at the most primitive levels where there's an ingest tool, a cloud data warehouse, a transform layer, and a BI tool. But

there's gonna be more and more tools kind of

embraced, you know, under that mantra,

both because in reality, people wanna do more things, and then there's also the incentive structure where every vendor is gonna label themselves the modern data stack x. And you're already seeing that. Right?

But I think they're actually for low code, no code stuff, I think there's definitely gonna be a space that can deliver a ton of value

for low code, no code solutions that align

with the values of the emerging modern data stack

that can bring in a business user,

but in a way that comports with the software engineering process laid out by the data platform team. And then along the lines of the sort of,

you know, data pipeline as code where I just say, here's my YAML file. I want it to do the thing, and I just throw that at Dagster, and Dagster does what it's supposed to do. I'm wondering if you see a potential future for that, or maybe you have a set of kind of prebuilt off the shelf

operations that somebody can just reference in a YAML file to say, I want this thing to happen, and I want it to tie into this other thing. And I don't wanna have to actually write all of the code that does those different pieces because DAXTER comes with all of those out of the box for me. Totally. So I think there's 2 components to what you're talking about. 1 is, like, a YAML file or an equivalent DSL.

We have users who do that now. Like and we didn't wanna prescribe a single YAML spec to rule them all because we find that the needs are often context specific.

But what we do tell people is, like, you know, it was I wrote it, I think, actually. We have an example of, like, oh, here's how you would take a ingesting YAML file and produce a pipeline out of that. And it's, like, you know, relatively simple, and people have taken that and repurposed it. We actually had a user which took that. They went to connect us all together and built a full WYSIWYG drag and drop system on top of Dexter.

So, you know, it's a layered system that you can build on top of it. I think the other

component that kind of is implied in what your question was was kind of prebuilt integrations

and prebuilt compute

such that you don't have to write code in order to integrate with a tool in a straightforward way.

And, you know, our ecosystem of integrations is growing

and continues to grow every day, both kind of in our monorepo

as well as,

you know, out in the wild. You know? We recently gave a talk at the Open

Data Stack Conference, and we're like, we don't have a Meltano integration. And then someone just googled it, and lo and behold, someone had written 1 out in space. And that's the magic of open source in these ecosystems. So I think there's kind of 2 components there.

1, just to summarize, you know, we've written a layered system with well structured APIs where you can overlay your own DSLs on top of this.

And then, you know, leveraging the power of open source and

having clear pluggability points, you know, we're the surface area of our integrations has expanded dramatically.

Another

aspect of what you're saying there of not being prescriptive about how people are building their computations and how people are

defining the sort of DSL that sits on top of DAXTER to be able to

be this more high level data platform experience.

You know, I started using DAXTER very early in the journey, and so there was definitely still a lot of experimentation

of how you structure the repository, how you actually write out the different operations. And I'm wondering if there has been any

sort of emerging

best practice of how to actually architect the

project, how to compose the different elements together, and be able to

units of computation that can be built on top of to build a sort of large and complex system without getting stuck in the trap of, you know, dependency hell or, you know, mountains of technical debt. Yeah. There's a lot to unpack there. Just in terms of good patterns and whatnot,

you know, I've been very pleased to see people dig into

our resource system a lot.

And without digging in too much detail,

this is kind of the seam that allows DAGSTER Computations to be far more testable

than any other of its pure systems.

You know, people who kind of invest a little bit in their resources up front and set up, like, a development environment and a production environment see, like, massive productivity wins for actually, like, a relatively limited upfront investment just to kind of think about it, get it right.

And then, you know, another thing that's been a really successful pattern for folks, which ties into your dependency issue,

is instead of kind of

making everything

in 1 big pipeline and having 3 different teams,

you know, collaborate on that, having, like we call it, like, the mega DAG or the monolithic DAG. It's not particularly scalable. It's operationally really difficult.

We've really pushed our users

and encouraged our users

to leverage our asset awareness to connect

different teams' computations through event based, what we call asset sensors.

So, you know, our system, you can set it up so that it tracks, like, okay. Like, a previous job says that I just updated this database table that's tracked in our asset catalog, and you can set up a downstream job to just listen on that,

be like, hey, whenever this is updated, like, kick off a new 1. And that allows those 2 teams to operate in dramatically

more decoupled fashion.

They live in what we call separate repositories, meaning they can easily use different container images.

They can deploy at their own independent rate,

and they don't have to know anything about the structure of each other's pipelines.

Like, I hate to break it to data engineers, but no 1 cares about the structure of your pipeline. No 1 cares. I know you're very proud of it. All they care about is the asset.

So, you know, by really doubling down on the asset is the interface between teams

has been really successful.

You know? And that very much, like, lines up with the whole

not to get into the data mesh because, because, like, I don't wanna start a Twitter fight. But, like, I think 1 of the great ideas of the data mesh

is that the asset

is promoted as more important than the pipeline, and that's the API between teams.

And that kind of structure reflects that.

In terms of,

you know, antipatterns

and things

which people shouldn't do, you know, we built this very sophisticated configuration system,

and we had this type ahead and all this stuff. And I think, actually, we encourage users to make things overly configurable

and cause people to generalize things too early

and do things in the config system that they should have just done in plain old

code. And I think we're still kinda like clawing that back a bit, actually.

Yeah. I think also sometimes people the API is so lightweight.

It's just a Python function. You're like, oh, I can decompose my pipeline into a 1,000,000,000 little things. It's like, well, that actually can cause performance overhead and understandability

problems. So, you know, I think that, you know, there's patterns like that that we're still

still sussing out.

You mentioned the asset as kind of the boundary condition between pipelines and the API that is sort of foundational

to how people are using and consuming data and that that's actually the sort of most

concrete element of the work that we're doing. And I know that 1 of the concepts that you and the elemental team and the DAXTER product are starting

to orient around and has become sort of an emerging

concept within the framework is the idea of the software defined data asset. And I'm wondering if you can talk

through some of the ways that that

evolution of that as the kind of orienting concept has on the ways that engineers approach pipeline design and some of the ways that that is influencing the direction of the API contract that you have within Dagster as to how you actually

construct these different units of compute and then compose them into these larger DAGs and pipelines?

Yeah. Well, I mean, first of all, thank you, Tobias, for paying such close attention to the product since we have not aggressively marketed this experimental capability that we've never talked about pub in any sort of real public forum. There's, I think, a single GitHub discussion about it, but it's a direction that I am very excited about.

And

1 interesting you said before

about, you know, x as code,

that trend, I wish all of it was called

software defined. So, for example, like infrastructure as code,

generally, people associate that with Terraform,

right, or some sort of declarative DSL where they, like, made up their own language.

I much prefer the term software defined because that is a little more expansive where you can imagine a system like Terraform that was built on top of Python where you could, like, use functions and stuff, and then you end up using Python to construct

in memory software artifact, which is then kind of consumed

by a system, and, you know, that structure is in, like, you know, reconciled in the same deployment concept scope.

So we really put software defined assets in that same lineage,

software defined networking,

software defined storage,

software defined infrastructure as a way to conceive

of your data assets and manage them.

You know, we're still figuring out it's very early days.

But, you know, I think

you can provide a really elegant

intuitive programming model this way where you no longer

have to manually construct your DAGs,

which is a huge just, like, short term win,

it becomes a very natural way

to express the dependencies between teams. You kinda like, there's no way you can write the code without defining your asset lineage, which is

awesome. You know, I like to call this our property of, like, the developers fall into the pit of success. Like, by default, they're doing, you know, the so called right thing.

And then I think we can really shift the mental model of orchestration.

Yeah. Internally, we like to say, like, orchestration becomes reconciliation,

where you think of it in a much more

declarative

way.

And so, yeah, I'm very excited about the direction. And I know that sort of in order to be able to support that more natural model of assets being the core element

and the core conceptual aspect of the goals of the framework. I know that you have also changed some of the API models where originally the sort of unit of computation was called a solid, which has its own sort of interesting backstory to that. We don't necessarily need to get into it. And then there were sort of different

levels of granularity

for being able to compose those units of computation into the larger graph where there was the pipeline, which was sort of its own special entity, but then there were also

composite solids, which were a sort of subdag of a unit of computation, and you've been able to kind of break free from that and unify those into the idea of the graph. And I'm wondering if you can just talk through some of the ways

that the idea of the software defined data asset and these more

streamlined API concepts have kind of played against each other as you explored that space more fully. I mentioned the software defined asset thing, which is very experimental capability that we're just starting out. What you just mentioned, kind of our latest release, o 13 o,

which kind of renamed a bunch of stuff, you know, it turns out the name Solid was incredibly stupid.

Like, I apologize

to my team and the entire community. That's my fault and kind of, you know, had to, like, eat my sin there. I think but more important than the rename

is that and I think, Tobias, as a user, you can speak to this, is that the system had a lot of power before this change, but yet, like, sift through some duplicative

abstractions, and there's some goofy naming and

all that stuff. So,

you know, we kinda took a step back

and we thought from first principles, okay, but the users who have kind of

reached through that muck and found the value, what's the value they have,

and then what's the most elegant, most pure way to

capture that value and express it?

And then as a result of that, you know, I think the team did extraordinary work

here, and effectively, I think the system has gotten more powerful

while boiling down the concepts. There used to be, like, kinda, like, 6 or 7 concepts you need to learn,

and we boil that down to, like, 3 or 4. You know? When you do that in 1 of these systems, the core concepts interrelate to each other.

So every additional concept you add in a core system, often there's this combinatorial explosion of, like, oh, how does this thing interact with this?

So I think this latest release is a massive step forward.

And even though we're changing a bunch of names and we're making people change code and all that stuff, the feedback

is off the charts positive,

which never happens. So I'm thrilled with the results.

And in a way, right, we kind of by boiling down the system

to a more stable,

coherent core,

it has now given us the space

to move forward and build more capabilities on top of that, and it will feel better

and, you know, allow the more, like, what we call the progressive disclosure of complexity.

And I think, you know, it's just a much more stable foundation

to continue to build,

you know, the stack of capabilities that we have. So incredibly excited about

what just happened there.

Going back to 1 of the things you were saying earlier about the richness of metadata

that can and should be embedded into the different stages of computation and the interrelations

between the different computations that are happening.

That's definitely 1 of the more powerful elements of that I appreciate is that it does have this very

expressive metadata graph and a lot of opportunities for being able to

expose and propagate metadata

in those units of computation.

And I'm wondering if you can just talk to some of the

opportunities for being able to take advantage of that metadata both within the bounds of Dagster and some of the ways that it can be leveraged

outside of the Dagster context and exposed into some sort of more universal metadata catalogs

and data discovery tooling and just some of the contracts and interfaces that you're thinking about for that? Totally.

So, yeah, I'm a huge believer

in allowing the engineer or the developer to directly encode metadata,

like, in their code directly, whether it's annotating the code itself,

having metadata

in the kind of their structure metadata in their logs so they can communicate during the computation, like interesting facts about things,

and then also attaching metadata to the results of the computation.

And by having metadata in all those different forms, you can provide enormous amounts of context

about anything in the system.

Right? You can look up an asset.

You can see information about that. You can then click on it, go back to the run that produced that asset. You get a ton of context about everything that happened in that run. Then you can go to the job

that was the basis that it run, and you can see how the code was annotated, who owns it, and all that stuff.

By being able to fluidly navigate through the system, any stakeholder

can get tons of context

about what's actually going on. And it is so much more powerful than control f ing through a log. It is like, if you fully buy into it, it is incredibly powerful.

So I'm a huge believer in that, you know, and especially the format where the code itself is annotated with descriptions. So 1 of those properties in DAGS is you can load up these computational graphs

prior to computation without any infrastructure requirements. You can just view the graph and, like, get all this descriptive information. It becomes a very useful system of record for your data platform.

In terms of extending the context throughout the rest

of the system, 1,

everything that exposes your in our UI

is backed by a GraphQL, or spoiler, a GraphQL API.

So, you know, people can integrate that into any of their tools,

and we encourage that. Right? We want DAXTER to be the interconnective tissue, not the 1 tool to rule them all.

And then, you know, where I'm really excited to kind of integrate with other data catalog

and metadata systems

is to

really

dig in to this

structured event stream that we produce

to allow an arbitrary consumer of to ingest that and then build up their own interesting indexes and integrates their system. We have, like, a built in, what I'll call, an operational asset catalog

that tracks assets that are produced by the Daxter system itself, and we can, like, enable interesting operational use cases. We're not interested in, like, the universe

of data cataloging capabilities,

you know, like crawlers

and, you know, really complex ontologies that are expressed in tools like a Munson and Data Hub. It's not our business.

We want them to be able to ingest our stuff, and I think the right way to do that, you know, will be structured to ingest. But, you know, all of these tools that I mentioned are actually quite early days.

We haven't seen, like, a massive market demand

for those integrations yet. I'm excited for the day where that occurs. You know?

But before you're integrating a data catalog tool with a orchestrator and all this other stuff, you need to, like, adopt it in the 1st place and kind of build up to those capabilities. So

I think I stole this from Thomas from Redpoint, but I think he's right. Like, the next decade in software engineering is a decade of data,

and it's still super early days. So, you know, I'm speaking a lot of terms in terms of aspirations, but this is kind of our general approach. And so another interesting element of this is the case where you maybe have multiple distinct

deployments of Dagster,

and each of those deployments has their own set of

computational graphs that they're concerned with, their own set of assets that they're trying to create, but they are still within the bounds of a given organization.

And I'm wondering what you see as the path towards being able to

gain visibility

across those maybe multiple deployments or distinct graphs to be able to see sort of the full lineage where maybe you have a data asset that is the output of 1 pipeline, and then the other installation

has a sensor that is keyed off of that asset to be able to trigger another downstream pipeline and then being able to kind of stitch that altogether

in 1 sort of comprehensive view to understand

as the end consumer of that second pipeline where the overall graph started.

Totally. And that's really where our cloud product comes into play. You know, I don't know exactly what date this product is gonna be published, but I believe, and I will request, that it'll be after the announcement of the early access to our cloud product.

And we consider those kinda enterprise capabilities. You could think of it like DAGSTER

Federation almost where you're federating all the different teams

in a really well structured way.

And, you know, I'm really excited to explore all those capabilities. You know, we're still early days out of this, but, you know, we're building in capabilities where you can have cross stacks or deployment,

asset sensors

out of the box,

you know, be able to provision those deployments very quickly with 1 click, and then have enterprise governance around those deployments. There are back and,

you know, various kind of enterprise

features around having, you know, really a really full fledged enterprise data platform built on top of Daxter.

So, you know, operating

and the application logic to do all that is actually,

like, very complicated.

And, you know, it makes sense for us to centrally manage that stuff. And we can get into kind of the open source commercialization boundary

as we go on. But,

you know, those sort of enterprise organizational use cases are definitely a focus of our

upcoming or recently launched DAX to cloud product.

Struggling with broken pipelines,

stale dashboards,

missing data?

If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than

Monte Carlo, the world's 1st end to end fully automated data observability

platform.

In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes.

Start trusting your data with Monte Carlo today. Go to data engineering podcast dotcom/montecarlo

to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo swag box.

Digging more into the cloud aspect, I know that you've been working on Dagster for a long time, and you delayed the sort of commercialization

aspect of it for a while until you were very certain of

where it fit within the ecosystem that you were solving the right problems. And I'm wondering if you can talk through some of the journey from,

I have this idea of a product that I want to build. Here's the open source implementation of it. I'm going to use this to explore the problem domain to

now I see a clear path towards

a commercial option to be able to actually monetize this without cannibalizing the community that I've built up along the way. So that journey,

I always start with the developer experience. It's kind of, you know, maybe it's because I'm a developer and I'm selfish, and that's who I empathize with. But, you know, that's what I think of.

I think that developers wanna be able to solve their productivity problems

and code without

interaction with a commercial product necessarily.

And, also, the technologies that I kind of like,

what I'll call, like, an intimate relationship with the user's code. Know, there's a framework and it calls into the user's code in the same process. There's very practical reasons why you want that to be open source. Like, the code's in the same stack trace, and you want people to be able to build integrations and have a community.

So even just beyond, like, the general notion that, like, I like open source communities, and I think they're incredibly powerful. And I thought

the GraphQL experience was I was directly involved with,

and the React experience, which I witnessed close-up but wasn't as involved with, was inspiring

and made lots of people's careers

and really democratized technology, and I wanted that to happen.

But I think you need to figure out something that people want

first,

and I wanted to do that in the open source domain.

I think if I went back in time, I might have kicked off the commercialization

earlier, not because we need to make money. Luckily, we had

very patient investors who kind of get open source,

and we're very on board for the, like, hey. Let's invest in the core technology, and there's no rush to commercialization because you kinda might might, like, you know, trying to come up with a metaphor that isn't macabre, but that wouldn't prematurely

kind of hamstring the technology.

But, you know, in the modern world, what's interesting is that there's an increasing comfort with hosted services

where it's not like the old days where there was an open source technology and you're like pulling teeth to get people to use your commercial service. Now it's the opposite where lots of people view

the open source project having a SaaS offering as an adoption requirement,

which completely flips the script, which has been a fascinating development.

So, yeah, I think, like, you know, that's been going on. I think we knew we wanted to do the big API change and, like, simplify the system,

but we knew we could also start building the commercial offering in parallel. So we kinda kicked off doing the commercial product

early this year

and did that in parallel. Now

with kind of the API we wanna stand behind,

and, you know, the early alpha users are getting a lot of value out of the commercial service, and we've matured that infrastructure just makes sense to kind of couple them now and move forward. And we really think it can kind of,

like, kick start adoption

and even accelerate that further

and just provide a ton of value for our community who wants this. Like, people don't wanna migrate a database.

It's like the worst thing in the world. Well, not the worst thing in the world. In the scale of software engineering, it's a very annoying thing to deal with. And it just makes sense for a company

to centralize

and manage those issues on users' behalf.

You know? The the company can do it way more efficiently.

And so there's a very natural,

mutually beneficial transaction there for users large and small.

In terms of the value add that you're

baking into the cloud platform, I'm wondering if you can just talk to some of the ways that it's architected to be able to

add some of those additional capabilities

and simplify the operations of the system while still being able to

sort of leave control in the end user's hand and maintain some of the sensitive aspects of data because of the fact that it is

liable to have PII or

regulated information or, you know, security constraints imposed upon it? Yeah. So, you know, we've chosen architecture. I think it's increasingly becoming a industry standard more or less.

You know, there's a few ways. Some people call it a managed control plane and a user data plane. Some people call it hybrid SaaS.

You know, like, our CI system does is build Kite. There's an agent that lives in our VPC.

It phones home to ask, like, oh, should I kick off a job?

And then the compute actually happens in our VPC. We pay the bill to Amazon direct, so there's not a tax on the compute.

And that works really well. Databricks is similar with their so called e 2 architecture.

You know, Snowflake even does this to some degree. They deploy compute into users' VPC. So we have a similar model where our goal is to host as many of the stateful,

complicated services as possible. So the metadata database,

the web server,

long running processes

that deal with scheduling,

and so on and so forth, but then still have the user

have control of their code and their data

such that it can operate on any infrastructure

from your laptop all the way to a case cluster.

And then there's just a agent

that phones home

and kicks off compute when there's something to do.

And then we stream up structured metadata

through that same channel or similar channel,

and that powers

all of our kind of operational tooling, our web UI, our asset catalog. And we think this provides,

you know, a great balance for people where,

you know, what's nice about the user cloud that we call it

is that it's stateless.

Like, all it does is you kick off compute, it runs it for a while, and it spins down. So it's very amenable to a fully elastic compute, and just like a cloud native infrastructure. It's great for cost. You can run on spot instances, all that good stuff. So minimal operational overhead. Yeah. But that's just the start of it. You know? I think that we're really interested in exploring the various modalities of this.

And as users you know, we'll respond to user feedback too. Like, we'd be excited to

either have more managed solutions or partner with people who can provide out of the box managed solutions. But,

you know, we've talked to dozens of early users on our wait list,

and they seem overall very happy with that trade off of having to do a little ops on their side, but they get to own their compute and their data. And then we kind of manage all the complicated,

like, staple services that they don't have to manage, and then we can also provide value add proprietary services on top of that. As

you progress along this journey of commercializing

and delineating the boundaries between the open source project and the capabilities there and what you're building into this managed service. I'm wondering what are the design principles that you're using to make those decisions

as new capabilities come up as a possibility to add to 1 or the other and some of the ways that you are kind of managing the

governance and long term viability of the open source project?

Yeah. I love this question, Tobias, and I think it's a really important 1

to ask open source founders

because, you know, if someone doesn't have

a principled framework

around doing this, they will be completely thrashed all the time

by, you know, oh, we have to open source this because our competitor did did this. Or, like, oh, no. There's a new PM who has a different take on this, and they wanna be grabby and decide to deny capabilities to your community to coerce them to use the commercial service, and that's no fun. Right? That doesn't provide predictability.

I think that open source communities

are very

worldly

and understand that we need to build a sustainable business.

They understand that. They just want to be able to predict our behavior and have there be a fair business model that works for everyone.

So in that vein, you know, I think that we have our own framework kind of starting with the HashiCorp model as a base foundation.

So they kinda chop up the world between things they call it technical complexity and organizational complexity,

where technical complexity is broadly in the open source domain and organizational complexity is in the commercial domain.

And I thought that was a great place to start, but I thought, you know, given our context, we could kind of even go a little further than that. So instead of 2 complexities, it kinda subdivide the world into 3.

1 is application complexity, meaning, like, how users structure their code

and very concretely how they consume and interact with the open source DAGSTER framework. Framework. Right? There's kinda like some rules you can depend on. Like, if the code is in process with your code, it's gonna be open source. Right? If you're

pip installing Python,

that's gonna be open source.

The basic core framework level will always be open source. Then there's what I call the enterprise complexity

where you're dealing with, like, SOC 2 compliance

and governance and

all of these capabilities which are very expensive to develop

and very complicated, and you wanna be able to change things quickly

to fix bugs and develop new product capabilities and all that,

the people who use those capabilities

want a commercial contract.

They don't trust a pure open source solution. So literally,

you're enabling usage by providing you know, like, a commercial contract is like a thing that business people understand. It's like, oh, I'm paying you money and you have this obligation.

And that makes sense. So there's application complexity, which heavily biases, maybe in a 100% biases, towards open source. And then there's the enterprise complexity, I'll call it,

which people want a commercial relationship.

I think the interesting 1 that's fraught,

that people struggle with

is this middle category I call operational complexity.

So operationalizing these computations because, you know, there's this balance

of you want open source users to be able to use this in very real scenarios

because our goal is to make DAGSTER a durable open standard.

And you don't wanna coerce people to use a commercial product in order to use and push forward the standard. We don't want that.

But there's this opposite thing where if you put 100%

of your operational capabilities

in the open source domain,

1,

you are subject to disruption by the cloud providers. That defensibility is real.

2nd of all, and I think this is underappreciated,

maybe actually

kind of the determinative

here, is that it is much slower to develop things in the open source domain.

So, you know,

if you open source an entire complicated

infrastructure, which has multiple interacting back end services and whatnot, there's actually very few organizations on the planet who can actually run that thing effectively.

And then,

like, debugging all those issues at all those installations across all those open source users is a huge tax on the team.

Right? And we don't have the engineering throughput to deal with that. So there's, like, this very practical issue

that some operational capabilities

make a ton of sense to centralize

because you wanna be able to move fast.

You wanna have more engineering throughput.

And then if the only organizations that can run it anyway are yourself, maybe the big mega tech companies,

and the cloud providers,

that doesn't help anyone.

So I really categorize it between, like, operational capabilities that have massive

economies of scale to centralize.

Like, for example, if we can never have 1 of our open source users run a database migration ever again, that would be awesome.

It is so much easier for us to do that on the user's behalf. Just as, like, a super concrete example of this centralized economies of scale I'm talking about. But,

you know, we are also going to have an open source Kubernetes implementation of the system,

and we would never, like, deny a bug fix to the open source community

because we wanna, like, get them to use the non buggy software.

First of all, it's ridiculous.

And second of all, it goes against our values, and there's no economy of scale benefit

from fixing that bug.

So that's just to summarize again, there's, like, application complexity, that's open source,

enterprise complexity, that's proprietary and centralized, and then you can kinda, like, chop

operational complexity into

1,

if it's complicated enough that not that many people can run it anyway,

and it benefits from centralization

that's proprietary.

And if it's, you know, just like building a Kubernetes executor

that has very strict and very well defined properties, but that is pluggable so you can, like, extend it and whatnot. That is in the open source domain.

We'll be writing

information. It's a complicated, you know, there's a lot there, so, but that's kind of the general framework of how we're approaching things, and we'll be writing more about this as time goes on. Yeah. And that all makes good sense. And

specifically, in terms of the operational aspects of it, if you do have a sort of prescribed way of deploying everything end to end and this is just how Dagster works, then like you said, it's never gonna fit everybody's use case. And

it's going to be

a point of friction for adoption because if somebody says, oh, well, the only way I can run this is if I run this 1 script that happens to deploy 6 things into Amazon and 5 things into Google, well, I don't use Google, and I don't wanna put anything in Google.

Why would I ever do this? And so having those kind of defined interfaces of, you know, this is how it runs. These are the layers that you can add your own capabilities to to run it in the way that you want it to. But if you don't wanna deal with that extra engineering overhead on your end, then, you know, pay us whatever it is per month, and we'll deal with it for you. Totally. And I think the other thing about this is that I think there's this this implicit contract with the community as well that, you know, where we wanna get to is a fair

usage based pricing model

where both the small players and the large players can participate in a way that feels win win for everyone. Yeah. And that's like I'm not saying we're gonna get there tomorrow. There's a lot of work to do there, but that is the eventual goal.

And I think that,

you know, I don't think people give open source communities of credit. Everyone's grown up,

and they know that you know, and our goal, unapologetically, we wanna build an awesome open technology and standard, and we wanna build an awesome business on top of that. And

we think that is beneficial for every stakeholder in the ecosystem.

So as you are iterating on the core product of DAXTER and the commercialization

efforts, and you have this sort of ambition of being an open standard for

data computation. I'm wondering what you see as some of the potential threats to the future success of the Daxter product and the elemental business.

Yeah. I think the threats are still at a basic level, which is like, will enough people get enough value out of us? Like, I'm an OptiSt founder, so, of course, I believe that, but we still need to grow the technology and make sure it applies to a broad enough set of people to make it a project worth investing in, a business worth building.

So, you know, there's always that existential threat, which is, like, you know, is the thing you're building

what people want? And, you know, we're definitely on the path there, but that needs to be proven out at a greater scale.

So,

you know, that's what I think about

mostly. I guess, you know,

other existential threats

are if some of the core premises

of the business

or the project are not true.

Right? If this whole notion that, you know, the software engineeringification

of data

is not the massive wave that we think it is, and instead it's like a niche

thing, and we can all move

to a no code solutions where there's no engineer in the critical path at all. And everything's gonna be solved by managed services, and

the engineers can go home and,

you know,

retire or cry depending on their position in life, I guess. You know, I don't think that's true. I think the path forward

is not to eliminate engineers, but to make them more productive.

And I think there's this glorious flywheel

as a result of that. So, anyway, I'm trying to refute the existential threat in person. But if some of the core underlying assumptions of the project

are not correct,

then that's, you know, potentially a threat to the business. But other than that, like, I'm not too concerned. I'm very confident about the future of the project and the company. I was struggling for these answers, actually. I you know, but I had to say something.

And then going back to the beginning of the conversation,

as we mentioned, you know, to some people's eyes, you're a relative outsider to the land of data engineering and data management. And, you know, you've been here for 4 years, right, give or take now.

And as somebody who is

a relatively new entrant to the ecosystem,

I'm wondering what have been some of the most interesting and surprising

experiences and sort of lessons that you've learned about the overall space that you have encountered as you have been investing your time and energy and contributing to and growing this

community that is sort of very deeply nested in sort of the core concern of the data ecosystem?

I mean, I don't think it's surprising,

but I was gratified. I think that people in data

are very excited,

and, you know, data engineers have a reputation of being grumpy, old, you know, people who like their tools, and they're like, you know, get off my lawn, but I don't really think that's true.

Through large swaths of data community, there's a lot of open mindedness. I think there's a lot of acknowledgment

that there's a ton of work to do. There's a lot of different projects and teams experimenting in different directions.

I think the vast majority of which,

even people who are putative competitors,

you know, it's a very collaborative relationship in the vast majority of cases.

So I think that has been

good. I think 1 thing that

I

underestimated,

and maybe this is because of my background at Facebook where there are centralized infrastructure teams taking care of all sorts of other issues,

is just how big

of a deal

simplifying the DevOps story for all of this stuff is.

You know, because I originally set out to build a software abstraction. I'm like, oh, the infrastructure is someone else's problem. Right?

But we've had to vertically integrate more,

and the deployment and DevOps aspects of this stuff

and sanding off all the edges and improving docs and all this stuff is just so critical

for onboarding adoption. And I think I would have realized that earlier.

Definitely, I think we would have front load the cloud product more and focused on that

earlier. But, you know,

I'm also really excited about the

capabilities. Yeah. And I think, like, kind of coming things up from

fresh perspectives is good, but you need to partner with people in the ecosystem. But,

you know, I think some of the capabilities of our cloud product are really exciting. You know, we're heavy Vercel users, which is kind of like a, like, hipster JavaScript

hosted environment that has these, like, this amazing DevOps story, and we're really inspired by that. And our goal is to kind of create a Vercel level

deployment of user experience, but in the data domain. And I think that's like a complete alien concept, you know. I'm obviously hyping things up, but I'm I'm very excited to see what the reception in the marketplace will be for that.

So, yeah, I think those are the 2 things. I think it's been cool. Like, I think it's a crowded ecosystem, but the vast majority of people are very open minded and collaborative.

I think, you know, all this DevOps and infrastructure stuff,

you know, people are just struggling. There's no centralized infrastructure team coming to save them.

You know? And I don't think I'd really kind of,

you know, aligned my head to that given my background.

As you have been building out the DAXTER team and the product and the sort of community around it, what are some of the most interesting or innovative or unexpected ways that you've seen it used?

If there's 1 thing that, you know, we bias towards pluggability

and flexibility

and sometimes at the expense of kind of things not being out of the box enough, but the trade off of that is that people come up with all sorts of stuff.

You know, 1 of my favorites,

and he spoke at a community meeting. It was either a company contracted by IKEA or IKEA itself

where he repurposed

Dagster

to be in the critical path of the application

to render 3 d models of furniture,

and

they were able to, like, repurpose

because they needed a orchestrated DAG compute, and they like the plug ability points and all that, but using it for a completely different use case than anticipated.

I think we only 1 of the orchestration systems

that

really

works on Windows.

That's been really cool because it's been

a kind of a entry point

to a lot of single players, so to speak, who worked in more locked down IT environments where they have to run on Windows.

And, you know, I'm really passionate about that because 1 of the things I was was gratified by the GraphQL experience

was how it penetrated legacy enterprises earlier in its life cycle than I expected. It was like, okay. Like,

kids and some are using it. But then really early on, like, Walmart and KLM was using that. Right? And because of this Windows capability we have, we have folks at Honda using Daxter. Right? And I really like that dynamic. And then the other thing,

I mentioned the person who built the drag and drop GUI interface. I thought that was fun.

1 story I really like

is that Good Eggs,

they run their entire data platform on Dagster.

They actually trained

their nontechnical

staff who work physically

on the warehouse floor

to use Dagster operational tooling because, like, they have to ingest data from their contractors, and then sometimes that data is malformed. They have to retry

it and whatnot. And the platform team did a bunch of work up front, but they were able to make it so it's self serve by people who are literally

on the warehouse floor. And that was really gratifying to see that our investments

in kind of our consumer grade UI tooling

paid off where

kind of a non engineer,

you know, business, not even an analyst, just like, you know, like a business user

can come in and with just a little training can, like, intuitively use the system, and that was great. In your experience of being a founder in this space and building the product, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think that we probably moved a little too quickly at the beginning

because my mental model was,

like, the same development style that we had used

I'd use internally at Facebook to develop a lot of the core infrastructure there. And we moved super quickly, and we made mistakes, and we corrected them.

In open source projects,

it's way harder to change a mistake. Because at Facebook,

you know, I could just, like

if I made a mistake and I need to change the name of a method or something, I could just be a madman and, like, stay up all night and, like, change all the call sites and move on with my life.

That's not possible in the open source domain.

And I don't think I really

internalized that lesson from the GraphQL experience because with GraphQL, it was like a well formed coherent

thing. And we open sourced a spec,

which has really stood the test of time because it was like well baked by usage. So the artifact we open sourced was kind of this durable thing.

Whereas with DAXR, we were trying to do a new thing and really push the boundaries on a few different dimensions and coming up with

ways to achieve the outcomes we wanted to achieve. And, you know, I think, like,

this latest release I talked about, it's kinda paying the price for that with the correct some goofy names and solace some abstractions. So that's been kind of a humbling

lesson, and I'm really privileged to work on a team

that, 1, was able to do that, but, 2, also willing to do it. Because it's like, oh, Nick, if you would've just, like, not done that, then you would've saved 3 months of my life,

you know, cleaning up your bullshit to use a blunt term. But everyone's been super cool about that, and our community has also been

great on that front. So I guess that's very top of mind because I just have it. And so for people who are looking for some foundational component for their data platform or they're looking for a way to manage these graphs of compute in their ecosystem. What are the cases where DAXTER is the wrong choice?

I think it's the wrong choice.

Let's see. We don't handle, like, real time applications or streaming.

The example I gave slightly to the contrary of the person at IKEA was kind of a real timey case, but they really did some clever stuff to make that work. You know? So I think that makes sense.

You know, we really target

data platforms.

You know? I think other folks in the space

are much more focused on pure orchestration

applied to anything. Right? So, like, CICD pipelines and,

you know,

any sort of workflow orchestration.

And, like, why you can use Daxter for that? You know, that's not our focused use case. You know? Like, we are focused on the use case where you're building a data platform, and the purpose of that platform is to build, manage, and curate data assets. That's our business.

You know, it's funny. 1 of our design partners is a very sophisticated user of System Mapbox.

The person I talked to there

was like, you know, we communicate this fairly clearly, and 1 of his stakeholder teams was like, I wanna use this for CICDs.

Is that allowed?

I'm like, well, you can do whatever you want. Like, you know, like, it runs code in order, and it retries things. So, you know, you can use it for a CICD pipeline. I think it works reasonably well for that, but our tooling will not be focused on that use case.

So, yeah, that's kind of my answer.

As you continue to

iterate on the commercial product and the open source community, what are some of the things you have planned for the near to medium term, and any particular projects that you're excited about? Yeah. The 1 I mentioned before, the software defined assets, I think

is, yeah, really, really exciting.

And I think that's gonna open up, like,

huge

possibilities in our tooling. Like, I think we can, like,

completely reimagine the way that backfills work. I think we can

really go a long way

towards

making a much more modern data stack DBT native orchestrator, and I think there's a real market need for that.

I'm also really excited to

kind of program model married with our cloud environment. I think we can help a lot with the problem of creating ephemeral development environments

for data practitioners

where they can just push something up to a branch,

and then the infrastructure team would have to work to do this, but effectively

automate the process of, like, okay. You're gonna you're developing your data assets. Okay. Make a copy of the input data, have test schemas, all that stuff, and have a really safe environment where you can iterate quickly

in a cloud environment. I'm really excited about that.

And then, yeah, in terms of cloud,

you know, I get excited

when you can use

a technical system

to

kind of help with the organizational

API problems, this kind of deployment federation stuff we were talking about earlier.

And, you know, we're starting at this, like, pretty base layer for DAGS to cloud where it was like, okay. We're making it easy to spin up a deployment. We're managing your ops. But I think there is so much runway

in managing all that enterprise and organizational complexity

for tons of use cases that can bring in way more stakeholders than any other orchestrator out there. So,

yeah, those are kind of the immediate directions I'm excited about. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Alright. This is like an unfair question to ask of someone who's building the infrastructure because, of course, like, you only think about, you know, what your stuff is. I guess I'll refer back to the capability that I was talking about before.

This notion of having a fast

development life cycle

when there's a bunch of managed services involved and you need to, like, make copies of production data and, like, all this stuff. And I think, like, the orchestration system will have a part to play there, but you can't solve that alone. And I think just the ability to do that,

to be able to have multi tool environments where you can just iterate on test data, you know you're not gonna do anything bad. It would be a huge unlock.

And like I said, I think that's an ecosystem wide problem that'll take a lot of collaboration, but I think that's a huge hole. Well, thank you very much for taking the time today to join me again and talk about the work that you've been doing on DAXTER. I'm definitely excited to start applying it more to my own environments and build up some more abstractions on top of it. So definitely appreciate all the time that you and the team have put into it, and I hope you enjoy the rest of your day. Thanks so much, Tobias. Thanks for having me.

Listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links