An Overview Of The State Of Data Orchestration In An Increasingly Complex Data Ecosystem

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses,

but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date.

With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation,

or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy. And today, I'm welcoming back Nick Schrock to talk about the state of the ecosystem for data orchestration. So, Nick, for people who haven't heard you on any of your past appearances, can you just give a quick introduction?

Sure. Thanks for having me, Tobias. My name is Iksharak. I'm the CTO and founder of Elementl, the company behind Dagster.

And before Dagster,

I was best known as the 1 of the cocreators of GraphQL.

And, yeah, founded Elementl

in

kind of incorporated in 2018, and the company really got going in 2019. I've been building that ever since.

And do you remember how you first got started working in data?

Yeah. I mean,

I I think we can start in terms of, you know, it became the bulk of my career at Facebook was

dealing with our data fetching stack for our application developers, and that was naturally

kind of data management

adjacent.

But when I really went head first into it

was, you know, when I left Facebook, I was

figuring out what to do next,

and I kept on talking to all these companies,

both inside and outside the valley,

and data infrastructure, ML infrastructure just kept on coming up over and over again as the biggest technical liability that they were facing. They couldn't get engineers to work on it.

All the developer workflows are completely broken. It felt like going back in time

relative to other software engineering domains, etcetera, etcetera. And then, you know, kinda started to look into it. And

it I was hooked because it shared a bunch of the properties that I'm really

interested in. You know? 1, I was kind of my I knew about data management issues, and I was adjacent to it at Facebook. But it was like a,

you know, a domain where, like, the engineers were absolutely in pain.

That caught that really personally motivated to me. It's a huge horizontal market. I really like working on projects where there's potentially 100 of 1, 000 and millions of users

and developers. I get a lot of energy from that.

And then orchestration in particular was this really interesting strategic point of leverage

in that

stack because the orchestrator

kind of invokes

every run time, which touches every storage engine,

and

every single practitioner who's putting a data asset or pipeline into production generally has to interact

with orchestration

because all data comes from somewhere and goes somewhere.

And, you know, and then in the end, like, this was also a problem that really mattered.

You know, these data these data platforms that these companies drive extraordinarily important

societal processes.

ML systems determine who gets loans and not. You know? How we price health care.

You know, all these absolutely essential functions of society. And I think it's, like, incredibly important that it's all built on solid engineering principles and that the ecosystem is compounding and moving forward. So that was kind of the background of how how we got into this. And so in terms of data

orchestration, we've already mentioned that term a few times. The idea of orchestration

is very abstract, exists in a number of different

venues and

and instances. And for the purpose of this conversation, I'd like to just get a shared understanding of what we mean when we talk about data orchestration and how that differs from other types of orchestrators that people might be experiencing,

particularly thinking things like container orchestration, generalized workflow orchestration,

things like that? Yeah. It's a great question. I mean, at some level

in computing,

all orchestration

is doing in all of these domains is spinning up compute and then spinning it down,

for the purpose of achieving some sort of outcome or goal. Typically, container orchestration and I think the kind of standardizing people think of when they think of container orchestration is kates.

And, typically, that's used to spin up compute to run a service

or just to run raw compute. That's generally what it's used for. You know, we at DAX, we build on Kates to build a lot of our services.

In terms of data orchestration, you are still

generally spinning up compute,

but it is in service of a goal

that's more specialized than container orchestration. And that goal, in this case, is to orchestrate how data is managed in your platform. You know? All data at some level

is produced by computation.

Right? So from the standpoint of a from the standpoint of a company, they have input data, which is typically data inputted by a user,

crawled from an external system outside of their control,

maybe generated by a sensor. Every single other data artifact that's managed by the system is derived from those raw sources and produced by compute.

And the data,

both the size and the well, not necessarily the size, but certainly the number of different tables is dominated by these

derived

data artifacts in the system. You know? Like, companies have, like, hundreds of thousands of these things, and that has to be managed.

And the data orchestration system is responsible for scheduling the computations that that produce all those all those derived datasets and data assets

and and spinning up the compute to do it. So to to summarize, like, all the orchestration variants

in technology are generally about orchestrating compute. Is this then it's a question of to what end and what specialized features do you need for that? Yeah. And what are the things that the orchestrators are paying attention to? So in data orchestration,

you are ideally focused on what are the aspects of the data that you're trying to work through, what are the operations that you're trying to do on it, and what information does the orchestrator need to have to be able to effectively

drive the state machine of that workflow

versus a generalized compute orchestrator like Kubernetes that needs to be aware of what are all of the hardware capabilities of the system, what are the resources that you're trying to request from it, what are the scheduling aspects of when this particular unit of compute needs to be executing for how long and under what failure modes? And, you know, for the data orchestrator, what are the failure modes of data, and how much does the orchestrator need to know about that to help the end consumer of that orchestration

make sure that they are on top of what those failures are in resolving them. And so in terms of the

adoption of data orchestration

as somebody who's been working in this space for I looked in the Git history, and it's been 5 years now since you first started working on DAXTER, a little bit less than that since it was generally public and in use. But what are some of the misconceptions that you've come across in terms of the applications of the need for and the cost to implement data orchestration

as you've worked with people who are consumers of DAXTER or who are evaluating the overall space?

Yeah. I think that the 1 big misconception,

which still exists and we still have work to do to educate the market on this, and I think we're gonna get into this,

is that

everyone

needs orchestration.

And right now, I think the the mental position

of practitioner is that bringing in an orchestration layer is something you do kind of late in the development of a data platform.

And I think in reality,

you need to be thinking about orchestration from day 1.

And because people don't bring in

broad based orchestration layers until too late, it often becomes just, like, kind of adoption struggle, and you need to retrofit your systems to slot into it. But what happens is that, you know, especially in the modern data stack world, people spin up all these point solutions that all have their little baby orchestrators inside of them. And they're running overlapping cron jobs, You know? So you might have, like, a cron job running in Airbyte, 1 running in dbt cloud, 1 running in Snowflake, 1 running in HighTouch. It would you know, whatever solutions. Then you're left with this Rube Goldberg machine. We don't have anything no idea what's going on. And then only later at a great cost

do you bring an orchestrator to kind of unify all that together. I think the the ecosystem would serve and, you know, I'm an orchestration vendor, so this sounds self centered, but, and self serving. But I generally believe it's true that the the the ecosystem and the practitioners be far better served if they view the orchestrator as kinda 1 of the first tools they bring in when they start building their data pipelines

as opposed to sort of after the fact thing. I think it's partially on us

to make that more obvious and kind of bring it farther left in the adoption curve. It's just kinda like how you can't make something secure after the fact. It's, like, much more costly to do after the fact. Just makes sense to kinda start with this foundation

and have a unified view of, like, all your data processes from day 1. It just makes your life so much easier. Another interesting aspect of that is the conversations with people who already have some form of orchestration in place, particularly if they say, oh, I've already got Kubernetes, or, oh, I've already got a CICD system.

Why do I need something like DAXTER or Prefect or Airflow or whatever it is? I've already got a way to make sure all these things are running. I may even have some way of

backing in some useful triggers of the data aspects that I care about? Why do I need a dedicated system just for that when I've already got an orchestrator or I've already got a workflow engine?

Yeah. I mean, it kinda depends on

what you're

what you're looking for,

and what your problems are. But, you know, I would just start asking

questions about, like, okay, how do we test all this stuff? Once there's an issue, like, let's go through your debug your debugging workflow.

Let's say you have 2 different data engineers or 2 different practitioners in your system who are trying to interrelate what each other are doing.

What's going on? How do you track the lineage of your data going through the system? How do you track the metadata of all all your data assets? How do you alert off of that? There's just an entire set of

capabilities

that,

a entire set of capabilities don't exist. I guess, like, you know, just in a kind of an emotional level, I feel like if you just have some cron job running

using, like, Kate's native cron job or something just as a throw out, like, a kind of 1 of the baby orchestrators.

You kinda feel like you're treading water as things work, but then

once things start going wrong, it becomes very difficult. And it's very difficult as well to kinda

have a single place where you could organize your entire beta platform and know what's going on.

So, I mean, part of this is, like,

the existing problems that you have. But the other thing is that, you know, once you kind of are in a world with some more sophisticated orchestration

system that gives you this, like, so called single pane of glass, it's really kinda hard to go back. You're like, how did I operate without having this? Absolutely. Yeah. From personal experience, I've definitely had conversations with somebody who said, oh, I've already got a singer tap in Target running in Jenkins. Why do I need something more than that?

Or why do I need to bring in something like an airflow or a? I've already got Argo CD that I can use to define these chains. I mean and, also, part of it is, what are your needs? Right? It it's kinda you know, someone like if someone's just running Jenkins and wants to do 1 thing,

and that's all they wanna do, and it works fine, that works for them in the same way that, like, someone can write a web page using HTML if they just wanna display some static content. They don't need to bring in a modern

front end framework.

But if you want to have, like, a more

modern experience,

you know, and and and do more sophisticated things. That in today's world, there is table stakes. Like, you generally need to bring in a framework,

to help you out. Absolutely.

For the requirements

gathering and evaluation

of these frameworks,

data is very multifaceted in nature

largely because of the different personas and stakeholders involved, not even necessarily just from the technical aspects. It's largely the organizational and socio technical aspects that really make it complex.

And I'm wondering what you see as the interfaces that are necessary to be able to properly support the various

needs of different stakeholders

in being able to

work with, gain visibility into,

and understand

the

applications of data in that sociotechnical

framework?

Yeah. This is a great question. Actually, something that is super top of mind,

for me because you kind of,

maybe I'll

you know, I might start repurposing some of your language for our this internal road map document I'm I'm working on. But, you know, internally, what we talk about a lot is that in these data platform teams,

you know, because a a ton of our users

describe themselves

either as data platform engineers or their data engineers who work on data platform teams. So and that set of users,

that champion

is just thinking in terms of stakeholders all the time because that's their life. Their entire job is to

scale data engineering across an organization.

And I think what we're

really learning here

and what

we have to, you know and what we're actively working on is making it so that central

driving data platform team

can incorporate their stakeholders into the Daxter ecosystem

with as minimum of code changes as possible.

You know? Like, I think we have this pretty well dialed with our DBT

integration, for example, where, you know, data platform team can come in,

implement

DAX or DAX or Cloud, and then kinda import the existing DBT project that the analytics engineer is working on, and their workflow isn't disrupted.

We can import their graph, view it natively in our system, and then that analytics engineer

working on DBT

doesn't have to disrupt their workflow, but they can get a bunch of value out of the tool because they can view the graph there and all their data lineage and the way it interrelates between systems. And then the data platform team gets a much of value because they've consolidated the control plane of their data platform under 1 system. And I think we need to

replicate that dynamic across a, you know, a bunch of things. Right now, what we've learned is that we often get into a situation where the centralized team is super excited about DAX. They're willing to invest the time to really get up to speed in the system. But then the prospect of them kind of teaching the rest of their stakeholders to also be deeper users is just a bridge too far. You know? We and so we're really thinking about dividing our use case

into this champion state this champion, you know, persona and the stakeholder persona. And they have different needs from the programming model and the tool. As I mentioned earlier, you've been working on DAXTER for 5 years now, give or take. At that time, Airflow was

definitely the predominant orchestrator that people were talking about. It still is 1 of the largest players in the space.

I'm wondering

over that time period, how you have seen the requirements and adoption and application of orchestrators change.

Yeah. This is a super broad question. I guess it's worth

talking about

all the things that have happened. So since 2018, there's obviously been kind of a new a new generation of orchestrators and tools

that have cropped up. So, you know, airflow still exists,

and they've been making progress, and they're still definitely, like,

the kind of largest incumbent.

But what hasn't changed is that in 2018,

Airflow was something that people lived with but did not love. And I still think that's true today,

by and large.

So it's kind of this, like,

it it does its job, but it doesn't feel like a complete solution. It doesn't have a great developer workflow and all this stuff. And so I think everyone has realized this, and there's kinda, like, new

crop of successors that have popped up kind of vying for their place. And I think the primary ones you can think about are Dagster.

There's Prefect. And I also categorize, you know, dbt cloud, especially,

as just it as another entrant in the orchestration space even though kinda traditionally it's not thought of. There's also an entire there's an there's a there's another kind of family of tools that we can get into too, which is these vertically integrated MLOps platforms

that usually have

a orchestration system baked into them. And lastly,

the people vying for the be the successor to core orchestration.

There's the MLOps platforms.

And then there's people who just kind of assemble orchestration systems out of these point solutions where they all have their little baby cron shops. In terms of how the and, again, I'm for the audience, they're like, okay. The Daxter guy is gonna now, you know, talk his book and you know, talk about Daxter. But I I will attempt it to be to put on, like, my analyst hat and try to be as objective as possible here. I'll talk about the core orchestration kind of domain first. I think what's interesting is that Airflow,

DAXTER, and Prefects, say, in 2019,

2020,

each

system had its kind of unique value proposition that it was claiming. But in the end, they actually, from a certain point of view, were fairly

had similar concepts. They were all orchestrating

DAGs of tasks.

Building a DAG of task, a DAG of computation with Python

and running it. Right? And we all had our different take on that. And then, you know, what happened in

20 late 2020 to 22

is that both air

Prefect and Daxter went 2 kinda different ways from Airflow. And, like, the category transformed from kinda, like, 3 people who are claiming, like, hey. We're the best

way to orchestrate tags,

instead went their own separate way. And Airflow kinda stayed the course. Right? They are still we build workflows with Python.

Prefect,

they called it Orion, but they it's now Prefect v 2. They went a completely different direction.

They actually look much more technologically, like, temporal now,

or or cadence, which is that it's not they don't programmatically build

DAGs.

Right? They just run code, and it's more like a distributed state machine. So you can kinda you can write loops, it's more dynamic,

it's more general purpose,

more imperative. And I think like the

the the concrete difference there

appears very early in the product experience where in prefect you can't visualize the DAG of compute until it starts running.

Whereas in Airflow or Daxter, you know much more about the structure ahead of time.

And then we at Daxter kind of went a different direction. Instead of going more imperative,

went more declarative.

Right? And more

specific purpose built for the data platform use case

with this kind of it's actually returning to the roots of the project. I was just reviewing my initial deck from 2018, actually, but we went this direction where you declare a graph of

assets. Right? And we try to kind of abstract away the execution on your behalf.

And that places it sort of in the spectrum

in between airflow and dbt,

right, which is, like, declared a data assets, but only using Jinja template SQL over a data warehouse.

So and I think it's actually a great

development

for

the ecosystem

that all these these systems have kinda put theirs put their stake in the ground, so to speak. And, like, we believe this is the way these things can be

organized, and it's given the customers, the community,

a lot more choice. And I think we're exploring the solution space, like, very nicely.

So I think that's been, like, a really interesting development over the last 5 years.

I just talked forever, so please feel free to follow-up on that. No. I I I think that that's all very useful context and very accurate.

And

as you said, the next generation of orchestrators that have been introduced

since that initial foray of DAXTER and Prefect as a response to the pains that people are experiencing with airflow

are largely focused either on

the ML use case or

we just want everything to be cloud native, whatever that might mean, which largely just means

we rely on Kubernetes

as the core engine of what we are doing. We're largely bringing

data awareness to Kubernetes as an orchestrator. So I'm thinking in particular of the flight project from Lyft, or there is the response of

we are going to be a vertically integrated solution

so that orchestration is a component of it, but we want to own the entire life cycle of your data,

and we will be the ones responsible for actually managing the underlying compute versus

a Daxter or a prefect or an airflow that says,

you own your compute. We're just here to help you tell it what to do. Yeah. No. And I that was remiss to not talk about flight as well, which is a project you have a bunch of respect for, which is kind of it's interesting. They're I would say almost they're, like, in the space in between

being an airflow successor, 1 of the MLOps platforms. I actually they they they very tar very much target

the ML orchestration use case, but in reality, they're quite a general purpose orchestrator.

And,

so, yeah, that's a good call out.

Going back to the question of adoption

and the motivating factors for somebody to bring in an orchestrator into their stack. There's definitely

an incremental cost. How large of an increment that is depends on what system you're using and what you already had in place and what you're trying to do with it. But what are some of the ways that you're thinking about the ways to mitigate that incremental cost and smooth the adoption curve,

particularly when it comes to

integration with that broader ecosystem. So you mentioned having a very good story for people using DBT

so the analysts don't have to be aware of everything that Dagster is doing. There are people who maybe already have some investment

in airflow most likely because of its in, state as the incumbent, and then there's a whole host of other tools and new ones being added all the time. And if you're investing in an orchestrator, you want to make sure that it is going to work with everything that you needed to work with, preferably without having to do a whole bunch of reengineering on your own to make that happen. So I'm wondering how you've been thinking about

the motivators for adoption and ways to reduce the barrier to entry, particularly

with the constant volatility of the overall ecosystem.

Yeah. You know, 1 of the 1 of the challenges

of orchestration is that, you know, you're really putting yourself the center

of the storm, so to speak, on a few different dimensions, both

how you deploy to all the heterogeneous infrastructure

at companies. So there's that aspect. And then there's also what you talked about, which is integrating with the rest of the data tools

where people are actually crafting their computations. And they're kinda 2 separate questions. I think, 1, we've really tried to meet users where they are on the infrastructure

piece

by having both the option to you know, our our commercial product, Xtra Cloud, has both a serverless component

or a serverless variant where we literally host all the computation. And for a bunch of users, that works really well for them, especially if all they're doing is orchestrating other SaaS services, and they don't need

to manage, you know if all their computers manage anyway, like, there's no care, really, if the orchestration compute

is on prem versus not. And then there's the and then we also have this hybrid solution

where, you know, we host the control plane to the metadata store,

and people can run their compute and keep their data in their VPC or on prem. And they end up running a fairly simple

cluster. I like to call a stateless cluster. All it does is you spin up 1 unit of compute for a run. It runs until it's complete, then it spins back down. So that ends up being a pretty simple model infrastructure wise. And I think helping out users with that is, you know, just really doing a bunch of blocking and tackling on the quality

of our integrations. And then importantly,

always having

ways when a user reaches the edge of what is reasonable for us to generalize,

allowing them to kind of punch down in the abstraction layer and add their customization because just software systems are requirements are so different

depending on the domain of the business. So, you know, we've really tried to

minimize the infrastructure overhead as much as possible

and really make it effortless

to run the compute. Then there's the other aspect you're talking about, which is

all the different integrations you have to deal with. And,

you know, that's also a challenge too. I think that we have done a ton of work building up an integration library. We also have an active community

that has submitted tons of integrations,

and I'd love to see us kind of continue to mature

and expand that, especially with getting more community involvement. Yeah. Did you wanna follow-up on any any particular aspect of that? I I think that the the integration piece is at least personally, where a lot of my interest lies, and I realized I probably should have called that earlier in the conversation my personal bias as a user of but anybody who's been listening to the show long enough already knows that. So in particular,

you know, you as you said, you do have a good suite of preexisting integrations.

You have a very good set of abstractions

and framework layers that people can hook into to add their own behavior and integration with other systems. There's also the question of

how much

does that integration need to do for the user

versus it just being a hook into another system that you then have to add some glue code to manage. So for instance, for the Airbyte and DBT use case that's gaining a lot of popularity,

you have a good answer where you can automatically

introspect the set of connections in air byte, the tables that that will generate, and you can easily wire that to the set of inputs that the DBT models are expecting, and you can generate this full asset graph with a few lines of code. I think I've got it in something on the order of, like, 15 or 20 lines of code in my project.

Whereas there are other integrations,

thinking in terms of something like AWS where the integration largely just looks like, we will prepopulate the s 3 client for you. The rest is up to you. We don't have any opinions, which is definitely great. It it's useful having that layer of

fundamental connection,

but then there are also questions of for the people who you're trying to target,

how much hand holding do they need? Do they just want to be able to snap together some LEGO bricks and not have to worry about the mechanics, or do they need to be intricately aware of what is being done and have full control over that execution? And how do you balance those levels of completeness for those integrations?

The answer is people always want both because we can't predict all the upfront needs. A lot of it is a reflection of the properties of the target systems as well. Right? The way I think about it is that, you know, what does it even mean to have a integration with s 3 that would get as much information as the information with Airbyte?

Right? If someone has someone has encoded a bunch of knowledge in their Airbyte installation

by they've set up sources. Right? They've set up all this configuration. There's all this

state that lives

in their airbyte install.

And what DAXTER is doing with our integrations is we're saying, like, hey. You've done all this work to kind of configure your data assets in the system. Instead of making you do it twice, we're instead gonna reach in and grab that information out so you don't have to so there's a source of truth somewhere. And similar with DBT, similar with, you know, high touch, all of our integrations with kinda higher level SaaS tools. In something like s 3, no 1 is encoding that level of information there. Right? So we kinda try to maximize

the amount of

un we try to keep it dry. You know, don't repeat yourself in that way whatsoever. And I think, you know, we can't cover

every single system

for all time. But, for example, you know, this is where the open source aspect really comes

into play.

There's a there's a DBTS system called Dataform,

which was acquired by Google if memory serves.

And 1 of our commercial customers

was kind of a legacy Dataform user, and they wanted to have a similar

capability as the DBT integration. And we're like, listen. We can't build and maintain that integration ourselves. It's not widespread enough, but, look, we can work with you, and you can kind of ape or copy

the integration that we wrote with dbt for your own purposes. And they're like, okay. We're totally willing to do that. And that was, like, a great example of we have the existing integration.

The customer understood, and they were totally reasonable. They're like, yeah. Why would you maintain a super mature integration for kind of a

more niche system? But we gave them the tools

so that they could, you know, replicate that experience in their own infrastructure. And I think that's, like, 1 of the powers of having 1 of these open systems. Absolutely. Yeah.

Other examples that I've come across in recent memory in my own work

of that balance of

having enough information to be able to build your own integrations and enough preexisting

infrastructure to make it relatively simple are the work on the open metadata project where they actually, as part of their default install, will embed airflow

as their kind of, scheduling system for being able to pull in the different sources of metadata. They've recently added a an interface to be able to make that agnostic to whatever orchestrator you want. And then in that DBT,

story of DBT versus data form, the newest

challenger is SQL mesh or 1 of the newest challengers, which which at least looks compelling. So there's also that aspect. What aspect of SQL mesh do you find most compelling? I'm I'm not

a I'll I'll be the interviewer. I'm very curious about your take on this. I think the most interesting aspect of SQL mesh is that they are trying to have

a

native understanding

of the SQL semantics

versus DBT, which is very focused on, we just wanna be able to wire together all these SQL blobs for you. We'll add some templating, but it doesn't actually understand the SQL, whereas SQL mesh is trying to go at a deeper level of

wiring together the the different blobs of SQL by understanding, okay. This generates this table. This uses this table, but only these 3 columns. And so we're actually going to automatically generate column level lineage because we know what's happening versus dbt of we're going to make you do the wiring together by using these different jinga macros, and we'll use that to to be able to wire things together. But other than that, we don't actually know what's happened. So you think the big difference is that SQL mesh is aware of the SQL AST

and can therefore

understand native column level, whereas dbt is more kind of smashing together strings with Jinja templates.

Therefore, it doesn't have deep understanding of the SQL. Got it. Exactly. And, also, because SQL mesh has that AST level understanding, it's able to do automatic translation between different SQL dialects. So if you write your code for DuckDB, but you wanna deploy to Snowflake, they'll actually do that translation for you. Very interesting. It's always fun to have these little forays into side channels. And another

aspect that I wanted to touch on is the visibility that the core orchestrator has into the overall system, the ways that that authority is challenged, not in terms of different orchestration competitors, but just in terms of trying to run the system.

You mentioned, for instance, temporal.

Airbyte uses that internally for their scheduling and execution. I mentioned open metadata that wants to embed airflow

as part of their system. You mentioned that DBT has its or DBT cloud has its orchestration capabilities that they want to control so that they know what's happening in the system. And then from the organizational level, there's the aspect of developers wanna be able to get things done. And if they're writing DBT code, maybe they just run the build from their laptop, and then Daxter has no visibility into the fact that that happened without you adding a little bit of extra work around that. And I'm wondering, what are some of the ways that you think about incentivizing

everyone to use that central

as the means of actually

driving computation

versus all of these other competing interests that want to own their piece of the puzzle and then having to retroactively

recreate a a visibility of that overall state. Yeah. I guess when I think about it, I think about it from 2 dimensions.

And in the situation you're talking about, I I am usually thinking about in terms of there's a centralized data platform team, someone who's owning orchestration, who wants to who is responsible for the general health of the overall data platform and then the stakeholder teams. Yeah. What we find is that

the earlier a stakeholder

team can kind of plug into our asset graph, that lineage graph, and they can see

visually

both their own assets as well as the way they interrelate to the other teams. That's an immediate unlock in terms of the value

of the system. They can now place themselves into

kind of context in the rest of the org. And then the other thing is, generally, if the data platform team is doing a good job, and we hope that DAX can help with them, but with that, they provide a very clear incentive structure

for the stakeholder teams to kind of onboard

onto the more centralized platform. So, generally, that involves

a much smoother development workflow.

So the ability

for all the stakeholders to use, like, the branch deployments feature in Daixter Cloud, which is, you know, creates a lightweight staging environment with every PR, which people instinctively understand that's very valuable.

And then typically, there's also some sort of contract where between the platform team, the stakeholder team, where the it makes it much easier for those stakeholder teams to self serve their pipelines operationally

with higher quality tools. And that's where kind of being able to use this centralized UI

where your data platform comes alive and you can kinda understand what's going on and build alerting on it. So I think there's both this, like, what's the immediate value that we provide to the stakeholders,

and then also, like, what's the exchange of value that's been set up between the platform team and the stakeholder teams.

This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI

tools, keeping your team productive and stakeholders happy.

DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting data engineering podcast.com/datafold

today.

1 of the other things that has been gaining a lot of ground

particularly in the past 2 to 3 years

in the overall space of data work is the idea of low code or no code so that it's easier

for nonengineering

teams to self serve into that data workflow or be able to build some of their own workflows

without having to bring an engineer into the picture and be blocked by whatever their avail availability happens to be. And I'm curious whether you see that low code, no code interface as the responsibility

of the orchestration layer or as a utility that plugs into the orchestrator or that the orchestrator federates out to for managing those execution graphs?

So it's a good question.

So I would say 90% of the time, I'm pretty skeptical of low code, no code solutions,

in that I think they fall over and break when pushed, and therefore, you fall back to code anyway, or you're left with this combobligly

unobservable,

undebuggable

mess. I didn't say a 100%, though. And I do think there and I'm looking forward to future directions in the DAX

ecosystem where

there are low code, no code solutions. But to me, in order for them to work well,

2 things have to be true. 1 is that you still have to be incorporated

into a software engineering

process. Meaning that if the user

makes a change to their data pipeline using the no code tool, you need to be able to validate that things are working before you

deploy it to production. And most of the no code tools don't do that, which is a big problem. 2nd of all, I think any low code, no code solution needs to be integrated into the platform.

The important bit is, like, maybe that

no code, low code solution can handle, like, the 90%

of the nodes in a data pipeline, but then there's that critical 10%

that are too bespoke and need too much customizable needs where you need

code, and you need the ability to kind of break the no code abstraction layer and have an engineer

kind of take over this, like, particularly critical or custom, something that can't be encoded with a drag and drop tool. In that eventual future where there's a no code, low code tool

that's incorporated into the software engineering process

and more deeply

incorporated into data platform. To me, this orchestrator has to be at the heart of that and has to be, you know, at minimum, very cognizant of these tools and integrated in a way or even built it natively. So that's kind of the way I think about it. But, you know, you can imagine right

now with this branch deployment idea, which I think is very powerful, you branch your code, and then it deploys

a lightweight staging environment to every PR. You can imagine in a with a no code tool, you could kind of be dragging and dropping in a pipeline and a branch as well. And then have the same kind of, oh, I'm branching this drag and drop pipeline then, like, remerging it so you can operate in a branch, test it out, make sure it's working, and only then merge it

back. But the the no code tool would have to be deeply aware of the branch deployment abstraction underneath for all that to work. So I'm not ideologically opposed to no code tools. I just think they have to be incorporated into a software engineering process for something as high stakes as a data platform. Absolutely.

And and that is something that I've been seeing as a general trend in recent years is that awareness that low code is not the

entire solution, and it can't be because of the fact that it can be very brittle. And you do have to get down into the level of writing custom code for it to be able to power

every use case. I think 1 of the interesting implementations I've seen of that is from the folks over at Prophecy where they are a drag and drop UI of we'll just give you your building blocks. You drag and drop them together, but it actually

translates that into code that lives in a git repo. And you're also able to define your own custom blocks writing code that people can then snap together as long as you know the interfaces, and they use a lot of compiler tricks to be able to translate between the 2 layers. They actually build on top of, I think, airflow and spark as their core infrastructure components that they target. I think the folks at Ascend dot io are also doing something

somewhat similar where they, I think, have a low code aspect. I think that it is still textual, but they try to,

have some of that same pluggability of different connectors,

but you can drop down into the layer of writing spark itself as well. So I I think it's it's an interesting

evolution of the space because I definitely have been in the, you know, earlier iterations of low code or no code tools and having to say, okay. Well, if you wanna write code, here's a little text box. Good luck.

No. No. No. No. That's like that's when you know you failed. Absolutely.

Is

the is the, oh, write Python in this text box, and it'll be fine. It's like, you're not gonna be fine. It's not gonna be okay. The this is a in your app, that box where you can write

arbitrary Python and save it to a database somewhere, and then the system loads it for you, that is the box of death and pain. Like, please avoid that. Yeah. I I I think a lot of the allergic reaction that folks have to low code and no code is the drag and drop tools of the early 2000. In my experience, those were things like Microsoft,

SSIS or SQL Server Integration Studios, and, I think it was something called Kettle from the Pentaho project. I think I, like, started poking at those a couple of times and said, no way. I'm done.

Right.

Alright.

So for the conversation so far, we've largely been talking in the abstract about data orchestration without speaking specifically

to

what the purpose of the orchestration is. Generally, the assumption

is around

business analytics or business intelligence,

but machine learning

has been gaining a lot of ground. And then with the in with the introduction of large language models as a consumer grade capability,

the the whole space has taken off like a rocket ship. And

most of the time, when you see people talking about machine learning, they're building their own machine learning

with the presumption that all of the data is already cleaned and prepped and ready to go, and they're entirely different machine learning stacks with feature platforms,

testing, and experimentation

frameworks, etcetera.

What have you seen as the the ways that

orchestration

is

integrating the 2 sides of data prep and data cleaning with the ML workflows.

Are those 2 separate concerns and you just have a clean interface for the handoff, or is that something that needs to be a continuous flow owned and orchestrated from a single system? So there was a great article that hit Hacker News,

and I think it was titled

ML Ops is 98%

Data Engineering

or Mostly Data Engineering. I don't know if the term 98 might have been in the body of the article. But, anyway, I thought it was a great article

and articulated,

I think, better than we've been able to do what we think about this, which is that ML engineering

is

mostly data engineering with some additional

capabilities you need layered on top. But the core activity of building an ML pipeline that

ingests data, processes it, and then produces a model out of it, it's fundamentally the same exact thing about building a data pipeline where the end data asset is used for data analytics. Furthermore,

generally, it is critical

for the data scientist or ML engineer

to have full control over a good chunk of the data processing

because so much of building the ML model is actually understanding the domain of your data, and the only way to do that is to actually be the 1 writing the pandas code or the SQL code or the Polaris

code that is, you know, mucking around with this and doing the the data processing code. So that's that's kind of the way we look at it. We view

DAGSTER as a data engineering tool,

and there should be a common data engineering layer

that spans multiple disciplines because ML engineering is mostly data engineering, analytics engineering is mostly data engineering, and data

engineering is mostly data engineering. So we're in this weird space now where,

you know, if you go into ML, often there's this entire parallel universe of tools,

and that's completely siloed. And we don't think that's the right configuration of the world. We think that MLOps

should be a layer

over a data engineering layer rather its entire silo.

And this has both

organizational

and technological

benefits.

You know, if you use the same data engineering later, you have a unified lineage graph that crosses all your different teams. That's great. And then your centralized data platform team can make all sorts of investments

in that data engineering layer that benefits

the ML team, not just their your data teams.

And we've seen a lot of organic

behavior among our customer base

with this? Because we haven't really pushed this too much in our public marketing, which I think is a mistake. I think we'll look forward to 2024 on that 1.

But I think, you know, 90% of our customers uses us for analytics and ETL. That makes sense.

And then 50%

use us for

ML

and 40% for

production use cases. And that adds up to well over a 100%, which means that multi use case is the norm. And what we find is that usually data platform teams bring us in for

the analytics piece and driving that, and then it's kinda like, wait. Oh, we could obviously

we could also use this for ML because those are just pipelines written in Python, and this is a great tool for that. So we're seeing a lot of organic

some organic behavior

that matches

kind of our ideological

view of this.

And I'm really excited for that direction going forward. But yeah. So to summarize, you know, we think ML engineering is mostly data engineering, and you should have a common toolkit

and this

entire state of the universe where ML is it's a whole

other

universe is, like, not appropriate,

not great for anyone, actually.

And I I think that that also

is an interesting view of the

adoption curve paralleling

the curve of data sophistication within an organization where in the book from, Joe Reese and Matt Housley,

they talk about that idea of your kind of level of data maturity where at the very beginning, you're just trying to figure out what data do I have, how do I even start working with it to, okay. I've got a good handle on the core

data needs of I have ETL pipelines. I can do the the very basics with it to, okay, now I have data in a good place, and I can start building more sophisticated usages of Adavit. So I can start doing things like reverse ETL and operational analytics. I can start thinking about building things like recommender systems,

and then to okay. Now I can start doing proper data science and machine learning onto it And having the orchestration engine be able to grow with you along that maturity curve is also a very valuable aspect of the consideration of what are the ways that you want to be able to manage your workflows that are related to data, whether that is, I just need to know how to even start working with it to everything is great, and now I'm doing some really sophisticated stuff with it. Yeah. I that was extremely well said. You might have a future in product marketing, Tobias. That was great.

And so another

aspect of the single pane of glass, full visibility into the overall stack is that idea of lineage that you mentioned where you want to make sure that everybody has the same view of how did I get to here and where did I come from to get there. I've asked you this in past interviews as well. I'm curious a little bit on whether your opinions have changed at all or whether there's more nuance that has been built up. But is the orchestration engine the right place to own all of that metadata,

or is it too much of an effort to try and bring all of that information into the orchestration system, and you actually do still need a separate metadata catalog or lineage graph to be able to get that complete comprehensive cross cutting visibility of everything that's happening in your data suite using something like a data hub or an open metadata

or 1 of the commercial offerings that are out there?

So I think that there is,

first of all, I think it is totally appropriate

and natural intuitive

for there to be built in lineage and catalog it into the orchestrator. And I think anyone who uses an orchestrator that has that kind of can't go back. That being said, all those other products you mentioned,

you know, there's DataHub, there's Atlin, there's all this other stuff, has a vast array of

metadata capabilities, and they often speak of metadata activation.

And they incorporate metadata into all sorts of business stakeholder

facing workflows with, like, manual annotation

and

all sorts of processes that is more like a business app rather than a

operations technical facing app. So

I view

DAGSTER's

built in

lineage graph

as a higher level of abstraction that those asset catalog vendors can build on. Actually, there's a previous episode. I think you called it, like, the future of data ops in your

interview, and it was a great episode. And it was actually the cofounder of Monte Carlo.

I'm not recalling his name, but he kind of laid out, like, effectively the the gold standard of data ops in his view is this is the global lineage graph where everything is alive and you can remediate

issues directly within that graph. And he thought it was so awesome that if there was a world where they didn't, all the tools didn't have to reimplement that from scratch.

You know? And I was like,

this is praise on. You know? Know? Testify. I was very excited about that because that was that's effectively what our vision is for that, is that it makes sense for there to be a system of record

that takes operational responsibility for all these assets.

The integration is very clear, but we want it to be an open API where people can kind of build additional capabilities on top of that. Yeah. The Lior Gavish is the the gentleman who you were trying to remember.

And

so

now having worked in this space and made a significant contribution

in terms of the product that you've built in the form of Dagster

and your experience of working with customers who are trying

to fathom this space of data orchestration

and how that will help them try to get a grasp on this volatile ecosystem of data engineering and data management? What are some of the most interesting or innovative or unexpected ways that you have seen data orchestration implemented or applied?

So it's been a theme we've talked about, but my mind always goes to these kind of data platform engineers who enable very interesting

internal applications.

So,

you know, for example, we work

with an energy company

where there's a data platform team that has built tools for their geologists,

and they actually have kind of integrated.

They have these custom tools where actually using a GUI, the geologist can configure

runs they wanna do, and it's extremely bespoke visualization of, like, depth of oil drill, you know, oil wells and stuff like that. But then they press a button,

and then that actually kicks off runs in

DAGSTER,

and so they've kind of, like, integrated the 2 UIs,

and then also built

their own framework on top of Dagster for those,

geologists to use. So it's just like it's, like,

very clever

usage of the system

that shows it's a platform for tool building in a in a domain that I wouldn't have guessed

is applicable.

And, you know, a similar story. There's another pharmaceutical research company that's a that's long time to actually cloud user. It's kind of a similar

case where they had built a layer of technology

for

their

researchers to use,

and then they've also figured it out. They've kind of wedged into our abstraction

so that they can transparently

move compute

from 1 system to another

to minimize cost

without changing any of their business logic,

and they

have empowered their less technical

stakeholders who are research scientists in order to do it. So those use cases where people are, you know, really integrating

DAGSTER into their

production apps in and their internal tools

in a really fine grain and interesting way. It's been

super cool to watch. And it's happened more than once, which means it's not a fluke, which is sweet.

And in your work in this space, what are the most interesting or unexpected or challenging lessons that you've learned? Oh, good lord.

You know, and I think we it's a classic thing when you build any startup or project,

which is you're building a very generalized tool. People wanna

pull you in a bunch of different directions.

They want every integration ever. I wanted to pull in every infrastructure,

integrate with every single system, and it's impossible for a small company to cover all the bases.

So balancing this messaging and the technology such that the easy things are easy

and the challenging things are possible is just is always a struggle.

And you wanna

empower people, but also set expectations properly about the amount of work they have to do. So I think that has always been a challenge. And then combine that with the

pace of change

in the ecosystem,

you know, because everyone in the data ecosystem starts with point solutions, and then they they

shape

their product to the realities that they're seeing on the ground. Maybe they expand to adjacencies. So it's this constant

shifting of landscape

about who's doing what and what their focus is. So and as the orchestrator, you are kind of at the center of this

this storm. So it's a it's a very challenging engineering domain to work in, and you're faced with a ton of trade offs all the time that don't make everyone happy all the time. Yeah. And I was just realizing too that in all of our conversation about integrations, we are largely focused on integrations with other complete systems, but there are also the integrations with the underlying compute substrate that we completely failed to touch on is probably a whole other conversation on its own, but just to quickly tick a few off, there's Kubernetes,

Spark, Dask, Docker containers, Celery orchestrators. So

Yeah. ECS too, actually.

I've I've,

ECS, I feel, is kind of a people don't appreciate how huge ECS is. It is used all all over the place. Everyone kind of assumes that the default is Kubernetes, but we find for AWS users, they default to ECS. Yep. And so,

I think I know the answer to this, but for people who are just starting in this space, they're trying to comprehend

what are the options out there, what are the ways that or data orchestration

fits into their stack, what are the cases where it's just the wrong choice and they're better off with just just use Kron?

Yeah. I think if you use if you use a single tool and you're gonna use a single tool for all of time, then bringing your orchestrator is probably too much. Right? So if you literally just wanna replicate data using

a ETL tool, I5, trying to erbite, and then directly query that data using a BI tool, probably overkill.

If you're a small

a small analytics engineering team, you're only gonna use dbt, no other technologies than a simple orchestrator solution for you. So, you know, those are the cases where I where really think about it. Or if you're using a fully integrated, like, drag and drop solution where you don't need to bring an external orchestrator.

But broadly, I think if you plan on using more than 1 technology

and actively developing

over time,

then investing in orchestrator makes a lot of sense regardless of which 1 you choose. Absolutely.

And so as you continue

to work in this space, iterate on your product,

keep tabs on what the other players in the market are doing, what are some of the things you have planned for the near to medium term or any useful

thought exercises

or evaluation

criteria that you have for people who are just getting started in this space? Yeah. In terms of what we're thinking about next, we actually in the same way that we think that a data orchestrator should kinda have a, know, a cataloging

capability

in it, we also think that it's natural for the orchestrator

to be involved in

consumption and spend management

as well as data quality. It's very natural for the orchestrator to have an opinion about consumption management because it is the thing that kicks off the compute that ends up draining spend. So if you can debug your pipelines and understand, like, what's causing you to just why is my snowflake bill

$100, 000 this month? You know, we think the orchestrator can play a big part in both diagnosing that and preventing it in the future. And then, you know, if our if our mission is to build a single pane of glass for the data platform,

integrating data quality capabilities into that single pane of pool glass makes a lot of

sense. You know, you should be able to look at your asset graph and be like, oh, this passed all my data quality tests. There's a green dot. Great. And I should be able to alert off that. So you'll see us go into that direction

and then really supercharging

this

kind of data platform stakeholder dynamic. We have a bunch of work to do there. So that's kind of our what we're thinking in terms of our short to medium term roadmap.

Are there any other aspects

of the overall space of data orchestration,

the capabilities that it offers, ways that people should be thinking about it, or any pontifications about its future direction that we didn't discuss yet that you'd like to cover before we close out the show? No. I think we're well over an hour, and so I'm sure that,

everyone could use a break from the data orchestration talk, and I think we've covered a lot of ground. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the usual final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling our technology that's available for data management today.

Yeah. I mean, it's always tough to ask this to a vendor because it's nearly impossible not to just, again, talk your own book. Well, the idea is that as a vendor, you've already filled the gap that you saw, so you must have a new 1. The journey is never complete, Tobias.

I guess,

to me, the biggest gap or maybe the most interesting question that will be resolved next few years is how

the industry is gonna solve this fragmentation

problem. Like, I think everyone has vendor fatigue.

Everyone understands that, like, you have to bring in too many point solutions to assemble a platform.

And what is the shape of the solution to that going to be? Is it gonna be, like, everyone bets on 1 vendor

with complete vertical integration,

or is it gonna be makes it easier to horizontally integrate a bunch of vendors? I want them both for the sake of running the company as well as the way I like software to be configured in the world. I hope it's the horizontal

version of that. You know? And we feel like we're building our own platform, platform actually internally. It's just like you have to deal with a lot of moving pieces all the time. And I'm

excited

to hopefully participate

in a solution where people feel like they have choice, but they can make choices and have the ease of use of an integrated solution. But I think that's kind of the 1 of fundamental questions is gonna play out over the next few years. Absolutely. Well, thank you again as usual for taking the time today to join me, share your thoughts and perspective on the ecosystem of data orchestration

and, making a best effort to have an unopinionated

take on it. So, appreciate the time and energy that you and your team are putting into making orchestration

an easier and more interesting,

problem area to work in, and I hope you enjoy the rest of your day. Yeah. And for the audience, by the way, I don't claim I was unbiased during the whole part. But during that that section where you asked me to give me this give me the state of orchestration, I did try to have an objective viewpoint.

Thanks a lot, Tobias. It was great great to be on the show.

Thank you for listening.

Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links