Summary
The current stage of evolution in the data management ecosystem has resulted in domain and use case specific orchestration capabilities being incorporated into various tools. This complicates the work involved in making end-to-end workflows visible and integrated. Dagster has invested in bringing insights about external tools’ dependency graphs into one place through its "software defined assets" functionality. In this episode Nick Schrock discusses the importance of orchestration and a central location for managing data systems, the road to Dagster’s 1.0 release, and the new features coming with Dagster Cloud’s general availability.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Nick Schrock about software defined assets and improving the developer experience for data orchestration with Dagster
Interview
- Introduction
- How did you get involved in the area of data management?
- What are the notable updates in Dagster since the last time we spoke? (November, 2021)
- One of the core concepts that you introduced and then stabilized in recent releases is the "software defined asset" (SDA). How have your users reacted to this capability?
- What are the notable outcomes in development and product practices that you have seen as a result?
- What are the changes to the interfaces and internals of Dagster that were necessary to support SDA?
- How did the API design shift from the initial implementation once the community started providing feedback?
- You’re releasing the stable 1.0 version of Dagster as part of something called "Dagster Day" on August 9th. What do you have planned for that event and what does the release mean for users who have been refraining from using the framework until now?
- Along with your 1.0 commitment to a stable interface in the framework you are also opening your cloud platform for general availability. What are the major lessons that you and your team learned in the beta period?
- What new capabilities are coming with the GA release?
- A core thesis in your work on Dagster is that developer tooling for data professionals has been lacking. What are your thoughts on the overall progress that has been made as an industry?
- What are the sharp edges that still need to be addressed?
- A core facet of product-focused software development over the past decade+ is CI/CD and the use of pre-production environments for testing changes, which is still a challenging aspect of data-focused engineering. How are you thinking about those capabilities for orchestration workflows in the Dagster context?
- What are the missing pieces in the broader ecosystem that make this a challenge even with support from tools and frameworks?
- How has the situation improved in the recent past and looking toward the near future?
- What role does the SDA approach have in pushing on these capabilities?
- What are the most interesting, innovative, or unexpected ways that you have seen Dagster used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on bringing Dagster to 1.0 and cloud to GA?
- When is Dagster/Dagster Cloud the wrong choice?
- What do you have planned for the future of Dagster and Elementl?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Dagster Day
- Dagster
- Elementl
- GraphQL
- Unbundling Airflow
- Feast
- Spark SQL
- Dagster Cloud Branch Deployments
- Dagster custom I/O manager
- LakeFS
- Iceberg
- Project Nessie
- Prefect
- Astronomer
- Temporal
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/ lunode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Your host is Tobias Macy. And today, I'm welcoming back Nick Schrock to talk about software defined assets and his work at Elementl on Dagster to improve the developer experience for data orchestration. So, Nick, can you introduce yourself for folks who haven't heard of you before?
[00:01:06] Unknown:
Yeah. Thanks for having me, Tobias, and thanks for that intro. As you said, my name is Nick Schrock. I'm the CEO and founder of Elementor, which is the company behind Dagster. Before Elementl, my career was mostly spent at Facebook from 2009, 2017. And there, I founded this team called product infrastructure, which was to make our application developers more efficient and productive. And kind of the most notable piece of work that came out of that was that I was personally involved in was GraphQL, and I'm 1 of the cocreators of that. And then moved on from Facebook 2017 and got into this entire domain and started working on data orchestration.
[00:01:46] Unknown:
And in terms of the overall DAXTER project, for folks who wanna understand more about what it is and some of the history, I'll point them to some of the other interviews that we've done. But the last time we spoke was last November, and I'm wondering if you can just give a quick update on what's new in DAXTER and its reaction to the evolution of the data space since the last time we spoke.
[00:02:10] Unknown:
Yeah. So since November, we've been making a ton of progress. So in December, the month after that interview, we launched our cloud product to early access, and that's been going better than I anticipated, actually. And we have tons of users across all sorts of companies all the way from startups to Fortune 500 companies. And as we'll talk about later in the episode, we're actually, you know, gonna be launching that to general availability to the open public next month on August 9th. So that's super exciting. In terms of the open source project, we had talked about it on that podcast, but we've really been doubling down on this software defined assets direction.
And in November, it was a super early prototype. Not super early. It was a fairly early prototype being used by a few early design partners. And now, you know, adoption and development has accelerated massively. And we've launched it with a stable API. And then we're actually
[00:03:09] Unknown:
also launching Daxter 1 0 next month at Daxter Day along with our Cloud GA launch. So it's been a huge 6 months for us. And so in terms of the overall software defined asset concept and some of the ways that it manifests, I'm wondering if you can talk to a bit of what the motivation is and what the meaningful impact is for data teams who are working with software defined assets as their primary abstraction rather than the core abstraction of this task based DAG that folks have become familiar with since the early days of data orchestration and especially exemplified by the Airflow community?
[00:03:49] Unknown:
Yeah. So in terms of the motivation, what we saw, and this was last summer when we really started to dig into this, is that a task based orchestration system was increasingly out of step with the way that data teams wanted to approach their data platforms overall and also the way that all the constituent tools in modern data, I'll avoid the term modern data stack, which were very kind of asset oriented or model oriented. So if you go to Erabyte, they'll talk about sources. Right? And they, like, model a distinct set of sources in their system or 5 Transimrally. If you go to dbt, they'll be modeling the system in terms of dbt models and not tasks.
And what was interesting, and this kinda came out in this, you know, discussion around bundling and unbundling platforms, which happened in the spring. But, effectively, tons of the information that used to be in the task based orchestration layer were pushed down into these constituent tools. Whereas, for example, prior to say DBT, each table would have had a corresponding task in airflow, say. But now you're just invoking dbt cloud or dbt run. That entire DAG is represented by a single task in the orchestration layer. And, therefore, the orchestration layer is not nearly as useful in terms of providing a single place where you can have visibility into your entire data platform.
So we saw that happening where task based orchestration no longer map to the constituent tools as much. And then second of all, just in terms of even absent those external vendors, the platforms are becoming more and more complicated, and people wanted a more declarative approach where they could manage more complexity in their head. And the kind of more declarative approach of of software defined assets really resonated with us and our early users. And that was kind of the motivation that happened there.
[00:05:52] Unknown:
And in terms of that early experience of releasing this capability, digging into that particular approach, and some of the ways that that maps to different teams' mental models. I'm wondering what were some of the notable outcomes as far as the development approach that teams had to building their data systems and data products and how that factored into their overall practices of how they thought about what the output of their work actually was.
[00:06:21] Unknown:
Yeah. That's a great question. So in terms of outcomes here, I think, first of all, the practitioners who are working in Python and orchestrating all these systems and building data assets in Python, they just became more productive using the system. They have to keep less stuff in their head. They no longer have to write 1 centralized DAG artifact or flow artifact manually. So if you come from airflow, there's typically this DAG construction file, which ends up being a sort of centralized, unknown dumping ground of code that gets more complex as your data platform gets more complex. And with software defined assets, you no longer have that centralized artifact. The complexity is distributed across the entire system in a good way, meaning that that centralized DAG is constructed for you.
So it just makes it so that coding is faster, collaboration is easier, and it's an overall win. I think more profoundly, you know, what we found, and this is also based on some early feedback too, is that this allowed users who want to, say, develop their entire data platform across a bunch of different independent GitHub repos by independent teams. They can each individually code in their own world. Once deployed, they can actually interconnect all the assets that are built in those different constituent GitHub repos. So it's really been a way to unify the different practitioners who might be working quite independently, but are still building up to 1 cohesive data platform.
And then that's kinda connected to the last bit, which is now they finally have a place to go to where they can understand how their data products interrelate. So they no longer have to go into different constituent tools and kind of, like, have it all in their heads. There's just, like, 1 spot where everything is defined. It's been pretty interesting to see how that's developed. Like, 1 anecdote I really like is that we had an early cloud customer who started to adopt software defined assets, and they kinda started with our out of the box integrations. So they use the ingest tools, they use some custom Python scripts, and they integrate that in the software to find assets, and then they also imported their DBT graph. And they saw this interconnected web of assets. But then they also use a feature store called Feast, and it kind of was in its own world because they hadn't imported that to software defined assets, and we didn't have an out of the box integration for that.
So they actually built their own integration to bring Feast into software defined assets land. And I thought that was interesting. 1, it was just like the value of it compelled them to write that integration. And second of all, it actually interconnected the data engineering and ML sides of of the house into 1 cohesive lineage graph. And I think that's a lot of the promise here is that you no longer have to desilo your organization. Right? You can kind of conform everyone to this common standard and have your entire org, regardless of persona, kind of interconnected in this single fabric of assets.
[00:09:26] Unknown:
And to that point of building their own integration to be able to represent what was happening in Feast as a software defined asset, can you talk a bit more about what is actually involved in building that type of an integration and what it means to actually represent some of these external systems through this asset oriented abstraction?
[00:09:46] Unknown:
I mean, in the end, if you look at the Python code, that's a software defined asset. It's just a function. So it's not this huge lift to do an integration. It's just merely annotating your existing because all these systems generally have Python integrations. Right? So in the end, you're writing some Python code, which invokes the tool that you wanna invoke. The magic is sort of how we structure the software and structure the metadata. So it just means that you've written integration. It ends up being a Python function, which evokes something else, but then you can plug into our system to interconnect it to the other assets, and that's kind of the critical piece here.
[00:10:26] Unknown:
And as you went through that early phase of proving out the idea, testing it out with your early adopters, and then iterating on that with some of the subsequent releases. I know that you have stabilized the API, and I'm wondering what were some of the shifts in the interfaces and the abstractions and paradigms you wanted to expose for people to be able to work in this mode of the software defined asset, And what were some of the modifications that were necessary internally to DAXTER and the framework to be able to support that as a first class capability?
[00:11:01] Unknown:
So I'll start with the first 1, which is what are the changes we had to make to core DAXTER in order to do this? And the answer is that the core was relatively stable. Right? So software defined assets is just a layer on top of our existing core. Right? So at its core, a software defined asset refers to an op, which is our core unit of computation, kind of the the foundation of our task based orchestration layer. If you write an asset, it builds the centralized DAG artifact, the job, for you, and it also manages interactions at a fine grade level with our asset catalog, which also predated software defined assets. So it's actually this relatively thin layer of software on top of the existing capabilities in our core system. And that's important as well because our task based orchestrator is not going away, that core orchestration engine. Some tasks, not to double use the word, some activities in data platforms are still highly amenable to kind of more imperative programming models, and those are critical to support, and we will always support that.
And in fact, the software defined asset layer needs that layer in order to execute well. You can kind of, like, think of it it's not a perfect analogy, but in that way, you know, Spark, for example, computations are encoded in SQL. And maybe in the, like, long term Spark, like, 80% of the computations will be in spark SQL, but there's still gonna be this critical 20% that are encoded in data frames. It's more imperative, but you need to do it in order to accomplish a lot of the activities you wanna do. We kind of view software defined assets in our task based system in a similar sort of way, where we anticipate over time more and more of the activity in the system will be encoded in terms of the software defined assets API, but there's still gonna be absolutely critical activities that you need to use this lower level for. So that all being said, software defined assets is a layer on top of our op layer, and that will forever and always be true. So most of the work was actually more about kind of conceptual work and getting that layer absolutely right, and then building tooling on top of it, which was by far the where the lion's share of the effort was.
So that's 1 thing in terms of your question about changes to the interfaces and internals of the system. Your other question was, how do we respond to feedback and how the API shift? And I think that focused on maybe 3 different things. And 1 was a lot of the users who understood this very intuitively and wanted it were our modern data stack users that were heavy DBT users and wanted much more advanced orchestration capabilities on top of DBT. So tons of features there. In particular, I thought 1 was interesting, which speaks to what I was talking about before was, you know, we had this ability initially where you can, like, write in the same GitHub repo, be able to load dbt projects and orchestrate it right there. But what tons of teams wanted actually was the ability to have their analytics engineering team to actually have a more independent GitHub repo in their own development cycle, and then have the orchestrator actually consume the dbt manifest file, for example. And that required some changes on our end, but it spoke to that people are really thinking about how to the interfaces between teams in these systems and then how to make it so that, you know, yes, they can deploy and operate independently, but once deployed to prod, the orchestrator kind of stitches the entire thing together.
So we got a lot of feedback around how our DBT integration worked and then how that kind of trend and then I also viewed it as feedback about how people wanna organize their teams, and that was super interesting. On a more tactical level, you know, we were originally gonna deemphasize our config system a little bit in the software defined assets layer, which is effectively the ability to parameterize computations without changing code. But super early on, people were like, wait. I want this. I miss this capability in the old system. So, like, we actually had to re frontload that quite a bit. And then lastly, you know, we knew it was gonna be a demand, but, you know, immediately, people wanted to focus on interoperability.
So they have their existing task graphs. Some stuff are appropriate to model as task graphs. How do you wanna interleave those layers? So I think those were kind of the early themes and feedback which we responded to. But the core conceptual underpinnings of the project have remained quite constant from the beginning.
[00:15:53] Unknown:
As you have gone through this process of working through the software defined assets capability, updating the interfaces and APIs and some of the tooling and just general messaging around that functionality. In parallel, you've also been going through a journey of the early beta testing and early adoption of your cloud product, and I'm curious how that journey has paralleled the capabilities of software defined assets and some of the ways that those 2 areas of effort have played off of each other in terms of your own internal engineering work? Yeah. That's a good question.
[00:16:29] Unknown:
I mean, I think the easiest answer and why I'm so excited about the cloud product happening is that the ability to have a hosted service allows you to deploy new capabilities and bug fixes and new system infrastructure to your users in a completely permissionless way, and that allows you to iterate with your users far more quickly. You know, in a pure open source model, let's say you fix a bug or have a new feature in your UI, you push it up to PyPI, and then you're kinda like waiting for people to upgrade and like that's their and then you're subject to their own internal infrastructure release processes and whatnot.
And with cloud, we can push whenever we want, get new capabilities into users hands extremely quickly and have a much tighter feedback loop with them. And that's been able to accelerate our product development dramatically, and I'm super excited to be able to do that at scale.
[00:17:30] Unknown:
So you've been going through this process of adding the software defined assets capability, cementing that API. Just prior to that, you went through the work of updating the core abstractions for DAXTER from these nomenclatures of solids and pipelines and
[00:17:47] Unknown:
Tobias, you're gonna bring those up?
[00:17:49] Unknown:
And that whole work and then, you know, streamlining that into this jobs ops graphs paradigm. So you've gone through a number of iterations. There's been a whole host of work. You've seen a lot of adoption. And so that leads us to what you mentioned earlier. On August 9th, you're having this DAXTER Day event where you're going to be committing to a 1 stable version of DAXTER. You're going to be releasing cloud product as GA. I'm wondering what are the major lessons that you and your team have learned in the process of going through the initial development, the initial adoption cycles, figuring out what are the concepts that actually make sense in this space, exploring how you can keep up with the rapid pace of evolution in the data ecosystem writ large while staying true to the core ideas and abstractions that you're putting into DAXTER and just some of the commitments and what it is about where you are in your journey right now that makes you feel confident in committing to that stable version identifier.
[00:18:51] Unknown:
1 of the big takeaways here is don't be too cute with your name in. The name Solid was my fault. It was actually originally gonna be the name of the project itself, and I kinda like the name. And then I anyway, we don't need to belabor that 1. So don't be too cute. Name things in the most obvious way possible. You know, it was a critical thing for us to switch to a more natural nomenclature and fix some very core API problems with that original pipeline solid. There are other things called modes and presets and all sorts of stuff. And we wanted to simplify and consolidate that into a far more elegant ADBI. And that was a huge step forward for the system.
And then assets were always gonna be a layer on top of that. So really the foundation of the 1 0 release and the big breaking change we needed to make was switching from to this op job graph model. And that is the stable core foundation that we're building everything on top of. The feedback on those APIs has been extremely positive. All of our major users have migrated everything. We're super confident in those APIs, that core Sable Foundation. And it was really that laid the foundation for this 1 over lease. And that doesn't mean we're gonna be stopping development by any sort of means, but any further changes are gonna be additive.
And we will, like, commit to supporting these APIs for the long term.
[00:20:14] Unknown:
I guess, can you remind me the rather subcommos of the question? There was a bunch in there. The main driving point of the question was why you're confident with committing to this 1.0 version specifier, but also just some of the overall sort of lessons that you've learned as far as how to keep up with the rapid pace of change in the data ecosystem and what are the core ideas and lessons that you've been able to lean on that have stayed true regardless of the external turbulence that is a constant in this space.
[00:20:47] Unknown:
The word that you used, I think, is telling, which is turbulence. So I think there's 2 things going on here. There's, like, what are the fundamental things we are trying to accomplish in our jobs? And regardless of all the different noise in the ecosystem, that doesn't change that frequently. And the end of the day, we're writing code to produce assets and whatnot. There is tons of activity in the tooling ecosystem to optimize different parts of that workflow and make certain parts easier and improve ergonomics and whatnot. So, you know, with the 1.0 signifier is mostly around our core abstractions.
You know? And then there's the entire integration layer and all the different libraries that both we and our community maintain. And that's kind of where you, you know, are dealing with the turbulence, so to speak. And as to new tools some come and go, the core notions in our framework, like, will remain the same.
[00:21:43] Unknown:
Along with the GA release of cloud, what are some of the new capabilities that you're planning on releasing as part of that?
[00:21:52] Unknown:
So there's a ton of new stuff that we're gonna be rolling out. But, you know, 1 thing I'm incredibly excited about in RGA release that's coming out. So first of all, for the DAXR 1 0, there's no huge changes coming out with DAXR 1 0. Right? The whole point is that there's a stable set of features that we are marking as mature, and you can be confident that things aren't gonna change in that stable core going forward. The Cloud GA release is a bit different. And first of all, what is the cloud product? And it's a managed orchestration service so that you can write your code and we host your control plane.
So you can deploy Dagster effortlessly in an enterprise grade way, and it comes with features like, you know, role based asset control, authentication, and, you know, you can just focus on your business logic. A new capability that's coming out that I'm incredibly excited about, which I think solves the sort of, like, unsolved problem in data engineering and data management is what we call branch deployments. So, you know, there's this current problem in, I think, all the orchestrators, which is you can spend a bunch of effort structuring your code so that it's locally testable.
And so you have a fast development life cycle. That's only in certain of the frameworks. Like, Airflow doesn't really support local development at all. And then what teams are left to do in those cases well, let me step back. So let's imagine you do do a bunch of work and you're, like, mocking out your data warehouses and doing all this work to try and make it so you have a nice local development flow on your laptop. There's a problem there, which is 1, there's a ton of stuff that still isn't tested if you aren't interacting with the services you're actually gonna interact with. And then if you try to interact with those services from your laptop, a lot of time your, you know, security policies prevent you from interacting with that. And then what people are left to do is still to do a bunch of testing in a staging environment. But those centralized staging environments are also a pain in the butt to manage.
And often, if you have multiple developers deploying code to that staging environment, they're overriding each other's changes, and it just gets complicated and whatnot. So you either have these, like, clumsy to muse, clumsy to manage staging environments, or what people all too often do is just end up testing in production. They just push their code, and then they manually trigger a job or a DAG and just see if it seems to work. And it's this incredibly low, low productivity life cycle. What branch appointments allow you to do is that when using cloud, is that for every PR that you make, it actually creates a lightweight staging environment that's specific for that PR.
So you write your code, you branch, you push, then our infrastructure takes over, deploys your container, and then actually spins up a lightweight environment where you can either automatically launch a job every on every push or manually do it in our UI. And it provides a sort of, like, very lightweight staging environment that serves almost as like an IDE because you can, like, click runs and see if they complete or not. It's a complete game changer in terms of the development life cycle, and it just fits so naturally into development workflow. And it's just an incredibly powerful capability.
So I'm incredibly excited about that.
[00:25:24] Unknown:
It's time to make sense of today's data tooling ecosystem. Go to data engineering podcast.com/rudder to get a guide that will help you build a practical data stack for every phase of your company's journey to data maturity. The guide includes architectures and tactical advice to help you progress through 4 stages, starter, growth, machine learning, and real time. Go to dataengineeringpodcast.com/rudder today to drop the modern data stack and use And to your point of the staging experience being able to make data workflows and data orchestration something that is tenable for having a fast feedback loop for the development workflow.
It's definitely great having that capability built into the orchestrator, but then there's also the question about what about all the pieces that that orchestrator is connecting to? How do you manage those branches and preventing accidentally polluting the actual production data or the production systems and just the overall space of developer tooling, CICD, just making sure that you have those fast feedback cycles and are able to validate your logic and your end to end workflow in the data space is still very painful. And I'm wondering what you see as some of the other sharp edges that folks are going to have to deal with now that they do have this capability of being able to easily branch their orchestration layer, just how to think about what the broader impact is on the rest of their systems with that capability? Robert Leonard It's a great question, Tobias,
[00:26:58] Unknown:
and, you know, we are able to use the orchestrator itself to manage all these different test environments. So let me give you a simple example. Snowflake actually has good support for this. They have a feature where you can clone a schema, which effectively makes a lightweight copy of your data in a copy on right way, but it does it safely. So if you do any sort of mutation, it does not affect the source production data. This lends itself very nicely to this branch deployment model because effectively, you make a clone schema for every branched environment.
You can't actually pollute the production data, but you have access to it in order to test your flows. And then if you actually do any writes in that data warehouse, it still is safe. Right? What's awesome about branch deployments is that you can set it up so that you can automatically run a job whenever you push up new code. So the way it works is that you actually have a job in Daxter that clones the Snowflake schema. And every time you push up a new PR and or a new change within that PR, it reclones the schema every time, So you kinda start from fresh. So by marrying the ability to have underlying infrastructure kinda support these things and have the orchestrator be able to orchestrate the management of those test environments. It kind of all comes together in 1 nice package.
Now the different vendors and tools support those type of things to different degrees. Right? So Snowflake has kind of the nice version. If you have, like, a full unmanaged data lake, you might actually have to kick off a job that copies a bunch of data. Right? Or write what we call custom IO manager that would know how to read from kind of the source tables and then write out to copies elsewhere. So depending on the tool support and whatnot, it depends on how much work you have to do. But the entire idea is that by having this capability built into the orchestration layer, the orchestration layer itself can manage all these test environments.
And that's really where this magic comes together.
[00:29:01] Unknown:
For the data lake use case, also, there are things like Lake FS that allow you to have a sort of branch and merge style workflow for your data in s 3. I know Iceberg and Project Nessie are looking to be able to do that in the space of things like Hive Tables or tables in your, your, you know, Amazon Athena Lake or things like that. And now that there is more movement in some of these different vendors and in what you're providing with the branch deployments, and it is a conversation that is being had. I'm curious what you see as some of the potential next steps that we, as a community, can and should take to help keep that ball moving or keep that flywheel spinning and continue on improvements in being able to actually manage these complex and heavyweight workflows while being able to safely iterate quickly on updating logic, modifying schemas, figuring out how these data flows actually impact the business, and just the the end to end workflow of being able to manage these high value, high risk, you know, heavyweight tasks?
[00:30:09] Unknown:
This is 1 of the reasons why I'm so excited about an orchestrator finally supporting this type of thing is that I think it will provide a great incentive structure so that people are forced to make progress more quickly. You know? Because what what will happen will be something like this where people orchestrate in 2 different technologies, and 1 of them, like Snowflake, has a super easy or LakeFS has, like, a very straightforward way of doing this. And then they're trying to orchestrate it with a tool that doesn't. And then they'll immediately be like, wait. It'll like, they have this workflow ready to go, and it's just 1 tool that's preventing it. And then those users will exert pressure on their vendors and tools in order to support that better.
So, yeah, I'm just super excited about having a place where there's an end to end workflow that kind of aligns with this trend that you're seeing so that, you know, it makes it clear what tools have fallen short here, so to speak. And I think it will provide a nice incentive structure for everyone to get their ducks in a row. Because right now, if you're trying to orchestrate all these different capabilities, right, like FFS and, you know, Snowflake's clone scheme are copying whatever and doing all their different bespoke branch and merge workflows. It's so much work to construct that end that no 1 does it. And therefore, people don't even know what they're missing yet. So to me, the way to really make progress on these type of issues is not to, like, gather a conference together and, like, convince every single vendor and, like, nag them to do this and this and this. It's just, like, with actual tooling, with actual users and actual incentives, just move everything much more quickly.
[00:31:51] Unknown:
I agree that having that capability of managing the full kind of test environment in the core of your data platform is definitely going to be a massive leap forward in terms of people's experience and even thinking about that as being a possibility because as you said, otherwise, it's this very heavyweight exercise that is full of toil of just saying, okay. How do I automate each of these pieces individually and be able to manage all the infrastructure and manage all of the and manage all of the logic and hopefully not shoot myself in both feet by accident in the process.
Yeah. Totally. And and going back to this release of the 1 version of Dagster, the GA release of the cloud product with these new capabilities. As you said, you've got this DAXTER Day event that you're planning to make all of these announcements. I'm wondering what are some of the other things that you have planned as part of that event and what you're hoping
[00:32:49] Unknown:
the participants will take away from it. The event is, you know, short but sweet. Yeah. Or I think we're gonna, clock in, you know, 45 minutes ish or so. And so what we really wanna be is like a very information dense, you know, high single to noise ratio where you can come understand the entire value proposition of the platform end to end all the way from the framework to the cloud product, and then have a forum where you can connect to other community members and ask questions. And in general, be kind of the first to know what's going on with all this stuff. And, you know, it'll be available afterwards, of course, for to view. But we're really excited to say, you know, it really is this coming of age moment for the platform where, you know, we're powering with cloud and open source, like, some of the most sophisticated data and ML platforms in the world. People have bet their entire mission critical workflows and their entire data platforms on this technology.
And this is more kind of a a coming of age party that communicates that underlying reality accurately and then opens up this capability to the entire world. So, you know, it's a big day for
[00:34:02] Unknown:
us. 1 of the interesting aspects of marketing something as stable with that 1.0 version is that it opens the doors for a lot of folks to even consider using it in the first place because there are organizations that have restrictions that say, we will never actually use something until it hits version 1 0, or there are people who aren't comfortable being the guinea pigs or testing things out with the possibility of breaking changes because they don't want to put in the time to actually stay up to date with things as they evolve. I'm curious what you anticipate as the overall impact on the size and composition of the community around now that you are hitting the stable milestone and some of the things that you as an organization and managers of the community are doing to prepare for these new categories of users that are likely to start kicking the tires on Dagster?
[00:34:59] Unknown:
Yeah. Like you said, there's a whole set of users out there who it's extremely important for them to have the company and the project signal to them that they're not gonna be broken and it's a stable foundation to be built on. And the reality is we've been operating like a 1.0 project for a long time now, both in terms of the reliability that we provide to our users as well as effectively the job. The core task based layer, for example, has been stable for a long, long time and we don't we don't break anyone there. So on an operational basis, day to day, not that much is gonna change for us, actually. We've been, like, operating like this for a while. You know, we have done renewed focus on our documentation.
You know, we have a full time technical writing staff now. They're doing fantastic work to improve the docs. So we're preparing to scale to scale our product education and community management alongside this.
[00:36:02] Unknown:
Now that you do have this stable point, this marker of a next stage in your life cycle, what are some of the new and upcoming capabilities that you're starting to look to with this solid platform to be able to build from?
[00:36:19] Unknown:
We are gonna continue to invest, you know, in additional integrations and additional tooling on top of the software defined assets, you know, stuff in particular. And I think the, you know, you can think of the system as layered. Right? So when you have a core stable foundation, then your activities generally end up being focused at the tooling built on top of that. And that's really where a bunch of our focus is gonna be as well as on integrations. I think there is just a ton of work in terms of both integrating both at the software defined assets level, but also working with partners to make those tools work with branch deployments, for example.
And I don't think we can underestimate how much both work that is and how much value there is in there. So I think you'll see us focus a ton on integrations work and value added to win over that stable core. In terms of other directions, you know, on the commercial side, you'll see us also working on supporting organizations as they scale across our platform, whether that means more advanced role based assets, RBAC capabilities, the ability to, you know, track lineage across your entire org as you cross deployments in the Dexter platform.
But, you know, we are super focused on getting 1.0 and the GA product out the door, and, you know, we'll be publishing more of a roadmap after that. 1 of our earlier episodes, we had a good conversation about your opinions and framework for deciding
[00:37:59] Unknown:
what capabilities belong in the open source framework capabilities of Dagster and which features belong in the commercial cloud offering. And I'm wondering how this stable marker of the platform and of the cloud system and its release into general availability is going to factor into your ongoing commitment to or modification of that philosophy and just some of the ways that you're continuing to manage that tension of which capabilities need to be paid for the sustainability of the organization and which capabilities need to be part of the open source because they are necessary for the continued viability of DAXTER as its own project.
[00:38:45] Unknown:
First of all, let's step back and talk about how it even approached this issue here. The goal of Dagster has become as broadly as adopted standard as possible for structuring your data platforms and data assets. And then, DAXTER Cloud is to build, you know, an awesome managed surface that is structured on top of that standard. And in terms of what goes into that standard, the open source versus what goes into the proprietary platform, I kinda divide the world into 3 layers of complexity here. Application complexity, operational, and enterprise.
So application complexity is about how developers work with the framework to structure their code in order to have, like, better API ergonomics, testability, all that stuff. And that's effectively the relationship between the engineer, developer, practitioner, and the code that runs in their process. Right? Like, literally, they are pip installing something, and they are running Python code that we have authored that calls into their code. So on a very practical level, that needs to be open source because you have code in the same stack trace, and there's no process boundary and all that stuff.
And, also, we want that framework to be composable and embeddable and all sorts of interesting contexts because we can't predict how everything's going to be used. And that's like the fun and the joy of open source. So that's 1 layer, the application complexity and the open source framework that will forever and always be open source. Then you get into the land of like tooling built on top of that stuff. Right? And I think that's where this gets more subtle and interesting. So we want teams to be able to self host open source Dagster and run production workloads on it. And we want them to be able to run that on different infrastructure like ECS or Kubernetes or just an EC2 node or whatnot.
We want people to be able to do that. We want them to able to self host that, and we will always support that. But we also have Daixo Cloud, which is a managed service which hosts the computations on their behalf. And the way that I kind of divide the divide the world here in terms of what should be proprietary and what should not be is actually a lot of it is in terms of execution efficiency of the organization. Meaning, it is much easier for us to develop advanced operational capabilities on a centralized hosted platform.
We can deploy it whenever we want. We can fix bugs immediately and so on and so forth. Let me give you a very specific example. Like a database migration, it is, like, so much easier for every single stakeholder involved if we run 1 centralized database migration instead of forcing our, you know, thousands of open source users to run their own database migration against their own infrastructure. Doesn't mean we're never gonna implement anything that requires a database migration. It just is an example of if we can centrally manage something, we can move faster and we don't have infinite engineering resources. And then we can fix bugs on the user's behalf and whatnot.
So I often think of this in terms of if there are operational capabilities where we can centrally host it and there are massive economies of scale for 1 organization to do it instead of distributing that work across thousands of organizations. We should really bias at minimum towards initial development being in the proprietary domain because we can move faster and provide better quality of service. I think the value exchange here, and this is something we're rolling out with GA, is that there should be a fair usage based pricing model where small players can participate as well as the largest enterprises.
So just to give another example of, like, where economies of scale don't apply, you know, we have a Kubernetes executor. It's not like we're gonna withhold a bug fix from that and only apply that to the proprietary stack. First of all, it's just the wrong thing to do and goes against our values. But in the rubric that I just laid out, there's no economies of scale to not fixing that more broadly. Right? It just is better for everyone. And then the last layer of complexity is enterprise complexity. So security, auditing, etcetera, these are capabilities which are complicated to implement that we wanna be able to push changes instantaneously and fix any issues there. And that enterprises who want those capabilities typically want a paid relationship anyway.
Like, they want to write a check because then there's someone on the hook when something goes wrong. So, again, that was kind of a long winded answer, but 3 layers, application complexity, that's really the relationship with the individual practitioner. Those capabilities will always be open source. Managing operational complexity, we wanna support production workloads, but, you know, for really complicated capabilities that take a lot of work to host and fix and all that. We wanna bias towards centrally hosting those. And then lastly, enterprise complexity, which is strictly in the province of the proprietary domain.
[00:44:01] Unknown:
Digging into that middle layer of the operational complexity where there is this decision to be made about whether it is proprietary or open source. You mentioned that at least the initial development for some of these capabilities belongs in the commercial offering, and I'm wondering what your decision structure or mechanisms are around being able to say, okay. We've built this internally, but this is actually something that is useful in the open source. And so being able to do that initial development, offering it as a paid capability for your cloud customers, and then eventually migrating that into the open source and just how that decision structure and how that actual management of the code and being able to make it something that can be migrated plays out in your mind.
[00:44:51] Unknown:
Yeah. We don't have a specific process in place yet. I guess what I'll say is that a lot of it just kind of happens because the proprietary service is really kind of a shell where work that's done on it naturally flows down into our core infrastructure. So that kind of just happens for free. And in terms of, like, an ongoing process where we develop things that are focused on the proprietary platform In your
[00:45:28] Unknown:
In your experience of rolling out the software defined assets capability and onboarding some of your early customers to the cloud platform? What are some of the most interesting or innovative or unexpected ways that you've seen those capabilities used?
[00:45:41] Unknown:
Couple examples come to mind. A bunch of our cloud users are actually energy companies that you probably have heard of, and 1 of them has really integrated Dagster into their geologists workflow as they, you know, are doing analysis to the point where it's actually completely integrated with their own internal custom apps. So they have internal apps where they click a button and it actually kicks off Dagster Cloud runs. That metadata is then fed back into their own custom proprietary app. Kitellum is kind of like their custom operational tools and our tooling has almost like fused into 1 centralized system, which is super cool to see. You know, it's just like it demonstrates the value of, like, using open source standards, like GraphQL to model your web app because then someone can literally effectively build a lot of those capabilities in different contexts and integrate it in interesting ways.
So that's been super interesting to see. Unexpected, this is 1 of my favorites. 1 of our customers is KIPP, the knowledge is power program, which is a charter school program. And they actually or a couple of their geographies, they have a fairly fragmented organization, are actually using DAXTER to, in effect, build report cards and progress reports for their students, which sounds wild, but it makes sense if you think about that. They actually use a bunch of different SaaS tools to track student activity, to record their work, to, you know, do attendance, all sorts of stuff. And then they actually built a data platform in order to assemble those tools and integrate the data and then produce a cohesive report that can be viewed by parents and students.
So, you know, I love it because I can explain this to, like, my mother. It's like, okay. Imagine you have this report card. That is an affected data asset in the system, which is composed of other data assets, and it's been integrated and computed upon. And in the end, you know, there are data pipelines which produce these data assets for parents and teachers, and it's a very, like, graspable example that people can understand. And something I did not anticipate that DAXR would be used as an orchestration platform for student report cards, but here we are. It's all data. It's all information that has to go somewhere. That's right.
[00:48:12] Unknown:
In your experience of bringing DAXTER and elemental up to this point of having a stable, solid foundation to build from and marking it with this 1 release and the general availability of the cloud product and all of the new capabilities and features that you have planned subsequent to that. What what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:48:40] Unknown:
It does speak to the 1.0 release, and it's something I understood intellectually before, but now really felt in my bones. It's just how important it is to communicate clear expectations around stability and future changes and make sure to do enough upfront design in order to do that effectively. I think early in the project we were a little fast and loose about kind of we want to be able to move quickly and whatnot but it's very different. You know, you can change your eyes quickly, and that's fine because you're not breaking people's code, but you just have to have a very different standard, you know, in terms of communicating that this is a stable foundation to build on, and that you have to do more upfront deliberate design, especially in these yeah. So I think that's been a lesson and have had to learn that. And, you know, we've learned from that and have changed our processes. And now we're singling this new phase.
And the other thing, and I spoke to it earlier, is again, I knew it intellectually, but it just in the last 6 months, it has just become super, super real is that, you know, we're excited about cloud. Like, obviously, you know, we're excited to build a sustainable business model and make forward progress on that front so we become a sustainable organization. But cloud has also just increased our pace of product velocity because we have a tighter relationship with those customers. We can see the effects of our changes immediately. We can push them changes. They can provide feedback.
In a day, we can fix a bug or add a feature, and it's just like the super exciting new phase of development. So those are kind of the 2 things that come to mind. So we've spent the
[00:50:27] Unknown:
past hour or so extolling all of the virtues of DAXTER and the new capabilities and benefits that you're providing, but what are the cases where and or a DAXTER cloud are actually the wrong choice and somebody is better suited with a different orchestration system or just writing a bunch of bash scripts or just using the built in orchestrator for their point solution tool?
[00:50:50] Unknown:
Yeah. So to the last thing is that if you are only using that 1 tool and as a built in orchestrator, you know, just use that 1 tool. I think that's, you know, fairly clear. And you only need like cron based scheduling, for example. But broadly, I do think that orchestration is a base capability that you should build in for your platform from day 1. And obviously, it's, like, self interested in me to say that, but I do believe it's true. It's kind of like telling someone if they're, like, writing Python or this isn't the best example, but, you know, let's say, like, someone's a Python program. It's like, well, only use classes if your program is gonna get big. You know? It's like, no. You should, like, structure your programming from day 1 to kind of you know, even small things become big later. So you might as well kinda make good engineering decisions all the way from front to back.
In terms of within the orchestration domain, when Dagster and Dagster Cloud is the wrong choice, I think you're actually seeing a pretty interesting bifurcation in the orchestration domain where, you know, when people are evaluating orchestration from ground up these days, they're kind of generally choosing between 3 options, prefect, astronomer, and Dagster. And I think those 3 solutions are actually kind of going their separate ways, so to speak, with their most recent directions. So, for example, Prefect has their new Orion project, which is gonna be their 2.0 project. And I actually think Orion is quite cool.
And it actually makes it a system much more like temporal, which is a microservices orchestration engine. So Temporal and Orion are a prefect, I should say. They're actually more general tools than DAXTER, and they're more operationally flexible, but they're much more imperative. So if you need to kind of build a state machine instead of a DAG, you should definitely choose those projects. And they're much less tailored towards the data platform use case. So they have no notion of data assets. You can kind of write almost like a Turing complete state machine using them. But as a result, you can't pre visualize what the computations are gonna be. There's very strict trade offs there. So if you want something more general and more imperative, and something with less operational constraints, and therefore more operationally complicated, you know, temporal prefix is kind of the way to go. Airflow is kind of your, you know, task based orchestration system, and that doesn't really think about the full developer workflow.
And if you don't think that orchestration should have a fast developer feedback loop and that's not really what its purpose is, that you should have scripts and tools and then the orchestrator's only job is to order and schedule them in production and that's its only domain, then Airflow is a better solution. But for DAXTER, it's like, if you are building a data platform and need a control plane for it and you believe that you're bringing this you should bring kind of software engineering best practices to the data domain and you wanna orchestrate it to be the seat of your full end to end life cycle, then we're kind of the solution for you.
[00:54:08] Unknown:
We've touched on this a bit already as far as what you've got planned for the near to medium term future of Daxter and elemental. I'm just not sure if there is anything else that you wanted to bring up before we start to close out the show. Dan Oh, there's so many plans and future directions to go. You know, we can save that for the next show. We are, you know, as a company
[00:54:28] Unknown:
internally and as well for our external communication purposes, we are laser focused on DAGSTER Day and the 1 release and the, you know, launching cloud to the world. So those are our plans.
[00:54:42] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or find out more about DAX today, I'll have you add your preferred contact information to the show notes. And as the final question, what is your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:55:00] Unknown:
And, Tobias, you always ask this of tool authors and vendors, and I feel like we're being
[00:55:08] Unknown:
sucked into a trap because, you know, it's gotta be what we're working on. Well, obviously, it's not because you're already solving that problem.
[00:55:16] Unknown:
Yeah. You know, to me, it's a couple things. 1, I do think the lack of a cohesive end to end developer workflow when the context of data is still a huge problem, an unsolved problem. And I think some of the stuff that we're working on is gonna be a linchpin for that, but it's gonna be a ecosystem wide initiative to get that solved. The other thing that comes to mind, and this goes back to this kind of unbundling bundling conversation, And not to rehash that, but this general notion that we are asking too much of data teams that they have to assemble, like, 12 different tools in order to get basic end to end capabilities. And I saw a tweet on this since, like, if the data ecosystem was in charge of the car industry, there'd be a tire vendor, a steering wheel vendor, a seat vendor, a body vendor, and every single consumer would have to assemble the car by themselves.
And that resonated with me. That's why in our platform, we are building in a bunch of capabilities. I call them like a base layer because we're not trying to replace all the products, but, you know, we have integrated linears, observability, orchestration where in a basic catalog. So that out of the box, you kind of have, like, these basic capabilities built in so that if you only need a simple version of those tools, you don't have to integrate a full another vendor. So to me, it's about the end to end developer workflow and also this notion of, like, not having to integrate a dozen vendors just to get basic capabilities into your platform.
[00:56:51] Unknown:
Well, thank you very much for taking the time today to join me and share some of the recent updates in DAXTER and the work that you're doing at Elementl and on DAXTER Cloud. It's definitely great to see this commitment to stable APIs and releasing your cloud product as generally available. So definitely excited to see that happen and what comes next. I appreciate all the time and energy that you and your team are putting into building such a quality product. So I thank you again for that, and hope you enjoy the rest of your day. Thanks so much, Tobias. Thanks for having me. Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you have learned something or tried out a project from the show, then tell us about it. Email hostspythonpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Updates on Dagster and Elementl
Software Defined Assets: Motivation and Impact
Early Outcomes and Development Approaches
Parallel Development of Cloud Product
Dagster 1.0 and Cloud GA Release
Branch Deployments and Development Lifecycle
Managing Test Environments and Tool Integration
Impact of 1.0 Release on Community
Open Source vs. Proprietary Features
Innovative Uses of Dagster and Cloud
Lessons Learned and Future Directions
When Dagster is Not the Right Choice
Closing Remarks and Future Plans