Summary
A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Pete Hunt about how the launch of Dagster+ will level up your data platform and orchestrate across language platforms
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what the focus of Dagster+ is and the story behind it?
- What problems are you trying to solve with Dagster+?
- What are the notable enhancements beyond the Dagster Core project that this updated platform provides?
- How is it different from the current Dagster Cloud product?
- In the launch announcement you tease new capabilities that would be great to explore in turns:
- Make data a team sport, enabling data teams across the organization
- Deliver reliable, high quality data the organization can trust
- Observe and manage data platform costs
- Master the heterogeneous collection of technologies—both traditional and Modern Data Stack
- What are the business/product goals that you are focused on improving with the launch of Dagster+
- What are the most interesting, innovative, or unexpected ways that you have seen Dagster used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on the design and launch of Dagster+?
- When is Dagster+ the wrong choice?
- What do you have planned for the future of Dagster/Dagster Cloud/Dagster+?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
- Dagster
- Dagster+ Launch Event
- Hadoop
- MapReduce
- Pydantic
- Software Defined Assets
- Dagster Insights
- Dagster Pipes
- Conway's Law
- Data Mesh
- Dagster Code Locations
- Dagster Asset Checks
- Dave & Buster's
- SQLMesh
- SDF
- Malloy
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png) Data teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. Dagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to [dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast) today to get your first 30 days free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. A new approach to building and running data platforms and data pipelines. It is an open source, cloud native orchestrator for the whole development life cycle with integrated lineage and observability, a declarative programming model, and best in class testability. Your team can get up and running in minutes thanks to DAXTER Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments. Go to data engineering podcast.com/daxter today to get started, and your first 30 days are free. Data lakes are notoriously complex.
For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com slash starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Your host is Tobias Macy. And today, I'd like to welcome Pete Hunt to talk about the launch of Dagster Plus and how that's going to help level up your data platform and orchestration across languages. So, Pete, can you start by introducing yourself? Hey. Thanks, Tobias. I'm Pete Hunt. I'm the CEO, here at Daxter Labs. And do you remember how you first got started working in data?
[00:02:01] Unknown:
Yeah. I, well, if you wanna go way back, I got a master's in distributed systems, and that was really about big data back when I was doing it. So a lot of you know, back then, like, Hadoop was getting really popular and MapReduce was the state of the art for how things were done. So I would say that that was my first taste of it. I I went on to to kinda work at at Facebook for a while where I didn't actually do much with data, but I I got really I fell in love with dev tools and open source. I was a founding member of the React. Js team over there. And then eventually left and found my way back to large scale distributed systems and data. I, I was cofounder of a startup that did trust and safety as a service. So we would, like, ingest a bunch of event data from social networks and marketplaces, and we would basically try to find, like, bad guys on the Internet. And so this would be spam accounts, hacked accounts, those sorts of things. And there was a lot of real time stream processing with Kafka, a lot of orchestrating, you know, the the, training and productionization of machine learning models and a lot of unsupervised anomaly detection using someone's intuition to create a heuristic. You know, like, the real world of of kind of, of this stuff is pretty messy.
And just just really fell in love with it. We we sold it to Twitter, and I managed some data teams there. And I've been working there, in the space ever since.
[00:03:29] Unknown:
For folks who want to dig a bit deeper into Dagster specifically, we've done a number of episodes about that, so I'll add links in the show notes. But for the focus of today, you're getting ready to launch the new Daxter Plus service. I'm wondering if you can just start by giving a bit of an overview about what is the focus of Daxter Plus as a product, what's the story behind how this came to be, and sort of the concepts that are core to this product launch?
[00:03:58] Unknown:
Yeah. So Dijkstra Plus is kinda everything that you need to build a productive and scalable data platform for your organization with Dagster at its core. I'll I'll kinda, like, dig into that that sounds a little, like, marketing y, and, like, I'll I try to be clear, but at the same time, you know, that's, that's that's really is descriptive of of what we're trying to do. Let's start with with what Dagster is all about because Dagster is at the core of this thing. And how we talk to customers and and users understood where the gaps were and what we were seeing kind of in the macro trends, and then how we responded to it with DAX or plus. So, DAX is this open source project. It's been around for a long time, and it's we kind of play in this data orchestration category. So, traditionally, data orchestration is is the main framework that you use to build your data pipelines.
It schedules them. It makes sure that they run on time, that if they fail, they get retried. If the retries hit a certain limit, you know, you page the on call and and those sorts of things. And what we we kind of started the project originally because there were a lot of complaints around developer experience with existing solutions. So there's a category of challenges with those systems that they're just not kind of designed with modern software engineering best practices in mind. So they require every team to have you know, share a single Python environment, for example. So you get into fights over whether Pydantic 1 or Pydantic 2 is is the 1 true way that we're gonna develop. Or, they don't support local development and testing very well because in order to kind of create a true multi tenant environment, you have to orchestrate a bunch of Kubernetes jobs and stuff like that. So, anyway, there's this category of of, you know, a lot of the existing orchestration solutions were, like, didn't have a great developer experience. So we started DAXTER to to kinda get at that problem.
And, it went pretty well. And 1 of the key things that we figured out that was important for delivering a good developer experience was getting the programming model right. It actually did take a couple swings, for us to get there, but we we did finally get there. And we settled on this kinda thing that we call software defined assets, which is the way I would kind of describe it is Daxter is is, an open source and Daxter plus are both asset oriented orchestrators, asset oriented tools. Whereas if you were to use something like, you know, Airflow or Prefect or 1 of these other tools, they're very workflow oriented. You know, it's right on the website. They're workflow engines. They they do a good job at that. But what we kind of started to realize as we were tackling this developer experience challenge was that asset orientation was really important, and it made a lot of things really easier. So first of all, the mental model for developing, you know, data pipelines, was much clearer. Right? You kinda write your code in a declarative way. You go asset by asset, and it becomes very easy for you to say, hey. For this given asset that my data consumer is asking questions about, how do I get from that to the code that produced it? If you develop it in an asset oriented way, very easy to to make that leap. You develop in a workflow oriented style. You know, you have to map this world of your data assets and the lineage separately from your workflows that produce and consume them. And you end up in these situations where it can be hard to figure out which workflow or which step in the workflow produced or consumed the asset. If you have multiple workflows depending on ups a single upstream asset, you often end up either overmaterializing it to to keep it, like, too fresh, which can, you know, create a bunch of problems around cost. Or you construct these, like, kind of interlocking, kind of fragile cron expressions where if your pipe 1 pipeline gets or 1 workflow gets slow, you know, the other the other 1 doesn't get the data in time. So there was, like, a lot of problems with this workflow oriented style, and we didn't even realize it at the time. But as we started to zoom out and take a look at other problems that that users were facing in the space, and they started to get louder around 2022.
So 2022 kind of, you know, interest rates went up. Tech kind of started leaving this boom era and went into this kind of quieter, more more economically challenging environment. And a lot of data teams started to realize that these giant market maps of, like, millions of modern data stack tools that they are buying and integrating together, it was not sustainable. And we were so we kinda had this asset oriented orchestrator, and we were talking to our customers about their challenges with kind of the explosion of complexity in the stack. And what we realized was a lot of our customers that were feeling the pain the most or a lot of the users, excuse me, that we were talking to that felt the pain the most were actually not DAX for users yet. And they were trying to struggle with this impedance mismatch between building workflows and something like Airflow and trying to map those to asset oriented tools like Fivetran or like DBT.
What DBT calls them models, but, like, we would argue that their models and assets are are very comparable concepts. And yet every tool in the in the data stack, you know, whether it was looking at metadata or whether it was doing data transformation or data movement, it all fundamentally dealt in terms of data assets. And then mapping those into this workflow paradigm was a real was was very painful, and it required a lot of different point solutions to be manually integrated. And so that's what really gave us the idea to to kinda go after DAXTER Plus as, as our strategy moving forward. I just wanna pause there and see if that kinda makes sense. I I know I I talked a lot. Yeah. No. That makes good sense. And
[00:09:36] Unknown:
to the point of asset oriented versus workflow oriented, I know that in the initial formulation of Dagster, it was very much focused on that. We're going to string together a bunch of tasks, and under the covers, it actually still does that to a certain extent, but that you're able to intelligently understand what are the different subcomponents or what are the different outputs of the steps beyond just the macro of I have this big batch of things. And in terms of the juxtaposition of DAXTER open source and DAXTER plus and also DAXTER Cloud, which is your current paid offering, I'm just wondering if you can give a bit of an overview about what that Venn diagram looks like and some of the new capabilities that you're hoping to address in Daxter Plus and some of the ways that that is intended to focus on this aspect of complexity and multiple points of integration?
[00:10:38] Unknown:
Sure. Yeah. That's that's great. A great question. So, first of all, Daxter Plus is the next evolution of Daxter Cloud. So Daxter Cloud name going away. It's gonna be called Dagster Plus now. And the reason for that is the original formulation of Dagster Cloud was it's, like, it's hosted to Dagster, number 1. So it's less infrastructure to manage. In the case of serverless, no infrastructure to manage. Had some kind of enterprise features around security, auditing, access control, role based, fine grain roles, that kind of thing, and had some developer quality of life improvements like, branch deployments, which would kinda let you fork your your data platform for every pull request and and have a really nice and slick SDLC.
And, that was that was that was pretty much it for for Dexter Cloud. Dexter Plus has grown a lot more features and capabilities and really is is not just an orchestrator anymore, at least in the traditional definition. We like to to say that we're kinda trying to evolve the category a little bit and and and kind of reposition where the integrations need to be and and basically kind of, like with assets at its core, like, the stack starts to look a little bit different is is kind of what I'm trying to get at. So, DAXTER Plus, specifically, I could rattle off a list of features, but, like, we have the asset graph at its core.
And so we talk to customers. We say, what are what are some of the things that that you wanna do with that or some of the questions that you wanna answer? Because oftentimes, these are around observability. Right? And so they say, we need to understand our Snowflake spend or our BigQuery spend or our s 3 spend. And the way that they do that today is they just the data platform team goes out to all of their partners and customers and says, we need you to tag all your resources. And they tell you that they're gonna get to it in 3 quarters from now. And so in the meantime, the platform team gets a call from finance, you know, every couple of months and says, hey. Can you go, like, figure out why we're spending too much money on on so and so tool?
And so what we're able to do because we because the orchestrator runs all the compute, it knows all the different services that that your your individual assets are talking to, and we also know kind of the different teams associated with the assets and everything like that. We can attribute the spend on those different third party services or first party services to the data asset itself. That becomes really powerful. Right? Because teams can now kinda manage their own spend. The central data platform team can look at the the platform holistically and start to identify trends. And, you can do all this without, like, this big project of, like, again, tagging the resources in the AWS, building the connectors to something like Datadog and, like, paying Datadog, by the way. And also, like, keeping those up to date. Because a lot of times what we would we would hear that, like, customers would do a spike on that, like, you know, once a year every 18 months. And it would work for a while, and then they would, you know, buy another company or spin up a new effort or something, and they would have to repeat that work again. And oftentimes, the the teams would forget that. So attributing cost and spend and resource utilization, we have a a capability called Daxter Insights that leverages all the metadata and logs that the orchestrator is the source of truth on, and, like, kinda activates that for customers and gives them that capability.
So everything else in Daxa Plus is like a flavor of that. Right? Like, all we could I guess, Tobias, do you have any questions about cost, or should I go on to the other kinda things that we're doing with DEXA plus?
[00:14:09] Unknown:
No specific questions, but it's definitely, as you said, a very important element of the problem and also 1 that does require a lot of focus and detail to do well and get right. And when you're talking about money, getting it right is obviously paramount. So
[00:14:26] Unknown:
Yeah. Yeah. Definitely. And so, again, the re like, this is we only build stuff in DAXTER Plus that takes advantage of DAXTER CORE's unique architecture and differentiators. Right? So, like, again, like a workflow engine, you can't really necessarily attribute the spends to if you attribute it to a task, like, that is helpful. But, like, you know, again, your stakeholders are coming in. They don't know anything about tasks. Tasks often might be, like, co owned by, like, 6 different teams depending on what they do. And the data asset is a much cleaner way to do that. And second, we've kinda spent a lot of time on developer experience in Daxter Core. And we have this resources system that kinda abstracts away interactions with external systems. And so what we're able to do is kinda inject additional metadata as part of those integrations that we maintain. So, like, you know, if you're using dbt and Snowflake, for example, you just, like, flip a switch, and and now you've got a comprehensive cloud cost management solution, which is, again, it's like building on these layers of abstraction that that we set up a while ago. So that that felt pretty cool.
The second thing that that we're doing for DAXTER Plus is we've got a really awesome data catalog that, that we've been working on for a while. And so if you've used DAXTER before, I know, Tobias, you've used it. We've, you know, we're we're this asset oriented orchestrator. We have a view of all the data assets that we call the asset catalog in Daxter. We've called it that for a long time. And we were talking to users, and then they said, hey. This thing is, like, really close to all we need for an asset cat for data catalog, but it's not quite there. And part of it was it was missing some features, and so we've we've built some of that. So, like, a good example is we now have a DBT column level lineage, you know, in the in the Daxon Plus data catalog now and the and the, the metadata in the open source version.
The the other thing is the information architecture was, like, all wrong. And so we had all this cool information that they wanted to get access to, but the stuff that we put front and center was what data platform teams care about. So, like, the materialization history, what, like, resources it was using, you know, execution time over the last week, that kind of thing. And what people actually wanted was, like, they wanted to, like, you know, see the schema and, like, see a description and have that be the the most the the top things in the UI. So a lot of this was just reorganizing the UI and doing some user research and talking to customers. And so we've now got this thing into a into a place where by the way, open source has gotten a lot of these improvements too. So the open source data catalog is much better, than it used to be. And now it's kinda suitable, for being you know, if if the majority of your data platform isn't in Dagster, you probably don't need a separate data catalog. If if it's, you know, if you're using a data catalog as, like, an a thing that's sucking in metadata from a bunch of different systems, you know, we don't have, like, a big connector library that does that. But, you know, we think that centralizing on Dagster, again, in this asset oriented way, produces a lot of productivity benefits, and, like, you should get a data catalog for free for for doing that. And then in in Plus, it's it's just, you know, it's a level of data catalog specifically around search and discovery. So, again, Dexter Plus is is kind of the all the stuff that you need to build a scalable and productive data platform for your organization. Data discovery is really important. And so we have a lot of multiplayer features and and a totally different search experience in Daxter Plus that is that is, like, you know, I think way better than than open source. And it's kinda hard to build a a search experience in open source anyway because it requires custom infrastructure and, like, lots of iterating on ranking and stuff like that. So
[00:18:11] Unknown:
Absolutely. And in the launch announcement for Daxter Plus, there are a few different bullet points that you lay out, taking them in turns. There's making data a team sport, delivering reliable high quality data that the organization can trust, observing and managing data platform costs, and mastering the heterogeneous collection of technologies. And you've addressed some of those, in particular, the cost aspect, the reliability, and observability aspect. 1 of the interesting pieces there too, and I think they play off of each other, is making it a team sport and being able to work across the different platform components and in particular across language barriers where up until now, DAXTER has largely been single player mode of if you write Python, then great. If you don't write Python, well, then just write some Python. And I know that with the Daxter pipes feature, you're branching beyond that a bit more. I'm wondering if you can talk to the heterogeneous aspect of the data environment and some of the ways that you're thinking to move beyond this, Python as the be all end all of the orchestration definition and the execution context and how that enables the team sport element of being able to collaborate across a wider group of individuals and teams?
[00:19:32] Unknown:
Yeah. Yeah. That that's that's a great question. So first of all, again, I I know I sound like a broken record, but, like, asset orientation is is really key here. And so I wanna maybe tell you how multi language support would work in a non s in, like, more of a workflow oriented world. I bet a lot of listeners, if they built a data pipeline, they probably have this CICD pipeline that builds a Docker container, and they have some cron job or or airflow task or something else that spins it up and runs it somewhere. Right? And if they wanna be able to observe that or they wanna be able to model that in their data catalog, each of those is like a separate integration. Right? And so in in this asset oriented world, the default should be to get asset lineage data out of this thing and all the appropriate metadata. And so what we heard from a lot of customers was they were building these, like, kind of they were writing a lot of Python that would spin up their Docker container or Kubernetes job or whatever or, and, like, run it. And then they would kind of find a way to communicate with that thing and then, like, log the asset data back using Daxter's Python APIs.
And so we basically took that pattern and built a protocol around it that we call Daxter Pipes. And this is a protocol that supports pluggable transport mechanisms. So there's, like, a subprocess launcher for this thing. There's a Kubernetes launcher. There's a Databricks launcher, and there's a local Docker 1, I think, too. And it kinda just, like, communicates with a a very basic, like, JSON protocol, and it passes, you know, execution metadata from Daxter to your subprocess that says, hey. You know, this is the partition range that we are materializing today. Here's a bunch of other information about the run. And then the, that that custom code written in whatever language you wanna use, running in whatever environment you wanna do, as long as we have a way of getting the JSON payload back, we have a couple of different transports to do that, it'll tell you, hey. Here's all my metadata. Here's the trace of my execution. Here's, like, whether I succeeded or failed, and and a lot of stuff like that. And so we have a reference implementation for that in Python, obviously. So if you wanna take kind of the Kubernetes jobs that you that you're currently orchestrating in in Airflow, for example, and, like, port those over to the system, It's very, very simple. And we also have prototypes and things in languages like Rust.
And, we we've also got an internal 1, I think, in TypeScript too. You know, you are starting to see these kind of AI native practitioners, I'm gonna call them, that are actually writing, like, data jobs in JavaScript. And so we think that having a language independent protocol for for asset oriented orchestration is, like, really important. And so that's that's what pipes is is all about.
[00:22:17] Unknown:
And as a brief digression, 1 of the interesting thought exercises is trying to decide what does it mean for something to actually be a data asset versus just a step in a graph. And I'm wondering what are some of the kind of philosophical ways that you have tried to make that ideology concrete and help people navigate
[00:22:47] Unknown:
Yeah. Yeah. We think the default should be asset, but that's that's not a super helpful framing of of this answer. But what I will say is, you know, first of all, there should be, like, a, like, a manifestation of the asset somewhere. So it like, here here's something that's not an asset. Right? Like, sending a notification or sending an email, that is clearly, like, a side effect that, like, happens in the world, and that is probably not an asset. But if you are taking, you know, a bucket of bytes and storing them somewhere, that is probably an asset.
Because if you think about it, if you're, like, kind of have some durable state somewhere, you probably wanna have some metadata about it. Other folks in the organization will probably wanna depend on it. And you'll probably wanna have some SLAs and policies around, hey. How often do we refresh this thing? How often is too often, and how how often is not often enough? So, we've got a lot of features in in in Daxter, core, which is our open source version, and and Daxter Plus that support that. So, you know, broadly, I think of, like, you know, if you're storing data in a system, that is probably a data asset.
[00:23:58] Unknown:
Yeah. I I agree there. And where it really starts to get tricky is I just need to dump this thing into an s 3 bucket somewhere to be able to pass it off to thing. Do I call that an asset, or is that intermediate state that should just be hidden somewhere? And I think, ultimately, it doesn't hurt to treat it as an asset because maybe there is some other downstream use of that intermediate state that is useful in somebody else's context.
[00:24:20] Unknown:
Yeah. It it certainly you you can have kind of fatter or skinnier assets depending on I think this I've just noticed that that different teams, like, kind of put more or less. They they do, like, more of a micro asset or larger asset approach. I please, god, let's not use those terms. But no. No. But but, I I think you're right that that sometimes sometimes it it can be a little unclear. But erring on the side of modeling them as assets is, I think, the the safe option. And, you know, just previewing, like, a little bit of where we're gonna go, you know, 1 thing that we've been talking about, we wanna get it right, is this notion of, like, public versus private assets.
So that s 3 that intermediate s 3 file, like, for your team, you actually might wanna really visualize that and, like, have multiple things that depend on that intermediate s 3 file. But you probably don't want some external team, like, taking a dependency on that because that's an implementation detail that they shouldn't be aware of. And so what we're kinda working on later this year that's not gonna be part of the Dijkstra Plus launch is this notion of, you know, different teams and and which assets are visible to which teams and what are considered public and which ones are considered private.
[00:25:34] Unknown:
To, throw some more buzzwords in there, which ones are a data product and are the boundaries of your data mesh?
[00:25:41] Unknown:
Well, yeah. You know? Yeah. We could talk about data mesh. It it's always funny because we, like, we we always like, whenever we say the words data mesh internally, like, somebody always laughs, and then we we, we end up having, like I don't know. Like like, it it is it's very buzzwordy, but it's also, like, a thing that people are actually doing. I I I like to call it, like, data decentralization, but really what it's about I guess I don't actually know. Does anyone really know what data mesh means? It's but, the way I the way I see it anyway is this balance between certain facets of the data platform are best managed by a central team, and other aspects are best decentralized and managed by separate product areas or or whatever. And really dialing in that balance is, like, pretty tough.
But, but if you can get it right, it can unlock a ton of productivity. But it kinda hearkens back to the old microservices versus monolith debate at the end of the day.
[00:26:38] Unknown:
Absolutely. Ultimately, it all boils down to Conway's Law.
[00:26:41] Unknown:
Yeah. Yeah. Exactly. That is totally true. And I'm working on a presentation for data council right now, and I'm not sure if I'm gonna put this slide in there or not. But kinda like overlaying the at making the case for why the asset graph is so important. Like, overlaying the asset graph with the org chart and showing that kinda like your graph of data dependencies often represents the team boundaries within your organization. It's important to, like, be able to visualize them and maintain them explicitly in the system of record.
[00:27:10] Unknown:
And that also brings us more into the multiplayer mode aspect of the data orchestrator and the capabilities of the data platform and your note of the public versus private assets and what are the visibility boundaries of those different assets. I'm wondering if you can just dig a bit more into what it means for the data orchestrator to be an enabling force for this multiplayer and data collaboration requirement across an organization and and and across those different organizational boundaries?
[00:27:43] Unknown:
Yeah. So I'll I'll I'll start with what we've what we've got today, and today being everything that we've built in Dijkstra course so far and then the Dijkstra Plus launch. And then maybe I'll just tell you a little bit about the road map that that hasn't quite landed yet. But in terms of what we've got today, 1 of the biggest enablers here is making your your data assets or data products discoverable to the right stakeholders and helping them self serve information about those those data assets. It's, like, key to making this thing multiplayer. And kinda like it seems like every single shop that we've talked to that has deployed a data catalog, they say, like, yes, we need this, but, like, the data is always out of date and nobody ends up actually using it because they can't trust the data that's in these things. And, like, fundamentally, that goes back to this impedance mismatch between workflows and the data catalog. So, yeah, like, you have to do this mapping between the workflows that are maintaining the data assets and the catalogs that are visualizing them. And if you don't constantly keep that up to date, the data catalog can fall out of date really easily. And then you're back to, like, the data health channel in Slack that is just people just asking questions all the time. And so Daxitlus, we're really trying to to get at that problem. So, like, we think that deep integration between the data catalog and the, the orchestrator makes a lot of sense because, again, the operational data is kind of the the source of truth. The orchestrator is, like, the source of truth for the or the operational data that should be visualized in the data catalog. The second thing is, like, we think the programming model is just better for dealing with multiple teams.
So, you know, I can publish my data assets into this graph. Another team can depend on them. There's a clear contract between the consumer of the data asset and the producer of the data asset. This is in a workflow oriented system. You don't have that clear contract. Right? And so there's, like, kind of maybe a Google Sheet somewhere that says, here's the s 3 URL where the data will be, and I update it at 9 AM every morning. And then you read from it and you you hope that it's updated. And maybe you you, you know, run some assertions on the last modified time or something. But you don't have that explicit dependency graph between the 2 and you can't kind of you can't have a team, like, go through and, like, spelunk and see, like, who is depending on like, if I were to delete this dataset, like, what would break? And a lot of times, teams just, like, you know, delete it and see what breaks, and then they restore it. And then they work with their teams. And, like, having that lineage is really, really important. In terms of where we're going with Daxter Plus, with respect to these multiplayer team sport type of features, we think that there is a lot of opportunity in governance. I said the word data catalog a lot. I'm really focusing on discoverability and observability. But a lot of other a lot of folks really lean on the governance capabilities of of their data catalogs. And so for us, we don't have any of that stuff yet, but I think there's a lot of exciting potential.
So first of all, we've got the lineage. So the lineage of the data assets as well as the column level lineage, if you're using a technology like dbt that supports that. And so we can kind of, like, write assertions about how personal data flows through the system. Right? So, a common question that we would get in my last job is, like, the, like, Irish data protection authority wants to know how you're using, like, their email Irish, you know, or EU citizen email addresses. And so answering that question without this type of infrastructure can be really challenging. But an asset oriented orchestrator has all that information baked in, and we can start to, like, write assertions about saying, hey. You know, if my upstream has personal data in it, all the down streams by default will have personal data in it unless we explicitly opt out of those. Right? So we can start to watch personal data flow through the system.
And because we control the execution, we can actually write you know, again, this is more future stuff. We don't have this built yet. But, like, you know, you could imagine a world where we have data quality checks that examine the access control on the data itself, right, and actually refuse to refresh the pipeline. Like, your downstream will refuse to materialize if the upstreams, do not have the right access control bits set, wherever their data is stored. And so but, again, like, this deep integration of the asset graph and the execution engine or the data orchestrator, as we like to call it, like, it just opens up this, like, dare I say, paradigm shift in how we build a lot of these things. A lot of things just become way easier when this impedance mismatch goes away. To the point of the asset graph and the organizational boundaries and the visibility of
[00:32:13] Unknown:
the flow of data throughout these multiple stages. In the task oriented view, typically, the only amount of lineage you'll get is what is the specific dependency sequencing across these discrete tasks. But as soon as that terminal node completes, that's the last bit of that you're going to get about that piece of data unless you have this overarching system of the data catalog where you're adjusting all of the metadata and then reconstituting that visibility of how how these boundaries propagate. Whereas with Dagster, presuming that it's all deployed under a single orchestrating node of of the Dagster Web UI or Dagster Cloud. It doesn't matter what the actual task boundaries are as long as the metadata linkages for those assets are designed appropriately, then you can actually see regardless of what the specific task steps are, whatever those assets are, you you can automatically view that dependency graph beyond the process boundaries and beyond the pipeline boundaries.
And I'm wondering what are some of the ways that you're seeing that change how data teams think about their organizational structures and their collaboration across the broader business context beyond just the core teams and their core responsibilities of the data assets that they're producing in the traditional task oriented flow.
[00:33:33] Unknown:
Yeah. Yeah. That's, that's really, really cool. And you also touched on something there that I think is is really interesting to you, or you danced danced right close to it, which is there's not a 1 to 1 relationship between data assets and tasks. Right? Like, you often see tasks, which is like a step in a workflow. If it's in the case of, like, DBT, it might materialize 100 of tables. Right? And I think 1 of the things that's, like, secret sauce to our implementation is that we can take these steps that do that and, like, subset them. So, like, you know, our our engine understands how to, like, only only materialize smaller subsets of of the dbt graph in the event that that's what you want. I think there are other systems that kinda try to model dbt projects as, like, you know, a graph of tasks, and you end up, like, with 1 task per model. And it's, like, super slow, And, and it's it's kinda painful to anyway, that's a little bit of a tangent. You asked about the ways that organizations change when they when they kind of adopt this this approach. And, like, what we have seen is that at smaller shops, so let's say, you know, it's a 3 person data engineering shop, they they like this asset orientation approach because it's just, like, at least a better developer experience.
There's not much of an organization to transform. So, like, organizational transformation is not really, like, top of mind for them. But they like, you know, they like kind of writing code in a declarative style, and being able to visualize it and and observe it. For larger organizations, though, what we have found is that budgets have been not going up anymore. They're not really decreasing so much. They're more like they've been flat for the last couple of years. But there's, the needs of the business have been increasing. And so there is, like you know, we have we have customers with, like, hundreds of of of seats, like, data engineers embedded in every product team that are that are producing insights for their individual product areas. And there's more and more data available via APIs, every single year. And they want more of it. They want to be able to, like, you know, make make business decisions based on it. I mean, this is this is, like, a phenomenon that has been going on for 20 years.
Right? But still going on. And the stacks have been really complicated. And so they're really trying to empower those embedded data engineers within each product area to, like, self serve as much as possible. And so, again, there's this question of, like, how much does the central data platform team like, what do they need to own versus what can they farm out to the individual products areas? And, like, I think that there's this basic desire to, you know, do the DevOps thing and, and give, like, as much autonomy to those embedded data engineers as possible. And that's great. It's harder than than just doing that in practice. Alright? Because you have these compliance concerns that really often need to be solved by, like, a central team. There's infrastructure management challenges where you wanna be able to take advantage of economies of scale. And so, what we what we basically found with with Daxter core and Daxter plus is there's 1, like, uber DAG of assets across the whole organization. So you get this, like, single pane of glass, top down view of, like, all the metadata, all the freshness information, all the spend, subject to the the access control that we have.
And then in terms of, like, you can chop up ownership of that graph and, like, give it to different teams. And because there's, like, a, like, a a strict contract between the consumer of an asset and the producer of asset, that's something that you can do. And then where those actually execute, there's, like, this layer below, which we do have a workflow layer that that sits below the asset oriented layer. It's just that we, you know, most of the time, people are writing code in that in the higher level. You can drop down into the the underlying workflow engine. But we have this notion of, this core architectural decision we made called, that we call code locations. And the idea is that an asset can map or sorry. 1 or more assets can map to a a single, like, underlying what we would call op, which, you know, Airflow would call a task. And that, most of the time, is, like, completely invisible to the user. They just write assets in op runs in something called a colocation. And so you can kinda do that mapping however you want. And what we found is that smaller shops have 1 colocation that does everything, like, very monolithic approach. And then most shops are in the middle where they have a small number of code locations and, like, you know, multiple teams or multiple people contributing to a single code location. At a code location, by the way, it's like it's a it's basically like a service that runs on Kubernetes that, like, runs all your code and speaks a gRPC protocol to the Daxer orchestrator and scheduler. And then, what what we are seeing is this really interesting thing where we're we're now seeing shops that do, like, hundreds of code locations, and they really are trying to, like, have every individual data engineer, like, run their own code location, be accountable for its runtime environment and all that stuff. And I tend to be a moderate in this in these situations. Like, I'm a I'm like a macro services guy where, like, service oriented architecture is great, but you should have a smaller number of multi tenant services. And and so I think that the the folks that are in the middle that have, like, you know, a code location per team kinda makes sense to me. I don't know if that was pretty technical, but is that kinda getting at what you're talking about? Yeah. Absolutely. And digging more into that code location element and
[00:38:53] Unknown:
the reusability, the distributed versus centralized aspect, there are a lot of elements to that. And for for for my use case, for instance, I have a fairly small team, but we support a number of different product lines. And so I've been orienting the code locations around those different product lines, not necessarily a lot around the team dynamics. And the reusability aspect comes in of some of those core platform elements, in particular, the IO managers of, I just wanna be able to fetch and push files to object storage and be able to use those in the context of the workflow, or I want to be able to have some reusable logic around how a particular type of data element is processed for consistency and to your point compliance.
And so being able to have that flexibility in terms of how those core primitives are composed into a given platform and deployment architecture has been very valuable.
[00:39:54] Unknown:
Yeah. Yeah. And I I think, it's another place where Conway's Law can sneak in. Right? Your structure of your code locations, it either kind of 1 starts to resemble the other over time, I think.
[00:40:06] Unknown:
Absolutely. And in your reimagining of what DAXTER Cloud is intended to provide and how this relaunch in the form of DAXTER Plus is being conceived of and deployed? What what are the core business and product goals that you're focused on addressing with this relaunch of your paid platform?
[00:40:29] Unknown:
Yeah. That that's a that's a great question. So there's there's a a couple of focuses here. The first 1 is to enable data as a team sport with best in class developer experience and observability. So, specifically, we have this feature called branch deployments. We've really leveled it up. We have basically, like, a branch deployments v 2 that's coming out in Dexter, plus that we we're it's already in the hands of design partners, and, like, we're getting a lot of really positive feedback about that. Branch deployments basically lets you fork your whole data platform for every pull request.
So you can you don't have to, like, fight over staging environments anymore. You can kinda preview what this thing is gonna do, run it. You can configure it to write you know, new data assets into a staging area. Again, this is, like, another advantage of, like, an asset oriented approach versus a workflow oriented approach. And and so we've we've really nailed, like I think we I think I can say with a straight face that we have, like, the best developer experience or at least the most modern SDLC, of of any of the alternatives out there. The second 1 is really helping teams manage the costs of their data platform at scale. So this is DAGs for insights.
By default, you know, if you're using 1 of our official integrations, like our DBT integration, our Snowflake integration, BigQuery, OpenAI, you'll get cost information flowing in, automatically with no work. And you'll be able to set alarms on it and, like, be able to attribute the spend to specific data assets and help you make the the the decisions about what to do about it. You can also use that to optimize your spend on on our cloud product too, by the way. And so, the third is, again, around this data catalog feature. We've really focused on leveling up the data catalog, both making it easier for data pipeline builders and data consumers, like the San Luis, to use. And we've also really leveled up the data discovery features. We built a brand new search experience that is that is fundamentally multiplayer in nature. The last 1 is around data reliability.
So we we have a feature in DAXR called DAXR Core and DAXR Plus called asset checks. This is kind of you can think of it as a way to to model data quality checks in an asset native way. So, for example, if you, like, import your dbt project into Dagster, that's that's a first class integration we have. All your dbt tests will show up as asset checks, and you can run them separately, and you can make orchestration decisions based on them. That's an open source feature, as well as a as a Dexter plus feature. But with plus, we are building on that a lot. So we have, you know, again, integration with the insights feature. So you can start to kind of visualize, you know, how your data quality is trending over time.
And we've we're also shipping, you know, built in freshness checks. So taking a look at, you know, your the last time your dataset was updated and making a decision about whether that is fresh enough. We talk to a lot of our customers. We say, hey. What are your biggest data quality challenges? Like, freshness is, like, the biggest thing by a mile. And so we have a rules based freshness capability that's coming out in Daxter Plus where you can just say, hey. This thing is older than 8 hours. Like, page me. And then we also have, an unsupervised anomaly detection feature coming out in Dijkstra Plus as well, which, we're we're very excited about.
[00:43:52] Unknown:
And as you have been building and expanding on the core Dagster project and the paid services that you're building around that? What are some of the most interesting or innovative or unexpected ways that you've seen those utilities used?
[00:44:09] Unknown:
Well, you know, 1 of the things that I really like about working in data is that, it's so horizontal. And it's like every company in every domain uses data and, has data engineers. And so my, you know, like, I I was, when when I was I was, like, driving down the highway the other day, and I saw this big sign for Dave and Buster's, which is this, like it's like an arcade for adults. They serve, like, you know, like, nachos and beer and play pinball or something. And I was like I was talking to somebody about the company while while we were passing by, and I was like, yeah. You know, like, data engineers are everywhere. And I was like, I bet Dave and Buster's even has data engineers. And I, like, went on LinkedIn. I searched for data engineer Dave and Buster's. Like, there there was a whole data engineering team at data investors. You know? And it was like, you wouldn't think that a place that just, like, serves nachos and has, like, pinball machines would would have a data engineering team, but they do. So, they're not a customer, by the way. They're just a random company that I drove by on the highway. But you've got, like, companies like them building data pipelines.
You've got DAXTER customers that are launching rockets, that are doing drug discovery, that are making, like, large scale, like, fast food and consumer packaged goods. We've got energy companies using us, you know, financial institutions. It's like it's this whole, like oh, yeah. And, like, like, trendy Silicon Valley tech companies too that do, like, social networking stuff. Like, it's really horizontal, and it's really cool that everybody is building you know, we're we're all doing the same stuff at the end of the day. You could take a data engineer from from Dave and Buster's and put them, you know, sit them next to a data engineer from some place like SpaceX, and they'd have something to talk about. So I think that's really cool. We see that in the Dagestr community, but, you know, frankly, it's not exclusive to us. Right? It's, like, it's everywhere in the data community.
It is interesting, though, seeing how people push the system and and organize their code in new and exciting ways that we didn't think of. The other day, we saw somebody well, you know, we've just seen some interesting tech stacks. I'll I'll leave it at that.
[00:46:25] Unknown:
And in your work of guiding the direction of the company, working with your customers, working with the technologies involved, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process, process? And in particular, in this lead up to the Dijkstra Plus launch?
[00:46:45] Unknown:
That's a that's a broad question. I think there are some tactical lessons that we're learning. Like, you know, 1 key question for us is always, do we make this UI driven, or do we make this code driven? I think that's a that's a lesson I think that's something that we're still dialing in. Most of our features in Dijkstra Plus are are gonna be UI driven, but a handful of them are are quite code driven. So there's a question of, like, how much do you make people scaffold up and have it be really flexible versus how easy do you make it? I think that, you know, this this might be a little bit, like, too transparent or whatever, but our last launch event was in, was, I think, q3 of last year, and that was a real hustle.
We worked really, really hard, and it was really down to the wire. And this 1, like, the features are all, for the most part, like, they've been been with design partners and early access customers for a while. I don't think we're gonna have any crunch, this time. And we don't we don't wanna be a company that, like, relies on crunch in order to ship. Right? Like, that's, like, I'm all about kinda like like, we wanna we wanna build a big, successful, sustainable company, and that requires, like, being sustainable in the way that we work. And Crunch is not, like, a sustainable thing. So I'm very proud of, like, the team and and and the culture that we've built and and how that how we've we've really, I think, we have a a steady launch this time, which is which has been great. And for people who
[00:48:19] Unknown:
either are interested in Dagster and starting to adopt it or people who've been using Dagster for a long time. What are the cases where Dagster Plus is the wrong choice?
[00:48:30] Unknown:
The wrong choice. Well, I think that I think if you're, like, very early, you know, in your journey. Right? So, I think if you are writing your first line of code, and especially if you want to, like if you're a curious engineer and you wanna stand up Postgres and you wanna stand up a Kubernetes cluster and and, like, own the whole stack end to end, I think that that's a very valuable exercise for individual engineers to do. Maybe not for every project, but, like, at least the the first 1 or 2 times. You know, you definitely wanna use open source for that. We have I think if you find yourself starting to get paged for infrastructure or if you are finding that you have to integrate a lot of tools.
Because, again, like, while DaxaCore is asset oriented, like, if you wanna do all that cost management stuff or anomaly detection, or have a rich searchable data catalog for, like, your at scale teams, you know, those those are all gonna be a lot of manual work. Right? And so I think if you find yourself probably having more than 1 1 or more than 1 or 2 business stakeholders you might wanna consider, consider going to cloud.
[00:49:50] Unknown:
And as you continue to build on and expand the Daxter core and Daxter plus offerings, you've mentioned a few of the different things in the road map. I'm just wondering if there's anything else that you have planned for the near to medium term or if there are any particular projects or problem areas that you're excited to explore and understand at a more at a deeper level.
[00:50:11] Unknown:
Like I like I said, you know, I I think that from the Dexter Plus perspective, we are, gonna double down on on the areas that we've talked about. So, we're gonna continue to evolve and improve our data reliability offerings. We're gonna probably introduce some kind of data governance capabilities towards the end of the year, that public private assets thing And, you know, understanding how PII is flowing in the system are 2 areas that we think are ripe for, for exploration. But, you know, I know we've talked a lot about DAXTER plus, but, like, DAXTER open is or or DAXTER, open source, excuse me, is a huge priority for us too. And, we're gonna have a bunch of stuff landing there, probably at the end of q2, early q3, and we'll probably, do some comms around that. But a lot of work related to our core, scheduling primitives, is happening over there. So we have this, auto materialization capability that we're we're working on leveling up right now.
And so stay tuned on that. So, you know, like like I said, Daxter Core, our open source offering, it did benefit a lot from the Daxter Plus work, but we're, we're kind of gonna have some more, like, more open source y stuff landing, end of q 2 or like q
[00:51:28] Unknown:
3. Are there any other aspects of the DAXTER plus launch, the features that you're focused on shipping as part of that, and the ways that you're hoping to empower data teams as a result of those capabilities that we didn't discuss yet that you'd like to cover before we close out the show?
[00:51:45] Unknown:
The the last note that I would just add is, and and you had asked about open source versus Daxter plus. Like, open source versus Daxter plus is not on prem versus fully managed cloud. Like, we have this this cool architecture where you run an agent inside of we have a hybrid on prem deployment option for DancerCloud where you can run an agent in your infrastructure, and then we only see kind of, like we only exchange, like, metadata signals and some logs. But we don't actually see any secrets or personal data or anything like that. So what we found is that companies will, like, come to us and say, hey. You know, our security team will never approve a cloud service. Like, we're just talking to you to learn a little bit more, but we're probably not gonna go with you. And once they learn a little more about the the architecture of our Dijkstra Plus, you know, cloud service, they, they end up getting super comfortable with it because it's it's a very low risk thing. So that that's the only thing that we didn't quite touch on was the was the underlying architecture.
[00:52:45] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing and to take part in the launch event, I'll have you add your preferred contact information to the show notes, and I'll add a link to the launch event there as well. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:53:08] Unknown:
Yeah. That's a that's a great question. I I had 2 candidates for that 1. The first is I had mentioned governance a lot. I still think that governance is a challenge, and, and nobody is happy with the status quo. Engineers aren't happy. The lawyers aren't happy. Nobody's happy with, with the state of data governance today. The second 1 is I'm really excited for there to be a TypeScript for SQL. So something that brings, you know, modularity and type checking to the SQL query language. And dbt has taken some great steps in that direction, and has completely revolutionized how people do it. But I think that in front end world, we saw this kind of move from string based templating languages towards truly composable, statically typed modular systems, and I think we're gonna see that in SQL world soon.
So I'm very excited for that to happen.
[00:54:02] Unknown:
And we are seeing that a little bit already with things like prequel and Malloy, and I think it really it becomes a matter of what is the adoption curve of these tools and what is the surface area that they're able to cover.
[00:54:15] Unknown:
Yeah. Yeah. There's there's a couple others too. There's, SQL mesh and, SDF are are other exciting projects in this space. There there's not a winner yet, I don't think, but but there's a lot of exciting, you know, innovation happening.
[00:54:30] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you and your team have been doing on the DAXTER Plus product and the upcoming launch. It's definitely great to see the continued investment in the capabilities that you're offering to data engineers and data teams. So appreciate all the time and energy that you folks are putting into that, and I hope you enjoy the rest of your day. Yeah. Thanks, Tobias.
[00:54:59] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview
Guest Introduction: Pete Hunt
Dagster Plus Launch Overview
Dagster Open Source vs. Dagster Plus
Key Features of Dagster Plus
Data Collaboration and Team Sport
Business and Product Goals of Dagster Plus
Innovative Uses and Lessons Learned
Future Roadmap and Governance
Closing Remarks