Summary
The technology for scaling storage and processing of data has gone through massive evolution over the past decade, leaving us with the ability to work with massive datasets at the cost of massive complexity. Nick Schrock created the Dagster framework to help tame that complexity and scale the organizational capacity for working with data. In this episode he shares the journey that he and his team at Elementl have taken to understand the state of the ecosystem and how they can provide a foundational layer for a holistic data platform.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform and blazing fast NVMe storage there’s nothing slowing you down. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box.
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Your host is Tobias Macey and today I’m interviewing Nick Schrock about the evolution of Dagster and its path forward
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Dagster is and the story behind it?
- How has the project and community changed/evolved since we last spoke 2 years ago?
- How has the experience of the past 2 years clarified the challenges and opportunities that exist in the data ecosystem?
- What do you see as the foundational vs transient complexities that are germane to the industry?
- How has the experience of the past 2 years clarified the challenges and opportunities that exist in the data ecosystem?
- One of the emerging ideas in Dagster is the "software defined data asset" as the central entity in the framework. How has that shifted the way that engineers approach pipeline design and composition?
- How did that conceptual shift inform the accompanying refactor of the core principles in the framework? (jobs, ops, graphs)
- One of the powerful elements of the Dagster framework is the investment in rich metadata as a foundational principle. What are the opportunities for integrating and extending that context throughout the rest of an organizations data platform?
- What do you see as the potential for efforts such as OpenLineage and OpenMetadata to allow for other components in the data platform to create and propagate that context more freely?
- What are some of the project architecture/repository structure/pipeline composition patterns that have begun to form in the community and your own internal work with Dagster?
- What are some of the anti-patterns that you have seen users fall into when working with Dagster?
- Along with your recent refactoring of the core API you have also started to roll out the Dagster Cloud offering. What was your process for determining the path to commercialization for the Dagster project and community?
- How are you managing governance and long-term viability of the open source elements of Dagster?
- What are your design principles for deciding the boundaries between OSS and commercial features?
- What do you see as the role of Dagster in the creation of a data platform architecture?
- What are the opportunities that it creates for data platform engineers?
- What is your perspective on the tradeoffs of pipelines as software vs. pipelines as "code" vs. low/no-code pipelines?
- What (if any) option do you see for language agnostic/multi-language pipeline definitions in Dagster?
- What do you see as the biggest threats to the future success of Dagster/Elementl?
- You were a relative outsider to the data ecosystem when you first started Dagster/Elementl. What have been the most interesting and surprising experiences as you have invested your time and energy in contributing to the community?
- What are the most interesting, innovative, or unexpected ways that you have seen Dagster used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dagster?
- When is Dagster the wrong choice?
- What do you have planned for the future of Dagster?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Elementl
- Video on software-defined assets
- Dagster
- GraphQL
- dbt
- Open Source Data Stack Conference
- Meltano
- Amundsen
- DataHub
- Hashicorp
- Vercel
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. Your host is Tobias Macy. And today, I'm welcoming back Nick Schrock to talk about the evolution of DAXTER and its path forward. So, Nick, can you start by introducing yourself? Yeah. Thanks, Tobias. My name is Nick Schrock. I'm the CEO and founder of Elementor, which is the company behind Dexter, and it's an honor to be here. And do you remember how you first got involved in the data ecosystem for folks who didn't haven't listened to your previous episode? So I worked at Facebook from 2,009, 2017,
[00:01:56] Unknown:
and I didn't really work in data management then. The thing I'm known for from that period of time was I was 1 of the cocreators of GraphQL, which has gone on to be, broadly adopted open source technology. And when I left Facebook, I was figuring out what to do next. And I started talking to companies both inside and outside the valley about what their biggest technical and engineering liabilities were. And the notion of data infrastructure and ML infrastructure just kept on coming up universally, both in terms of just raw I know, I like to describe it as the biggest mismatch or gap between the criticality and complexity of a problem domain and the tools to support that domain that I'd ever encountered.
It was also a critical domain insofar as, you know, the job of data management is to curate data assets, which are the basis of all decision making in the enterprise these days, whether it be a human making strategic decisions or machine learning model making automated decisions. So it's a really important problem, You know, there are practitioners in pain, which is very motivating to me.
[00:03:12] Unknown:
And I started digging in, you know, about, you know, 4 years ago now. So that's a bit of the story behind sort of what motivated you to build Dagster. I'm wondering if you can talk through some of the core concepts and design elements that you baked into some of those early versions of it that you were aiming at starting to target some of those pain points that people were experiencing in the conversations that you had?
[00:03:37] Unknown:
So very quickly, orchestration stuck out as both a pain point and a massive opportunity. Because mostly orchestration solutions that people had used, they were fairly narrowly conceived as purely operational tools. Meaning, like, this is how you order and schedule things in production. And I thought that was a very missed opportunity. You know, what I found was these orchestration systems were in their developer life cycle. You know, the whole job here is to build graphs of computations that consume and produce data assets. And, you know, solutions like Airflow and the container based solutions really didn't have smooth local development experiences.
And then I also found that the DAGs or graphs they encoded were very metadata poor, and they were not data aware, typically. And I thought that was a huge missed opportunity because the DAG can be a source of truth for a large number of things. You can encode what data is produced. That data's lineage. You can make it much more self describing the computations themselves. You can structure this thing so that the computations can be rendered in tooling prior to computation, which is actually very important for making it a a system of record for your data platform.
And so that was kind of the entry point and why I started Daigster.
[00:05:01] Unknown:
Given that sort of foundational concept and the idea of building much richer metadata into the computation graph that you're building up with the Dagster, And we've talked a lot about some of the core elements and the architecture of the project in the previous episodes. We'll add a link in the show notes for people who wanna revisit that. But that was 2 years ago now that you were on the podcast last, and I'm sure a lot has changed in the tool and in the ecosystem. So I'm wondering if you can just talk through some of the ways that the project has changed, how the community has evolved, and some of the ways that the ecosystem around the project has shifted your goals and priorities.
[00:05:36] Unknown:
So it's a much more mature and robust system now, and it's been slightly, you know, conceived a little differently insofar as the original vision was to have Daxter be more of a pure software layer over other orchestration systems. And it became clear relatively quickly, actually, after that podcast by the way, I can't believe it was 2 years ago. It became relatively quickly that we needed more vertical integration to achieve our objectives and satisfy our users. Meaning that effectively, we built, like, a full orchestrator underneath the hood. Because, you know, we got feedback that people were like, we love the programming model, but we wanna only adopt 1 thing, not 2 things. And that made sense.
So in terms of high level technical properties, that is the biggest shift, and it really changed the scope of the project, you know, and observing us very well. You know, the community has changed and grown dramatically since then. You know, we have, like, no usage and maybe, like, a 150 people on our Slack when you were kind enough to allow us on air. You know, fast forward, now we have thousands of people on our Slack and really awesome design partners that power their data platforms and ML platforms, you know, Goodeggs, Prezi, Loom, GoPuff, you know, UNICEF, which is a nice 1, Drizly, Scale, Mapbox, whole bunch of awesome users.
The Philippines government used us for the data platform for their COVID response, which was surreal. You know, it's been deployed into Fintech and telecom companies in Southeast Asia. So, you know, we made a ton of progress on that front. Like, you know, we've grown the team a bunch. We have, you know, 20 people on the team now and, you know, making a ton of progress. And then, you know, I think we're gonna talk about later, but we're also, you know, on the cusp of doing a open announcement of our cloud products. So there's been massive progress on a number of fronts in the last 2 years.
[00:07:34] Unknown:
As you have been exploring the data ecosystem, particularly given your sort of fresh eyes on the problem. I'm wondering what you have grown to see as some of the sort of core foundational challenges and compute the existing ecosystem and some of the things that have been sort of transient or short lived or potentially mutable in the problem space as sort of people try to deal with the complexities that arise due to the nature of data? So a lot of foundational things have not changed.
[00:08:04] Unknown:
You know, there are fundamental underlying properties of data processing applications. I use that term generically for ML training pipelines or ETL jobs or data pipelines because they're fundamentally doing the same exact activity. They're fundamental properties of those computations which have not changed, meaning that they're typically they are across organization but interconnected. So there's multiple tools in play. That has not changed. It's multi persona, and they span teams. That has not changed. And then these systems also have this unique property in that the people crafting their computations do not have control over their inputs.
And this is very distinct from web applications where the programmer actually has much more control over the user. So, you know, you put a form in front of them. If they, like, mistype something or misformat something, you throw up a red box and say, please change this. So you can constrain your inputs, and, therefore, you have much more control. That is not true in data processing applications where you just get data from an external source, and it is what it is. You can't control it. Often, you can't, like, have the upstream person change it. So you just have to be far more defensive in your computations.
And it's much more akin to a manufacturing process where, you know, instead of an assembly line, you have an assembly DAG, and the raw materials flow through the system. In the end, you want a meaningful output. You know, as in manufacturing, you want QA at every step. And then, you know, that property makes it so that you have a much more intricate relationship between your data assets and the code that produces your data assets than you do in other forms of programming. So you can think about, like, a particular data asset might represent the data that was produced on a certain date.
It was processed by code that was blast updated on a certain date and then recomputed at a later date. So there's like 3 notions of time just in that sentence. And that is really hard to deal with. The complexity is almost, like, hidden and subtle because a lot of times you look at the code, actually, and just, like, there's, like, 2 lines of Pandas code, and you're splitting a number. Why is this hard? And it's hard because of those underlying properties that I referred to, and that is not going away. I think what is, you know, transient or incidental complexity that's going away is that it's becoming received and accepted wisdom that the way out of a lot of these problems is to apply software engineering techniques to the data domain.
And I think you're really seeing that. Like, a good example of this is the meteoric rise of DBT, right, which is really, like, taking the lessons of software engineering and then getting analysts to adopt that. You know, they retitle themselves analytics engineer, and they're dramatically more productive and leveraged. And I think that lesson needs to be reapplied across multiple domains in data, You know? And I think I'm thrilled by that development.
[00:11:17] Unknown:
There are a number of different directions to go from here, and I think that the first 1 I wanna talk about is this idea of applying software engineering principles to the domain of the data engineer and some of the challenges that that brings in because of the fact that data, unlike software engineering, is inherently cross functional where it requires participation from every member of the organization because of the fact that there are producers of the data, consumers of the data, you know, manipulators of the data, and they all need to be able to align and interact along the entire process. And so there are concepts such as no code or low code platforms or, you know, the the the sometimes loved, sometimes hated UI oriented pipeline building tools and then, you know, this movement towards, you know, x as code where code essentially means a YAML file or a JSON definition or the other trend that's pulling people into another direction of containerizing everything so that you can have multiple languages coexisting in the different stages of computation.
And given the fact that Daxter is unapologetically a software tool and, you know, written and implemented and consumed in Python. I'm wondering what you see as some of the sort of potential challenges and opportunities and just the overall sort of acceptance of this very unapologetically software oriented approach to managing data computation and some of the opportunities for being able to bring in some of these other concepts of, you know, polyglot data applications or app data applications as a more sort of low code, no code approach.
[00:13:03] Unknown:
There's a lot to unpack there, but I guess we can start with the premise that yeah. We like to say that Daxter is built by engineers for engineers. You know, our core adopters always self identify as engineers, and that's very deliberate. So, you know, frequently, the people who really latch on to Dexter and really get a ton of value on it describe themselves as data platform engineers, which means they themselves are engineers, but they are serving tons of non engineering stakeholders. So, like, you know, I just brought up dbt. Right? Like, you know, we have platform engineers who work lockstep with analytics engineering teams. Those analytics engineering teams are still coding in DBT, but we have a nice integration.
And then those DBT users can use us to understand how their computations interrelate to all of their stakeholders and then also have kind of data observability so they know how their assets interrelate. And, you know, even though I'm a grumpy engineer, I think there is a place for low code and no code environments properly conceived and placed within a software engineering process. So, you know, I actually have kind of a picture in my head, which we don't have to get into, and I even have a domain for it, but I'm not gonna tell you what that is because I've had domain names stolen from you before. For a no code, low code tool that plays nicely in a composable way with the rest of the data platform instead of having a completely siloed system that is completely foreign to our engineering process.
So I think there's a spot for that. But, really, what I see happening almost you know, there's a term du jour out there, the modern data stack. Right? And there's, like, debate about what it means, and is it a set of technologies or, like, an emotional state? And, you know, I really think it's a mindset of a way of approaching the problem. And, effectively, data infrastructure is being rebuilt from the ground up for the modern cloud era and the software engineeringification of data. So starting at the most primitive levels where there's an ingest tool, a cloud data warehouse, a transform layer, and a BI tool. But there's gonna be more and more tools kind of embraced, you know, under that mantra, both because in reality, people wanna do more things, and then there's also the incentive structure where every vendor is gonna label themselves the modern data stack x. And you're already seeing that. Right?
But I think they're actually for low code, no code stuff, I think there's definitely gonna be a space that can deliver a ton of value for low code, no code solutions that align with the values of the emerging modern data stack that can bring in a business user, but in a way that comports with the software engineering process laid out by the data platform team. And then along the lines of the sort of,
[00:15:56] Unknown:
you know, data pipeline as code where I just say, here's my YAML file. I want it to do the thing, and I just throw that at Dagster, and Dagster does what it's supposed to do. I'm wondering if you see a potential future for that, or maybe you have a set of kind of prebuilt off the shelf operations that somebody can just reference in a YAML file to say, I want this thing to happen, and I want it to tie into this other thing. And I don't wanna have to actually write all of the code that does those different pieces because DAXTER comes with all of those out of the box for me. Totally. So I think there's 2 components to what you're talking about. 1 is, like, a YAML file or an equivalent DSL.
[00:16:30] Unknown:
We have users who do that now. Like and we didn't wanna prescribe a single YAML spec to rule them all because we find that the needs are often context specific. But what we do tell people is, like, you know, it was I wrote it, I think, actually. We have an example of, like, oh, here's how you would take a ingesting YAML file and produce a pipeline out of that. And it's, like, you know, relatively simple, and people have taken that and repurposed it. We actually had a user which took that. They went to connect us all together and built a full WYSIWYG drag and drop system on top of Dexter. So, you know, it's a layered system that you can build on top of it. I think the other component that kind of is implied in what your question was was kind of prebuilt integrations and prebuilt compute such that you don't have to write code in order to integrate with a tool in a straightforward way.
And, you know, our ecosystem of integrations is growing and continues to grow every day, both kind of in our monorepo as well as, you know, out in the wild. You know? We recently gave a talk at the Open Data Stack Conference, and we're like, we don't have a Meltano integration. And then someone just googled it, and lo and behold, someone had written 1 out in space. And that's the magic of open source in these ecosystems. So I think there's kind of 2 components there. 1, just to summarize, you know, we've written a layered system with well structured APIs where you can overlay your own DSLs on top of this. And then, you know, leveraging the power of open source and having clear pluggability points, you know, we're the surface area of our integrations has expanded dramatically.
[00:18:03] Unknown:
Another aspect of what you're saying there of not being prescriptive about how people are building their computations and how people are defining the sort of DSL that sits on top of DAXTER to be able to be this more high level data platform experience. You know, I started using DAXTER very early in the journey, and so there was definitely still a lot of experimentation of how you structure the repository, how you actually write out the different operations. And I'm wondering if there has been any sort of emerging best practice of how to actually architect the project, how to compose the different elements together, and be able to units of computation that can be built on top of to build a sort of large and complex system without getting stuck in the trap of, you know, dependency hell or, you know, mountains of technical debt. Yeah. There's a lot to unpack there. Just in terms of good patterns and whatnot,
[00:18:59] Unknown:
you know, I've been very pleased to see people dig into our resource system a lot. And without digging in too much detail, this is kind of the seam that allows DAGSTER Computations to be far more testable than any other of its pure systems. You know, people who kind of invest a little bit in their resources up front and set up, like, a development environment and a production environment see, like, massive productivity wins for actually, like, a relatively limited upfront investment just to kind of think about it, get it right. And then, you know, another thing that's been a really successful pattern for folks, which ties into your dependency issue, is instead of kind of making everything in 1 big pipeline and having 3 different teams, you know, collaborate on that, having, like we call it, like, the mega DAG or the monolithic DAG. It's not particularly scalable. It's operationally really difficult.
We've really pushed our users and encouraged our users to leverage our asset awareness to connect different teams' computations through event based, what we call asset sensors. So, you know, our system, you can set it up so that it tracks, like, okay. Like, a previous job says that I just updated this database table that's tracked in our asset catalog, and you can set up a downstream job to just listen on that,
[00:20:21] Unknown:
be like, hey, whenever this is updated, like, kick off a new 1. And that allows those 2 teams to operate in dramatically
[00:20:24] Unknown:
more decoupled fashion. They live in what we call separate repositories, meaning they can easily use different container images. They can deploy at their own independent rate, and they don't have to know anything about the structure of each other's pipelines. Like, I hate to break it to data engineers, but no 1 cares about the structure of your pipeline. No 1 cares. I know you're very proud of it. All they care about is the asset. So, you know, by really doubling down on the asset is the interface between teams has been really successful.
You know? And that very much, like, lines up with the whole not to get into the data mesh because, because, like, I don't wanna start a Twitter fight. But, like, I think 1 of the great ideas of the data mesh is that the asset is promoted as more important than the pipeline, and that's the API between teams. And that kind of structure reflects that. In terms of, you know, antipatterns and things which people shouldn't do, you know, we built this very sophisticated configuration system, and we had this type ahead and all this stuff. And I think, actually, we encourage users to make things overly configurable and cause people to generalize things too early and do things in the config system that they should have just done in plain old code. And I think we're still kinda like clawing that back a bit, actually.
Yeah. I think also sometimes people the API is so lightweight. It's just a Python function. You're like, oh, I can decompose my pipeline into a 1,000,000,000 little things. It's like, well, that actually can cause performance overhead and understandability problems. So, you know, I think that, you know, there's patterns like that that we're still still sussing out.
[00:22:02] Unknown:
You mentioned the asset as kind of the boundary condition between pipelines and the API that is sort of foundational to how people are using and consuming data and that that's actually the sort of most concrete element of the work that we're doing. And I know that 1 of the concepts that you and the elemental team and the DAXTER product are starting to orient around and has become sort of an emerging concept within the framework is the idea of the software defined data asset. And I'm wondering if you can talk through some of the ways that that evolution of that as the kind of orienting concept has on the ways that engineers approach pipeline design and some of the ways that that is influencing the direction of the API contract that you have within Dagster as to how you actually construct these different units of compute and then compose them into these larger DAGs and pipelines?
[00:23:00] Unknown:
Yeah. Well, I mean, first of all, thank you, Tobias, for paying such close attention to the product since we have not aggressively marketed this experimental capability that we've never talked about pub in any sort of real public forum. There's, I think, a single GitHub discussion about it, but it's a direction that I am very excited about. And 1 interesting you said before about, you know, x as code, that trend, I wish all of it was called software defined. So, for example, like infrastructure as code, generally, people associate that with Terraform, right, or some sort of declarative DSL where they, like, made up their own language.
I much prefer the term software defined because that is a little more expansive where you can imagine a system like Terraform that was built on top of Python where you could, like, use functions and stuff, and then you end up using Python to construct in memory software artifact, which is then kind of consumed by a system, and, you know, that structure is in, like, you know, reconciled in the same deployment concept scope. So we really put software defined assets in that same lineage, software defined networking, software defined storage, software defined infrastructure as a way to conceive of your data assets and manage them.
You know, we're still figuring out it's very early days. But, you know, I think you can provide a really elegant intuitive programming model this way where you no longer have to manually construct your DAGs, which is a huge just, like, short term win, it becomes a very natural way to express the dependencies between teams. You kinda like, there's no way you can write the code without defining your asset lineage, which is awesome. You know, I like to call this our property of, like, the developers fall into the pit of success. Like, by default, they're doing, you know, the so called right thing. And then I think we can really shift the mental model of orchestration.
Yeah. Internally, we like to say, like, orchestration becomes reconciliation, where you think of it in a much more declarative way.
[00:25:08] Unknown:
And so, yeah, I'm very excited about the direction. And I know that sort of in order to be able to support that more natural model of assets being the core element and the core conceptual aspect of the goals of the framework. I know that you have also changed some of the API models where originally the sort of unit of computation was called a solid, which has its own sort of interesting backstory to that. We don't necessarily need to get into it. And then there were sort of different levels of granularity for being able to compose those units of computation into the larger graph where there was the pipeline, which was sort of its own special entity, but then there were also composite solids, which were a sort of subdag of a unit of computation, and you've been able to kind of break free from that and unify those into the idea of the graph. And I'm wondering if you can just talk through some of the ways that the idea of the software defined data asset and these more
[00:26:03] Unknown:
streamlined API concepts have kind of played against each other as you explored that space more fully. I mentioned the software defined asset thing, which is very experimental capability that we're just starting out. What you just mentioned, kind of our latest release, o 13 o, which kind of renamed a bunch of stuff, you know, it turns out the name Solid was incredibly stupid. Like, I apologize to my team and the entire community. That's my fault and kind of, you know, had to, like, eat my sin there. I think but more important than the rename is that and I think, Tobias, as a user, you can speak to this, is that the system had a lot of power before this change, but yet, like, sift through some duplicative abstractions, and there's some goofy naming and all that stuff. So, you know, we kinda took a step back and we thought from first principles, okay, but the users who have kind of reached through that muck and found the value, what's the value they have, and then what's the most elegant, most pure way to capture that value and express it?
And then as a result of that, you know, I think the team did extraordinary work here, and effectively, I think the system has gotten more powerful while boiling down the concepts. There used to be, like, kinda, like, 6 or 7 concepts you need to learn, and we boil that down to, like, 3 or 4. You know? When you do that in 1 of these systems, the core concepts interrelate to each other. So every additional concept you add in a core system, often there's this combinatorial explosion of, like, oh, how does this thing interact with this? So I think this latest release is a massive step forward.
And even though we're changing a bunch of names and we're making people change code and all that stuff, the feedback is off the charts positive, which never happens. So I'm thrilled with the results. And in a way, right, we kind of by boiling down the system to a more stable, coherent core, it has now given us the space to move forward and build more capabilities on top of that, and it will feel better and, you know, allow the more, like, what we call the progressive disclosure of complexity. And I think, you know, it's just a much more stable foundation to continue to build, you know, the stack of capabilities that we have. So incredibly excited about what just happened there.
[00:28:24] Unknown:
Going back to 1 of the things you were saying earlier about the richness of metadata that can and should be embedded into the different stages of computation and the interrelations between the different computations that are happening. That's definitely 1 of the more powerful elements of that I appreciate is that it does have this very expressive metadata graph and a lot of opportunities for being able to expose and propagate metadata in those units of computation. And I'm wondering if you can just talk to some of the opportunities for being able to take advantage of that metadata both within the bounds of Dagster and some of the ways that it can be leveraged outside of the Dagster context and exposed into some sort of more universal metadata catalogs and data discovery tooling and just some of the contracts and interfaces that you're thinking about for that? Totally.
[00:29:21] Unknown:
So, yeah, I'm a huge believer in allowing the engineer or the developer to directly encode metadata, like, in their code directly, whether it's annotating the code itself, having metadata in the kind of their structure metadata in their logs so they can communicate during the computation, like interesting facts about things, and then also attaching metadata to the results of the computation. And by having metadata in all those different forms, you can provide enormous amounts of context about anything in the system. Right? You can look up an asset.
You can see information about that. You can then click on it, go back to the run that produced that asset. You get a ton of context about everything that happened in that run. Then you can go to the job that was the basis that it run, and you can see how the code was annotated, who owns it, and all that stuff. By being able to fluidly navigate through the system, any stakeholder can get tons of context about what's actually going on. And it is so much more powerful than control f ing through a log. It is like, if you fully buy into it, it is incredibly powerful. So I'm a huge believer in that, you know, and especially the format where the code itself is annotated with descriptions. So 1 of those properties in DAGS is you can load up these computational graphs prior to computation without any infrastructure requirements. You can just view the graph and, like, get all this descriptive information. It becomes a very useful system of record for your data platform.
In terms of extending the context throughout the rest of the system, 1, everything that exposes your in our UI is backed by a GraphQL, or spoiler, a GraphQL API. So, you know, people can integrate that into any of their tools, and we encourage that. Right? We want DAXTER to be the interconnective tissue, not the 1 tool to rule them all. And then, you know, where I'm really excited to kind of integrate with other data catalog and metadata systems is to really dig in to this structured event stream that we produce to allow an arbitrary consumer of to ingest that and then build up their own interesting indexes and integrates their system. We have, like, a built in, what I'll call, an operational asset catalog that tracks assets that are produced by the Daxter system itself, and we can, like, enable interesting operational use cases. We're not interested in, like, the universe of data cataloging capabilities, you know, like crawlers and, you know, really complex ontologies that are expressed in tools like a Munson and Data Hub. It's not our business.
We want them to be able to ingest our stuff, and I think the right way to do that, you know, will be structured to ingest. But, you know, all of these tools that I mentioned are actually quite early days. We haven't seen, like, a massive market demand for those integrations yet. I'm excited for the day where that occurs. You know? But before you're integrating a data catalog tool with a orchestrator and all this other stuff, you need to, like, adopt it in the 1st place and kind of build up to those capabilities. So I think I stole this from Thomas from Redpoint, but I think he's right. Like, the next decade in software engineering is a decade of data, and it's still super early days. So, you know, I'm speaking a lot of terms in terms of aspirations, but this is kind of our general approach. And so another interesting element of this is the case where you maybe have multiple distinct
[00:32:47] Unknown:
deployments of Dagster, and each of those deployments has their own set of computational graphs that they're concerned with, their own set of assets that they're trying to create, but they are still within the bounds of a given organization. And I'm wondering what you see as the path towards being able to gain visibility across those maybe multiple deployments or distinct graphs to be able to see sort of the full lineage where maybe you have a data asset that is the output of 1 pipeline, and then the other installation has a sensor that is keyed off of that asset to be able to trigger another downstream pipeline and then being able to kind of stitch that altogether in 1 sort of comprehensive view to understand as the end consumer of that second pipeline where the overall graph started.
[00:33:35] Unknown:
Totally. And that's really where our cloud product comes into play. You know, I don't know exactly what date this product is gonna be published, but I believe, and I will request, that it'll be after the announcement of the early access to our cloud product. And we consider those kinda enterprise capabilities. You could think of it like DAGSTER Federation almost where you're federating all the different teams in a really well structured way. And, you know, I'm really excited to explore all those capabilities. You know, we're still early days out of this, but, you know, we're building in capabilities where you can have cross stacks or deployment, asset sensors out of the box, you know, be able to provision those deployments very quickly with 1 click, and then have enterprise governance around those deployments. There are back and, you know, various kind of enterprise features around having, you know, really a really full fledged enterprise data platform built on top of Daxter.
So, you know, operating and the application logic to do all that is actually, like, very complicated. And, you know, it makes sense for us to centrally manage that stuff. And we can get into kind of the open source commercialization boundary as we go on. But, you know, those sort of enterprise organizational use cases are definitely a focus of our upcoming or recently launched DAX to cloud product.
[00:34:57] Unknown:
Struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's 1st end to end fully automated data observability platform. In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today. Go to data engineering podcast dotcom/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo swag box.
Digging more into the cloud aspect, I know that you've been working on Dagster for a long time, and you delayed the sort of commercialization aspect of it for a while until you were very certain of where it fit within the ecosystem that you were solving the right problems. And I'm wondering if you can talk through some of the journey from, I have this idea of a product that I want to build. Here's the open source implementation of it. I'm going to use this to explore the problem domain to now I see a clear path towards a commercial option to be able to actually monetize this without cannibalizing the community that I've built up along the way. So that journey,
[00:36:29] Unknown:
I always start with the developer experience. It's kind of, you know, maybe it's because I'm a developer and I'm selfish, and that's who I empathize with. But, you know, that's what I think of. I think that developers wanna be able to solve their productivity problems and code without interaction with a commercial product necessarily. And, also, the technologies that I kind of like, what I'll call, like, an intimate relationship with the user's code. Know, there's a framework and it calls into the user's code in the same process. There's very practical reasons why you want that to be open source. Like, the code's in the same stack trace, and you want people to be able to build integrations and have a community. So even just beyond, like, the general notion that, like, I like open source communities, and I think they're incredibly powerful. And I thought the GraphQL experience was I was directly involved with, and the React experience, which I witnessed close-up but wasn't as involved with, was inspiring and made lots of people's careers and really democratized technology, and I wanted that to happen.
But I think you need to figure out something that people want first, and I wanted to do that in the open source domain. I think if I went back in time, I might have kicked off the commercialization earlier, not because we need to make money. Luckily, we had very patient investors who kind of get open source, and we're very on board for the, like, hey. Let's invest in the core technology, and there's no rush to commercialization because you kinda might might, like, you know, trying to come up with a metaphor that isn't macabre, but that wouldn't prematurely kind of hamstring the technology.
But, you know, in the modern world, what's interesting is that there's an increasing comfort with hosted services where it's not like the old days where there was an open source technology and you're like pulling teeth to get people to use your commercial service. Now it's the opposite where lots of people view the open source project having a SaaS offering as an adoption requirement, which completely flips the script, which has been a fascinating development. So, yeah, I think, like, you know, that's been going on. I think we knew we wanted to do the big API change and, like, simplify the system, but we knew we could also start building the commercial offering in parallel. So we kinda kicked off doing the commercial product early this year and did that in parallel. Now with kind of the API we wanna stand behind, and, you know, the early alpha users are getting a lot of value out of the commercial service, and we've matured that infrastructure just makes sense to kind of couple them now and move forward. And we really think it can kind of, like, kick start adoption and even accelerate that further and just provide a ton of value for our community who wants this. Like, people don't wanna migrate a database.
It's like the worst thing in the world. Well, not the worst thing in the world. In the scale of software engineering, it's a very annoying thing to deal with. And it just makes sense for a company to centralize and manage those issues on users' behalf. You know? The the company can do it way more efficiently. And so there's a very natural, mutually beneficial transaction there for users large and small.
[00:39:31] Unknown:
In terms of the value add that you're baking into the cloud platform, I'm wondering if you can just talk to some of the ways that it's architected to be able to add some of those additional capabilities and simplify the operations of the system while still being able to sort of leave control in the end user's hand and maintain some of the sensitive aspects of data because of the fact that it is liable to have PII or
[00:39:57] Unknown:
regulated information or, you know, security constraints imposed upon it? Yeah. So, you know, we've chosen architecture. I think it's increasingly becoming a industry standard more or less. You know, there's a few ways. Some people call it a managed control plane and a user data plane. Some people call it hybrid SaaS. You know, like, our CI system does is build Kite. There's an agent that lives in our VPC. It phones home to ask, like, oh, should I kick off a job? And then the compute actually happens in our VPC. We pay the bill to Amazon direct, so there's not a tax on the compute. And that works really well. Databricks is similar with their so called e 2 architecture.
You know, Snowflake even does this to some degree. They deploy compute into users' VPC. So we have a similar model where our goal is to host as many of the stateful, complicated services as possible. So the metadata database, the web server, long running processes that deal with scheduling, and so on and so forth, but then still have the user have control of their code and their data such that it can operate on any infrastructure from your laptop all the way to a case cluster. And then there's just a agent that phones home and kicks off compute when there's something to do.
And then we stream up structured metadata through that same channel or similar channel, and that powers all of our kind of operational tooling, our web UI, our asset catalog. And we think this provides, you know, a great balance for people where, you know, what's nice about the user cloud that we call it is that it's stateless. Like, all it does is you kick off compute, it runs it for a while, and it spins down. So it's very amenable to a fully elastic compute, and just like a cloud native infrastructure. It's great for cost. You can run on spot instances, all that good stuff. So minimal operational overhead. Yeah. But that's just the start of it. You know? I think that we're really interested in exploring the various modalities of this. And as users you know, we'll respond to user feedback too. Like, we'd be excited to either have more managed solutions or partner with people who can provide out of the box managed solutions. But, you know, we've talked to dozens of early users on our wait list, and they seem overall very happy with that trade off of having to do a little ops on their side, but they get to own their compute and their data. And then we kind of manage all the complicated, like, staple services that they don't have to manage, and then we can also provide value add proprietary services on top of that. As
[00:42:26] Unknown:
you progress along this journey of commercializing and delineating the boundaries between the open source project and the capabilities there and what you're building into this managed service. I'm wondering what are the design principles that you're using to make those decisions as new capabilities come up as a possibility to add to 1 or the other and some of the ways that you are kind of managing the governance and long term viability of the open source project?
[00:42:53] Unknown:
Yeah. I love this question, Tobias, and I think it's a really important 1 to ask open source founders because, you know, if someone doesn't have a principled framework around doing this, they will be completely thrashed all the time by, you know, oh, we have to open source this because our competitor did did this. Or, like, oh, no. There's a new PM who has a different take on this, and they wanna be grabby and decide to deny capabilities to your community to coerce them to use the commercial service, and that's no fun. Right? That doesn't provide predictability. I think that open source communities are very worldly and understand that we need to build a sustainable business.
They understand that. They just want to be able to predict our behavior and have there be a fair business model that works for everyone. So in that vein, you know, I think that we have our own framework kind of starting with the HashiCorp model as a base foundation. So they kinda chop up the world between things they call it technical complexity and organizational complexity, where technical complexity is broadly in the open source domain and organizational complexity is in the commercial domain. And I thought that was a great place to start, but I thought, you know, given our context, we could kind of even go a little further than that. So instead of 2 complexities, it kinda subdivide the world into 3. 1 is application complexity, meaning, like, how users structure their code and very concretely how they consume and interact with the open source DAGSTER framework. Framework. Right? There's kinda like some rules you can depend on. Like, if the code is in process with your code, it's gonna be open source. Right? If you're pip installing Python, that's gonna be open source.
The basic core framework level will always be open source. Then there's what I call the enterprise complexity where you're dealing with, like, SOC 2 compliance and governance and all of these capabilities which are very expensive to develop and very complicated, and you wanna be able to change things quickly to fix bugs and develop new product capabilities and all that, the people who use those capabilities want a commercial contract. They don't trust a pure open source solution. So literally, you're enabling usage by providing you know, like, a commercial contract is like a thing that business people understand. It's like, oh, I'm paying you money and you have this obligation. And that makes sense. So there's application complexity, which heavily biases, maybe in a 100% biases, towards open source. And then there's the enterprise complexity, I'll call it, which people want a commercial relationship.
I think the interesting 1 that's fraught, that people struggle with is this middle category I call operational complexity. So operationalizing these computations because, you know, there's this balance of you want open source users to be able to use this in very real scenarios because our goal is to make DAGSTER a durable open standard. And you don't wanna coerce people to use a commercial product in order to use and push forward the standard. We don't want that. But there's this opposite thing where if you put 100% of your operational capabilities in the open source domain, 1, you are subject to disruption by the cloud providers. That defensibility is real.
2nd of all, and I think this is underappreciated, maybe actually kind of the determinative here, is that it is much slower to develop things in the open source domain. So, you know, if you open source an entire complicated infrastructure, which has multiple interacting back end services and whatnot, there's actually very few organizations on the planet who can actually run that thing effectively. And then, like, debugging all those issues at all those installations across all those open source users is a huge tax on the team. Right? And we don't have the engineering throughput to deal with that. So there's, like, this very practical issue that some operational capabilities make a ton of sense to centralize because you wanna be able to move fast.
You wanna have more engineering throughput. And then if the only organizations that can run it anyway are yourself, maybe the big mega tech companies, and the cloud providers, that doesn't help anyone. So I really categorize it between, like, operational capabilities that have massive economies of scale to centralize. Like, for example, if we can never have 1 of our open source users run a database migration ever again, that would be awesome. It is so much easier for us to do that on the user's behalf. Just as, like, a super concrete example of this centralized economies of scale I'm talking about. But, you know, we are also going to have an open source Kubernetes implementation of the system, and we would never, like, deny a bug fix to the open source community because we wanna, like, get them to use the non buggy software.
First of all, it's ridiculous. And second of all, it goes against our values, and there's no economy of scale benefit from fixing that bug. So that's just to summarize again, there's, like, application complexity, that's open source, enterprise complexity, that's proprietary and centralized, and then you can kinda, like, chop operational complexity into 1, if it's complicated enough that not that many people can run it anyway, and it benefits from centralization that's proprietary. And if it's, you know, just like building a Kubernetes executor that has very strict and very well defined properties, but that is pluggable so you can, like, extend it and whatnot. That is in the open source domain.
We'll be writing information. It's a complicated, you know, there's a lot there, so, but that's kind of the general framework of how we're approaching things, and we'll be writing more about this as time goes on. Yeah. And that all makes good sense. And
[00:48:36] Unknown:
specifically, in terms of the operational aspects of it, if you do have a sort of prescribed way of deploying everything end to end and this is just how Dagster works, then like you said, it's never gonna fit everybody's use case. And it's going to be a point of friction for adoption because if somebody says, oh, well, the only way I can run this is if I run this 1 script that happens to deploy 6 things into Amazon and 5 things into Google, well, I don't use Google, and I don't wanna put anything in Google. Why would I ever do this? And so having those kind of defined interfaces of, you know, this is how it runs. These are the layers that you can add your own capabilities to to run it in the way that you want it to. But if you don't wanna deal with that extra engineering overhead on your end, then, you know, pay us whatever it is per month, and we'll deal with it for you. Totally. And I think the other thing about this is that I think there's this this implicit contract with the community as well that, you know, where we wanna get to is a fair
[00:49:30] Unknown:
usage based pricing model where both the small players and the large players can participate in a way that feels win win for everyone. Yeah. And that's like I'm not saying we're gonna get there tomorrow. There's a lot of work to do there, but that is the eventual goal. And I think that, you know, I don't think people give open source communities of credit. Everyone's grown up, and they know that you know, and our goal, unapologetically, we wanna build an awesome open technology and standard, and we wanna build an awesome business on top of that. And we think that is beneficial for every stakeholder in the ecosystem.
[00:50:01] Unknown:
So as you are iterating on the core product of DAXTER and the commercialization efforts, and you have this sort of ambition of being an open standard for data computation. I'm wondering what you see as some of the potential threats to the future success of the Daxter product and the elemental business.
[00:50:21] Unknown:
Yeah. I think the threats are still at a basic level, which is like, will enough people get enough value out of us? Like, I'm an OptiSt founder, so, of course, I believe that, but we still need to grow the technology and make sure it applies to a broad enough set of people to make it a project worth investing in, a business worth building. So, you know, there's always that existential threat, which is, like, you know, is the thing you're building what people want? And, you know, we're definitely on the path there, but that needs to be proven out at a greater scale. So, you know, that's what I think about mostly. I guess, you know, other existential threats are if some of the core premises of the business or the project are not true.
Right? If this whole notion that, you know, the software engineeringification of data is not the massive wave that we think it is, and instead it's like a niche thing, and we can all move to a no code solutions where there's no engineer in the critical path at all. And everything's gonna be solved by managed services, and the engineers can go home and, you know, retire or cry depending on their position in life, I guess. You know, I don't think that's true. I think the path forward is not to eliminate engineers, but to make them more productive. And I think there's this glorious flywheel as a result of that. So, anyway, I'm trying to refute the existential threat in person. But if some of the core underlying assumptions of the project are not correct, then that's, you know, potentially a threat to the business. But other than that, like, I'm not too concerned. I'm very confident about the future of the project and the company. I was struggling for these answers, actually. I you know, but I had to say something.
[00:52:07] Unknown:
And then going back to the beginning of the conversation, as we mentioned, you know, to some people's eyes, you're a relative outsider to the land of data engineering and data management. And, you know, you've been here for 4 years, right, give or take now. And as somebody who is a relatively new entrant to the ecosystem, I'm wondering what have been some of the most interesting and surprising experiences and sort of lessons that you've learned about the overall space that you have encountered as you have been investing your time and energy and contributing to and growing this community that is sort of very deeply nested in sort of the core concern of the data ecosystem?
[00:52:48] Unknown:
I mean, I don't think it's surprising, but I was gratified. I think that people in data are very excited, and, you know, data engineers have a reputation of being grumpy, old, you know, people who like their tools, and they're like, you know, get off my lawn, but I don't really think that's true. Through large swaths of data community, there's a lot of open mindedness. I think there's a lot of acknowledgment that there's a ton of work to do. There's a lot of different projects and teams experimenting in different directions. I think the vast majority of which, even people who are putative competitors, you know, it's a very collaborative relationship in the vast majority of cases.
So I think that has been good. I think 1 thing that I underestimated, and maybe this is because of my background at Facebook where there are centralized infrastructure teams taking care of all sorts of other issues, is just how big of a deal simplifying the DevOps story for all of this stuff is. You know, because I originally set out to build a software abstraction. I'm like, oh, the infrastructure is someone else's problem. Right? But we've had to vertically integrate more, and the deployment and DevOps aspects of this stuff and sanding off all the edges and improving docs and all this stuff is just so critical for onboarding adoption. And I think I would have realized that earlier.
Definitely, I think we would have front load the cloud product more and focused on that earlier. But, you know, I'm also really excited about the capabilities. Yeah. And I think, like, kind of coming things up from fresh perspectives is good, but you need to partner with people in the ecosystem. But, you know, I think some of the capabilities of our cloud product are really exciting. You know, we're heavy Vercel users, which is kind of like a, like, hipster JavaScript hosted environment that has these, like, this amazing DevOps story, and we're really inspired by that. And our goal is to kind of create a Vercel level deployment of user experience, but in the data domain. And I think that's like a complete alien concept, you know. I'm obviously hyping things up, but I'm I'm very excited to see what the reception in the marketplace will be for that.
So, yeah, I think those are the 2 things. I think it's been cool. Like, I think it's a crowded ecosystem, but the vast majority of people are very open minded and collaborative. I think, you know, all this DevOps and infrastructure stuff, you know, people are just struggling. There's no centralized infrastructure team coming to save them. You know? And I don't think I'd really kind of, you know, aligned my head to that given my background.
[00:55:20] Unknown:
As you have been building out the DAXTER team and the product and the sort of community around it, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:55:32] Unknown:
If there's 1 thing that, you know, we bias towards pluggability and flexibility and sometimes at the expense of kind of things not being out of the box enough, but the trade off of that is that people come up with all sorts of stuff. You know, 1 of my favorites, and he spoke at a community meeting. It was either a company contracted by IKEA or IKEA itself where he repurposed Dagster to be in the critical path of the application to render 3 d models of furniture, and they were able to, like, repurpose because they needed a orchestrated DAG compute, and they like the plug ability points and all that, but using it for a completely different use case than anticipated. I think we only 1 of the orchestration systems that really works on Windows.
That's been really cool because it's been a kind of a entry point to a lot of single players, so to speak, who worked in more locked down IT environments where they have to run on Windows. And, you know, I'm really passionate about that because 1 of the things I was was gratified by the GraphQL experience was how it penetrated legacy enterprises earlier in its life cycle than I expected. It was like, okay. Like, kids and some are using it. But then really early on, like, Walmart and KLM was using that. Right? And because of this Windows capability we have, we have folks at Honda using Daxter. Right? And I really like that dynamic. And then the other thing, I mentioned the person who built the drag and drop GUI interface. I thought that was fun.
1 story I really like is that Good Eggs, they run their entire data platform on Dagster. They actually trained their nontechnical staff who work physically on the warehouse floor to use Dagster operational tooling because, like, they have to ingest data from their contractors, and then sometimes that data is malformed. They have to retry it and whatnot. And the platform team did a bunch of work up front, but they were able to make it so it's self serve by people who are literally on the warehouse floor. And that was really gratifying to see that our investments in kind of our consumer grade UI tooling paid off where kind of a non engineer, you know, business, not even an analyst, just like, you know, like a business user
[00:57:48] Unknown:
can come in and with just a little training can, like, intuitively use the system, and that was great. In your experience of being a founder in this space and building the product, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:58:02] Unknown:
I think that we probably moved a little too quickly at the beginning because my mental model was, like, the same development style that we had used I'd use internally at Facebook to develop a lot of the core infrastructure there. And we moved super quickly, and we made mistakes, and we corrected them. In open source projects, it's way harder to change a mistake. Because at Facebook, you know, I could just, like if I made a mistake and I need to change the name of a method or something, I could just be a madman and, like, stay up all night and, like, change all the call sites and move on with my life. That's not possible in the open source domain.
And I don't think I really internalized that lesson from the GraphQL experience because with GraphQL, it was like a well formed coherent thing. And we open sourced a spec, which has really stood the test of time because it was like well baked by usage. So the artifact we open sourced was kind of this durable thing. Whereas with DAXR, we were trying to do a new thing and really push the boundaries on a few different dimensions and coming up with ways to achieve the outcomes we wanted to achieve. And, you know, I think, like, this latest release I talked about, it's kinda paying the price for that with the correct some goofy names and solace some abstractions. So that's been kind of a humbling lesson, and I'm really privileged to work on a team that, 1, was able to do that, but, 2, also willing to do it. Because it's like, oh, Nick, if you would've just, like, not done that, then you would've saved 3 months of my life, you know, cleaning up your bullshit to use a blunt term. But everyone's been super cool about that, and our community has also been
[00:59:40] Unknown:
great on that front. So I guess that's very top of mind because I just have it. And so for people who are looking for some foundational component for their data platform or they're looking for a way to manage these graphs of compute in their ecosystem. What are the cases where DAXTER is the wrong choice?
[00:59:57] Unknown:
I think it's the wrong choice. Let's see. We don't handle, like, real time applications or streaming. The example I gave slightly to the contrary of the person at IKEA was kind of a real timey case, but they really did some clever stuff to make that work. You know? So I think that makes sense. You know, we really target data platforms. You know? I think other folks in the space are much more focused on pure orchestration applied to anything. Right? So, like, CICD pipelines and, you know, any sort of workflow orchestration. And, like, why you can use Daxter for that? You know, that's not our focused use case. You know? Like, we are focused on the use case where you're building a data platform, and the purpose of that platform is to build, manage, and curate data assets. That's our business.
You know, it's funny. 1 of our design partners is a very sophisticated user of System Mapbox. The person I talked to there was like, you know, we communicate this fairly clearly, and 1 of his stakeholder teams was like, I wanna use this for CICDs. Is that allowed? I'm like, well, you can do whatever you want. Like, you know, like, it runs code in order, and it retries things. So, you know, you can use it for a CICD pipeline. I think it works reasonably well for that, but our tooling will not be focused on that use case. So, yeah, that's kind of my answer.
[01:01:23] Unknown:
As you continue to iterate on the commercial product and the open source community, what are some of the things you have planned for the near to medium term, and any particular projects that you're excited about? Yeah. The 1 I mentioned before, the software defined assets, I think
[01:01:37] Unknown:
is, yeah, really, really exciting. And I think that's gonna open up, like, huge possibilities in our tooling. Like, I think we can, like, completely reimagine the way that backfills work. I think we can really go a long way towards making a much more modern data stack DBT native orchestrator, and I think there's a real market need for that. I'm also really excited to kind of program model married with our cloud environment. I think we can help a lot with the problem of creating ephemeral development environments for data practitioners where they can just push something up to a branch, and then the infrastructure team would have to work to do this, but effectively automate the process of, like, okay. You're gonna you're developing your data assets. Okay. Make a copy of the input data, have test schemas, all that stuff, and have a really safe environment where you can iterate quickly in a cloud environment. I'm really excited about that.
And then, yeah, in terms of cloud, you know, I get excited when you can use a technical system to kind of help with the organizational API problems, this kind of deployment federation stuff we were talking about earlier. And, you know, we're starting at this, like, pretty base layer for DAGS to cloud where it was like, okay. We're making it easy to spin up a deployment. We're managing your ops. But I think there is so much runway in managing all that enterprise and organizational complexity for tons of use cases that can bring in way more stakeholders than any other orchestrator out there. So,
[01:03:10] Unknown:
yeah, those are kind of the immediate directions I'm excited about. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Alright. This is like an unfair question to ask of someone who's building the infrastructure because, of course, like, you only think about, you know, what your stuff is. I guess I'll refer back to the capability that I was talking about before.
[01:03:41] Unknown:
This notion of having a fast development life cycle when there's a bunch of managed services involved and you need to, like, make copies of production data and, like, all this stuff. And I think, like, the orchestration system will have a part to play there, but you can't solve that alone. And I think just the ability to do that, to be able to have multi tool environments where you can just iterate on test data, you know you're not gonna do anything bad. It would be a huge unlock.
[01:04:08] Unknown:
And like I said, I think that's an ecosystem wide problem that'll take a lot of collaboration, but I think that's a huge hole. Well, thank you very much for taking the time today to join me again and talk about the work that you've been doing on DAXTER. I'm definitely excited to start applying it more to my own environments and build up some more abstractions on top of it. So definitely appreciate all the time that you and the team have put into it, and I hope you enjoy the rest of your day. Thanks so much, Tobias. Thanks for having me. Listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Welcome
Nick Schrock's Background and Motivation
Core Concepts and Design of Dagster
Evolution of Dagster and Community Growth
Foundational Challenges in Data Ecosystem
Applying Software Engineering Principles to Data Engineering
Best Practices and Anti-Patterns in Pipeline Design
Software Defined Data Assets
Rich Metadata and Integration with Other Tools
Dagster Cloud and Enterprise Capabilities
Commercialization Journey and Open Source Balance
Design Principles for Open Source and Commercial Features
Potential Threats to Dagster's Success
Lessons Learned in Data Ecosystem
Innovative Uses of Dagster
Challenges and Lessons as a Founder
Future Plans for Dagster
Biggest Gap in Data Management Tooling
Closing Remarks