Summary
Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source system for building modern data applications
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Dagster is and the origin story for the project?
- In the tagline for Dagster you describe it as "a system for building modern data applications". There are a lot of contending terms that one might use in this context, such as ETL, data pipelines, etc. Can you describe your thinking as to what the term "data application" means, and the types of use cases that Dagster is well suited for?
- Can you talk through how Dagster is architected and some of the ways that it has evolved since you first began working on it?
- What do you see as the current industry trends that are leading us away from full stack frameworks such as Airflow and Oozie for ETL and into an abstracted programming environment that is composable with different execution contexts?
- What are some of the initial assumptions that you had which have been challenged or updated in the process of working with users of Dagster?
- For someone who wants to extend Dagster, or integrate it with other components of their data infrastructure, such as a metadata engine, what interfaces do you provide for extensibility?
- For someone who wants to get started with Dagster can you describe a typical workflow for writing a data pipeline?
- Once they have something working, what is involved in deploying it?
- One of the things that stands out about Dagster is the strong contracts that it enforces between computation nodes, or "solids". Why do you feel that those contracts are necessary, and what benefits do they provide during the full lifecycle of a data application?
- Another difficult aspect of data applications is testing, both before and after deploying it to a production environment. How does Dagster help in that regard?
- It is also challenging to keep track of the entirety of a DAG for a given workflow. How does Dagit keep track of the task dependencies, and what are the limitations of that tool?
- Can you give an overview of where you see Dagster fitting in the overall ecosystem of data tools?
- What are some of the features or capabilities of Dagster which are often overlooked that you would like to highlight for the listeners?
- Your recent release of Dagster includes a built-in scheduler, as well as a built-in deployment capability. Why did you feel that those were necessary capabilities to incorporate, rather than continuing to leave that as end-user considerations?
- You have built a new company around Dagster in the form of Elementl. How are you approaching sustainability and governance of Dagster, and what is your path to sustainability for the business?
- What should listeners be keeping an eye out for in the near to medium future from Elementl and Dagster?
- What is on your roadmap that you consider necessary before creating a 1.0 release?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Dagster
- Elementl
- ETL
- GraphQL
- React
- Matei Zaharia
- DataOps Episode
- Kafka
- Fivetran
- Spark
- Supervised Learning
- DevOps
- Luigi
- Airflow
- Dask
- Kubernetes
- Ray
- Maxime Beauchemin
- Dagster Testing Guide
- Great Expectations
- Papermill
- DBT
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or you want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that coverage too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. This week's episode is also sponsored by Data Coral. They provide an AWS native serverless data infrastructure that installs in your VPC. Data Coral helps data engineers build and manage the flow of data pipelines without having to manage any of their own infrastructure. Data Coral's customers report that their data engineers are able to spend 80% of their work time invested in data transformations rather than pipeline maintenance. Raghu Murthy, founder and CEO of Data Coral, built data infrastructures at Yahoo and Facebook, scaling from mere terabytes to petabytes of analytic data.
He started data coral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as Dataversity, Corinium Global Intelligence, Eluxio, and Data Council.
Upcoming events include the combined events of the data architecture summit in Graphorum, the data orchestration summit, and the data council in New York City. Go to data engineering podcast.com/conferences today to learn more about these and other events and take advantage of our partner discounts to save money when you register. Your host is Tobias Macy. And today, I'm interviewing Nick Schrock about Dagster, Str, an open source system for building modern data applications. So, Nick, can you start by introducing yourself?
[00:02:30] Unknown:
Yeah. Thanks for having me, Tobias. My name is Nick Schrock. I'm the founder of a company called Elementl and our current project is as you mentioned this open source framework for building data applications, which is kind of the the word we use for describing systems like ETL pipelines, ML pipelines, and I'm sure we're gonna get into that. Before elemental and Daxter, the bulk of my career was spent at Facebook, where I worked from 2009 to 2017, and I worked on this team through most of my career that I formed called product infrastructure, whose job it was to produce technology to empower our product developers and the users that they serve.
And that, you know, that team ended up producing some open source artifacts of Node, namely React, which I had nothing to do with, but I worked next to those folks for years, and then GraphQL, which I, was 1 of the cocreators of.
[00:03:33] Unknown:
And so from that, can you explain how you first got involved in the area of data management?
[00:03:39] Unknown:
Yeah. Absolutely. So I left Facebook in February of 2017, which is actually a little over 2 years ago. You know, I took some time off, but I was thinking about what I was gonna do next. And I actually, you know, started talking to people across various industries because I was actually looking for kind of almost like a non tech industry to work in that needed tech help. And then as I was talking me, like, a legacy industry like health care or finance or those those types of industries. And as I was talking to people across those, across various companies and organizations I would ask them what their primary technology challenges were and this data engineering, data integration, ML, you know, doing ML pipelines, analytics, etcetera kept on coming over and kept on coming up over and over again, and you know, I would then kind of go to practitioners in the field and ask them like, hey, can you show me what your workflow looks like and what your tools work like?
And, you know, there's amazing compartments of technology in this sector, but when you look at kind of the developer experience or what I'll call the builder experience because it's not just software developers, developers, analysts, and data scientists also participate in this. From someone with my background in what I fondly call the full hipster stack, meaning, like, React, GraphQL, and the associate technologies. Kind of the aesthetics and the tooling is just of not of this of a quality that I was accustomed to. And then, you know, you would go back and talk to these business leaders after talking to their engineers and they would say something like, listen, our ability to transform health care we think is actually limited by our ability to do data processing.
And then I remember this meeting distinctly, I was like, wait wait wait wait wait, you're telling me that's what what's prevent what you think is preventing you from transforming American healthcare is the ability to do the moral equivalent of regularized computation on a CSV file. And they're like, yeah, that's probably it. And I was just like, this is crazy. And that kind of started me on the path of looking into this.
[00:05:55] Unknown:
And given the fact that you didn't have a lot of background context in data management and data engineering before that, I'm curious how you managed to get up to speed and get so embroiled in the overall space of data engineering and data tooling, and how you identified where to approach the problem.
[00:06:16] Unknown:
I mean, it's just pure immersion, you know, the I just started reading and consuming as much material as possible and talking to as many people as possible. So, yeah, I knew it was time to go back to work. I was actually on on on on my honeymoon with my wife, and we were in a train and I was reading Mateza Harrias of, the founder of Spark, his PhD thesis on the train. She was like, Nick, this is ridiculous. Like, put that paper down and you're going back to work when we get home. But so I, you know, and and, you know, Tobias, like, your podcast and podcast like it have been utterly invaluable. And through those podcasts, I also was able to connect with like minded people and really get their feedback and understand what they were doing. You know, I particularly, for example, loved your episode with Chris Berg about data ops. I thought that was super insightful. But effectively, it was just, you know, you just when you start learning something, everyone in the history of the world who knows something at some point did not know that thing. So you just put 1 step in front of the other, start reading every single thing out there, talking to every person that you know about it, and then just start building and experimenting with stuff.
[00:07:32] Unknown:
So from all of that, you ended up creating the Dagster project. I'm wondering if you can just explain a bit about what it is and some of the early steps of getting along that path and understanding how to approach the problem.
[00:07:45] Unknown:
Yeah. So, you know, a lot of this comes from my, you know, everything is biased and through the lens of your previous experiences. So, yeah, I was definitely trying to think about what are the design principles that led to things like GraphQL that I thought were applicable in this space And, you know, I started to think a lot about why does programming in data management broadly, why does it feel so different? And it it's and what are the properties that make it so that seemingly, like, software engineering practices end up being different in this domain than a traditional, like, application domain.
And as I was thinking about that, 1 of the properties of these systems that struck out to me is the relationship between the computation and the underlying data. Meaning, in a traditional application, you have, let's say, a single database table and that is manipulated in a transactional fashion, meaning there's lots of different pieces of software and entry points into the system that are both that are mutating that table. Right? Like, this user updates this setting from this this endpoint and this other user updates this other setting from this endpoint all of that shared state. What is different about 1 of the big things that's different about this domain of computation is that typically there's a 1 to 1 correlation between a data asset and the computation that produced it. Meaning that if there's a data lake somewhere and there's a set of par k files that are being produced or just say simplify just a parquet file.
Typically, there has only been kind of 1 logical computation that has been producing that thing throughout time. Meaning that you have a function somewhere let's say just in a very abstract sense like a piece of computation like a spark job and it's been producing daily partitions of in parquet files over and over and over again. And there's a 1 to 1 correlation between that set of partitions in the data lake and the computation that produced it. And you can actually generalize that to, yeah, almost anything, like, all of these systems whether they're call them ETL pipe lines or ML supervised learning processes or whatever are typically just DAGs of functions that consume and produce data assets. And what was really interesting about focusing on the computation itself is that that is actually kind of a more essential definition of the data data than the data itself in some ways. Let me give you an example. Imagine that you had a computation that said that, hey, I'm a computation and I produce a sequence of tuples that have strings and ints, right, and imagine that you could actually, you know, in a in a in a really standardized way instruct that computation to conditionally either generate a CSV with that schema or a JSON file with that schema.
In reality the it's the computation that is the source of truth there and not the produced, CSV or JSON file. So it was kind of this, like, hey, why don't we start focusing on attaching metadata and a type system and a standardized API around these broad computations instead of the data itself, and that was kind of the fundamental insight that led to the project.
[00:11:00] Unknown:
And in the tagline, you use the term modern data application for familiar with such as ETL framework or for building data pipelines. And I'm just wondering if you can describe your thinking in terms of what you mean when you say data application and some of the main types of use cases that DAXTER is well suited for.
[00:11:25] Unknown:
Totally. So, let's frame this by talking about the term ETL. So, and again, I think this is part of the benefit of me coming in fresh to this about a year and half ago and kind of assessing, like, what are the what's all this different terminology that I use and why does it exist? So specifically, ETL, let's talk about that term. Extract transform load. And its historical etymology is you have, you know, the traditional 1 is, like, you have Oracle systems and you have a transactional database on 1 side and you have a data warehouse on the other side. And every night, you do a 1 time transformation that extracts that data out of the transactional database, does some computation on it, and then loads it into a data warehouse. So my what people call ETL today looks nothing like that. It looks absolutely nothing like that, meaning that it is typically multi stage, it has multiple stages of materialization.
It typically passes through multiple different computational engines like the ingest might be through Kafka or a tool like Fivetran and then it might be in a data warehouse for a while or maybe then Spark will operate on it and then different systems. So the term ETL is no longer no longer attached to its original definition. When people say ETL today, they effectively mean any computation done, in the cloud. And the other thing which I think and the reason why we're kind of interested capturing a new term called data application for these is 1, I believe that ETL, data pipelines, supervised learning processes are all in effect the same system.
They are graphs of compute that consume and produce data asset. Right? Within every ML pipeline, there's an ETL pipeline. There's just 1 additional step that produces a model. And the other thing is that, you know, the and this also kinda comes from their origin story is that I really view this domain as in a similar spot to where front end engineering was about 10 years ago. And back then yeah. And the reason why it comes from this is that if you talk to anyone in data today, they'll say something like, I spend 10, 20% of my time actually doing my job and 80 to 90% of my time data cleaning. And I was when I kept on hearing this from people, it actually kind of gave me started giving me these, like, flashbacks to talking to front end engineers and say 2010 within Facebook, and they would say, like, I spend 80 to 90% of my time fighting the browser and 10 to 20% of my time building my app. And, you know, React really changed the world on that front. And 1 of the things it did is it no longer thought of front end as kind of a sequence of scripts that are stitched together and then you touch once and you never talk to them again. It's, like, hey, these are not really complicated pieces of software.
We need a framework that respects the problem and the discipline and, like, is lives up to the inherent complexity that's in those apps. I think that data is in a similar spot whereas, like, these are no longer just, like, 5 scripts that are stitched together in a DAG that you have to run once a day. Like, those exist, but in reality, we are in a much more complicated world. The ETL, what we're hitherto known as ETL pipelines are much more intermixed with the business logic of your application. Meaning, like, often you're doing transformations that stream in data back into the app, and there's kind of a reflexive relationship between the data pipelining and the core behavior, the your your front end application.
So these are just, like, far more complicated things now, and I think you need to think of them as applications. Meaning, like, they have they're they're alive all the time. They have up times. There's complicated relationships within them you have to think about it not just as a 1 off script but you have to respect the problem write testing for it really start to model these things in a more robust way that's amenable to both human inspection, human authoring, and tooling. And so, you know, that's kind of why we're referring to those things as data applications, because data applications are multidisciplinary application world is partially caused by the siloing of terminology, actually. But they're all collaborating on the same activity. And I would also argue that application engineers
[00:16:09] Unknown:
are starting to bleed into the data engineering life cycle as well, where previously ETO engineer or data engineer would be responsible for pulling information from the system of record that the application uses. But as we get to more real time needs and the requirement of incorporating data as it's being generated, the application engineer needs to be aware of what the overall systems are that are able to process that data downstream, particularly with the introduction of systems such as Kafka as the sort of centralized system of record for the entire application ecosystem, both for the end user applications and for the data applications. And so I think it makes sense to have this unified programming framework that everybody can understand and everybody can work together on rather than having them be componentized and, monolithic and fully vertically integrated.
[00:17:09] Unknown:
I I couldn't agree more. And the the days, of, you know, this is why I, yeah, I mentioned it earlier, the your interview had Chris Berg, about DataOps, you know, you're kind of in different words describing the DataOps vision of, like, you know, there used to be this you know, if the analogy is DevOps, used to be the siloing between developers and operations, and now developers are responsible for operations to some degree in that there's, like, a programming model where they can program the ops. And I think we need to move to a similar world here where you can have self contained teams that are responsible for building the app and building the and deploying it and also integrating it with your data applications internally because the people who wrote the apps know the most about the domain of their data. We shouldn't be living in a world where an application developer can wake up willy nilly and change their data model and then break everyone else without being responsible for that.
[00:18:05] Unknown:
So can you take a bit of time now to talk a bit about how Daxter itself is architected and some of the ways that it's evolved since you first began working on it? Totally.
[00:18:16] Unknown:
So, you know, Daxter, you know, if if you look at it, I think someone once called me, told me it's like, oh, this looks like fancy Luigi. So, you know, at first blush, it definitely looks like a fairly traditional ETL framework. I think what distinguishes it and how it's architected is that we are very focused on allowing the developer to express what the data application is doing rather than just how it is executed. So if you look at something like airflow, right, the primary abstractions there are just they have operators which then you create tasks and then you build a dependency graph. Right? And if you open up that UI, right, all you see is kind of a series of nodes and edges and those nodes have like a single string that describes what it is and then there's edges between them and that's all the information that you have.
And the prod the goal of the system is to orchestrate and ensure that those computations complete, and that you can retry them, and things of that nature. Daxter's primary focus although we do do some of that execution as we'll get involved but the primary focus of DAGSTER is enabling the developer to express at a higher level of abstraction what those computations are doing. So, when you write a solid it's a general you know there's a type system that comes with Dijkstra, so every single solid is a function we say that hey every single node in the graph is actually a function that consumes something and produces something and you should be able to express that, you should also be able to overlay types on top of that so that you can do some data checking as things enter and exit the nodes as well as express to tooling exactly what is going on with this thing. These things can also express how they get configured we have strongly typed configuration tools and then as the computations proceed they actually inform the enclosing runtime about what's been happening meaning that hey I produced this output, Hey, I actually created a materialization that outlives the scope of the computation.
I just passed this data quality test, etcetera, etcetera. So our focus is much more on kind of this new call the application layer, right, for data management, and that is the kind of the primary focus for our programming models. And since
[00:20:42] Unknown:
Daxter itself is focused more on the programming model and isn't vertically integrated as I mentioned before, as opposed to tools such as airflow that people might be familiar with, I'm curious how that changes the overall workflow and interaction of the end users of the system, and what your reasoning is is for decoupling the programming layer from the actual execution context.
[00:21:08] Unknown:
Yeah. So, you know, the world of infrastructure is changing a lot, and, you know, what we what, you know, Airflow is an existing let's just talk about Airflow specifically. Right? You know, Airflow is a very vertically integrated system and it they are it has a UI, it has an execution, like, cluster management aspect to it, and it also, you know, has this user facing API such as such as it is. And, you know, because it's not layered as much, they haven't been able to move as quickly. Like, for example, they've they've, you know, airflow still doesn't really have a coherent API layer such that you could build, you know, really move quickly on the front end of that system in a decoupled way. But I think what's more interesting is that the world of infrastructure is just changing a lot.
And, you know, just to go back to the previous comment about that, like, Daxter's primary concern is about what the what the data applications are doing rather than exactly how they're doing it, the how is of what these systems are going to be doing is going to be changing a ton over time. So I think there's going to be lots of different physical orchestration engines as new different cluster computing primitives come along. So you know just for example there's like out there there's Dask which you can use for cluster management if you just want to kind of do Python native and then obviously people are really interested in interested in computational workloads on Kubernetes, but I don't think Kubernetes will be the end all of all, you know, compute infrastructure for all time. And so I just think that world is moving very very quickly and you want to be able to also be able to use a new software abstraction on existing legacy infrastructure.
Right? So this this in some ways comes from my experience working with GraphQL. And 1 of the things I was really pleasantly surprised about open sourcing GraphQL was just how effectively it penetrated legacy enterprises and the reason why is that GraphQL is a pure software abstraction that you can overlay on any programming language, any run time, any storage engine, any ORM. And that meant you if your front end people with GraphQL wanted to go in and use GraphQL but overlay it on top of some legacy IBM web sphere something or other, you could actually have someone write a GraphQL server which interact with that thing and that was an extraordinarily powerful operating modality for an abstraction to really have a lot of impact not just amongst the Cognizant building greenfield apps but in an industry wide scale.
So we kind of approach this in the same dimension of what we like to call a horizontal community platform meaning that yes these are just DAGs of functions and by functions I mean like a coarse grain computation meaning like a spark job, a data warehouse job, or any sort of legacy process that you have in your system. You should be able to orchestrate those computations on arbitrary compute based on your needs and your requirements and then but regardless of what's actually doing to compute and what's physically doing to orchestrate it, there's still a ton of commonality between all those things and that's where it's kind of the the what part of me of what Dexter is describing meaning like it has types, metadata, etcetera. I mean, we have common tools that can operate over all of that. And you can just actually see this trend of moving away and unbundling vertically integrated stacks kind of across a few domains of computing all the way from content management systems to other systems, and I think this is kind of part of that. Yeah. I think having these different composable layers
[00:25:14] Unknown:
provides a lot more longevity for each of those different layers independently. Because as you said, today, you might want to use airflow as your actual execution to context. Tomorrow, it might evolve to Spark because your scaling needs have evolved. Or maybe there's some new framework that's coming out that you want to be able to leverage, but you don't wanna have to rewrite all of your computation just because you're running across a different actual execution engine.
[00:25:40] Unknown:
Yeah. And if there's nothing else, the other thing that I really noted coming at this industry fresh is just how heterogeneous and fractured it was. Meaning that in when you have kind of a kind of a coherent or typically you're crossing 3 or 4 technology boundaries with dealing 1 of these things. And in a in a legacy organization where there's complicated or maybe they've even done acquisitions and stuff the data the data infrastructure heterogeneity is absolutely mind boggling. So having this kind of, like, this single opinionated layer that all it does kinda does is describe what's going on and make it in a way that can integrate with legacy, both computational engines and legacy infrastructure, we think is really powerful.
[00:26:32] Unknown:
And just quickly, I'm curious to what your evaluation process was to determine that Python was the right implementation target for Dagster and what other language runtimes or frameworks you might have considered in the process.
[00:26:48] Unknown:
So it wasn't actually, you know, I'm when it comes to languages, I'm much I'm a pragmatist. And for these type of systems where you want a wide variety of personas, interacting with it successfully, but still being able to build so called real software and in the data domain, I don't think there's any choice but to use Python. Python has a lot of good things going for it. 1, just like everyone in data is accustomed to it. It's highly expressive, so you can with it just it just a very for these kind of metadata, metaprogramming type of frameworks extremely useful to use. Python also, and I think this is the reason why it's been successful in this domain is that as a programming language it has just a vast dynamic range.
Meaning that I think you can grab anyone who, say is proficient enough to do something complicated in Excel and you can put them in a Jupyter notebook and they can do meaningful work, but you can also build Instagram on top of Python, and that's kind of Python's superpower. So that, you know, what are the other choices in the data domain, you know, you could a lot of Scala is in Vogue. Scala does not have that dynamic range. You cannot plop an Excel user, into a Scala program and expect them to be successful and then, you know, they're really, you know, what other languages should I have chosen or should I have considered? Actually, you know, it's 1 of those things where I don't think I even, like, really considered another language because to me the choice was so obvious.
[00:28:33] Unknown:
And then going into DAXTER, I'm curious, what were some of the main assumptions that you had, and have those been challenged or updated as you have put DAXTER in front of more end users?
[00:28:46] Unknown:
Yeah. That's a great question. And, actually, it's kind of difficult to go back in time and reconstruct exactly, you know, what my thinking and then the team's thinking has been in every step along the way. I think, you know, most recently, I actually think I still think this is the correct architectural decision, but initially, we were very focused on the kind of the use case of of hey, I'm a team, I have an airflow cluster, I want to, you know, have a higher level programming model on top of airflow such that my team is not or people on my team are not manually constructing airflow DAGs, they're programmatically generating those DAGs from some other API in our case, DAX do. Because it was a pattern we saw over and over again. Typically, most airflow shops of sufficient complexity have built their own layer on top of airflow that, you know, for whatever reason that's specific to their domain or their context, actually programmatically generates those Airflow DAGs. So we were really focused on the incremental adoption case, but our early users, a lot of them came to us, they're like, hey, you know I think it's really cool that you have this airflow integration and actually kind of proves that the system is interesting and generic and that we won't be locked into anything but for right now I just really like your front end tools and I just wanna be able to build kind of like my greenfield app on top of this and kind of a 1 click, you know 1 stop shop sort of way and that's actually what we've been working on for the last couple months is coming up with a like you know, a kind of Daxter native, you know, vertically integrated instantiation of a Daxter system that has a scheduler and a lightweight execution engine along with some DevOps tools so you can just, you know, essentially, like, write we call it you know we have a library called DAX or AWS DAX or AWS in it and it spins up a node and AWS for you, spins up an RDS database and you're kind of you can go from hello world to scheduling a job in about 2 minutes with a beautiful, you know, hosted web UI to monitor and productionize your your apps.
So, you know, we kind of started out with this, you know, horizontal integrate with everything approach but don't be super opinionated to actually we do have like 1 instantiation of an opinion which is like you can you know have this kind of like out of the box solution but the architecture is still there to integrate it with other systems and other execution contexts. So, you know, I think we've changed our initial target market at first and then I would say the other thing is that related to that is that you know this started out as a much more kind of vanilla ETL framework and the the insight that allowed it to eventually target the different execution engines has definitely been an evolution in order to kind of, so that that thinking has definitely changed, along the way, but I would have to think about other things in order to answer that question more fully. And then for somebody who wants to extend DAGSTER
[00:32:01] Unknown:
and either integrate it with other systems that they're running or add new capabilities to it or implement their own scheduler logic, what are the different extension and integration points that Daxter exposes?
[00:32:15] Unknown:
Sure. So we can go those 1 by 1. So, for example, if you want to use a new say compute engine like a new Spark. Let's say you're using Spark, but you really wanna experiment with, say, there's a there's a new kind of not Spark successor, but a similar system that does distributed computation called Ray, for example. It's like, okay, I wanna write I wanna be able to use Ray within Daxter. Well, all you would do is you would kind of write 1 of these what we call solids, that generically can kind of interact and and wrap kicking off a ray job and all you need to do is kind of look at the way we integrate with Spark and data warehouses today and kind of use those as patterns in order to build your own your own solids and you're off to the races. So literally wrapping existing computational frameworks is relatively straightforward and you can cargo call that from our open source repo.
Another example you had was say I wanna be able to use my own scheduler for whatever I want. Well, the Daxter is fully built on a GraphQL API. So the system is very pluggable. So it actually be very straightforward to kind of implement your own schedule logic because all you would need to do is, you know based on some schedule essentially execute a GraphQL mutation against our hosted installation or your hosted installation, you'd be able to enqueue jobs to be run. In terms of, you know, if you wanted to execute this on a new orchestration engine, right, we also have kind of a pluggable API for that and, you know, all those examples are also checked into our resource repo. Right now we have integrations with Dask and with airflow where effectively we've written code that allows you to take a DAGSTER representation of a pipeline and then effectively compile that into either an airflow DAG or a DAS DAG and you would, you know, if you wanted to use another execution engine in order to do that you would just kind of mimic that process.
So the system is designed for pluggability
[00:34:34] Unknown:
through and through. And another component of environment that somebody might want to be able to integrate DAXTER with is their metadata engine to be able to keep track of data provenance and being able to identify what are the transformations that are happening. And I'm curious what would be required for somebody to be able to extract all of the task metadata to integrate into that system?
[00:35:00] Unknown:
Yeah. So that's a great question. You know, the the system is definitely designed with that in mind, meaning that, you know, you whenever you execute a solid, you know what the inputs and output types are. But in addition to that, those solids can also communicate that, hey, I have created this what we call materialization that will outlive the scope of the computation. So you can subscribe to those events via GraphQL subscription or you can just, like, consume them with our Python API. But what that allows you to do is the a tool which is consuming those stream of events has an enormous amount of context about what's going on. It knows, like, when the thing was executed.
It might know what container has been executed in. It knows what configuration file was used, meaning, like, the Dax configuration file was used to kick that off, and then it gets runtime information about the materialization, and it's a total user pluggable kind of structured metadata system. And so, you know, definitely on our road map is for us to build our own metastore on top of this, but it's meant to be very pluggable where you could just write a generic facility which consumes these events and every time a materialization is consumed you would be able to actually persist in a metadata store enough state to have full lineage and provenance on that produced materialization.
So we don't have anything out of the box to support that right now, but it would actually be pretty straightforward to integrate that with an existing meta store, and we are just really excited about that direction. So if anyone wants to do that, please come talk to us, because
[00:36:45] Unknown:
we love working with people who like to build on top of this. And for somebody who is interested in getting started with Dagster and writing their own data flow or data application, can you talk through the overall workflow for somebody to be able to define all of the different computation points and integrate it, and then deploy it to production and make sure that the execution contexts are configured properly?
[00:37:10] Unknown:
Yeah. So, you know, I'll just go through that quickly. So, you know, the you you know, so what do you start with? Well, you PIP install Daxter. Right? It is just a Python module. And what you would effectively do is to say, hello, world. You would have a Python file, and we you would write what we call a pipeline which is 1 of these tags and then a solid which is just effectively a function which defines a computation. So you write a function that function is totally black box you can call you can invoke pandas a data warehouse job a spy PySpark job whatever you want and then you orchestrate we have this kind of elegant DSL for stream those solids together into a DAG once you do that then what you can do is you can launch that and debug that with either in a unit testing environment obviously but also using our development tool called DAG it so just locally in your machine without deploying anything you can run Daggett you can visualize the DAG, inspect it, you can configure an execution of it, we have this beautiful auto completing, config type system, you can then execute that locally and verify that things happen, So, you know, the this system the fact that we've architected it to be executable in different contexts means that it's also executable in your local machine for testability and whatnot. And we have another different we have also kind of abstractions that help users isolate their environment from their business logic because this is just critical for getting testing going. Okay. So now you have that working.
The deploying it, you know, with our new release you can actually deploy that in a very straightforward fashion by kinda using our kind of DevOps tools that come with with this. So once you have that pipeline written, you would then effectively type DAX or AWS and net and then it would provision an instance install the required correct requirements if you have you need to have a requirements dot TXT locally and then then you're up and running in then you're up and running in the AWS environment in your VPC and then you can you know we also have a Python API for defining schedules which is just a light wrap around cron and so you can go from kind of like writing this to to also deploying it very quickly.
If you further wanna customize it then you can actually what we we have this kind of new abstraction that we call an instance or you can think of it as like an installation and you can configure that thing to instead basically you think of it when you init that AWS environment or init your local environment you can say like hey, instead of just doing a single threaded single process executor we instead want to execute this thing on top of Dask for example and so you could you would also configure your instance, which is just a YAML file and kind of a well known spot in order to instead of using our native toy executor, use something like Dask. So it's definitely pluggable on multiple dimensions.
[00:40:16] Unknown:
And then 1 of the things that you've commented on and that stands out about Daxter is the concept of strong contracts that it enforces between the different solids or computation nodes. And I'm wondering why you feel that those contracts specifically are necessary, and some of the benefits that they provide during the full life cycle of building and maintaining building and maintaining a data application?
[00:40:39] Unknown:
So this is what struck me about a lot of these systems is the amount of implicit contracting that was in these systems and how frequently unexpressed they were meaning that again contrasted to airflow right airflow if you look at their documentation they say that if you feel compelled to share data between your tasks that you should consider merging them, and then actually someone wrote a system to try to pass data between tasks called xcom, but that is generally not used that much and I believe even the creators of it has kind of, like, been like, that was kind of a swing and a miss. And so but the thing about that is that airflow tasks are passing data between another implicitly.
Right so if you have task a which comes before task b presumably the dependency exists only because a has changed the state of the world in such a way that it needs to happen before B right and because and if you want that to be testable it has to have parameters and you have to pass data between them, so to me this wasn't some massive realization like I think everyone understands that there's data dependencies between these things. It's just a question of whether you express them or not in the system. I think it is critically important to express them for any number of reasons, you know, both in terms of human understandability mean like you can actually inspect this thing in a tool and understand what the computation is doing to ensuring or guiding your users to write these things in a testable manner because if you can't pass data between tasks there's no way you can test those tasks right and I just think that these all these data applications are DAGs of functions that produce and consume data assets and that they should be testable you should be able to execute arbitrary subsets of them and in order to do that you need them to be parameterized and some of the parameters need to come from outputs of the previous task which means ABB functions.
And then also there's you know really interesting operational properties that come out of expressing your data dependencies. It's a fundamental layer on which you could build say fully incremental computation and have the system understand how it should memoize, produce data and other aspects and you know this all kind of you know Max of airflow flame Max Boceman who's also been on your show you know, has written a couple blog posts about this which kind of, you know, has been influential in my thinking about so called functional data engineering. So I just think it's the right way to build these systems on any number of dimensions and I think you can get a lot of value by expressing those data dependencies, those parameters in your computational graph.
[00:43:45] Unknown:
And you mentioned testing a few times in there. And I've got a couple of questions along those lines. 1, in terms of how Daxter itself facilitates the overall process of testing and some of the challenges that exist for testing data applications, and how you approach it. But also, I'm curious how you approach defining the type system for Dagster to be able to encapsulate some of the complex elements that you need to be able to pass between things, such as things like database connections to be able to identify that there was some change in a record set or, an s 3 connection for being able to define the fact that there were some parquet files dumped into a particular bucket. Okay.
[00:44:31] Unknown:
Well, I feel like you just asked to those those are 2 questions we could fill up an entire podcast on, but I will, do my best. So you asked me how DAXTER approaches testing, and this is a huge and important subject. Not that how Daxter does it but the testing and the kind of data domain in general. Everyone acknowledges that's really difficult and hard and it's 1 of the things that I really noted when I first was learning about this, so in terms of you know things that make this domain different from application programming, you know, the same developer operating within an application, a traditional application would be like writing lots of unit tests. You take that same human being, move them to writing 1 of these systems, and all of a sudden they're not writing tests anymore because it's a fundamentally different domain and it's harder to do testing on. So when I think about testing in this, I think about kind of 3 different layers.
Let's go through them. So 1 is unit testing the other is integration testing and then we'll call it pipeline or production testing and each of them in this in this environment has their own issues. So unit testing this stuff is hard and a big reason why is typically these systems have dependencies on external pieces of infrastructure which are effectively impossible to mock out or very difficult to mock out. You know, this is 1 of the reasons why we built in a system where 1 of the things that we do in Daxter is that we flow around a context object throughout your entire computation and the goal of that is instead of anywhere where you would just kind of grab some global resource like a connection from a 5 like a database connection with hard coded values or a spark context or whatever, you instead attach those things, those same exact objects to our context object and that allows the runtime to control the creation of that context and therefore kind of with the same API be able to control the environment that our user is operating within so what I mean by that is instead of just saying like you know global function getcon as in get connection you would instead say context.resources.connection and what that allows you to do is based on how you configure your computation and then specific instantiation you can kind of swap in a different version of that connection so that you can test this stuff in a unit testing context without chasing the business logic and you know so and but the thing about the data domain is that you can't capture as much stuff in unit testing as you could in probably application development because this external dependency stuff, but you can still do a lot in the unit testing environment. You can make sure that, like a refactoring process worked or if you renamed a function, you know, if you testing out that your configurations are like actually like being parsed correctly. There's all sorts of changes that can be covered there, and I think it's critical part of the process in the CICD pipeline. Okay. Next, integration testing.
So this is more like, hey, we can't mock out our spark cluster because actually, like, mocking out a spark cluster would be an entire company that's an entirely complicated piece of software itself but what we do want to be able to do is easily parameterize the computation so that hey in the integration test environment we actually spin up a like very tiny test only spark cluster and ensure that we can, you know, run it on a sub sample of the data and still have something happened in the verifiable way. So that integration testing layer, that's what I'll emphasize on our built in configuration system. So in order to make these pipelines testable, typically, they become extraordinarily complex functions with tons of parameters that configure both how they're interacting with their environment and also where they're getting their data from and so we fully embrace that and we built this kind of configuration management layer that makes it easier to manage complex configurations and 1 of the goals of that is to enable better integration testing so that in your CSV pipeline or locally you can kind of have different instantiations of config that will do full or partial integration tests as your pipeline.
That way you can kind of slice and dice your pipeline into whatever subset you want of and then be able to execute it within different environments. And then the last component of testing in these data pipelines which we also have full support for within DAX there's a notion of pipeline or production tests now this thinking in this area I'm deeply influenced by Abe Gong the creator of Great Expectations who kind of uses the term pipeline test in order to describe this and what it says is like, you know, 1 of the differences between this and traditional programming is that in data pipeline or applications, you do not have control over your input instead you're typically ingesting data from a data source you don't control and you know that means that if the data changes some assumption about it changes you know it's like think about this you like you look at some CSV file you write some code and you're parsing a CSV file there are a bunch of implicit assumptions in that code that you wrote in order to load it incorrectly and then if in the next days data dumps some assumption about that changed and it's not part of a formal contract your code will break. So, what expectations are or data quality tests is the notion that, hey, instead of having these like implicit contracts between your incoming data and your computation which only end up getting expressed when things break instead why don't you front load that and say that hey in order for this computation to work the incoming data has to conform to this data quality test meaning like I expect the 3rd column to be named Foo and for at minimum 1% of them to be null and the rest of them to be integers greater than 0. And because you cannot control the ingest, the only way you can test that and know that it happened was at production time.
This is much more like a manufacturing process than a traditional software process, meaning that the data is the raw material, you're getting it shipped from some place but before it goes into the machine you need to do tests on that raw material to ensure that it conforms to the requirements of the machine that you might just be breaking right now. So, you know, we through our abstractions try to guide the developer and with tooling built on those abstractions, help the developer kind of execute all those 3 layers of testing which are all necessary for a well functioning system. I believe your other question was about our type system and things that you pass between the solids correct? Correct. Alright. So, the type system of DAXTER is a good way to transition from those production tests because when we went into this it was like you know, because of that that property that I was talking about where you typically don't control the ingest and this kind of like the vast heterogeneity of systems that are used to process this. What can the type system in 1 of these things that proclaims to, you know, spam programming languages and spam different computational system. What what can it actually do? And we actually came to the conclusion that for now, like, the most simple and most flexible thing for a type to say in DAXTER is when you say, hey, I have solid it has an input of type Foo all that type Foo says at its core is you provided a function that says hey when the value is about to be passed into that solid it needs to be able to pass this test so literally the core kind of capability of a type in the DAXTER system is that there's just a function that takes an in memory value does an arbitrary computation and then returns true or false, plus some metadata about what happened.
So it's a totally dynamic, flexible, and gradual typing system that allows the user to kind of customize their own types and do whatever they need to do in order to pass that type check. The the type system isn't is and the inputs and outputs all are about the data that is flowing through the system. The other things you mentioned though were database connections, s 3 connections, things of that nature. Those we model on a different kind of a different axis or dimension of the system that we call resources. And I was mentioning that, like, context object that we flow through the entire system. A database connection or s 3 connection is something that you would attach to that context and where the vision really goes here is that we want to have an entire ecosystem of those resources so that people are even thinking in terms of higher levels of abstraction. So like let's take s3 what are typically people using s3 for? Well maybe what you're doing is all you're doing is saying that hey in previous parts of this computation I'm producing a file and I just need to stash it somewhere and have it saved so that later down the line a solid can, like, take that and then do some further processing on it. If you think about that abstraction, right, like, I think we call it file cache, we have like a file cache abstraction that comes with Daxter and there's like a local file system implementation of it and there's also an s 3 implementation of it, so that you can do that operation of stashing files somewhere for, you know, to in order to actually perform your business logic, but locally, you can just kind of say, hey, I'm I'm operating local mode, use the local version of that file stash resource, but then in production it's operating a cluster community environment you give me the s 3 or GCS version of that same exact extraction And so then things that like s 3 connections, database connections, and the things stacked on top of that, we model as resources because they're not business logic concerns, they're operational concerns.
So our goal is to kinda have the type system and the data quality tests be about the data, the meaning of the data that's flowing through the system, and have the context and the resources aspect be about more environmental or operational concerns.
[00:55:23] Unknown:
And on top of all your work on DAXTER itself, you have created a company to be sort of the backstop for it in the form of elemental. And I'm wondering how you're approaching the overall sustainability and governance of Dagster, and what your path to sustainability and success for the business happens to be, and how they relate to each other. Yeah, that's a great question and I think about this stuff a lot, you
[00:55:46] Unknown:
know, the the open source government stuff is top of mind for me actually because about a year and a half ago, Lee and I, Dan, the GraphQL creators kind of spun GraphQL GraphQL out of Facebook and started the GraphQL foundation which is now, you know, run-in concert with Linux Foundation and that's been a really interesting experience and then there's also been kind of a lot of, for lack of better term, Hoopla around open source sustainability across many dimensions about should there be new licensing regimes, what's the relationship between the cloud vendors, what is proper governance?
So my belief is that I like to have a pretty clear wall of separation between an open source ecosystem and any commercial entities stacked on top of it or associated with it. So I deliberately chose the name elemental to be different from Daxter, and my goal in terms of how they're related is that the relationship between Daxter and Elementor I hope will be similar to the relationship between GitHub and Git structurally, meaning that Daxter will be an open source project that will forever and always be free. It's not me type of thing where we just kind of flip feature flags and have enterprise features for it, it'll be a self contained, governable open source project with very well defined properties and very very well defined boundaries such that in the future we can have a neutral governance model that will work well. Simultaneously though we are also trying to build a sustainable business and a healthy business and that's also our goal and that's where elemental comes in. And the reason why I like the GitHub Git analogy is that GitHub is a product they chose to make a bunch of it closed source it's hosted there's a login you have users that do stuff people are happy paying for it. It leverages the success of Git, right, but it has its own product dynamics and whatnot. It's not just pure host to git, and then there's this cool dynamic right where GitHub made it easier to host git which actually increased the adoption of git, made git more the obvious winner which then increased the popularity of GitHub and there's kind of this reflexive relationship.
This is what we want to do eventually with the product that will become elemental.cloud or whatever you end up calling it and DAXTER, so elemental will eventually be a product that would leverage the success of DAXTER meaning that if your team has adopted DAXTER as a productivity tool, it will be natural, compelling, and in everyone's best interest to adopt elemental as your data management tool that leverages the adoption of abstraction. And I think that's, like, very everyone's incentives are aligned if you do that well and you know you can kind of clearly communicate to your users that like you're not going to be hoodwinked into being if if you're just using this for a pure productivity tool that's totally fine with us and godspeed and it's our job to build data management tooling that leverages that such that the enterprises that, you know, contain or employ those developers that use DAXTER feel really good about having a commercial relationship.
[00:59:19] Unknown:
Before you are comfortable cutting a 1 dot o release?
[00:59:22] Unknown:
Oh, the the ever present question of a 1 dot o release. You know, to me, just to, you know, the the the future road map, you know, I certainly think that you will see us, 1, you know, effectively based on dynamic user feedback kind of prioritize integrations with specific parts of the ecosystem. So after this 0 6 0 release I will imagine that the tools will look more compelling to people in which case certain aspects of that tooling will be like hey I understood that you had this Airflow integration, I'm really interested in using this other tool that I see my friend using but I still can't move my company over off of Airflow 1 whole shot what can I use as a value add for this over Airflow? So we anticipate kind of maturing our integrations with other different technologies but that will be based on user demand. I think the other thing is that you'll see us building kind of more and more tooling off these higher order layers of the computations be able to say like hey I did this data quality test I produce this materialization so that means you know you could say you know you can name off any number of things you can do based on that like a meta store anomaly detection, data quality dashboards all sorts of other stuff, but I think for the next you know 1 to 2 months it's gonna be a very kind of more meat and potatoes type time where based on feedback, based on ergonomic issues, based on operational issues that come up, we will be evolving the programming model or documentation and kind of, you know, doing a getting back to basics type. In terms of a 1.0 release, to me this is about this is mostly about communicating expectations to the users in terms of like hey this is like an API we're gonna stand behind for years and really commit to backwards compatibility and inaccessible in a, like, really really serious way and, you know, we're super confident that this is like the base API layer for the future of the system.
We still have a few kind of iterations to get with that, we're not gonna be breaking people willy nilly on this stuff, but I suspect that based on like, based on user feedback and how this stuff gets evolved gets used organically around this process that we will be changing some APIs and maybe even like taking the system in different directions. So to me the whole 1 dot o is mostly about external communication and about expectations for the future users, and it's more of a qualitative judgment than anything else. Are there any other aspects of DAXTER or your work at elemental
[01:02:08] Unknown:
or your thoughts on the space of data applications that we didn't discuss yet that you'd like to cover before we close out the show?
[01:02:15] Unknown:
I guess 1 aspect of 1 thing I'll say is that I think most of these systems, and this goes back to kind of like what's the deal with this new terminology aspect that is that they over specialize so there's lots of people who build ML experimentation frameworks for example that are totally and wholly separate from their data engineering practice and all these things end up having to coexist within the same data app anyway and so I think a lot of these tools are overly specialized like 1 thing I'm really excited about in terms of tooling that we'll be able to build is that it it will be it will be very straightforward to build ML experimentation framework over DAXTER because you can use an API to enqueue new jobs with different configuration parameters, right? Which is what you need to do in in order to say do a hyperparameter search or things of that nature and you know you should be able to use effectively just a lightly specialized tool over the same ecosystem to do ML experimentation rather than use an entirely different like domain of computation.
So, yeah, we just we very deeply believe in this kind of, you know, multiple dictionary aspect of it like 1 other integration that we didn't really talk about is that DAX here is a first class integration with Papermill which I believe you've done an episode on and that what that system does it allows you to turn a notebook into a function, a Jupyter notebook into a course grading function effectively and then we in turn allow that to easily wrap that within a solid. So you know, yeah I guess what I'd like to emphasize is that the multidisciplinary aspect of this that's a way for people to describe and package their computations in that are actually encoded in different system but express them in a similar and wrap them in similar metadata system in the same vein we actually have kind of a prototype quality implementation integration with dbt as well where you can have a DBT project which is authored by an analyst or a data engineer and then wrap that as a solid and then you can execute it within the context of 1 of our pipelines and that solid will communicate, hey, this DBT invocation proves these 3 tables and these 2 views etcetera etcetera.
So, yeah, I think that, you know,
[01:04:38] Unknown:
this we need this sort of unification layer and that's what we're trying to do. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:04:58] Unknown:
Yeah. I mean, I guess it's pretty self serving, but if I it it would be an issue if, I was working on something, and then thought the answer was totally different from that, you know. I guess, you know, from what I see, the gap is yeah. The gap in the ecosystem is somewhat Daxter shaped, I'll say. Meaning that we just, like, I don't think the gap is the 1 next good cluster manager or just the right ETL framework that's drag and drop that extracts away the programmer or something. This is a software engineering discipline and so I guess I'll just kind of answer the question is like the biggest gap in tooling is not trying to is just in the abstract tools that instead of trying to abstract away the programmer and try to instead try to kind of more up skill people who don't consider themselves programmers to participate in the software engineering process and to really treat these things seriously as applications and not as kind of these 1 off scripts or something that you just wanna wanna, like, drag and drop once and be done with it. This is why 1 of the reasons why I'm such a huge fan of DBT because I think 1 of the reasons 1 of the things they've been able to do is take people who don't conceptualize themselves as software engineers, analysts and essentially through a really nice product be able to allow those analysts to participate in a more industrial strength software engineering process and I think that direction is super exciting
[01:06:35] Unknown:
and we're trying to do that and trying to enable those type of tools with Daxter. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Daxter. It's a tool that I've been keeping a close eye on for a while now, and I look forward to using it more heavily in my own work. So thank you for all of your efforts on that front, and I hope you enjoy the rest of your day. Thanks, Tobias. Thanks for having me. Listening. Don't forget to check out our other show, podcast dot in it at Python podcast dotcom to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction to Nick Schrock and Dagster
Nick's Journey into Data Management
Creating Dagster: The Initial Steps
Modern Data Applications and ETL
Architecting Dagster
Decoupling Programming Layer from Execution Context
Extending and Integrating Dagster
Workflow for Defining and Deploying Data Applications
Testing and Type System in Dagster
Elementl and Open Source Governance
Future Roadmap and 1.0 Release
Multidisciplinary Approach and Integrations
Biggest Gap in Data Management Tooling
Closing Remarks