Astronomer's Role in the Airflow Ecosystem: A Deep Dive with Pete DeJoy

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that.

Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details.

Your host is Tobias Macy, and today I'm interviewing Pete DeJoy about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow three. So, Pete, can you start by introducing yourself?

Yeah. Absolutely. Thanks for having me on, Tobias. My name is Pete. I'm one of the cofounders, and I lead product at Astronomer. I've been working on Airflow and kind of Airflow adjacent technology for, the better part of the last decade,

primarily focused on, one,

building the open source project airflow and working with a very talented team who's who's really driving that forward,

and two, building commercial products to help with managing airflow and running airflow at meaningful levels of scale, and now kind of extending into a broader kind of data ops platform that we're working on.

Astronomer has been around since 2018,

really focused on, again, air all things airflow,

driving the project and and community forward. And,

yeah, it's been it's been a real joy to work with such a vibrant open source community and project and see all the progress that's been made. So And do you remember how you first got started working in data? Yeah. So, actually, like, prior to,

Astronomer being the Airflow company, we we were

working

as kind of a data services company focused on a bunch of different things. This was kind of back in 2015, '20 '16

doing services projects for a bunch of customers. And,

we got to know Maxine very well early on working with Airflow. He had written Airflow at Airbnb and open sourced it in twenty fourteen, twenty fifteen.

And we became kind of in house Airflow pros because we were writing all of these data pipelines on behalf of our customers. And as a result, we got to know the ins and outs of the project very well, and it was time for Maxine to go, start working on his newer project, superset.

We had a discussion with him and, really stepped in to continue driving the airflow project forward, And that's when when we kind of rebranded the company to really focus on Airflow and refocused everything that we did around Airflow. So, yeah, we were kind of doing data services work for a long time before we even got our hands on just running Airflow for a bunch of customers at scale. As you mentioned,

Airflow and Astronomer have grown to be very closely linked by virtue of the fact that that's where you're spending a lot of your time.

Obviously, Airflow is an Apache project, so there's no one company that that controls it. But so I'm interested in how you would characterize the relationship that you have between

Airflow, the open source project, and Astronomer, the company, and the role that you play in that community. Yeah. It's a great question. There are, like, so many different ways to build open source businesses. You you look at companies like MongoDB that really may can maintain control of their licensing strategy.

But then you also look at companies like Databricks and Confluent with Spark and Kafka,

really building businesses around, Apache projects. We we fall into the latter bucket, obviously, focusing on an Apache foundation project and building kind of the commercial abstractions around it. We really do drive every major airflow release. We have 18 of the top 25 committers to the project and eight PMC members on the astronomer payroll. We've contributed at this point about 60% of all airflow code just in the repository. And, yeah, we have a ton of airflow expertise in house. So, generally, a lot of our business is centered around folks that are running airflow for very critical workloads.

And, you know, we we have a very deep batch of folks that know all of the ins and outs of the airflow code base that can help out. So yeah. You know, as with anything, you know, we we don't like to use the word owner that that feels a little bit disingenuous

given that the project is an Apache project, but we certainly feel a great obligation to the community

to making sure that the project maintains its vibrancy and health and that we're building for the next generation of kind of developer and data engineering interactions in the open source. So we certainly feel accountable for the outcome of Airflow. Maybe that's the best way to to position our relationship with the open source project. Like, it's we very much,

have a have a deep best interest in its success. So Because of the fact that Airflow is one of the early movers in

whichever generation of data orchestration we're in at this point, it has long become the target

for any subsequent projects that have come along. I'm thinking most notably of Daxter and Prefect that use Airflow as sort of their boogeyman to

draw comparisons against.

And

so I know that also Airflow itself has not remained static. There have been a number of iterations, and one of the things we'll talk about today is the upcoming version three release. And so I'm wondering if you can

give your sense of the position of Airflow in the ecosystem,

some of the lessons that it has learned from its competitors, and some of the inspirations that it has provided to the overall data ecosystem?

Yeah. That's a really great question, Tobias. Look. I I think we're always just kind of focused on Airflow's success. We we try not to spend too much time worrying about what the other folks are doing. Kindly like so much of our business and our focus is just in growing the airflow market and making sure that we're servicing that market with commercial products that add value. But all signs for us and and all of our engagement with the community says that airflow is only continuing to get more popular

over time and that the community is just continuing to grow. We actually just released our

airflow kind of state of airflow 2025 report yesterday.

Had over 5,000 respondents. It was a five x on in terms of just pure volume as last year's survey. And we just are seeing a lot of adoption of the community. Download counts continue to go up into the right. I'm fairly certain we had about 32,000,000

downloads on PyPI in the month of December kind of closing out the year, and that's up from a much smaller number several years ago. I I don't wanna I don't wanna cite something something wrong here. But you can go kind of look at the PyPI stats yourself and and draw your own conclusions. Airflow is also the most contributed to Apache project of all time at this point. It has the most contributors

just from a pure kind of GitHub contribution perspective.

So the community is still incredibly vibrant. And as we look at Airflow three and think about the next generation of Airflow, we certainly do look at, really what is required to continue supporting the needs of data engineers, things like better data awareness, better abstractions for event driven architectures and scheduling. And that really motivates the way we think about driving the future of the project. But, again, like, that really is done in tandem with our customers and our community members. More so just focused on solving their problems, not so much focused on kinda competitive pressures or decisions.

I also forget exactly what the number is, but airflow came out a number of years ago at this point. If it hasn't already reached it, it's gonna be pushing a decade. And, obviously,

the overall data ecosystem

has changed dramatically since then, especially in the past year or two with the introduction of generative AI and all of the new data infrastructure requirements that that brings with it. And so I'm interested in understanding

the ways that Airflow itself has grown and evolved to account for the underlying shifts in the ecosystem

that it is put into the center of to manage the actual health and well-being of those different platforms and the infrastructure and systems that they support. And when you when you say different platforms, are you specifically referring to kind of generative AI platforms and applications?

No. Just generally speaking, the the data platforms that businesses are building and using Airflow to maintain the health and well-being of making sure that all the data goes where it's supposed to go. So it's it's funny you bring that up, Tobias, because we've seen this, like, very, very interesting trend. I I've been working with Airflow for almost ten years now and, you know, I think you're right. It has been a decade of of Airflow at this point. When we first started engaging with data engineers in the Airflow community, a lot of the discussions we were having were about analytics. It was really about how do I write a data pipeline that, you know, does some kind of data transfer from one of my SAS APIs into my data warehouse so I can build a dashboard of some kind. Like, inter very internal reporting heavy. We've seen this trend,

kind of more broadly over the last several years where a lot of data infrastructure is now powering more critical stuff for these customers than internal reporting dashboards. Not to say internal reporting dashboards are not important. I use them every day in in my job at Astronomer. But we're talking

tables that are embedded in applications for end users to consume such that if the airflow pipeline goes down, there's, like, a product

customer facing SLA breach or things like regulatory reporting for highly regulated industries like, sports betting, gambling,

gaming,

banking,

financial services, etcetera.

Things like machine machine learning models and all of the kind of operationalization

of the training process.

These are the types of things that we see people increasingly leaning increasingly leaning into airflow for. And as a result, the position of airflow and orchestration has just become much more critical. So, like, kind of this whole focus on reliability and the connection to this broader data platform has gone up in the last several years, and we really felt that pressure from our customers. I we have this kind of joke that we say internally. You know? It's it's not,

we certainly are kind of p a p zero service for all of our commercial customers. If if people are working with us, it's also because it's often because they're running airflow for some very critical process that is important to them. Meaning, like, if airflow goes down or if a strontium goes down, there's a really, really big problem. That's not a privilege we take lightly. So primarily for us, this whole trend moving from analytics into operational data products

has been really interesting to see because it's just increased the criticality of data engineering workloads and the work that data engineers are doing every day. Now that trend is obviously fueled even more by the rise of generative AI. Now everybody is trying to build something with generative AI and leveraging their domain specific data to have some level of competitive advantage

even even without like getting to fine tuning or actually doing anything fancy or on top of foundation models besides pure retrieval on documents that are proprietary to the company. And we see Airflow community members really leaning into Airflow as the batch orchestrator

of choice for these Jet AI workloads. You know, Airflow is already kind of battle tested and proven for these types of batch oriented workloads. So So when it comes to things like,

chunking and embedding documents and loading them into a vector database for some rag application, airflow is the standard and is the default. When it comes to doing things like batch inference, summarizing,

sales calls was one that one of our customers mentioned earlier this week. They process all the transcripts of their sales calls and send an executive summary every week via an airflow deck that effectively,

you know, pulls in all that data. It calls the LLM

and then produces the result and sends an email to all the executive team letting them know what trends they should be looking at in their sales calls. This is kind of an entirely new category of workload that is, again, just increasing that focus on airflow and on that data platform. So it's very exciting time to be engaging with this community and, something that we're very focused on in airflow three and beyond, building the right kind of abstraction to support more of that. Now digging into the

Airflow three release,

obviously,

major version numbers signify some significant amount of change. We just discussed some of the ways that the underlying Davy ecosystem is changing around it. So wondering if you can give some of the main takeaways

of what you've learned over the course of the version two life cycle and some of the key capabilities that you're looking to introduce with version three? Yeah. I know. It's a great question. So I think people kind of fell in love with Airflow because of its simplicity, this kind of deterministic,

we oriented

way of defining workloads. Right? It was, like, very, very process oriented.

Naturally, one of the big pressure points in the community was to add more, more data awareness. I connect that deterministic

process oriented workflow to the actual tables and assets that are being manipulated

by those processes.

So a lot of, the focus on airflow three for for those use cases was really centered around bringing some level of data awareness into the core orchestration engine and introducing the concept of assets and airflow and and everything,

everything going on there. Additionally, we've we've made some significant architectural improvements to decouple the task execution interface from the core orchestration engine. That's allowed us to do two really interesting things.

One is actually run kind of remote workers for customers rather than closely coupling the workers with the scheduling layer. You can have a lot of flexibility in your compute layer. So orchestration generally sits as one control plane on top of a very highly distributed computing environment. And many of our customers and community members wanna be able to schedule across AWS e c two instances, across servers they have running on premises, a Kubernetes cluster that they're rolling out on OpenShift. And this level of flexibility and optionality

allows customers to find that perfect balance between convenience and cost and security so that they can plug in more flavors of enterprise workloads into the airflow engine. The third category of thing is we're able to actually introduce multi language support at this point. Airflow has always been very Pythonic.

And by decoupling this task execution interface from the core orchestration engine, we're able to build SDKs for languages like Go, TypeScript,

Java, and that's gonna be a big focus of ours over time. So, you know, again, a lot of the way we think about this is Airflow at its core is a really incredible

batch orchestration engine. It is running some of the most critical workloads in the world for some of the biggest companies in the world at this point. So it does have those miles at kind of the enterprise scale testing end of the market. And what we really wanna do is make it such that more flavors of workload can be deployed to that engine. And that's simultaneously a technology problem, but it's also really an interface problem. Like, developers needs to be bet with the right SDKs and abstractions that they're comfortable with. We don't want to go to, like, a new persona and say, hey. You must now learn how to write Python and how to how to know the airflow DSL if you wanna actually, like, schedule some interdependent processes together. So we wanna make it much more accessible for a broad broader audience of people. To that point of the audience

and the growth of the number of people who are interfacing with it, data engineering

has become a very diffuse responsibility where, for a while, it was the DBA became the data engineer. It was, you know, largely a change in title, but not a change in role. As data has become more operationally significant and found its way into consumer facing applications and every other aspect of business, the responsibility

for data has become more than just the DBA or the data warehouse engineer or the business intelligence engineer.

So to that point of meeting people where they are, how are you seeing that persona of who's actually building with and for airflow

evolving as we grow the number of people who are responsible for the data of an organization?

Yeah. So there's this like, one of our gospels at Astronomer that we share with, like, all of our new hires was this post that Nexseem wrote, I think, in 2017 called the rise of the data engineer. I know, like, kind of data engineers have been around for a long time, but that was kind of his classification

of what

the new kind of cloud native data engineering profession really looked like. And he defined, hey. Data engineers historically lived in these vertically integrated ecosystems and did a bunch of kind of WYSIWYG drag and drop work. Data engineers are now looking much more like software engineers where they're defining their data pipelines and processes as code. They're version controlling those assets in the same way that they version control their software applications.

And I do think that was, like, a very big shift

in the industry.

Data engineering really moved into this world that looked much more like software engineering. That I think that definitely took off. Like, early days of the company, we spent a couple years, like, convincing people that kind of pipelines as code and data engineering as software engineering was the right way. We no longer had to had to kind of sell that future pretty quickly because a bunch of people really leaned into it. Now the way we've seen the data engineering and kind of core data platform profession evolve over the last couple years, Tobias, is what we found is as data engineering centers have really built this center of expertise and excellence

in the data platform at large, they really have become the bottleneck to production for a lot of different kinds of data workloads inside of the four walls of a company. So many as data has become more strategic, everybody wants to get involved with AI or or or deep advanced analytics projects. There are many different kind of stakeholders in the data ecosystem in the company than there were ten years ago. You have this whole kind of category of analytics engineers that are largely living in, the world of dbt.

You have data scientists and machine learning engineers who are living in Jupyter notebooks and r and and kind of scripts running on Chrome locally. You have a bunch of folks really focused on data ingestion and management. And then you have a bunch of folks who are actually much more comfortable living in kind of lower code interfaces that might be programming DAGs, but in something like Zapier

or, some other kind of, more accessible consumer oriented workflow tool. And what we found is as the needs of those downstream stakeholders teams have evolved, the data engineering team tends to be the control point between them and kind of the production infrastructure. So generally, if the data engineering team is the one responsible for productionizing

DAGs, what ends up happening is they have an analytics engineering team throwing DBT models over the wall to them and saying,

hey. I need to run this model every night at midnight so that my tables are fresh in the morning.

And then it's on the data engineering team to really reason about, okay. How do I take this this DBT job, turn it into an airflow DAG, put it in production so that it actually is part of our core production scheduling system. Same thing with, like, a a Jupyter notebook. A data scientist might say, hey. Here's a notebook. I just need it to run every night. Schedule it for me. Now the data engineering team knows that, like, scheduling notebooks arbitrarily in production is is usually a pretty bad idea. So they do a bunch of work to translate that notebook code into something that's a little bit more, more rational

and putting it into that production engine. And as a result, a lot of data engineering teams are, like, totally underwater and inundated with tickets

because they're servicing all of these other internal stakeholder teams, and that actually, like, stops them from doing more strategic high leverage work for the business.

So when we think about the next kind of wave of data engineering, this is obviously gonna be assisted by large language models, the productivity improvements that we can see in just data and software engineering workflows with the, embedding of of LLMs.

We want to make data engineers

an enabler

and stop them from being a bottleneck. So allow those teams to all self-service that core orchestration engine, whether it's via developer abstractions for dbt. We have a library called Cosmos that we released that actually

solved this problem in spades, and we saw a lot of viral adoption in. Whether it's meeting the data scientists and machine learning engineers where they are or whether it's meeting any of those other downstream personas where there are some of the data engineering team can set guardrails,

but not actually be standing in that path to production as they are today. And we very much view that as kind of the highest value thing that we can do for data engineering communities these days, and it's where we would like to take that profession and persona. And I think it's it really is posing an exciting opportunity. So this is something that we talk to all the time about our customers because, you know, often we're working with that core data engineering team, and they'll call us up and say,

hey. I I can't free myself to do free myself up to do higher leverage work because I'm I'm so underwater with, you know, productionizing work from we we we even have a term for it, OPC, other people's code, and, I wanna go

do more strategic work, so that's a big focus of ours. One of the constant burdens for data engineers

in a number of situations also is the fact that they are responsible for the underlying infrastructure that their pipelines are executing on. So there there's a wide swath of the stack that they have to be familiar with to be effective. So they need to be able to understand how their VM's working or how Kubernetes operates, how to deploy the airflow and the the associated DAGs, the different other pieces infrastructure that they're managing, maybe the data warehouse.

And to your point, that can consume a lot of attention and time that could otherwise be spent on more high leverage use cases,

which is where your Astro project comes in, where you try to abstract away more about infrastructure management to let them focus just on this is the piece that you care about. These are the DAGs. These are the data flows. This is what you actually want to be doing with your time, not figuring out why the TCP stack in your Kubernetes cluster is crashing or whatever the weird obscure bug is that they might be running into. So I'm wondering if you can talk to some of the ways that you think about how much of that to abstract away

while also

maintaining some of the escape hatches for the case where that engineer really does need to go down to, you know, tuning kernel parameters to optimize throughput or whatever it might be. Yeah. It's, so, like, I think what you're highlighting, Tobias, is, like, the data engineering abstraction tree is very complex.

Like, if you go have ask a data engineer what they have to deal with on a daily basis, they might say anything from, like, a very low level networking bug to, like, something

very deep in Kubernetes

to some credential thing to some Python dependency thing, like, they might be living in in Python requirement hell, all the way up through, like, data like, table column schema level data quality issues. And, like, everything in between there, there's a lot that can go wrong, and whatever can go wrong will go wrong. I I recently sat in on our internal data teams off-site, and they very much modeled a lot of their work as an iceberg where there's a small amount of things at the tip of the iceberg that, like, people see, like, whether that's a dashboard or a report or a table. And then below the line, like, kind of under the surface, there's all of the supporting work that they need to do to keep those dashboards afloat, so to speak. And when we think about where we would like to take the data engineers

in the future, we really want folks to just be able to think about the job that they're trying to accomplish. Like, in this in in some cases, it might be, hey, I have a data product I wanna build that's gonna power some product feature and also and, like, not deal with the things they don't wanna deal with. Infrastructure management is, like, the easy one that's very easy to conceptualize because, like, data engineers wanna write data pipelines, not manage servers. So, like, we can definitely easily take the server management piece away. But I think there's actually even much more we can do. But to your point, our goal isn't to make it all a black box

because we're often dealing with folks that are highly technical, that know the internals of the system very well. So we have to expose the right level of kind of observability and monitoring

such that if something does go wrong, people can go do root cause analysis and and make sure that it works as quickly and as efficiently as possible. We've actually built a bunch of, like, kind of intelligence into our commercial product to help with that RCA flow

to make it much easier for users and customers. But the general way we think about it, Tobias, is we want data engineers to focus on actually the above, like, kind of the above the line things at the tip of the iceberg

and offload the below the line kind of under the surface things to us as a vendor and partner. Such that they can focus on the high leverage activity and not worry about the undifferentiated

heavy lifting. You mentioned observability

as one of those key capabilities.

You recently released the Astra Observe

product suite for giving some of that visibility into what is actually happening in your data flows, what are the alerts that you need to be aware of.

I'm curious how you think about

the golden signals, if you were, of what are the things that are actually useful, how do you address the signals to noise ratio that is always a tricky balance to manage, and some of the ways that you guide

your customers into the pit of success as it were.

So maybe I can just give some context on, like, why we built AstroObserve to begin with as a starting point. So, historically, I mean, we've always been kind of the airflow company, and we run airflow for a lot of really good customers. I mean, read about a bunch of them on our website, but everyone from Marriott to, the Texas Rangers and and everybody in between.

And one of the things that we observed as we started to get really deep with all of those kind of customers that were using us to run their airflow is that every single one of them had cobbled together a house of cards to support data product reliability in

reliability in some way. There was like a mess of data quality tooling, data lineage and observability tooling, third party alerting systems and data cataloging

that people were kind of duct tape and chewing gumming together,

so to speak, to actually, like, sleep well at night and understand that the table that they care about is gonna be up to date in the morning and the way that they think it's gonna be up to date in the morning and of high quality and all that good stuff. It even got as extreme as, like, working with, one of our really big customers on a system that they had rolled out internally to, like, checkpoint all of their airflow DAGs to make sure that, like, if anywhere in the process was delayed in some way, they would be alerted immediately. Like, they would at channel their whole data engineering channel in Slack and let them know, hey. Like, this checkpoint ran a minute later than it was supposed to. Like, everybody drop what you're doing and look into this right now. And there are a bunch of different examples of kind of how folks have built out those stacks in in a very fragile way. So when we started thinking about how we could add incremental value to our customers and started getting a lot of feedback from folks, a lot of the ask were around better observability. Hey. We don't want to have these third party data observability, lineage,

and alerting tools just to make sure that our tables are delivered on time and at a high quality. So AstroObserve's

first kind of tone waters for airflow users. We want you to be able to have all of the airflow centric data observability and data quality tooling that you need to ensure that all of the important data products that you're working on are delivered on time and and at a high quality. One of the great things about AstroObserve is you can use it whether you're running your flow with Astronema or not. So if you're an open source user and you just want better data lineage across your airflow dags and airflow deployments

or you want better alerting extractions on top of your data quality metrics,

you can plug right in. It's very easy to get started with. But this kind of allows us to extend beyond purely just managing airflow for customers and kind of just taking care of that infrastructure management problem. And now talking more about how do we actually help you with those above the line

things that you're worrying about every day and, making sure that you're getting a lot of really great value on top of just the PureFlow management layer at the layer. So that was a lot of what kind of spawned the project. And we're really excited to take it as we're we're rolling out what we call an insights engine into AstroObserve

because of that complexity and the data engineering abstraction tree, a lot can go wrong. And there's a lot we can proactively recommend customers do based on signals that we see. One example is if we see, like, an airflow task that has a high standard deviation and it's run time, that signals to us that there's that might be, like, the weak link in the chain, and it might put your downstream asset at risk. We can give you an insight that says, hey. There's a high standard deviation in this task run time. You should probably do these things at the compute level to actually tweak

your, execution pattern so that they're, you know, that that's a little bit more consistent. But it also gets even more advanced than that. We have, a bunch of really, really awesome,

generative AI and and large language model stuff running under the hood that does kind of AI power root cause analysis there. So if there is a task failure somewhere in the lineage graph, we can tell you exactly what went wrong and where and what to do about it. So we're pretty excited about this. We already have some great customers using it, and, we'd love to talk to anybody that is out there that is interested in better observability for their airflow. Observability also can be very narrow and focused, in this case, on airflow specifically. It can also be very broad where you need to know what is happening across my entire suite of systems and services, both from front end applications

to back end data warehousing

to my orchestration engine to my business intelligence.

And there are a large number of different observability tools across those different problem domains and numerous just in the data ops and data observability domain specifically.

And so I'm curious how you think about the role of AstroObserve

in the broader case of data observability and maybe some of the ways that you are able to

either hook into those other systems so that people don't have to pick one or the other and also maybe some of the ways that you look to provide

visibility into the broader scope beyond just airflow? Yeah. No. That's that's a great question. You're right. There are a lot of tools out there.

Generally, like, we're we have very friendly integration strategy. So, like, if our metadata is valuable to third party observability tooling, we we do tightly integrate there. But where we can deliver, I'd say, differentiated value with Astro observe

is at this intersection of observability and operations because we're actually running data pipelines and airflow deployments for,

a lot of a lot of customers.

We are able to take that observability data and act on it rather than just sending you an alert that says, hey, you had a data quality breach or an SLA breach here. We can go be proactive there and say, hey. We're gonna scale up the underlying nodes. We're gonna give you more memory on this task that failed to guarantee that your table gets delivered on time. So for us, it really is about

solving for that intersection between observability and operations

where we can marry the metadata that we get with the actual,

infrastructure management and kind of task run time

that we manage for our customers.

A good example of this is one of the kind of key features of our observed product is our ability to monitor

credit consumption of the underlying data warehouse. So if you're running Snowflake queries or BigQuery,

queries or Databricks jobs from airflow,

we can actually do cost attribution for you and say, hey.

This is how many tasks you have that are actually amounting to this many dollars in Snowflake. And we can see things like, table duplication

or,

you know, if, you're running a bunch of jobs that are producing tables that are not consumed by downstream

parties.

And we can take action on your behalf if you would like us to that can actually just save you money on the underlying compute layer.

So this is the type of thing that I think is very interesting to us as we look ahead,

really uniting that observability and operations layer. Beyond observability,

beyond abstracting away the infrastructure so that engineers can focus on the data flows that they care about, what are some of the other

aspects of the Airflow ecosystem and enhancements to it that you

are currently offering through the broader Astra platform and some of the other ways that you're thinking about evolving it to add even more value to your customers?

Yeah. Without without maybe giving you too much of, like, an astronomer commercial here. Like, we generally think about our product value in three categories, build, run, and observe. On the build side, we have a bunch of really great developer experience, abstractions that make writing, testing, and deploying airflow codes really nice. We have a CLI tool. We have a full GitHub integration,

and we have an in product IDE

as well that allows you to very easily construct and deploy airflow tags and test them as well as a kind of execution back end that we built.

On the run side, we we've built this proprietary component in our system called the Astro hypervisor

that actually makes the unit economics of running your float scale much better. So we can auto scale things much more aggressively. We can look at your gag schedules and make decisions around what infrastructure should be running and what

what time so that you're not running any idle resources and wasting money. And on the observed side, that's kinda what we just talked through, these

broader data quality and lineage abstractions for, just having more reliable outcomes on your data products. So I hope that's a quick overview. I don't I you know, if if you wanna go in more depth, I'm sure you can learn more at Stronger,

Stronger.io,

but I don't wanna, you know, beat your listeners over the head with an ad here.

Airflow itself has established a fairly substantial ecosystem around it. There are a large number of plug ins, tooling.

People have invested a lot in their

development around and on top of Airflow.

With a major version upgrade, there is always some amount of churn in terms of breakage

and migration paths. I'm wondering if you can talk to some of the ways that

you have tried to mitigate that and some of the ways that teams who are building on top of airflow need to be thinking about how to evolve those capabilities and maybe some of the some of the types of tooling and plug ins that are obviated by the improvements in the underlying Airflow framework?

Yeah. It's a great question. So I think, you know, part of the great piece of on Airflow three that I actually didn't mention earlier is there's a lot of, like, really great community requested features as well independently of some of, like, the more low level,

architectural changes that we've made. We have things like DAG versioning, brand new UI with dark mode and all the great things that people wanna see, a much better backfill interface,

and and and a lot more in there too. So, generally, we see a huge appetite from our customers to get to Airflow three as quickly as possible. And we have a lot of experience with these major kind of data infrastructure

upgrades. We've been doing it for our entire tenure as a company. So when we rolled out Airflow two, we kind of ran into the same question. How do we make sure that we're well prepared to upgrade the community and upgrade our customers? So there's been a lot of work done in, kind of upgrade checks, both on the open source kind of lending side, a lot of, like, kind of static code analysis stuff both around rough. And we also have a full, like, professional service team at Astronomer that helps with, like, the more critical surgical upgrade processes. As you can imagine, if you're, a company like Ford who's using airflow to train all their self driving models, like upgrading isn't as easy as just pressing a button and hoping it works. Like, there's it's it's a pretty surgical process, and, we're very welcome to do so. So I think the the short of it, Tobias, is people really wanna upgrade to Airflow three because of these long awaited features and a lot of the great stuff coming in the release. There's a bunch of open source technology that will help with some of the more basic upgrade checks. And if you have kind of a big system upgrade, we have a lot of expertise in house that can help out with that. So And as you have

been building and growing and maintaining and contributing

to the Airflow project and community and the Astro platform and product, what are some of the most interesting or innovative or unexpected ways that you've seen that combination applied? Yeah. As I mentioned earlier, we, you know, we when we started the company, we were working on a lot of dashboards. Like, we we'll be doing a lot of services projects to do reporting. And as we've just matured and gotten in front of more customers, like, use cases can tend to surprise us. The Texas Rangers use us to do a bunch of low latency in game analytics. They actually used us this season. They won the World Series, which was pretty incredible. They had a my cofounder and a few of our team members down to their ballpark last spring and, like, let us try on the rings and everything, which was pretty which was a pretty special experience. But they look at, like, hitter and pitcher mechanics to predict how the batter should think about going after the pitchers in the next inning and how they should be positioning their infielders and outfielders. It's, like, definitely, like, a very cool consumer oriented. I'm also a big baseball fan, so, like, that was a cool one for me,

baseball fan, so, like, that was a cool one for me, use of data. Another really large customer of ours, who I unfortunately can't name to all this recently, though, that they're running their payroll on Astronomer via Airflow DAGs. You know, payroll, I suppose, especially in a company with variable compensation models is in itself a data product. So that's, like, you know, a very critical thing that is much bigger than just a dashboard. This is one of the amazing things about building infrastructure companies, honestly, because you build this platform that has all these very programmable interfaces on top, and then people very much surprise you with what they do with it. They see a very flexible orchestration engine for taking complex

interdependent

Python jobs and deploying them at scale.

And they say, oh, wow. There's a lot you can do with this. Right? Like so as we continue to extend the platform, you know, I'm just so excited to see what else we can extend to. One other one that actually came up last week with another customer of ours was folks that are using LLMs to seed a bunch of their community forums. So they have a bunch of kind of equivalents of subreddits

on their platform.

And every Monday morning, their community team shows up to work and has a bunch of suggestions for, like, seated conversations in their subreddit equivalents

that are all generated via LLMs that are taking in a bunch of historical context from discussion on these threads in the past week and say, hey. This is something that's likely to get engagement. Do you accept it or do you decline it? And that's actually allowed them to get much more efficiency out of the team and just do a lot of content production that's at actually a very high quality. The level of quality that these elements are able to produce is is really kind of astounding. So as we look ahead and think about what can folks do at the intersections of large language models in airflow, that's where things get really interesting to us and exciting. As well as being able to use those large language models to ask what is happening in my airflow and why did it do that? Of course. Yeah. Yeah. That's I think that's that's table stakes.

But yes.

And as you have been building the company, building the ecosystem and community, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

Yeah. You know, I I have so many that I could share here. Like, I feel like this could be a podcast episode in and of itself. I don't think it would, like, cater to the date to, like, the data engineering podcast audience necessarily, but I think, a lot of companies are really in the midst of this big kind of push for tooling rationalization

right now. And I think earlier in the astronomer life cycle, I personally I'm a I'm an optimist, you know, I think to be an entrepreneur, you have to be a bit of an optimist.

I very much, you know, believed that a lot of the tools and really awesome ideas and companies that entrepreneurs were seeding were gonna have a place in the market and that there were gonna be big enough markets for all of these tools to really go build good businesses. And what we're seeing in the market and hearing from our customers is that likely isn't the case

as a bunch of buyers and teams are pushing to rationalize their spend and stacks

and finding that

there's a big difference between the tools that are p zero, I. E. If they shut down tomorrow, like, very bad things happen. Like, it's like a it's a very critical thing. Someone, like, gets fired, whatever. And p one tools that if they shut down tomorrow, maybe it's an annoyance, but there's not, like, a huge problem for the company.

And what we're hearing from a lot of our our customers is those p one tools,

there's a lot of scrutiny of whether or not they're required

anymore. Do we need this thing? If it goes away tomorrow, how bad is it really? And is it redundant to functionality that we're getting from the core platform? So I think one of my big learnings is people's buying patterns and technology usage patterns normalize on a long enough timeline.

And generally,

there's an appetite to simplify. Like, there's everybody wants to kinda try to decrease the entropy of the universe on a long enough timeline. I mean, I think it's to be determined how effective that is, but we certainly feel that inside of our customer base. And a lot of kind of what's happening in this broader data ops market and the consolidation of these subcategories

is gonna be a big thing to watch over the next year or two. As we've mentioned, there is a broad range of

options for data orchestration,

how to build your data platform.

Everything in the data ecosystem wants to own its own little piece of orchestration

from

DBT

cloud to Airbyte to Snowflake

to every other system that you can imagine.

What are the cases where for people who are trying to design their systems, you would advocate that Astro or Airflow is not the right choice and maybe they should just use their built in orchestration

for that platform or some other framework

to suit their needs. Yeah. I think, like, if all of your orchestration needs are, quote, unquote, local, it's a really good idea to start there. Like, very commonly, we have somebody we have folks coming to us that have

started by building Fivetran or Airbyte ingestion

process and then a bunch of DBT models. And what they realize is they need broader state management across the entire life cycle. So for us, that's very much an integration story. Like, we use Airflow in tandem with Fivetran and in tandem with DBT

because then you have kind of that broader observability and state management across the entire life cycle. That's actually, like, a very small example.

The reality of enterprise data ecosystems is that they are

very messy. You talk about entropy. Like, the entropy of enterprise data ecosystems

grows faster than the entropy of the universe in many in many ways. I I was looking at our own data catalog recently. Our, like, it's very small five person data team at Astronomer

has a data catalog with 13,000

assets in it. And it's like that is the type of scale that we're talking about as we get into a lot of these discussions.

And it's not just limited to their next generation Snowflake or Databricks kind of lakehouse architecture. There's all of this legacy kind of all of this all these legacy services that need some broader integration into the next generation strategy, whether they're running on premise in a SQL Server instance or in Hadoop. The reality of most enterprise data ecosystems is local

orchestration is often just not an option. But I will say if you're starting a entirely new cloud native greenfield project and all you need to do

is transform some tables in Snowflake,

then local orchestration there actually is a great option and a great place to start. And Airflow really and and kind of this broader,

I'd say, horizontal orchestration comes in when you are starting to deal with a little bit more complexity in the data stack. Same thing with with a tool like Airbyte or Fivetran on the ingestion side. If all you need to do is take your Salesforce customer object and replicate it to your Snowflake instance every night, that's a really great option and a really great place to start.

Now where it gets complex is when you need to, like, figure out how to enrich that data with data from servers that you have living in your Hadoop ecosystem. That's where I think, we see a lot of data teams leaning into Airflow as the right answer. So I hope that answers your question. Generally, I do think that there is a place for these purpose built orchestrators

in their local context. And where where we see Airflow and Astronomer playing is as a very strong partner and integrator into those kind of subcategories.

And as you continue to build and grow and evolve the Astronomer

product suite and contribute to the Airflow project and ecosystem, What are some of the things you have planned for the near to medium term or any particular

projects or problem areas you're excited to explore? Yeah. I think I think the the way in which LLMs disrupt data engineering workflows is so fascinating to us. Because, like, you know, I think we're certainly, like, there's no question that there was a huge hype cycle over the last, you know, two years around LLMs because everybody starts to piece together,

like, how they're gonna change the future of our work. I mean, I can speak for myself. I use tools like Cloud and ChatChapiti

every single day in my daily workflow at this point. That is a fundamental platform shift that changes the way that I interact with technology as like a PM and product leader. Now the same thing is obviously happened with software engineers as well.

Like, tools like Cursor have taken the market by storm, and Cursor is, like, incredible. I I use it when I can and when I get some free time to write some code. Now I think data engineering is really interesting because what we found from surveying all of our customers is that in data engineering, like, what we call context

really matters in a way that it doesn't necessarily for software engineering applications.

Just for example, it's only you're you don't get that much value out of just going to GPT four and saying, hey. Write me an Airflow DAG that does some modeling in Snowflake, which it certainly can do. It can produce you, like, a fine Airflow DAG that gives you some boilerplate to work around with. But the challenge is that's decoupled from the context of the rest of your data platform. What

is the Snowflake schema that I'm supposed to be working in per the guardrails of my organization?

What really does that upstream object look like

for me personally? So in in data work, it's so broad and it spans so many systems. So the context of how your organization

has determined it wants to work with these systems is super important for the model to know. And I think the gap between

data engineers today using tools like ChatGPT or Claude to generate

data pipelines and remove boilerplate

kind of authoring. And the next generation is making sure the models have access to the to the data platforms context, all of its connections, all of the custom patterns that you've built to interact with these systems, all of the guardrails

that you've embedded in those patterns. And Airflow, those take the form off of what are called custom operators,

where people hard code certain parameters and say, like, hey. This is how you must interact with the system. You must upload documentation here, here, and here, etcetera. And that's really exciting and interesting to us because at Astronomer, naturally, like, we have that context. We have all all of the metadata that we need to get feed models to make informed decisions about how they should be recommended data engineers interact with the platform. But that's kind of the next big focus for us, and we very much view that as the next big kind of transformation in the data engineering community. So You mentioned a couple of times already your own internal data team, your own internal data catalog. And I think one of the things I forgot to ask about earlier is what are some of the,

interesting ways that you are dogfooding your own product or some of the useful lessons that you've learned in that process? Yeah. We dog we use Airflow for everything,

as you can imagine. So, like, even things that, like, that, like, maybe you you might not traditionally think you should use Airflow for, we're we're using Airflow for it. One of the really cool things that we did in the last year was we actually started embedding some of our Snowflake tables in our product as a feature. So we have kind of a dashboarding tab in our product that you can go get a bunch of really interesting metadata about how your organization is operating with Airflow. Because a lot of our customers are running Airflow across many teams,

so having a bunch of centralized metadata on, like, how are these 50 teams operating? What are the most commonly used operators? Who's spending the most money? You know, who is actually has the most users deploying every day? That's all, like, analytics data that we have in Snowflake. And what we've done is via our partnership with the guys at Sigma have embedded those dashboards directly into our product. That actually and those dashboards are naturally, those tables are are loaded every day by our own internal airflow dags, of course. That actually created a really interesting cultural shift inside of Astronomer because all of a sudden our data team went from being only accountable for dashboards and reporting and maybe more operational data use cases to having a product SLA. So, like, if if the DAGs failed, like, we were violating our own SLA, like, three nines commitment to our customers,

and they were on call. Like, they were gonna get paged. And I think we've seen, like, a lot of that happen with data chains over the last couple of years where as the use cases for these pipelines have shifted,

the requirements and roles and responsibilities of the data professionals have also shifted. One of my cofounder, Viraj,

used to say that that the DAG is the new microservice in some way. All of a sudden, these data pipelines

are microservices in our core applications, not just kind of back office reporting pipelines that are doing doing different kinds of things. So that was a really interesting use case for us, mostly because of how it changed our culture and how it changed our own internal expectations of our data team. And it's been really fun to find more of those too. Like, our product now is powered by many different DAGs, and, we kinda have a whole dedicated, like, you know, production server running the stuff that powers our our product to customer facing features, and that's been pretty cool. So Are there any other aspects of the work that you're doing at Astro, the work that you're doing to contribute to and support the Airflow community and ecosystem,

and just the overall space of data orchestration that we didn't discuss yet that you'd like to cover before we close out the show? No. I thought my my reaction is we touched it all. If I think of anything, I'll I'll definitely shoot you a note to buy. So this has been super fun. Appreciate you having me on, and always fun to talk about data engineering and and airflow with, an awesome community member. I'm a big fan of the podcast too. Long time listener. So Thank you. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Look. I think the biggest gap is kinda how disjointed it all is, honestly. And, again, people have adopted so many different kinds of tools and point solutions for this whole kind of data ops life cycle. And what we're what we're seeing from our customers is they don't wanna do that. Right? We very much believe orchestration is kind of the natural consolidation point for a lot of these tooling adjacencies. Naturally, like, that's kind of our our world view. Because of how horizontal it is, it has full access into the full life cycle across the enterprise. It's not just local to Databricks or Snowflake or, ingestion. It's gonna be able to tell you everything.

Plus, it is the thing actually running these pipelines at the end of the day. So, actually, like, kind of that operational lever allows us to do some very interesting things at the intersection of observability and data pipeline management and execution

that you're not gonna be able to get elsewhere. But I do think

that generally, one of the biggest problems in the space, especially in the enterprise is tooling sprawl. And how are we gonna actually both simplify and rationalize while creating more value for the end user? I'm actually okay with tooling sprawl if it creates net new value for the end user. But I think we found ourselves in a place where we have a lot of tooling sprawl that is actually detracting and creating worse user experiences for the data engineer. Because if something breaks,

the data product

has a quality or SLA breach. Where do you look while you're now kind of traversing five or six different systems from the Kubernetes cluster to whoever your networking provider is to your kind of data quality tool and table column schema level tool to your alerting system, to your orchestration system, to airflow logs. And I think simplifying that is a major opportunity.

Just going from

actually needing to traverse that abstraction tree to saying, hey. Your table

is not in the state that you want it to be because we know what state you want it to be and when you need it to be in that state. Here's exactly what you need to do about it. And pulling away all of the complexity,

I think is where where this thing needs to go to create net new value. I just think the days of traversing five or six tools to solve one

problem are gone. And having to build five or six integrations just to solve that problem in the way that you want it solved. I'm thinking most notably in my own context about authorization

and authentication

where you can off people, but what are the permissions that they're supposed to have, and how do you actually apply that without making yourself insane by writing the same thing in 15 different ways for five different systems? A %. So believe it or not, this is a funny story. I'm I'm a bad engineer at this point. My team actually gives me crap whenever I, like, try to open up a a PR when I get some free time. But I did I I was involved in writing the first version of our hand rolled auth system. This was, like, prior to, like, really leaning into Okta and Auth0 for just about everything.

And the level of complexity there is so crazy,

and it's it's it's, mind boggling. So I I think that's actually a good good comparison and analogy.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Astronomer and Astro and your contributions to the Airflow community. It's definitely a very

useful and important

member of the overall data ecosystem. So I appreciate the time and effort that you and your team are putting into keeping it healthy and vibrant, and I hope you enjoy the rest of your day. Thanks so much, Tobias. Appreciate you having me on.

Thank you for listening, and don't forget to check out our other shows.

Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and colleagues.