Summary
The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation
Interview
- Introduction
- How did you get involved in the area of data management?
- What was your initial project requirement?
- What tooling did you consider in addition to Airflow?
- What aspects of the Airflow platform led you to choose it as your implementation target?
- Can you describe your current deployment architecture?
- How many engineers are involved in writing tasks for your Airflow installation?
- What resources were the most helpful while learning about Airflow design patterns?
- How have you architected your DAGs for deployment and extensibility?
- What kinds of tests and automation have you put in place to support the ongoing stability of your deployment?
- What are some of the dead-ends or other pitfalls that you encountered during the course of this project?
- What aspects of Airflow have you found to be lacking that you would like to see improved?
- What did you wish someone had told you before you started work on your Airflow installation?
- If you were to start over would you make the same choice?
- If Airflow wasn’t available what would be your second choice?
- What are your next steps for improvements and fixes?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Quantopian
- Harvard Brain Science Initiative
- DevOps Days Boston
- Google Maps API
- Cron
- ETL (Extract, Transform, Load)
- Azkaban
- Luigi
- AWS Glue
- Airflow
- Pachyderm
- AirBnB
- Python
- YAML
- Ansible
- REST (Representational State Transfer)
- SAML (Security Assertion Markup Language)
- RBAC (Role-Based Access Control)
- Maxime Beauchemin
- Celery
- Dask
- PostgreSQL
- Redis
- Cloudformation
- Jupyter Notebook
- Qubole
- Astronomer
- Gunicorn
- Kubernetes
- Airflow Improvement Proposals
- Python Enhancement Proposals (PEP)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and 40 gigabit network, all controlled by a brand new API, you'll get everything you need to run a bulletproof data platform. Go to data engineering podcast.com/ slash linode to get a $20 credit and launch a new server in under a minute. And are you struggling to keep up with customer requests and letting errors slip into production? Wanna try some of the innovative ideas in this podcast but don't have time? DataKitchen's DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and datasets while improving quality.
Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement today and sign up for the newsletter at datakitchen.iode. After that, learn more about why you should be doing DataOps by listening to the head chef in the Data kitchen at dataengineeringpodcast.com/datakitchen. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing James Meekle about his experiences building a new airflow installation. So, James, could you start by introducing yourself? Hey, Tobias. Thanks for the invite. I'm James. I'm a site reliability engineer at Quantopian.
[00:01:38] Unknown:
Quantopian is a crowdsourced quantitative investment company, essentially. We have our users from all around the globe learn about finance and write algorithms that can trade on financial data. As for myself, I focus on a lot of our back end data systems because what you see on the Quantopian website is, of course, just the tip of the iceberg in terms of what we support prior to working at Quantopian. I most recently worked at Harvard, at the Center For Brain Science, where I was a site reliability engineer for MRI brain scan data. And I've been all over the place in other jobs, but generally focusing around the DevOps space. I'm also an adviser for DevOps Days Boston this year. And how did you first get involved in the area of data management? It's a great question. I've actually been involved for quite a while because my, initial path in life was focusing on psychology and political science research. So even before I was a professional programmer, I was already managing data. I got a little bit away from that for a while when I first moved into full stack web development, but I was always interested in getting back to working with data, processing data. Actually, 1 of the first, you know, paid programming projects that I ever did was really early days of the Google Maps API. So working with a lot of geospatial data, I'm making that easier and more accessible for users. And as you mentioned,
[00:02:54] Unknown:
you you recently, I guess, completed the initial setup of an airflow installation at Quantopian. So I'm curious when you first got started with that project, what the initial requirements were and what other tools you considered in addition to airflow. I think for that story to make sense, you have to think of where we were as a company at that time. So the
[00:03:17] Unknown:
prior in the year, towards the beginning of the year, we had started, trading, our fund, as we call it, on stock market using live data. And, the nature of these algorithms, they need a lot of financial data, and that financial data has to be updated day over day. We use data from a lot of different vendors that arrives at different times. We set up a series of cron jobs to do this data processing. And, as with most software projects, it was much more complicated than we had initially scoped. So what started off as a handful of reasonable cron tabs, sort of grew and grew organically until we had, you know, really hundreds of cron jobs across, you know, a few dozen machines probably. And that wasn't sustainable from a data management perspective. It meant that the site reliability team was getting large numbers of pages to address what were essentially just mistimings of data or data quality checks that had failed. We needed a better platform for managing all of that. So it was actually late in the year. So after several months of, working through this backlog and it's kind of our infrastructure side issues reduced in frequency, and data management issues became a larger slice of the pie. What I actually got as a task from my manager was to research improvements to ETL systems. Was given a very broad scope here. There was no, declaration of a specific tool, and it was only given a handful of initial requirements. So the first thing that I did was set up a larger set of requirements.
What were we really hoping to get out of a tool's capabilities? And that was maybe, 15, 20 bullet points, categorized. And I took those bullet points, and I did a pretty comprehensive analysis of the ETL tools that were out there. So I had a mix of open source tools like Azkaban or Luigi, more vendor specific tools, ones that may be open source but might be tied to a particular vendor, and even some totally off the shelf options. Like, we look at the AWS options. I believe Glue is the name of, 1 of the tools for this kind of ETL pipeline. And what really struck me is that this head to head comparison, where I was looking at at least 10 different tools for ETL jobs, Airflow is really head and shoulders above the rest of them. Some of them were, you know, maybe a little better for a specific niche.
Like, Packaderm looked pretty interesting for, on desktop machine learning sort of DAG workflow. But when it came to our requirements of running back end data processing for, you know, potentially dozens of services for potentially hundreds of files per night, Airflow is really the right scale for what we needed to do and had the right level of polish and operational experience coming out of Airbnb that we would really feel comfortable trusting what's essentially very valuable financial data to it and making trading decisions based off of some of the files processed by Airflow. And
[00:05:56] Unknown:
so you mentioned that in the general case, Airflow had a much better feature set across the board versus some of these other tools that were more targeted at a particular niche or had a better fit within a given context. And I'm curious if there were any particular features of Airflow as a platform that lended themselves particularly well to your set of requirements or your operating environment or, level of experience within the team in terms of their
[00:06:27] Unknown:
ability to work with the tool? Yeah. I'd say 1 of the core requirements out of that set that I drafted was that we were really hoping for a tool where Python was the language of choice and that it would be proper Python, so not like some domain specific language or just YAML configuration files. You know, I write a lot of I'm comfortable writing YAML, but I also know its limitations. And I know that when you're writing a data pipeline, you're writing essentially arbitrary code. So it made sense to use code as the native language there. And there are a few different tools that offer this, but with Airflow, they really go all in. The Python files you write are really just standard Python files. You can use any of the features of Python that you want. So that was very appealing. It was very appealing that it was battle tested at scale. You know, I could go find blog articles about how to make a high reliability airflow instance with something like Luigi. It's not that there was no content about that, but it clearly wasn't something that a lot of people were doing with Luigi. So that meant we would've had to build more of it in house. Likewise, the visualizations, that come out of the box with airflow were actually pretty important for us. When we're working with financial data pipelines, what can happen is, they can organically grow very complex because you have multiple teams working on linking these pipelines together. They may, store data in implicit ways, like write to an s 3 bucket, and then another downstream task reads from that s 3 bucket. You may not realize that those 2 tasks actually have a connection until it's too late to do something about it. And when you have tasks that are structured like that, having an operational dashboard is really important.
So you get that operational dashboard with some of these tools, but the dashboard for Airflow is like a fully featured web service. It was designed to be a web service from day 1, so it's not like something extra that was bolted on top of it. What that means is it has a full rest API. It now is the most recent version, has role based access control. It has, SAML, login. So we have a lot of capabilities there that you might expect from a modern web application that just come for free with airflow. And that's really important to us because 1 of the goals of our project was to I'm just stealing, Maxime's words here, to democratize data science. So at Quantopian, we have a lot of people who do things that either could be called or adjacent to data science, but we wanna open it up to even more people within the organization. And having a dashboard where someone can without necessarily having SSH access to a server, without having privileged access, can just log into a web dashboard, see the state of all the tasks, ask tasks why they haven't run yet, and manually click to perform operations.
That's all very powerful capabilities. Basically, that lets us take a team where we trust them to know the domain experience, and we trust them to know what sorts of capabilities the platform offers. Well, we can now actually trust them with the ability to rerun jobs rather than having to go through an intermediary. And that means the SRE team is less of a bottleneck.
[00:09:23] Unknown:
So once you decided that Airflow was the right tool for the job, there are a number of potential design patterns that you can go with for airflow. It's a very sort of loosely defined framework for being able to work with data in whatever way you see fit. So I'm curious what type of a deployment architecture you ended up with and if you are focusing on a particular executor or if you're doing a mix and match where some tasks get run on maybe Celery and others get run with Dask or anything along those lines just to sort of get a picture of what the, topology
[00:10:01] Unknown:
and constraints that you're working with are? Yeah. Well, we we knew from the start that we wanted to run Airflow in a more distributed manner. It's sort of a requirement, for anything that we wanna deploy to have it be high availability. We ran into this problem all the time with crontabs, where we don't want a crontab running on more than 1 instance, so we would have 1 instance running the crontabs. And if that instance goes away, it causes all kinds of problems. It was exactly that type of, like, monolithic cron instance that we wanted to get away from. That was an explicit goal of this Airflow ETL migration. So coming at Airflow from a distributed point of view, it was a natural fit to choose the salary executor because that was the primary way of doing this, back when we started, like, last December. So once we knew that we were gonna use Celery, we decided as a back end to use, Postgres and Redis as the broker and to implement the instances themselves using what's essentially standard for us is CloudFormation and Ansible. So we're not running dockerized airflow at this point. We actually just have this on EC 2 instances and auto scaling groups. So all of the, instance types can be killed and will come back fairly gracefully, and we have that separated into the web servers, the workers, and the scheduler. The scheduler is the only component that isn't running concurrently, so we'll have 2 at least 2 workers and at least 2 web servers at any given time. We only run 1 scheduler for the moment because it was a known issue that airflow couldn't support 2 concurrent schedulers running. There were some ways that we could have adapted this to more pattern with an active standby scheduler.
But in practice, the kinds of errors we've seen with the scheduler are not ones that would have been helped by having a second instance anyway, so it's been less of a priority for us. And you said that 1 of your main goals is to democratize
[00:11:47] Unknown:
access to the data that you're processing, and that leads me to wonder how many engineers are involved in writing the tasks that are going to be submitted to the Airflow installation versus people who are just consuming the outputs?
[00:12:01] Unknown:
So I'd say, initially, it was focused on people who are consuming the outputs. So it's sort of like, I got the I got the task to build this project, taking a very long term view of what this project would be for the company. So for the 1st month of the project was really just researching, building out prototypes, comparing this to other tools. We wanna be really sure that this was the direction that we were going to go. The next few months were, after I build out, you know, prototypes and documentation, was actually running some real tasks on the cluster, starting with some of our staging, data loads as we call them. And, you know, when you have a really organic Etail architecture where the Etail jobs are broken across, 7, 8, or more services per night, you really have to take a lot of care to make sure that you're running this, migration in a sustainable way. So we've taken a very evolutionary approach.
So we didn't do a dramatic cutover where suddenly everything was running on airflow. Instead, we've tried to start new projects in airflow and migrate older projects 1 at a time, particularly when we can justify it because there's been some kind of outage that was airflow related. What we found was that after a few months of having airflow running the lower importance tasks, we started thinking when we got into an instant review, hey. This didn't have to happen. If we'd taken the time to put this in airflow, then this wouldn't have happened. Or, airflow would have fixed this. Or if we'd had Airflow, then we would have gotten paged about this earlier.
So it's really actually demonstrating the capabilities of the platform and gradually getting buy in from the teams. So some teams actually are now essentially using Airflow and may not even realize that that's what's happening under the hood, because we might have set up that initial DAG for them. Whereas other teams are now actively writing software that is geared towards airflow. I think 1 of the most exciting developments is we do a lot of data science at Quantopian because, essentially, our platform is allowing our users to do data science. And we do that in the form of Python notebooks. So all of our users are writing, Python notebooks for analysis purposes, and we use them internally for analyzing what our users are doing, analyzing the performance of algorithms, really for, our internal business process. We're much more likely to use a Python notebook than something like an Excel spreadsheet.
So the upshot of this is that once you have something as a Python notebook, you might want to automate it. And as it turns out, the tools for automating Python notebooks are not very good right now. So this is actually where we're starting to see a lot of value in rapid iteration from airflow. We're working on our initial POC for that right now to take existing Python notebooks that we use for analytics purposes and get them into a much more stable and reliable format where we can run them inside of containers on an automatic basis with the most up to date data so that we don't need to rely on a user remembering to run this notebook so that we can really have our team members focus on writing the notebooks and less on maintaining the notebooks and running them on a scheduled basis. And in the process
[00:14:58] Unknown:
of your own research and learning and in bringing other members of the team up to speed on being able to work with airflow and write DAGs and be able to effectively adopt a set of design patterns that promote easy deployment and easy extensibility of the existing task structures. I'm wondering, what resources you leaned on and some of the design patterns and architectural conventions you have settled on in the process
[00:15:31] Unknown:
of building out this installation? I'd say that there's been some really great blog posts out there for, Airflow, some of them coming from Airbnb or from former Airbnb folks, slides from various Airflow meetups, and certainly a lot of corporate contributions. So Qubole and Astronomer stand out as 2 companies where I read, like, a lot of blog posts about how they do Airflow. And sort of taking inspiration from some of these more managed platforms or the more mature data science platforms in house. And I think it really helped that I had a lot of domain experience going into this because there were a lot of people using Airflow in ways that I think are not necessarily how it was intended to be used. And That was actually something we really initially came up with, was what projects were not going to be targeted for airflow. And we came up with, like, a pretty good set of definitions for, like, what is a job that looks like it's a good fit for Airflow. Because the way that Airflow is designed, it actually does have tasks that it doesn't even attempt to handle. Like, if you're using a streaming data system, Airflow, like, doesn't really have primitives for interacting with streaming. So if you're trying to shoehorn something like that into Airflow or something like really dynamic tasks that change a lot day over day, you're gonna run into some difficulties. And for us, that's not the case. Financial data structure actually changes, pretty slowly.
You know, if 1 of our vendors is going to be changing their data format, they might tell us, like, a year in advance that they're going to do that. So while the values of financial data change all the time and we need up to date calculations, we don't necessarily have a pipeline that's constantly changing parameters. We need to refresh the data, but we don't need to refresh the pipelines that often. Also, a lot of the pipeline work that we do is simply adding new tasks. Once we ingest some type of financial data, we typically keep on ingesting it for a very long time, make that available to all of our users. Because we're always expanding the data on the platform. It's very rare that we take anything off from it. And in terms of managing the deployability of the DAGs,
[00:17:21] Unknown:
as I've done research on various occasions to see sort of what the plug in architecture looks like or what some of the software best practices are for writing the DAGs themselves. I've seen some, mixed levels of maturity in terms of how people are handling the Python code that they're using for it or the deployment mechanisms that they're using for being able to update the software on their airflow deployment. So what have you settled on in terms of being able to deploy new resources, any sort of,
[00:17:54] Unknown:
testing or quality checks that you put in place to gate new tasks before they end up on Airflow or updated tasks? I think that's a great question, and I also think that this is something this question is something that Airflow has answered very poorly, to be honest. It's 1 of the pain points for us. Just a walk through for what we do. So we have a repository that's essentially our infrastructure repository. That's mostly Ansible code, and it's mostly people on the site reliability engineering team who are contributing to it. It just tells you how to install a cluster and which DAG repositories to install. We'll take that cluster code, and we'll deploy multiple clusters in different environments. So we have 2 staging clusters, 1 production cluster, and then 2, trading fund related, clusters. And those clusters might share DAG repositories.
So we have 1 repository. It's very cleverly called airflow tasks because it contains most of our airflow tasks. And this repository actually gets run on both staging and production. So we actually just made the sync problem even worse. It's not just syncing within the members of a cluster. It's now we have the same DAG repository running on 2 clusters in 2 different versions. So the way that we approach this is, we have a wrapper script. We just call it find dags.py, and that actually does most of the work that airflow is trying to do in recursing through your folders to find your DAG, but it treats it as more of a factory function. So for instance, we'll pass in an environment variable with what the current airflow, environment is. So we'll know whether we're in staging or production, or we'll say whether we're running currently in unit testing mode or not. And that means the DAG file is not just a bare definition. It actually is like more of a factory interface where we have a function that you're supposed to implement that returns a DAG. So since that DAG is, like, in a function, it makes it a little easier to test. So you can also write a test function alongside the DAG with what the behavior of the DAG should actually be. That being said, once you have that file in the repository, you're still gonna run into some syncing issues. The way that we're doing this right now is in place updates on the instances through Ansible. So when a Jenkins build passes, so it runs the test suite, whatever test suite that DAG has, then practice, they're still fairly minimal at this stage, mostly just syntax checking. So if that test suite passes, we will start doing a push deploy to any currently existing staging instances.
And that actually has been a source of pain at various points. We've seen, some pretty unusual behavior there. Like, there are circumstances under which an old task definition can get stuck. There are circumstances where the update can succeed, for the most part, but not get picked up in the web interface fast enough. You might have to wait a few extra minutes for, gunicorn workers to die off. So we're actually looking at deprecating this approach. We kind of don't want to do these on file system updates. And we're looking at more of a Git polling approach, where each instance is polling against Git. And anytime it notices a change, it stops the current airflow services, updates on the file system, and then restarts the airflow services. It's from our perspective, all of our tasks need to be resilient to airflow service shutdowns anyways, and we don't have very high confidence in the in place DAG updates.
So that's what it was going to look for us going forward is more of like an atomic on file system update. Update. That being said, I don't like the status quo, and I'm really excited about some of the discussions that have been happening on the mailing list. There's a good proposal out there for a DAG fetcher abstraction, where you would be able to write your own implementations for, what version of the DAG should be accessible at a given time. And I think that this is going to be much closer to what most people want out of airflow. What I'm particularly excited about this is that it would open up the possibility to have a DAG fetcher return different DAGs for different points in time. This has been an issue for us.
We sometimes want to recalculate previous results with a new version of the DAG, and other times, we actually would not want to calculate historical results for a given task. That might be the case if, for example, we add a new task, where the data wasn't available before a certain point, or we change the definition of a task while keeping the same identity for that task. Those are cases where we actually would want to maintain history as it happened, and currently, Airflow can't do that. Every time you update a new version of a DAG, all historical results for that DAG are also displayed with a currently parsed version of the DAG. That can lead to very confusing history. Like, if you change the order of dependencies, like, if a task moves around inside of the DAG, sometimes you'll see nonsensical results in the past from before you made that change. So I think that this is a challenge that Airflow is working on, the development team, but I wouldn't be hopeful to see it anytime soon, unfortunately.
So for now, you're stuck with either on file system updates, or you have the pain of having to set up a shared file system. And with most of us coming from more of an operations background on the team, we knew that we definitely didn't wanna go the shared file system approach because there's a lot of ways that that can really harm your reliability.
[00:22:47] Unknown:
And 1 of the other issues that I saw when I was looking particularly at the plug ins but also at the tags themselves is the issue of dependency management for the Python code that you're installing where there's the potential to package the plugins or even the DAGs as an actual Python package using the various mechanisms available there or even just using a requirements dot text file for specifying when this gets installed, install these versions of the additional Python packages that I needed. But in some of the examples that I was seeing, those were completely lacking, and it was just a matter of put the code on and then somehow sort of into it the fact that you need additional dependencies or have those specified elsewhere. So I'm wondering where you landed on that approach in terms of ensuring that you have the necessary dependencies
[00:23:39] Unknown:
installed for the appropriate versions of the plug ins or DAGs? Yeah. This is also a pretty bad pain point for us. We maintain 2 internal airflow plug ins. 1 for the code that we hope to open source 1 day and 1 for the truly internal code. And, I'm really not sold on the existence of Airflow plug ins. They seem to almost be more hassle than they're worth to package code in the airflow way as opposed to the Python way. So that may be something where we, in the future, just don't make them airflow plug ins at all and just use them as Python libraries. I think the story there is is really not good, and especially the way that Airflow tasks are typically executed. I actually go so far as to say that if you're really taking, like, access control seriously, you shouldn't really be using things like the Python operators anyways. There's just so little, isolation or guarantee there. It just seems like a a recipe for making mistakes if everything is sharing, like, the same Python process, the same Python libraries. It It also leads to really bloated workers with, like, huge Python environments and long start. So what we prefer and what I really wish that I'd known that we'd be this far when I first started planning this, like, several months ago, is we're actually moving to execute almost everything in Kubernetes.
And at the time, we knew that airflow might have some level of Kubernetes integration. But over the past 6 months, it's been really remarkable to see the progress there. And, you know, I'm I'm not sure all your listeners know much about Kubernetes, but, essentially, it's just a tool for running large numbers of containers in a more orchestrated systematic way. And as it turns out, Kubernetes is really great at running containers, but it's not good for specifying dependency management between containers. And Airflow is really good at specifying dependency management between tasks, but it's not good at executing tasks in a sandboxed and reliable way. So if you use them together, you can actually get the strengths of both, and that turns out to be a really winning combination. You can take whatever Python code you want and bake it into a container. In our case, this is actually pretty great. We have some data science workflows that use Python 2 code for the most part. Some of that has Python 3 incompatibilities, but then there are certain function calls that would benefit from Python 3 speed ups. So we can actually use a mix of Python 23 containers and orchestrate them together in the same workflow.
That's not at all something you could do easily with Airflow. You would have to do things like install separate Python virtual environments for the 2 of them. So in my mind, that's really where the future of Airflow is. I wouldn't be surprised if that's really what sells Airflow for a lot of people, is the ability to take an existing Kubernetes cluster and very trivially start running arbitrary
[00:26:14] Unknown:
user workloads in a highly scheduled manner. Yeah. And for people who aren't at the level of maturity or scale where they where it makes sense to be operating a Kubernetes cluster, 1, there are a number of managed services available, but another approach that seems like it would be viable is to build in some sort of support for things like pex builds where you can have the entire Python dependency for a given package or a plugin built into a single archive and then just execute execute directly from there rather than polluting the entire system Python environment with all the dependencies for 1 plugin or 1 DAG that is only gonna be executed, you know, a a a per certain percentage of the time. Yeah. I think that that would be you know, a a per certain percentage of the time. Yeah. I think that that would be a pretty trivial,
[00:26:57] Unknown:
like, side by side class with the existing, Python virtual environment operator that exists. Another 1 that I've seen people use to great effect is, not the Kubernetes operator, but just the Docker operator. Even just switching from, like, subprocesses into actually using, like, a fully containerized environment, that can really get you a lot of mileage. Even if your airflow instance is not running inside a container, it can still start container processes. And that's actually most of the isolation you'd need at a smaller company. And in terms of
[00:27:29] Unknown:
the types of integrity checks or data evaluation or just unit or integration tests that you've built up for ensuring that when you execute a job, it isn't missing dependencies or that the data is in the expected format. Curious what sort of safeguards you've put in place to be able to monitor the reliability
[00:27:53] Unknown:
and the success rates of the airflow deployment and the tasks that you're executing? Yeah. So in our case, we have a lot of evolutionary code here. So we're taking existing services and scheduling them through Airflow. And the good news is that a lot of our services already do, data quality checks, so they're built into the code base. As an example, if we have a service that downloads, like, ticker data for stocks, it might check to make sure that, like, some stocks that it expects should exist in any reasonable universe also exist in this data set. So we're missing random data. We would be able to spot that more easily. So we have some checks that don't exist within airflow. The airflow node would fail if the data quality checks failed, and then we'd see in the log files that it was because of a built in data quality check. But we are starting to run-in line data quality checks. So let's say that we have a service that checks the data quality as it's related to the format. What I mean by that is it checks that it's, like, a valid file, that it parses correctly, and we uploaded s 3. From the services that's generating it, from its perspective, everything's fine. But downstream, we might have something that's a valid data file according to specifications, but the data inside the file is what's invalid. That might be a ticker that we expect is missing, or 1 thing that happens really frequently is, companies have all kinds of mergers and spin offs. So what might happen is we end up with some value that we've never seen before. That doesn't necessarily mean that it's wrong, but it might mean that we didn't understand where it came from, and we need to do some initial research into that. So that's not something that the, service would have necessarily spotted. It might be a downstream consumer of that service. So what we've started to do is add some intermediary data quality checks like this that are more domain specific.
We can just add those as nodes in the airflow DAG. We have the option of whether to make the nodes blocking or not. It depends on the kind of data quality check. We actually are fairly atypical in that, when we run airflow, we're actually running it very online and like the sense of online processing. We, every night, have to get everything ready for the next day's trading session. So we it's often not correct for us to stop a task and say, hey. This failed. Take a look at it. What you actually often want to do is continue performing the task and then warn someone that the data is invalid, and then they can make a decision about whether to abort what we're doing with that data or not. Because we may only have time for 1 or 2 reprocessing attempts because the timelines are so tight. And in that case, we can have nodes that are more sibling nodes to some other node. They don't block the downstream node from processing, but they can at least alert us if something seems wrong. That's actually something that we've, over the course of the summer, had 1 of our interns, Danny, working on. And, she hadn't used Airflow or Kubernetes before, but she's been doing the bulk of our, DAG code writing lately, whereas I've been focusing more on the infrastructure and enablement side. And that's taking, some existing data quality checks that were actually run manually every morning inside of a Python notebook and putting those data quality checks inside of a DAG so they really free up a lot of time in the morning for our team's members.
[00:30:58] Unknown:
And in the process of building out this entire deployment, I'm curious what sorts of dead ends or pitfalls that you encountered
[00:31:08] Unknown:
while you were working on it that you wished somebody might have told you about before you got started? Oh, Yeah. That's a hard question. Well, don't let my boss hear you, but I think it might have been somewhat of a dead end to host Airflow on our own EC 2 instances. In parallel to the Airflow project, which is most of what I've been working on, the rest of the SRE team has been more involved with our Kubernetes project. So we've been adopting them side by side. And, I really feel like we might have been in a better place if we had focused more on getting airflow to run inside of Kubernetes, that we might have been a little bit further along with both projects. Because anytime you don't have to run infrastructure, you probably don't need to. I think it's particularly true, Google just released, the Google Cloud team, specifically, the airflow operator for Kubernetes, which is different from an airflow operator. I know it's very confusing.
The, operator in the Kubernetes sense is like a prepackaged bundled version of Airflow that you can sort of 1 click install and get running on Kubernetes. So I definitely feel like sometime within next year, I'm going to take out all of this artisanally maintained Ansible YAML code and convert it into Kubernetes manifest. I think that's going to be the future for it. I'd say another dead end more on the internal side. There were some projects that just haven't really panned out for various reasons. I think that I initially was more focused on porting, existing systems to run-in a more airflow sort of way. And what's actually had the most mileage is really, getting kinda down and dirty with some of these tasks and thinking about how can we take an existing service and make it basically headless. And in this case, 1 of the easiest approaches for us ended up being go to fashion SSH. It's not something I'd, recommend for, like, a, new deployment. But in our case, we had an existing instance that was just running crontab commands. It It was much faster to, rather than redeploy that instance, to just plan to have airflow SSH into that instance and give it access to run a limited set of commands. And, no, that's not a perfect long term solution. But I think if the focus from the beginning had been just making it work and then later on replacing it with a more elegant solution, I might have been a little bit further along than I am now. Yeah. The old don't let the perfect be the enemy of the good. Exactly. And I think that the DAG structure is actually, really friendly for that kind of progressive enhancement. Because if you have a particular, type of task, So let's say, in my case, the SSH operator. What I can do is, later on, I can take that same DAG and replace what kind of operator without changing the task ID. So all the success, fail, and time information will be, persisted. Of course, it's gonna look a little bit weird. You may not wanna actually look at some of the previous results, but it lets you preserve basically the same dependency structure. So your DAG isn't changing very much. It's just the implementation of the specific nodes that are. And you've spoken a bit about some of the shortcomings
[00:33:57] Unknown:
of airflow as it stands now, and I'm curious if there are any others that you would like to see improvements in as you continue to work on expanding and improving your own installation of it? Yeah. So I've actually submitted recently,
[00:34:13] Unknown:
a PR against Airflow for a real pet peeve of our team, the Airflow SLA mechanism. The Airflow SLAs are kind of clearly geared towards someone who wants to know if their job is failing, but maybe can fix it, like, eventually. Whereas for us, as an SRE team, we need it to be more like a pager duty style managed incident. We need to know what specifically failed. We need to know what the downstream implications of that are. So for us, just getting an email about it really didn't suffice. And some of the APIs around the SLA system are just a little incoherent with the rest of Airflow. So I wrote up a pretty major refactor of how Airflow handles SLAs. I'm hoping to get that merged. There are still some outstanding issues with the PR because I'm still learning the code base. And in particular, I think that the airflow code base is pretty hard to run the test suite for. I'm not gonna lie. There's been some discussions about how to improve that. So I think a lot of people are aware of that. So on a personal level, I'm really excited to see that move forward at all. Because for us, we actually can't use the SLA feature for paging right now. We have this issue where it'll page us on weekends because we have a lot of skipped tasks on weekends, so tasks don't meet their SLA if they're skipped. That's been fixed in the upcoming version, but it's not quite ready for production for us yet.
So we're really eager to see improvements in that subsystem in particular. On a more general note, I'm really happy to see that the airflow enhancement proposals are now starting to go forward. I think, airflow improvement proposals is what they're calling them. That's something that being a Python programmer, I'm really familiar with, that style of, like, PEP process. And I'm really hopeful to see that drive longer term airflow development because I really like structured plans for how we're gonna improve a framework and where we're gonna bring it in the future. It's nice to get progressive improvements, of course, and Airflow has a lot of that kind of activity. But I really wanna see what system is Airflow planning to become 6 months from now, a year from now, 5 years from now. Because I think that it's really, particularly with the Kubernetes integration, filling a a very real data processing niche. You know, there's a lot of tools out there for data science, but the ones that are comparable to Airflow are simply not up to the same challenges that it is. And out of all the other tools that attempt to deal with parallelism or data management or data lineage, a lot of them are actually very complementary to Airflow. So I see it as sort of the missing piece to a lot of data science ecosystems.
[00:36:37] Unknown:
And if you were to start the entire project over today knowing what you do now and with the recent additions or improvements in various other tools that are available in the ecosystem, do you think that you would still land on the same choice of using airflow? Or might you choose something else that maybe came out more recently or that might have been your second choice initially?
[00:37:02] Unknown:
I think that, definitely, it still would have been airflow. If anything, now that I worked with it more, I understand really, just how bad off we would have been if we had gone with Luigi. As it turns out, we actually did have in house, a Luigi project as well. That 1 has been very difficult to work with for us. And in fact, we've pulled Luigi out of it entirely, whereas Airflow has been so much nicer than we even originally expected it to be. We really have a lot of buy in from various people who are just very invested in learning it. And in particular, it's the desired effect of pulling in more people. You know, we recently had, someone who's on essentially 1 of the Quantopian, like, quantitative marketing teams be interested in, like, how can we use Airflow for the types of analytics reports that we're running? Because, you know, when people hear Airflow, they think, oh, big data and, you know, data science. And, sure, that's a part of it, but it's also this team that needs to run reports on their own analytics. It's also the back end software engineering that we do for processing financial data. It's even running our, some of our chat bots are just running on airflow now for scheduled tasks.
You can really use it as a broker for all kinds of different services. It's really the fact that it's a shared common code base in Python that makes it very easy to write these DAGs without having to learn some obscure domain specific language, but while still providing the full capabilities of Python, which we all use as a common language at Quantopian. That's what's really making it powerful for us. That being said, if I were starting a new Airflow project today, I would very seriously look at Google Cloud Platform's offering. In our case, we wouldn't be able to use it due to both security requirements and the fact that we're on AWS rather than GCP. But if I were starting from scratch for a personal project, I would absolutely go look at that because the idea of a self hosted airflow, I definitely would have saved a lot of time on that. And, of course, there are other, managed airflow providers out there as well. So I think that that's a really great direction that there are now multiple companies competing in this space to all offer really amazing platforms.
[00:39:01] Unknown:
And going in the other direction from the typical focus of data engineering projects and building up data pipelines, have you considered bringing airflow into some of your more operational processes such as being the management interface for making sure that various system backups are taking place or that various periodic tasks or integrity or health checks across your infrastructure are being run? So we actually have a a big airflow project board. And for scale, I think it's up to, like, 150
[00:39:32] Unknown:
tasks on this project board. So that's, like, a 150 GitHub issues, just about various improvements to the company that Airflow could bring. So we do have backups on there as 1 of the items. I'd say, for us, that's somewhat lower priority. But 1 thing that's really high priority for us is actually, accounting processes. It might sound pretty boring. Right? But as it turns out, when you are trading money on the stock market, you have a lot of accounting to do, and that's actually a major business requirement for us. So this is something that's pretty far removed from what you think of as data science, but it's actually a great fit for us, of just moving around various financial reports from 1 FTP server to another in a reliable way with a useful dashboard. Airflow is actually amazing for that. So I would say if you're thinking about adopting Airflow at your company, don't just think of it as a data science platform. Platform. I mean, yes, it can be that. But I would say if you can't think of any scheduled tasks that you're doing other than data science tasks, you're probably not fully exploring the capabilities of what Airflow could be offering you. Essentially, the way that I think of this is if I have any cron job where I need to run it in relation to any other cron job, so with any kind of dependency order, and it needs to be distributed, like, those criteria alone make it a really good fit for airflow. The only real crontabs that are new crontabs that we're writing nowadays are ones that can run on each instance in a service and are more about, keeping those instances healthy. So, like, sure, we still have crontabs for things like log rotations that definitely wouldn't make sense to put inside of Airflow. And even, like, a service level crontab may not make sense if it's very isolated and it just needs to, like, run every hour with no dependencies. But anything that's like a real human facing task that some person is thinking about whether it's running or not, then it probably actually belongs in Airflow. I'll give this example. We have a analytics export where we take things from, some machine or another and move it into a mix panel. And we just wanna know about, like, user interactions with the website. That's fine. That by itself, we could just run that with a crontab every 24 hours. But what if someone on the marketing team wants to know about it when it finishes? What if someone wants to then kick off some task inside of mix Panel? The reality is if it were on a crontab, they might not even know that's how their data is arriving. If we put it into Airflow, if we help them write the initial PR for this, now they know where to go for modifications to this, and they can make those modifications self-service. So even if there aren't any dependencies yet, if you can think of any reasonable dependency that someone might wanna hang off of this in the future, just do the extra effort and get it into Airflow. It's really well suited for that kind of use case. And are there any other aspects of the Airflow project or your implementation
[00:42:11] Unknown:
of it or anything
[00:42:13] Unknown:
surrounding it that we didn't discuss yet that you think we should cover before we close out the show? I think that I've used this term, in a few places, but the idea of a data science platform is, I think, really important. And, again, Maxime can put this much more eloquently than I can, so you should definitely go read his, medium articles about this. I think the the way that some people let let me let me back up a step. When I say some people, I mean, people that I see trying to adopt Airflow. So these are people on Stack Overflow. These are people in the Airflow getter chat. So if you ever need support, drop in there. I'm more than happy to help out with it. I see a lot of people trying to take their existing processes and just jam them into airflow, or take an Airflow instance and just get it, like, a press button, receive data science workflow. And it's not that Airflow can't be those things, but it can be so much more than those things. But you have to come in with a strategy about how to get other people to do your work for you. Right? I know that sounds bad, but, ultimately, I'm a very lazy engineer, so I don't want to drag all of our tasks into airflow. What I want to do is figure out which tasks need to be moved into airflow and then figure out ways to incentivize people to do that for me. Because, ultimately, I'm on the site reliability engineering team. It's part of my mandate to make our data pipelines effective. It's not really my mandate to rewrite all of our data pipelines. I need to enable people to rewrite our data pipelines by essentially providing better tooling for them. I really recommend reading about platforms in general. Like, what constitutes a platform? When do you switch from an piece of code to a framework to a platform?
And it's all about the level of maturity and services that are provided. So thinking about how you're going to address data holistically, it's not just by installing Airflow. A lot of the decisions that I've had to make are adjacent to Airflow. So where does Airflow store data? What other systems does Airflow integrate with? Who has access to Airflow? So installing Airflow was important, and getting some initial tasks in in was good for the demo, but really making sure that Airflow fits within the overall data platform vision of the company. And for us, adopting Airflow was kind of a 5050 split with adopting Kubernetes. We would not have gotten as far as we've gotten with Airflow if we hadn't also been adopting Kubernetes at the same time. So neither of those is actually our data platform. Our data platform is the combination of the 2 of them, plus all of our policies in place around who has access to data, where can you do initial exploratory research. We need to make sure there's an end to end process where you have an idea, you can experiment on the idea, you can build a rapid POC, and then you can productionize it without a lot of pain. And those are all key points that we've had issues with at various other projects.
So making sure that our airflow installation is prepared to do that entire cycle has been really crucial to success, and I feel like that's something some people miss with airflow. You can't just install airflow.
[00:45:06] Unknown:
Even Airbnb doesn't just use airflow. They have other systems that are augmenting it. It's all part of a larger picture. So, yeah, that's what I would say is a really important thing to know about an air airflow installation. Alright. And for anybody who wants to get in touch with you, follow the work that you're up to, or ask any follow on questions, I'll have you add your preferred contact information to the show notes. And with that, as a final question, I'd like to get your perspective on what you view as being the biggest gap in the tooling or technology that's available for data management today. So I'd actually say the biggest gap is really kind of the next 1 after Airflow.
[00:45:41] Unknown:
So you've got a really cool pipeline. You've got some data that you're processing through it, and you wanna make some prospective changes to it. And that, I think, is a place where, the workflow there is not very good. You might have to actually copy that DAG file. You might have to, press a button in a GUI and type in some parameters. If you compare that to the workflow of, like, branching in Git and running some tests, it's typically much simpler to do something like that than to run an entire end to end data pipeline based on a Git branch. And I think even Airflow doesn't quite get you there. I I think airflow is going to be a tool that will help you get there. But, essentially, the way that I envision data pipelines is it doesn't make sense to have a singular data pipeline that you funnel all of your data through. I think of it as more of a mesh. So anytime that you can make a decision, your pipeline should be able to fork and explore all the possibilities.
Essentially, I want an environment where I can test what would happen if I made some change in the data, go actually model that out in the real world, and come back to me with some set of answers. I want it to be able to perform automatic sensitivity analyses to look at what happens if I add an initial filtering step to clean out some busted data rows. What effects does that have downstream? It's actually very difficult to do that kind of dipping or comparison, particularly once you have any kind of graph data structure. So that's something I've been working on myself lately, but it's a very hard problem.
I'm hopeful that we'll see better tools for that. Because as much fun as it is as a pet project, you know, you can't build everything from scratch. You need to standardize if you want to compete. So I don't see a lot of tools out there in that space of not just managing your data by keeping in 1 place, but actively allowing for exploratory analysis of data decision making. And that's what I see as the next frontier in data science.
[00:47:34] Unknown:
Alright. Well, thank you very much for your time and for sharing your experiences of standing up a new airflow installation and working towards building a unified data science platform for your company. So it's, definitely been very interesting and useful. So I appreciate that, and I hope you enjoy the rest of your night. Well, I appreciate the invitation. Thank you so much, and I'm glad to share the info. Feel free to reach out to me.
Introduction and Guest Introduction
James Meekle's Background and Experience
Initial Requirements and Tool Considerations for Airflow
Key Features and Benefits of Airflow
Deployment Architecture and Executor Choices
Team Involvement and Evolutionary Approach
Resources and Design Patterns for Airflow
Deployment and Testing Strategies
Dependency Management and Kubernetes Integration
Integrity Checks and Data Quality
Challenges and Pitfalls in Airflow Deployment
Shortcomings and Future Improvements for Airflow
Reevaluating Airflow and Alternatives
Operational Processes and Airflow
Final Thoughts on Airflow and Data Science Platforms