Summary
Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Jeremiah Lowin about Prefect, a workflow platform for data engineering
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Prefect is and your motivation for creating it?
- What are the axes along which a workflow engine can differentiate itself, and which of those have you focused on for Prefect?
- In some of your blog posts and your PyData presentation you discuss the concept of negative vs. positive engineering. Can you briefly outline what you mean by that and the ways that Prefect handles the negative cases for you?
- How is Prefect itself implemented and what tools or systems have you relied on most heavily for inspiration?
- How do you manage passing data between stages in a pipeline when they are running across distributed nodes?
- What was your decision making process when deciding to use Dask as your supported execution engine?
- For tasks that require specific resources or dependencies how do you approach the idea of task affinity?
- Does Prefect support managing tasks that bridge network boundaries?
- What are some of the features or capabilities of Prefect that are misunderstood or overlooked by users which you think should be exercised more often?
- What are the limitations of the open source core as compared to the cloud offering that you are building?
- What were your assumptions going into this project and how have they been challenged or updated as you dug deeper into the problem domain and received feedback from users?
- What are some of the most interesting/innovative/unexpected ways that you have seen Prefect used?
- When is Prefect the wrong choice?
- In your experience working on Airflow and Prefect, what are some of the common challenges and anti-patterns that arise in data engineering projects?
- What are some best practices and industry trends that you are most excited by?
- What do you have planned for the future of the Prefect project and company?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, speedy SSDs, and a 40 gigabit public network, you get everything you need to run a fast, reliable, and bulletproof data platform. And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and 1 opening in Mumbai at the end of the year. And for your machine learning workloads, they just announced dedicated CPU instances where you get to take advantage of their blazing fast compute units.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference.
And coming up this fall are the combined events of Graphorum and the Data Architecture Summit in Chicago. The agendas have been announced, and super early bird registration is available until July 26th for up to $300 off. Or you can get the early bird pricing until August 30th for $200 off your ticket. Use the code b n l l c to get an additional 10% off any pass when you register. And go to data engineering podcast.com/conferences to learn more and to take advantage of our partner discounts when you register for this and other events. And And you can go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Your host is Tobias Macy. And today, I'm interviewing Jeremiah Loewen about Prefect,
[00:02:16] Unknown:
a workflow platform for data engineering and data science. So, Jeremiah, can you start by introducing yourself? Yeah. 1st of all, thanks so much for having, for having me on your your podcast. I'm the founder of Prefect. As you mentioned, Prefect is a data engineering startup that's based, here in DC, and we make some amazing software for building and executing data workflows.
[00:02:37] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:40] Unknown:
Yeah. My career has been a very, slow but steady progression from the the data science world into the data engineering world. I started out as a statistician. I was doing a lot of analytics and and some machine learning work in the very early days there. But over the course of my career, I kept moving more and more into the engineering space simply as a matter of necessity. In order to get these analytics that I was being, you know, that I had been hired, that I was being paid to to produce into the world, and in order to make them useful, I had to keep taking steps towards how do we how do we productionize them? How do we deploy them? I had to understand my infrastructure, and this was a product of of a few things. 1st, the amazing software and developer tooling that's become available, over the last decade, which just made it possible for an increasingly small team to handle this type of work. But also hand in hand with that, the fact that I kept moving on to smaller and smaller teams even as I moved into larger and larger institutions.
In in a recent role, I was sort of a team of 1 handling all aspects of a massive analytics effort. And if I wasn't able to handle both sides of that equation, the data science side and the data engineering side, I was sort of lost. And that was when I first became involved in the Airflow project. That was, literally a lifeline to me to take the hours in the day that I had and the amount of work that I had to do which should not fit in those hours and start automating
[00:04:07] Unknown:
it. And so before we get too far into it, can we just start by describing a bit about what Prefect is and your motivation for creating it and some of the story behind building the project in the company?
[00:04:19] Unknown:
Absolutely. So in in a simple sense, Prefect is this tool that I always dreamed of having. In as much as my career has been a a progression from strictly data science into this hybrid of data science and engineering and data engineering, it's been a quest for, how do I build tools faster? And how do I deploy ideas that I didn't even have yesterday into production tomorrow? So as I mentioned, I became involved in airflow when I just needed to automate things. I needed to kick things off. But as I stabilized my infrastructure and I started turning my attention back towards the sort of more fun, ephemeral, dynamic data science work that I was doing, I very quickly ran up against limitations of airflow as as a tool. I was doing things that were fast, programmatic, parameterized, dynamic, scaled, different, and and Airflow just doesn't quite have the vocabulary there. So Prefect started life as an experiment to see if we could marry some of the things I really loved about Airflow, which is, you know, the idea that it has state and state dependencies, and you can do things like run a task if if something else failed.
With everything that was so amazing about living in the data science world, Right? These these beautiful, analytic tool sets and pipelines and, DASK in particular, is 1 that I was making heavy use of. And so the very, very, very first prototype of Prefect was, can I take Dask's pipelining technology and combine it with a semantic similar to Airflow's successor failed? And in doing so, can I get the best of, you know, data engineering tooling and the best of data science tooling in 1 project? And that prototype was so impactful in my work, even as a nothing, as a prototype, that, you know, very quickly we started turning our our resources and our attention into developing it alongside Airflow and and and fairly soon it subsume the work that we had previously given to Airflow. And
[00:06:13] Unknown:
so 1 of the things that you mentioned in there is the fact that you were looking for some of the dynamism and iterative capabilities of your data science workflow within the data engineering context. And so I'm wondering if you can just characterize the assumptions that were baked into tools such as airflow about what's necessary for being able to do data engineering work and how that contrasts with the needs of a data scientist and why it's necessary and useful to be able to encapsulate both of those styles of workflow into 1 tool or platform?
[00:06:50] Unknown:
It's a it's a very good question, and I'll give you the best answer I have. I think that broadly, if you look at the data science world and the data engineering world and how they've evolved in the last 5 or 10 years, you see a philosophical difference, that's sort of revealed in the choices that software makes in either camp. In the data engineering world, we tend to have state driven software. So you do a thing and then depending on whether or not the thing succeeded, you will move down 1 of 2 paths to a second thing, and then on and on and on. And, of course, that takes on a very familiar, DAG shape.
In the data science world, while we're still dealing with DAGs, their the dependency nature is very different. There's this implicit idea that things will succeed. If at any point in a data science pipeline something fails, that's almost always equivalent to the entire pipeline failing. And that's true in pretty much any you know, if you use any machine learning framework, anything like that, Beam, when a node fails, the workflow fails, if you will. And that's and that's because 1 depends on passing data around, that's the data science world, and 1 depends on passing state around, that's the data engineering world. And while that's somewhat a vast generalization of the 2, since we started sort of beating this drum about 2 years ago, we see this sort of borne out in almost every set of tools we look at. So the idea of bridging that gap really involves combining a state based workflow, which is to say you can react to whether something succeeded or failed, with a database workflow, which is to say you can react to whether data has been produced by an upstream task, and you can make use of that information however you choose.
So this idea of bifurcation of philosophy that then is sort of entrenched by software choices is 1 that we prefix reject. And if we borrow the 2 best practices that we see in those 2 fields and literally combine them in these you know, we we have these very rich state objects that drive our system. We can all of a sudden enable amazing things. So we can borrow the power of, for example, the DAS cluster that underlies our system to pump massive amounts of data through the system like any traditional pipeline in data science context. On the other hand, we can, with no extra work, treat any 1 of those potentially millions of tasks as a fully independent, not necessarily idempotent, retryable piece of logic, from a more traditional data engineering standpoint. And so that idea, more than anything else, that combination of these 2 practices is what drove us forward. Because the truth is when you look around, things rarely are 1 or the other. Or or to the extent let me let me rephrase that. To the extent that people think that things are 1 or the other, they're ignoring the fact that at some point, those 2 worlds need to collide. Right? So I can live I can build a model in TensorFlow. I can use a a strictly data science oriented approach, but at some point, I need to get that in the hands of someone who can work with it, productionize it, deploy it, make it robust to who know who knows what.
Conversely, if I'm in a data engineering context, I am very rarely gonna be completely divorced from the idea of removing data or information, even a parameter, through a system. And so while we can live most of the time in 1 of the 2 worlds, there's always this this friction, this this crossover. And I think I felt it very keenly in my in my career, just being someone who literally had a foot in both worlds. And so I became very motivated to solve it. Now what we've learned is that while this friction, this surface area may be relatively small, the time it takes to deal with it is sort of incredible. It is the number 1 thing that we hear people complaining about, and there's no there's no person in most organizations who's naturally incentivized to solve this this problem.
The problem that I just described only exists between 2 teams. Both teams are probably quite happy, in fact. And so by focusing on that, we found a very interesting problem that doesn't have a natural owner, that doesn't isn't naturally incentivized to be solved. And so we come along with prefect, and we found that we can actually solve it in a very general way, and that's been a very exciting, you know, thing to see resonate with our with our users.
[00:10:57] Unknown:
Broadening the conversation a little bit, Prefect is a workflow engine. Airflow is a workflow engine, and you might hear some of these be referred to with other terms as far as, like, ETL tools or data integration platforms. But broadly speaking, in terms of workflow engines and the capabilities and requirements thereof, there are a pretty wide number of them. But I'm wondering what you see as being the main axes along which the different tools or platforms can differentiate themselves? And among those potential axes, which ones you're focusing on for a prefect that would make somebody choose it over some other option?
[00:11:37] Unknown:
By and large, the workflow systems that have emerged today in the popular mindset, at least, are these monolithic and generally unpleasant pieces of software. They require an enormous amount of setup, infrastructure. They are quite opinionated about what you can and cannot do. In other words, they have a very, restricted vocabulary. And the the things that their creators chose to make first class important really dictate what you're able to do with them. And so say, 5 years ago, when the main activity that you wanted to schedule was hitting a third party system, waiting for it to complete, moving on to the next thing, that limited vocabulary was fine. All you really need to do is schedule something, you know, execute a script, and wait for it to complete. That that was a vocabulary sufficient for, you know, 90% of work. But that's not really what the modern data stack looks like. The modern data stack, I've mentioned before, is is is, dynamic, and it's fast. It's experimental.
It's it's often broken. And what we need to do is we need to be able respond quickly, move quickly, and we need to not be beholden to the limitations of the workflow system. I I'll tell you a funny story. I I met a VC when we were in the early days of of of seeking to to fund Prefect. And he explained to me how he thought a system like Prefect would be deployed into a large technology company that that that you would recognize. And he described to me this this vision of the data process there that, honestly, was just completely foreign to me. And and he had these data scientists in his head who who basically received a nice packet of data, and they applied this known model to it, and they produced a report, and then they, you know, repeated that forever. And I I remember thinking when he was describing this, what an incredible luxury that would be as a data scientist to to live in that world where there was almost no challenge. It was like being the veal of of data science in a sense, and this this image of this data science veal that that has has stuck with me since that conversation because that's just not how the you know, that's not not what using data today looks like. Using data today is this frantic struggle to figure out, like, what what is usable in this dataset? What can I extract from it, and where can I put it? And you are lucky when you get that into a repeatable process. So what we need as a result is we need a way to make sure that the workflow system is really governing the workflow itself, the semantics of the workflow, and not restricting the actions that I'm allowed to take. Because it would be a mistake if a workflow system pretended to anticipate the needs that I have as a user and my use case. And this is we spoke earlier about a Pydata talk that I gave, and this was the topic of that talk is the way that people essentially end up having to hack around the assumptions of their workflow system in order to get it to comply to their use case. So what we focused on very much with Prefect, to to answer your question directly, is not forcing users to learn a new vocabulary, a new a new set of actions, but rather just embracing Python as an API.
And our design goal is that Prefix should be almost invisible. And Prefix's job is to sit between the actions you want to take. So when you when you ask somebody to to draw a workflow, what they usually do is they draw boxes and then they draw arrows between them. Most workflow systems, as I've described them, restrict what you're allowed to put in the box because they're only capable of handling a relatively well defined set of actions. But they pride themselves on on those arrows. Right? We're gonna make sure that you go from box a to box b to box c in the right order at the right time. So Prefect takes a very different point of view. We say you can put anything you want in that box. Any valid Python function can go in that box. We're only focused on the arrows. And we will make sure that the syntax, the the API that you use to express that workflow, queues as closely to the Python that you are comfortable with as possible. And by eliminating that restriction, that's where we open up into this data science world all of a sudden, and we try to tear down that barrier between the 2. We don't need to just kick off a 3rd party script or you don't have a very carefully constructed and item potent task.
We will do whatever our users want. If it if it fits in a Python function, Prefect will happily handle it. And what Prefect brings to the table, again, are those workflow semantics, the ideas of retries, error handling, logging, deploy, etcetera, etcetera, etcetera. So that's that's really where we're focused is we wanna make sure that no matter what someone brings to us, the answer is yes. Prefect handles this because Prefect literally is agnostic to what goes in that box.
[00:16:16] Unknown:
In your presentation that we mentioned and in some of the blog posts that I read in preparation for this conversation, you introduced this concept of positive versus negative engineering. And so I'm wondering if you can give a bit of an outline about what your thoughts are on that topic and what you mean by those terms and some of the ways that Prefect itself handles the negative cases for you so that the engineer who's building a workflow doesn't necessarily have to guard against them as religiously as they might otherwise?
[00:16:46] Unknown:
Absolutely. So the boxes and arrows that I mentioned ago, it it you know, very quickly conversations get silly when you just keep talking about boxes and arrows, so we decided to name them. We call them positive and negative engineering. Positive engineering is what you are hired to do as an engineer, right? This is writing code to achieve an outcome. You want to load a database. You want to run a model. You want to produce a report. Whatever it is that you are are trying to achieve, we just call that positive engineering. And then negative engineering, in contrast, is all of the things that you need to worry about to actually make sure that that positive code runs. So that's how are you gonna package it? Where will you run it? How will you deploy it? What happens if it fails? How are you gonna retry it?
How are you gonna make sure that the data it needs to retry is available? This this sort of infinitely complex risk management surface that is orthogonal to the purpose of your code. Right? You may have error handling in your code that's related to things you expect to happen and and that fit in this positive engineering context. But once your code is is out of your hands, or maybe even if it is in your hands, this negative engineering concept just rears its head, and it's hard enough with 1 task. Everything that I just mentioned could be applied to a single task. When we start to think about how that task interacts with other tasks, maybe it's 1, maybe it's a 100, maybe it's a completely different workflow, the complexity, the the the, surface area for error just explodes magnificently.
And so, if we if we quickly reference those boxes and arrows again, if the boxes are positive and the arrows are negative engineering, Prefect is a tool for negative engineering, specifically for eliminating negative engineering. That is, you know, that is the mission of our company. We would like to say to people, you work on the thing that you are hired to do, the thing that you are good at, the thing that you are an expert in. That's the positive engineering. Build that, decorate it with our with our code so that prefix negative engineering goes to work. And all of a sudden, all the things that I just referenced are brought to bear in sort of a best practices kind of way. When we give our code to someone, we are not telling them what they need to write, how they need to write it, what it needs to do, what data it needs to produce. That's completely up to them.
Our promise is that if this thing needs to retry, you're not going to write retry code. You're just going to tell us, retry this 3 times and move on. If this thing needs to run as a cleanup only if something fails, you're not gonna code that. We're gonna know how to do that. We're gonna make sure that that happens, and we're gonna make that guarantee.
[00:19:17] Unknown:
And as far as the execution context that Prefect provides and some of the built in capabilities. I'm wondering if you can dig in a bit more as to how it's actually implemented. I know that you mentioned that it relies heavily on Dask for being able to distribute execution, and that it's using Python as the implementation language. So I'm wondering if you can just talk through some of the, design choices and technology choices that you made and some of the overall architecture of the system that you've settled on.
[00:19:49] Unknown:
So this guiding principle we have that Python is the API is a is a forcing function to make sure that we don't, rely too heavily on any 1 technology and that we really embrace a embrace a very generalist attitude. So DASK is a is a a big part of our system. It's it's how we recommend, people deploy Prefect. But if you go into our open source and look at the DaaS hook, I I think it's less than 50 lines of code because Prefect is a fully customizable modular system. And, it's funny, you know, 1 of 1 of our goals was to make this thing that sort of was batteries included out of the box, super easy to use, encapsulates best practices, but also has this rich, customizable back end that people can plug whatever they want to, whatever execution engine they want to, into it. And so what we ended up doing to achieve that is we split the project into a series of modules, and each module exposes classes with a very simple API. So just, for example, the execution module, if you want to plug this into any back end, whether it's Dask or anything else, there's only 2 functions that you need to implement on a class.
And Briefec will happily interface with any class that implements that API. So taking this from Dask to, you know, you name it, whatever you want, local process, you know, asynchronous process, Lambda, what whatever you want, it it's trivial because you only have to add these this contract is very simple. And we've taken that throughout the system. So, whether it's the environment. So how do we package and deploy a workflow? Does it get put in the container? Does it get written to a file? Does it get sent to GitHub? Again, it's a it's a we spend a lot of time on what is the API? How do we abstract away the bare minimum that's needed to like richly express this idea, and then we built a module for that. And so Prefect as an engine is a collection of those modules surrounding, and again, literally called the engine in our open source code, surrounding a very rigorous set of steps that we believe in code, the best practices of how we should implement workflow semantics and workflow logic.
So there's there's sort of 2 pieces to this puzzle that I'm describing. 1 is that execution pipeline, which, you know, is is, all all everything I'm describing is open source, and anyone can go check it out. That execution pipeline is very clear, well tested, atomic. It's it's great code. It's it's much easier to read than any any other workflow engine that that we've looked at and compared it to. But on the other hand, it is all driven by hooks that are extensible into these modules we built so that people can rip out any part of our system and replace it without worrying that they lost any of the functionality that that workflow logic guarantees. So, I already mentioned, the execution pattern, the packaging pattern, logging falls under that category, how you want your results handled falls under that category.
Really, every piece of it was designed to be modular.
[00:22:41] Unknown:
And in the process of building this platform, I know that you have drawn fairly heavily on your experience working on airflow and with airflow, both for positive and negative ideas of what to do and what to avoid. I'm wondering what are some of the other systems or platforms or experiences
[00:23:02] Unknown:
that you're drawing from for inspiration? So, yeah, no secret that airflow is a is a big source of inspiration. I've worked on airflow for many years, since since the day it was open source, as a matter of fact. And more than Airflow itself as a tool, to be honest, was the interaction with Airflow users. So over the course of the last, I guess, 4 years, I have worked with 100, I don't think it would be an exaggeration to say 1, 000, maybe it would be, of data engineers who are working in or around the airflow ecosystem. And so I have cataloged these use cases that are just extraordinary.
And that education has been the single most valuable thing, much more than looking at any piece of software, because software tends to be written to solve a problem. And in collecting these 100 or thousands of problems, we had a very unique opportunity to look around and say, well, what is the problem we should solve here? In other words, I know what my problem is. But maybe my problem is just 1 outlier in this massive distribution of data engineering problems. And so we had this incredible data set, which informed almost everything about Prefect.
And the big challenge that we had was to take these 100 and thousands of use cases and extract from that the actual features or, you know, the things that we needed to deliver to try and solve ideally every single 1. And the amazing thing is when we boiled it down, there weren't that many things. We'd see the same things repeated over and over and over. We'd see, obviously, speed and scale, number 1 and 2, probably for anyone who's using Airflow. But beyond that, we would see requests for parameterization. We would see a lot of requests for dynamicism, which we implemented as a as a mapping operator.
You know, as there were there's only a small number of things that would actually solve or really move the needle for the vast majority of use cases that just aren't met by current technologies. But the problem is some of them don't some of them aren't obvious perhaps when you set out with 1 specific problem that you're trying solve, and other of them others of them feel like they're byproducts of a different process. So, for example, passing data around data flow is sort of taken for granted by everyone in the data science world. That is how every single tool works. It would be very weird to try and use a a tool for data science that couldn't move data around. But on the other hand, as we start to blur the line between data science and data engineering, that requirement is now is now making it impossible to deploy, in some cases, data engineering tool that can't even do this most basic thing. And if you go way back into the airflow history, you'll find 1 of the first things I added to it is a is a thing called an xcom, which was my attempt to solve this problem. Now with many years more of experience than that implementation, I would never in a 1000000 years have, done it the way I did it then.
It's I'm a little bit embarrassed by it in some regard, but it is out there. It's too late to change it. But it's a good example of if you just try to tackle 1 problem at a time and solve it or do it by committee, you're not gonna end up in an optimal place. You're you're just gonna solve the problem you thought you had. And so by leveraging not what we learned so much from other tools, but by by leveraging this wealth of feedback and friendly relationships that we've built, we ended up in a really amazing place.
[00:26:25] Unknown:
And the idea of being able to pass the data itself between the tasks rather than just the current state and a pointer to where that data might reside is definitely something that you said is very valuable in the data science context and something that I can see would definitely be very useful just in general because it reduces some potential failure cases where you have maybe issues between, permissions across different workers where the worker producing the data maybe pushes it to an s 3 bucket, but then the credentials for the reading worker doesn't have access to it. Just various challenges that can come up along those lines.
I'm wondering, particularly when running in a distributed context, whether on Dask or otherwise,
[00:27:10] Unknown:
how you handle actually being able to pass that data between processes, particularly when you might be dealing with large volumes of information? Yeah. It's a wonderful This is 1 of those things where it's very easy for me a couple years ago to kinda pound the table and say, and it has to pass data around, but then to sit and wonder, well, how how are we gonna solve this problem? And, the the prototype that I always use is, not even necessarily across boundaries where we might have 2 processes running, but they need to literally pass the data to each other. The the harder version of this problem is what happens when a task fails and needs to retry in an hour? So it's the same task, but it needs access to the data produced by upstream tasks that themselves are no longer running. Right? They're successful. They're done. And so the there are variants of this problem when you actually start to map it out that, like, bring bring even more complexity and and and more difficulty to it. So the way that we solve this is by introducing, a very originally named class into our open source engine called the result handler, which allows users to specify, well, what do you want done with the result of a task?
You know, a good as as you mentioned, a good way to do it is write it into s 3, write it into cloud storage, put it network share. You know, again, this is a class that exposes a very, very simple I think it's a 2 function API. So anything that you write that's compatible with that API can be used as a result handler. And that tells us, okay, this task produced something. Now we know, a, where we put it, and, b, how to retrieve it. But what we can't do is we can't just store every single piece of data that's produced by the system. It's not feasible. It's not realistic.
It's just untenable. So instead, we started to look for how do we actually know when data is needed. Like, how can we detect and hint to our system that data needs to be passed around? And an easy way to do that is if a task ever ends in any state that doesn't indicate that it's finished, that's a pretty good clue that it's gonna need to recreate its data. And so what we do is it's actually not the upstream task that stores data, it's the downstream task that decides it's going to cache the data it received. And so we turn this problem on its head, and now if that task fails, or if that task needs to get data somewhere, or rather, I should say that task needs to receive data from somewhere, it is the responsibility of that task to make sure that it invokes a result handler to put that data in a permanent place. Now we do have sort of what you might call a greedy mode, where if you're very, very concerned that the downstream task might not even have the opportunity to cache the data, we we can, in fact, fall back on what I just said. It's not great practice, but it it can be appropriate where we where we literally write every single piece of data into storage, and we never have to worry about about losing it. But if all else fails, we pass data through storage, and we do it in a semi intelligent way that that helps us be efficient. However, because of technologies like Dask, which, again, we love Dask.
Dask ships with what I consider to be the best task scheduler. I should I should clarify. It's a slightly different type of task in in in Dask's world than the 1 that we're talking about in prefect. Let's call task the best way to run a distributed computation in in Python, in my opinion. And it has this millisecond latency scheduler that can ship tasks wherever they need to go, but it also has a great, serialization protocol. So when we're running on Dask, we don't need to use a cloud storage at all, unless we unless we are in this retrans situation I described before. Dask itself will take care of serializing the data intelligently and making sure that it's available, to the task that needed. Now Dask will go so far as to actually schedule tasks in the same node if it's appropriate to minimize the latency and, network requirements of moving data around. And so by by leveraging both sort of a clever approach to understanding when do we need to serialize data properly, but also, by using technologies like DASK that have shown the way to doing this very effectively in a concurrent setting. We solve this problem in a in a in a in a very clean and and effective way. And another thing that comes up particularly when running tasks across multiple nodes is the issue of requirements, whether it's data
[00:31:28] Unknown:
availability or, certain set of dependencies that need to be installed on a particular instance. And so I'm wondering how you approach the executed on a executed on a specific instance or within a specific location or capability. And then also in the event where it's not necessarily possible to have the specific node as part of maybe a DASK cluster and you need to run it as just a 1 off task or something that's scheduled out of band, how you would then handle capturing the status of the previous execution for being able to surface it within Prefect itself, and also for being able to maintain data lineage and provenance for tracking within the broader
[00:32:19] Unknown:
Prefect installation? Yeah. Excellent question. So for the moment, what we've discovered is that the majority of the use cases that we're we're dealing with today, within a given workflow, tasks tend to share affinity or put a different way, whatever the the most stringent requirements of any task are, it tends to be shared by others. So the easiest thing to do and the most common thing to do is build a single Docker container that, contains all of the workflow logic. And now we can just reuse that Docker container anytime that we need to run any task in that workflow, and we know that the dependencies are there. We know that any special requirements are there. We we sort of abstract this problem away. It's it's built in there. We do have a concept of task tags, where certain tasks can be well tagged, which is a way of demonstrating that they have a requirement or an affinity. This is something that we added very early in the system.
And then as I mentioned, we're, like, constantly showing things to people, seeing how they use them. This was something that just didn't get a lot of lift in the beginning. It it was so, it was so easy to just take the entire thing, check it in a Docker container, and just know it was going to run that we, frankly, we just didn't see people using this feature. And so it's it's still there, but we haven't done very much of it. And and and to be honest with you, version 1 of our platform is actually not going to not going to, like, have a way to express this because we we are waiting for someone to really say to us, we absolutely need this right now as a higher priority than than other workflow, requirements.
So in a very interesting way, when we just give people the power to describe the environment they need and then, execute it in whatever cluster they want. So you you you basically can describe whatever it is that you need. It it solved, at least in the use cases that have been brought to us, the need to be hyper specific about each individual task. And in part, I think that's because Prefect encourages a slightly different workflow than what you might see in other workflow engines. And that's because each task in Prefect is designed to be small. We want a task to be a single piece of logic. And the smaller it is, the better it is because that means that each task in brief that can be checkpointed individually in a sense. So whereas other workflow engines, especially ones that can't pass data around, they sort of necessitate these monolithic tasks that do just an enormous amount of things because you can't pass the data you need to a downstream task or you need to do a loop or or something that isn't just, you know, natively possible in workflow manager. And so as a result, the idea that tasks would be very different from each other in terms of what environment they need isn't really 1 that comes up often in the pre type ecosystem because tasks are much more like a script,
[00:35:11] Unknown:
a Python script than a discrete and different set of, operations taking place in incredibly different environments. And another challenge that can come up from the environmental context is the idea of needing to cross network boundaries where you need to potentially distribute the task to maybe be collocated within a network environment that has access to the data. So maybe it's a Hadoop cluster with an HDFS file system that you need to be able able to retrieve data from or store results into. But then it needs to be able to communicate with the application database that's running in a different network environment. And so I'm curious how you approach some challenges that exist along those boundaries.
[00:35:52] Unknown:
Yeah. So here here again, we are we are, exposing the environment, class, if you will, to our users, but primarily, it's at it's at the flow level. And what we found is that by appropriately pulling secrets and, effectively running subtasks wherever they need to be, we've solved this problem to a large degree. 1 of the 1 of the things we're working on now and and trying to identify the best API for it is allowing per task environments, which would allow completely isolated deploys of a task. Task. Like, not even bridging a network boundary, bridging in everything boundary to a completely isolated environment. We're we're actually soliciting feedback right now, in our repo if anyone out there who's listening, has a strong opinion about what is the best way to express that. Our our default view is it's just to, you know, use our existing environment classes and allow people to override them on a per task basis. But, obviously, the devil's in the details, and, we wanna make sure we do this in in a in a good way. But broadly speaking, when it comes to how do we run something in 1 place and another, it's just a matter of designing the workflow in a way so that the workflow itself runs itself in these in these 2 locations.
[00:37:05] Unknown:
And so we've touched on a few of the different capabilities and features of Prefect, but I'm wondering what are some of the others that we didn't touch on yet, and particularly any that are often either overlooked or misunderstood by your users that you think would be valuable and that they should potentially exercise
[00:37:24] Unknown:
more often than they otherwise do? 1 of the things that comes to mind, I I mentioned earlier, we introduced this mapping operator. So when people discover our map, they're usually very excited, but we're not we're not sure if we're doing a good job of of educating people to its existence. So, by way of history, 1 of the most common early requests or early use cases that we saw when we were designing Prefect was the ability to dynamically generate workflows. In other words, using information that was only known at runtime, we want to adjust the workflow. Now I used to hack around Airflow to do this. Airflow has this, it's a it's a feature of bug depending how you look at it, where it reparses DAG every time it runs a task, which means if you put workflow logic in the DAG definition, you can trick Airflow into running it for you. And so I would use that, for example, to get Airflow to create new tasks for new files that needed to be processed that I could discover at run time, basically. But we don't like that because we don't like, workflow logic to be running outside of the workflow governance system. That's a that's that's bad in our eyes. So we discovered that, rather than implementing a fully dynamic workflow, like, you can put any task you want in any order you want, and they are it's generated work from the fly. We actually found that almost every single use case that was brought to us was essentially an instance of mapping over a runtime known list.
So that could be, get a list of URLs that you need to go visit, or that could be get a list of user IDs that you need to process or, you know, take things off a queue and work with them. And so what we what we ended up designing, and it really solve this problem in a really elegant way, is any prefect task. If instead of calling it, you just you just say dot map, will automatically turn itself into a set of parallel pipelines, which then can be further mapped by downstream tasks. And you can basically extend these pipelines as as far as you want and then reduce them at any time, and work with their their combined output. And so we we we internally, we love this feature. We do think you know, we talk about the Slackbot that we run on. And in all honesty, 1 of the reasons we talk about the Slackbot a lot is because it's sort of thing that no 1 would ever dream of running in a traditional workflow system, but it runs great in. So, it's fun to it's fun to talk about. So our our Slack bot runs stand up every morning. And in order to do that, it has to send messages to everyone on our team. But we don't wanna sit there and hard code every single person in our team into the workflow. It's a it's a disaster.
So instead, we use a map operator over a database that we keep anyway that has all of our Slack, Slack usernames in it. So we write 1 1 task to send a reminder, and then we map that task over the usernames. And now everybody gets a reminder every morning, if they haven't sent their update yet. So this this map operator is something that we're we love. We're really excited about. When people find it, they tend to love it too, but we need to do a better job about telling people about it. And then there's something else which we actually haven't talked much about because it's gonna be tied, much more to our cloud platform that's gonna be coming out later this year, which is versioning.
So versioning is 1 of those things that I can't remember now if I I referenced it or not, but it was 1 of those really common requests where, you know, I have a model or I have a workflow, and I need to change it. And, traditionally, what you would do is you'd sort of blow away the old 1, upload the new 1, maybe you change its name to say version, you know, 5, and you would hope that you didn't make a mistake. And so we hate that, obviously, and and coming from data science, mindset, I also don't wanna get rid of my old versions. They have the metadata, the history is incredibly valuable. And so we have a what we hope is a very simple, very generalizable versioning concept that we think will let people map whatever their preferred versioning logic is on top of it. And what I mean by that is some people think of versioning as a promotion of the new work flow and, you know, you archive and and just keep the old workflow for reference. Some people think of versioning as a way to do AB testing, and you run 2 versions of of something that are that are very similar, but you want them actually both active. And we as we started researching this, we came up with a handful of different you know, what does versioning mean to somebody?
And so, again, we we did the same exercise we do with pretty much every other use case. We step back. We abstracted the commonality. We exposed that in an API.
[00:41:52] Unknown:
And so I'm I'm very hopeful that that becomes a a useful feature, once we have this platform out. And some of the other challenges and requirements that come up often in the context of workflow execution or ETL tasks are the issues of being able to do local development and execution for validating that the code that you're writing is doing what you want it to. And then also the idea of testability and being able to do validation both ahead of time and at runtime to ensure that your tasks are doing what you want them to do or that your source data isn't introducing anomalies or defects. And so I'm curious how Prefect handles that, what the interfaces for those capabilities look like.
[00:42:36] Unknown:
Yeah. So this is gonna sound like a cop out answer, but it's a very important 1. When I said we'd literally treat anything, you know, as a Python function, that's what we're asking for, we mean that. So when you build a task in Prefect, it is a Python function. It behaves like a Python function. You can call it like a Python function, and most importantly, you can test it like a Python function. So if you go look at prefix unit tests, and I'm extremely proud of how many thousands of unit tests there are, you will see complex workflows being generated in the unit test and then being tested right there, and we can instrument those tests. We don't have to run the whole workflow end to end. We can supply you know, we can basically override the output of certain tasks. So I can test that the 7th task, if I force it to return a failed state, I can test that the 8th task is gonna respond appropriately.
And so we have a we have an entire test utility, that ships with the Prefect Engine. So so we basically took all of the little things that we were doing to make our lives easier as we wrote these unit tests and we turned them into utilities that anybody can use. Even things like temporarily overriding configuration or temporarily setting environment variables. We have context managers to do that. So we took this this same, sorry if I I sound like a broken record, but we just took the same idea of Python as a friendly known API. How do we expose that sort of same level of just accessibility throughout everything in our platform, and and tests are just like that. So testing a preferred workflow is an awful lot like testing Python. If you wanna use, you know, a set of breakpoint in the middle of a task and then jump into an interactive session when that breakpoint is set, go right ahead. It will work.
All of your tools are are available to you because this is it's just Python at the end of the day. Now if you want to do testing in a distributed setting where maybe you can't just set that break point, you you've gotten past you know, you're comfortable locally, you want to distribute it. We're working on a variety of tools for debugging. And so here, you know, obviously starting with logs, you wanna make sure that you log so you know where you are. We wanna make sure that you have a way to recreate that task data at any time. So, as people are are getting more and more into deploying these things and wanting to make sure that contracts are kept and and, data quality checks,
[00:44:50] Unknown:
we'll be rolling out more and more of our tooling for ensuring that as well. Yeah. As you mentioned, the response to a few of my questions has been along the same lines, but I think that that's just a signal of good design and, you know, good amount of 4th, foresight in terms of how you approach the problem space. Because if you don't have to have special cases for every different type of, instance or context, that means that the tool itself is built in a way that you're able to do what you need to do without having the limitations placed on you like you're mentioning.
[00:45:24] Unknown:
Yeah. And and and look, I I I appreciate you saying that. I I I would invite anyone out there. Like, you don't have to take our word for it. This is we're talking about an open source package. You can go see our docs. You can go play with it. We we do we we did work extremely hard to make it feel like Python because nothing turns users off faster than having to learn a language that is almost but not quite exactly like Python in order to to build a workflow. And, honestly, Thomas, there there's no 1 here at on our team that's gonna take credit for necessarily having the foresight to know, like, this is a good way of doing it. This is just because we have, we have just made such a strong effort and such a such a real, attempt at outreach to make sure that nothing we do is in a vacuum. Everything is is shown to our partners. We had more than a 100 people who were partners before we open sourced the package using it, doing formal research with us, giving us feedback. Some of them would call us every day. Some of them, honestly, we heard from once, and we didn't really hear from them again. But we we tried so hard to make sure we had this broad, broad, broad group of partners, you know, honest, legitimately partners, so that we didn't have to guess and be right.
We could fail quickly when we were wrong. And I think that, you know, we're we're all very proud of of the API that we've that we've developed, and we're happy when we see data science, students picking it up and using it for projects, and we're happy when we see it deployed into massive architectures. Because for the same product to do both ends of that spectrum is,
[00:47:04] Unknown:
is something we're very proud of. And so as you mentioned, do you have a cloud platform that you're working on getting deployed, and you also have the open source core capability of the prefect engine. And so I'm wondering if you can outline the differences in terms of feature sets and capabilities between those 2 offerings and what the limitations are on the open source repository as compared to the cloud platform?
[00:47:30] Unknown:
Yeah. So we we don't ever wanna think of it in terms of limitations, to be honest with you. So so Prefect, didn't start life as a as a company. It started life as something I was building honestly for me. And so in many regards, it takes all the things that I hate about working with most open source tools, and we just said we're just not gonna do that. So our cloud platform is not, you know, a gated version of our of our open source package. It's a completely separate product. And it's very important to explain, you know, why do we have this distinction? No 1 should ever use our open source workflow engine and worry that, oh, we have to pay to get, you know, the retry logic or or something like that. I can't even I can't even think of a good example.
The way we think of it is perfect core, which is what we call our open source. That is a workflow engine. That is where everything that you need to design the workflow, to test the workflow, to execute the workflow, to package the workflow, to deploy the workflow, I don't know, to log the workflow, to do the result. I mean, every single thing you could possibly need is in that open source package, and it is it is an engine. It is a very powerful engine. Our cloud platform that's coming serves a very different need. So once you have a workflow and it's in you know, it's been deployed, it's in production, and we have plenty of people who have put PreCheck Core, the open source package into production.
There's a different stakeholder, or it could be the same person, but, often, it's a different stakeholder, who becomes extremely interested in in what we would call, you know, state and interaction, which is to say the metadata that these workflows are throwing off. So that's things like making sure the run history went, doing scheduling in a in a much more sort of rich and robust way than what we're able to do in in the open source, which is sort of bound to a single processor and and not gonna be asynchronous. And that is where we see the value of our of our cloud platform. So our cloud platform is where we ship a database, a GraphQL API for interacting with that database, and the database, of course, is for storing, state history.
Our UI is in the cloud platform. Really the way we think of it is is if if Core is the engine, then Cloud is the car. And the car is completely worthless without the engine. Our cloud platform literally can do nothing without the capabilities that are new to it by Core. However, it brings to Core a variety of things that 1 could very easily build on top of Core if 1 were so inclined, but we don't really see a reason to do that. We have some opinions about the best way to handle the state and do things like make sure tasks only ever run once, certain guarantees that we're able to make, certain ways of tracking logs and exposing them. And so that's that's where we bring our our cloud platform to bear. And so we're we're we're pretty excited about being able to offer it. It's it's important to us, as I mentioned, like, open source is a is a value of our company, and so it's very important to us. There will always be a completely free tier of prefect cloud, which will not have, you know, features stolen from it that 1 would have to pay to unlock. We think it's just fundamentally wrong to hold out an open source project and then tell people that, oh, yeah. We have this other nice thing that it's not quite it, but it's different. And, sorry. You're not able to use it. We'll never say that. And going into
[00:50:40] Unknown:
the initial implementation and design of Prefect as just a personal project, and then as it grew in scope and, ambition to be a company built around this idea, I'm wondering what were some of the assumptions that you had initially, and how they have been challenged or updated as you continue to dig deeper into the problem space and start receiving feedback from users about other ways that they are leveraging the capabilities of Prefect and other adjacent tools? Yeah. So so as as I mentioned, we we always design Prefect with with users in mind so that when we were wrong about stuff, we could we could fail
[00:51:17] Unknown:
relatively quickly. Like, you know, we we could learn that that early. So there have certainly been many times when that's happened, but fortunately, as a as a matter of of structure, we haven't allowed them to to progress too far. Many of them, I won't bore your audience with because they just they they deal with certain, like, details of the API. But but 1 thing that comes to mind is actually we we made a mistake that I now see echoed throughout the sort of data engineering space, which is that we assumed that we would be working with big data or, you know, machine learning or complex use cases, and we prepared an enormous amount of, outreach and messaging around such use cases. But what emerged when we actually started bringing this to people is something that, you know, we we sort of jokingly call the apology. And it's this moment in many, many, many of our customer directions where someone will apologize to us. And and and and it's a little bit odd, to be honest, because the apology always has the following form. They'll say, you know, I'm really sorry that the problem we're actually having isn't about big data. It's about this CSV file we have that only has, you know, a 1, 000 lines. We get it every morning, but it's in, I don't know, some some odd format or or or the first column is blank, and so we can't import it, you know, properly into our software. And so there's this this apology moment where it turns out the problem is not big, and it's not it's not impactful except to this person who has to go deal with it all the time. And so we we we learn to shift ourselves a little bit, and we realize, you know, we are tackling small problems. We're tackling tackling annoying problems. We're tackling the things where the most common way people describe it to us is this should be easy, but it's not.
Whereas the the mistake was to go out and assume we were gonna tackle the things that we all agree are hard. So that's been a really good interesting lesson for us to learn, and now it's 1 of the ways that we can very quickly judge is is Prefect gonna fit in well with this organization? Because, when that apology moment happens, and and, you know, I'm calling it the apology facetiously, of course, But when that moment happens, we know that this person has something that is meaningful to them. It's a problem that, many other tools don't deem important enough to solve.
But as I mentioned with our Slackbot example earlier, Prufex is a great solution there because it's so easy. It's so lightweight. And we love frankly, we just love helping people with something they hate to do. So that's been a really important lesson, and that's been 1 that scales. Because once that person gets that CSV file working the way they need to, now they've freed up their time to go work on the thing that they actually wanna be doing, which is some, you know, giant analytic dataset. And now Prefix is already demonstrating its value to them in a concrete way. Now it's very easy to understand how it will demonstrate its value at scale as well. Continuing on that idea of
[00:54:06] Unknown:
different edge cases or interesting problems that people are tackling with it. I'm wondering what are some of the ways that you've seen Prefect use that were particularly interesting or innovative or unexpected?
[00:54:18] Unknown:
Oh, we've we've seen some very cool things. I mean, some of the most interesting ones honestly come from internally. As you can imagine, anyone who joins Prefect, no matter whether they can code or not, they they immediately begin writing flows as a way of just getting comfortable with the with the the language of our company, if you will. And so we have someone here who uses Prefect to remind her, that she needs to do laundry. And so I I just always love that because it's it is, again, like a Slackbot, it's just so completely different than anything you would think, from a data engineering toolset. But when you step back, it makes perfect sense. Right? We have something we need to do that's gotta interact with an API. In this case, it's a Slack message, on a regular schedule, and it has strict dependencies. And we wanna make sure that it, you know, retries itself. We wanna guarantee that it succeeds.
So apart from the fact that you might instead of, you know, in this case, reading a value out of a Firestore database and telling someone it's time to do laundry, you might, you know, instrument that Fire database Firebase, it should be Firestore database itself. It all comes back to this idea of prefect is about the arrows, not the boxes. And so we've we've seen just a huge variety of things. I'm trying to think internally. We have we have designed flows that tell you if your plane is late. We've got, flows that calculate the value of a portfolio. And then, you know, in in more in bigger terms, we've got a flow that downloads terabytes of data from NASA every night. In fact, that's a good example of of where the apology matters. It turns out that the the bottleneck there had nothing to do with the amount of data, the processing of the data, the storage of the data.
The bottleneck there was actually just that sometimes the connection to NASA failed and it needed a retry. And it's it's sort of extraordinary. When we fix the retry, which is all of, you know, 2 lines of code, all of a sudden, we were able to enable this massive workflow. It's just so exciting to see. So the the the the spectrum of things we've seen Prefect deployed into is is wider than you can imagine, and I'm confident it's being used for things that we haven't even contemplated here. Conversely,
[00:56:14] Unknown:
Prefect is this flexible tool that you can do all kinds of different things with, but when is it the wrong choice?
[00:56:20] Unknown:
So it's a very important question. Prefect is the wrong choice when for example, Prefect is the wrong choice when TensorFlow is the right choice. So Prefect is not a tool for building, for example, back prop, you know, implementing back prop for training machine learning model. Could you do it in Prefect? Sure. Is Python. You could go right ahead. But we would never recommend that you do that because you would be foregoing amazing specialized tools for doing that purpose. What we would love for Prefect to be is to be the glue around those specialized tools. And that's why it's so important that Prefect has this very flexible, very agnostic API. So, we're not out there trying to replace software. We're we're out there trying to to play nicely with software.
So so when there's a tool, and TensorFlow came to mind first, where you are trying to do a very specific application specific thing, use that tool. Rapid and Prefect, but use that tool. And then, you know, in fairness, trying to give you answers, that are more connected to a business purpose. We've had a surprising amount of interest from the high frequency trading community. And in honesty, those are usually calls where we suggest, other tools. So, when we talk about speed and scale for Prefect, we're talking about being able to do things in a near real time fashion or in, you know, what what we consider scale. But, some of these these firms that we that we talked to that I interacted in a in a different life, interacted within a different life, are doing things at a at an order of magnitude difference. And the guarantees that we make in are are just not, at the, you know, nanosecond level that they're that they are looking for.
You know, I don't even think Python is is the right language in that choice. So that's that's an example of, you know, a field where Prefect is probably just not the right tool for that job. And speaking a bit more broadly,
[00:58:10] Unknown:
based on your experience of being somebody who works in the industry as a data engineer and a data scientist, and then now when you work with Prefect of interacting with all of the users and partners as you try to iron out the capabilities and direction of Prefect. I'm wondering what you have found to be some of the common anti patterns or challenges that arise in data engineering projects and some of the overall best practices and industry trends that you're most excited by?
[00:58:40] Unknown:
So this this talk I gave at PyData touched on this issue in a big way. It It was all about basically how people hack around their workflow system. And this this is an anti pattern that we are very sensitive to, because I spent an awful lot of time bug hunting in airflow, and and this was often the the the culprit. When systems and and and for our purposes today, let's talk about workflow engines, but but almost any bit of infrastructure could probably fall under this heading. When a system doesn't do exactly what you need, but does most of what you need, you work around it. And you know as well as I do code code is code. Right? We can we can make it do pretty much whatever we want if we really try hard enough.
So the danger is that systems can only make guarantees about what they expect to be working with. So when you go around the system, you encounter what when I oversaw risk management, I would call wrong way risk. Right? You you encounter an environment in which when things go wrong, you have willfully stepped away from the governance and the guarantees of the systems that you put in place, which means that when they go wrong, you know, you don't even find out that they went wrong. Either the system has no mechanism, has no vocabulary to protect you anymore. And so what could have been an error can turn into a catastrophe very quickly because you went around the system that was supposed to implement those safeguards. And so this is this is an anti pattern that we see, certainly not generally, but commonly enough that it influences our our design on on in. And so earlier, we talked about the difference between Prefect Core and Prefect Cloud. And increasingly, I've come to view Prefect Cloud as a risk management tool where we hope people never look at it. And I mean that in a very honest way. We have a design goal that 90% of the time, someone should log into it, see a bunch of green lights, and leave.
That would be a successful outcome for us because we're very sensitive to this idea that if people need to go around the system, we or or whatever system it is, won't be able to accurately report and reflect back to them the situation that is most meaningful to them at that moment, which is, you know, inevitably some sort of error or some sort of failure. And so that is a anti pattern that we see, which we are really, really, really doing our best to never give someone a reason to go outside our system or at least be very upfront about how we recommend, you know, approaching certain problems.
So the the flip side of that question, I think you asked me is, you know, what what trends am I excited to see? And there's there's a few. I really love seeing data engineering sort of blossoming into an industry in its own right rather than being sort of a a necessary outcome of the progress that's being made in the data science world. Like, that's that's wonderful. That's awesome. The hiring statistics for data engineers, we we obviously look at these types of things, are outpacing those for data scientists, for about the last year for the first time.
And these this kind of growth, this kind of interest, these sort of specialization tracks is wonderful to see because at the end of the day, I can build the greatest model in the world, the greatest analytic in the world. It doesn't do me any good unless I know that it's gonna be put into production. And the best way to make sure it gets put into production is to have wonderful, competent trained people who love data engineering. And And data engineering is like this is this nebulous thing. You you talk to 10 people. What is data engineering? So to me, data engineering has come to mean, or rather, let's put it this way. An expert in data engineering is someone who understands the the access patterns of data. So I don't look for a data engineer to be an expert in Postgres or MongoDB or any single 1 database, but I expect them to understand the access patterns and the compromises and the trade offs that would be made no matter which, underlying technology is used and similarly for, you know, any other, technology in the data engineering stack. So 1 thing that I love seeing is data engineering just blossoming into this institution in and of itself, which speaks to just broadly the growth and the importance of of data in the modern business world, which is awesome to see.
Another thing I love seeing are cloud first technologies, things that just would not have been possible, 5 or certainly 10 years ago. My favorite example is, that I can use BigQuery for, you know, anything, for any project I have. BigQuery is this technology that when I implemented it many years ago in an organization I work for, delivered an order of magnitude performance, improvement while reducing costs, by a similar proportion. And that was just this incredible moment. I mean, I I read the docs. I didn't believe it was possible. So I love seeing technologies like that come to bear.
We've we've loved working with Google because they make a lot of these things available. And so that's something that's very exciting to me. And looking forward, what do you have planned for the future of the prefect project and company? Oh, so much. We are very focused right now on getting our our cloud platform out to our early partners, which is happening, you know, more or less as we speak. Maybe by the time this is released, it will have actually started. That is that is our number 1 focus right now. We're we're working with some really wonderful companies in the same vein as we did with our open source project to make sure that this is, incredibly successful, product and project.
And then more tangibly, we are working on some very interesting things in our open source as well. We're working on, something we call a listener flow, which is gonna respond to a customer request for more event driven, workflows. So the idea there is basically in response to some asynchronous event, a queue, whatever it is. We're able to launch flows in the same process so that they can share memory and and, you know, all these other benefits. And, many of these many of these sort of planned features, we we write them all up. We write up all of our design decisions. They're they're in our docs. We open source them, and we discuss them. So even even when something's in the idea phase, you can usually go off and and it's a it's a lot more information than you'd see on just a road map. It's a it's a proposal. It's a full proposal often with code of how we think we should do this, and and we invite people to comment and and let us know,
[01:04:59] Unknown:
what they think of that of that design. And are there any other aspects of the Prefect project and business and the overall state of the industry, as far as workflow engines and data integration
[01:05:13] Unknown:
that we didn't discuss yet that you'd like to cover before we close out the show? You know, I I feel like your listeners might murder me if I just keep talking about filling the gap and and and putting in his glue. So I will I will spare everyone and and, and be I'm very, very grateful for the opportunity to talk about what we're building. My whole team is very excited to, finally have something that we can,
[01:05:32] Unknown:
show in public and and talk about. Alright. For anybody who does want to follow-up with you or or keep track of the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:05:49] Unknown:
You know, I think you said what I just said. It would be embarrassing to, like, walk it back. But Prefect was a project because it filled a need of mine. Prefect is a company because I literally couldn't find something that did what we're doing, and, it was the single biggest problem in my world. But I I won't just talk about Prefect for a moment. In in a more general sense, I think that there's a trend in the data engineering, software space where companies are building a thing that solves their problem, and then they're releasing it to the world. And sometimes, but not always, that technology can be readopted for individual use cases or or, you know, whatever whatever someone needs to solve. But we don't see a lot of companies that are building things that are truly generalizable.
And that makes perfect sense. Right? Very few companies, almost no companies are actually incentivized to build something that solves anyone's problems but their own, and and something that we have learned as a business is that, there's a structural belief that if you have a problem and there isn't 3rd party package solving it, it must be something that's unique to your business and as such, it must be correlated to something special that you're doing as a business. In in in a very interesting way, it's part of your competitive differentiation. So 1 gap that we see is that this push towards large companies, which again may share almost nothing in common with the use cases of the majority of companies, pushing their technology out and saying, you know, this is the way to do this, has led to what what what I would consider, suboptimal outcomes. And so, open source is 1 way to combat that, although open source is not immune to what I'm describing. Many popular open source projects themselves started within large companies. And so, you know, I I don't have a magic bullet that would that would solve this. But to the extent that we are cognizant of of a of a fear I mentioned earlier that that folks can work around a system to get it to do what they want and in doing so, abandon the guarantees that that system is able to make. That's a that's a fear I have as the data engineering world, expands
[01:07:49] Unknown:
in ways that I otherwise think are really incredible and wonderful. Alright. Well, thank you very much for taking the time today to share your insight on the state of the industry for workflow tools and the work that you're doing on Prefect. It's definitely an interesting platform and 1 that I plan to start experimenting with on my own. So thank you for all of your effort and energy on that front, and I hope you enjoy the rest of your day. Thank you, Tobias. Thank you for having me on. Thanks for inviting me to speak with you today, and, I really appreciate the opportunity. Hope you have a wonderful day as
[01:08:19] Unknown:
well.
Introduction and Sponsor Message
Upcoming Conferences and Events
Interview with Jeremiah Loewen Begins
Jeremiah Loewen's Background and Career Progression
Introduction to Prefect and Its Motivation
Data Science vs Data Engineering Needs
Prefect's Approach to Workflow Engines
Positive vs Negative Engineering
Prefect's Execution Context and Design Choices
Inspiration and Lessons from Other Systems
Handling Data Passing and Serialization
Managing Task Requirements and Dependencies
Cross-Network Task Execution
Prefect's Features and Capabilities
Testing and Validation in Prefect
Prefect Core vs Prefect Cloud
Assumptions and Lessons Learned
Interesting Use Cases of Prefect
When Prefect is Not the Right Choice
Common Anti-Patterns and Best Practices
Future Plans for Prefect
Final Thoughts and Closing