Summary
In this episode of the Data Engineering Podcast, Anna Geller talks about the integration of code and UI-driven interfaces for data orchestration. Anna defines data orchestration as automating the coordination of workflow nodes that interact with data across various business functions, discussing how it goes beyond ETL and analytics to enable real-time data processing across different internal systems. She explores the challenges of using existing scheduling tools for data-specific workflows, highlighting limitations and anti-patterns, and discusses Kestra's solution, a low-code orchestration platform that combines code-driven flexibility with UI-driven simplicity. Anna delves into Kestra's architectural design, API-first approach, and pluggable infrastructure, and shares insights on balancing UI and code-driven workflows, the challenges of open-core business models, and innovative user applications of Kestra's platform.
Announcements
Parting Question
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
In this episode of the Data Engineering Podcast, host Tobias Macy interviews Anna Geller, a data engineer turned product manager, about the integration of code and UI-driven interfaces for data orchestration. Anna shares her journey from working with data during an internship at KPMG to her current role as a product lead at Kestra. She provides her insights into the concept of data orchestration, emphasizing its broader scope beyond just ETL and analytics, and discusses the challenges and anti-patterns that arise when using existing scheduling systems for data-specific workflows.
Anna explains the overlap between CI/CD, scheduling, and orchestration tools, and the limitations that occur when these tools are used for data workflows. She highlights the importance of visibility and governance at scale and the need for a dedicated orchestrator like Kestra. The conversation also delves into the challenges of using data orchestrators for non-data workflows and the benefits of combining code and UI-driven approaches.
Anna discusses Kestra's architecture, which supports both JDBC and Kafka backends, and its focus on API-first interactions. She explains how Kestra handles task granularity, inputs, and outputs, and the flexibility provided by its plugin system. The episode also explores Kestra's approach to data as assets, the target audience for Kestra, and how it bridges different workflows across organizational boundaries.
The discussion touches on Kestra's open-core model, the challenges of balancing open-source and enterprise features, and the innovative ways Kestra is being applied. Anna shares insights into Kestra's local development experience, the lessons learned in building the product, and the upcoming features and projects that Kestra is excited to explore.
In this episode of the Data Engineering Podcast, Anna Geller talks about the integration of code and UI-driven interfaces for data orchestration. Anna defines data orchestration as automating the coordination of workflow nodes that interact with data across various business functions, discussing how it goes beyond ETL and analytics to enable real-time data processing across different internal systems. She explores the challenges of using existing scheduling tools for data-specific workflows, highlighting limitations and anti-patterns, and discusses Kestra's solution, a low-code orchestration platform that combines code-driven flexibility with UI-driven simplicity. Anna delves into Kestra's architectural design, API-first approach, and pluggable infrastructure, and shares insights on balancing UI and code-driven workflows, the challenges of open-core business models, and innovative user applications of Kestra's platform.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us you should listen to Data Citizens® Dialogues, the forward-thinking podcast from the folks at Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to Data Citizens Dialogues on Apple, Spotify, Youtube, or wherever you get your podcasts.
- Your host is Tobias Macey and today I'm interviewing Anna Geller about incorporating both code and UI driven interfaces for data orchestration
- Introduction
- How did you get involved in the area of data management?
- Can you start by sharing a definition of what constitutes "data orchestration"?
- There are many orchestration and scheduling systems that exist in other contexts (e.g. CI/CD systems, Kubernetes, etc.). Those are often adapted to data workflows because they already exist in the organizational context. What are the anti-patterns and limitations that approach introduces in data workflows?
- What are the problems that exist in the opposite direction of using data orchestrators for CI/CD, etc.?
- Data orchestrators have been around for decades, with many different generations and opinions about how and by whom they are used. What do you see as the main motivation for UI vs. code-driven workflows?
- What are the benefits of combining code-driven and UI-driven capabilities in a single orchestrator?
- What constraints does it necessitate to allow for interoperability between those modalities?
- Data Orchestrators need to integrate with many external systems. How does Kestra approach building integrations and ensure governance for all their underlying configurations?
- Managing workflows at scale across teams can be challenging in terms of providing structure and visibility of dependencies across workflows and teams. What features does Kestra offer so that all pipelines and teams stay organised?
- What are the most interesting, innovative, or unexpected ways that you have seen Kestra used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kestra?
- When is Kestra the wrong choice?
- What do you have planned for the future of Kestra?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
In this episode of the Data Engineering Podcast, host Tobias Macy interviews Anna Geller, a data engineer turned product manager, about the integration of code and UI-driven interfaces for data orchestration. Anna shares her journey from working with data during an internship at KPMG to her current role as a product lead at Kestra. She provides her insights into the concept of data orchestration, emphasizing its broader scope beyond just ETL and analytics, and discusses the challenges and anti-patterns that arise when using existing scheduling systems for data-specific workflows.
Anna explains the overlap between CI/CD, scheduling, and orchestration tools, and the limitations that occur when these tools are used for data workflows. She highlights the importance of visibility and governance at scale and the need for a dedicated orchestrator like Kestra. The conversation also delves into the challenges of using data orchestrators for non-data workflows and the benefits of combining code and UI-driven approaches.
Anna discusses Kestra's architecture, which supports both JDBC and Kafka backends, and its focus on API-first interactions. She explains how Kestra handles task granularity, inputs, and outputs, and the flexibility provided by its plugin system. The episode also explores Kestra's approach to data as assets, the target audience for Kestra, and how it bridges different workflows across organizational boundaries.
The discussion touches on Kestra's open-core model, the challenges of balancing open-source and enterprise features, and the innovative ways Kestra is being applied. Anna shares insights into Kestra's local development experience, the lessons learned in building the product, and the upcoming features and projects that Kestra is excited to explore.
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Macy. And today, I'm interviewing Anna Geller about incorporating both code and UI driven interfaces for data orchestration. So, Anna, can you start by introducing yourself?
[00:00:26] Anna Geller:
Yes. Of course. I'm Anna Geller. I'm a data engineer and technical writer turned product manager. I worked in many data engineering roles, including, consulting, engineering, and later also DevRel. And currently, I worked as a product lead at Kestra. And, yeah, that's the subject of today's, podcast. And do you remember how you first got started working in data? Yes. So I I think I started working with data during internship at KPMG, processing data for a year end audits. So there there was a lot of Excel spreadsheets and queries to SQL Server. Yeah. That was how I started. I actually also studied, kind of data engineering,
[00:01:14] Tobias Macey:
as as my master. So, yeah. In terms of the scope of this conversation, can you start by giving your definition of what constitutes data orchestration and what is necessary for a system to be able to orchestrate data effectively?
[00:01:31] Anna Geller:
Yeah. So it's it's always, a bit difficult to agree, in the industry on definitions. The way I see data orchestration is that it's, automated coordination of workflow nodes that touch data. This means that, essentially, any workflow nodes that interact with data, whether they produce data or consume data, they all fall into this category. I think one misconception I see, that many people associate the orchestration only with, ETL and analytics. And instead, I think that, we should see it a bit more as a broader concept that, covers how DITA moves, across your entire business.
So I think, every company has their, internal APIs that need to exchange data. You need to react to events, like, sending an email, and maybe update inventory anytime there's a new shipment. You need to process data across ERP, CRM, PLM, all kinds of internal systems, and and you often needs need to do that in real time, rather than in just nightly ETL jobs. Yeah. So I think the distinction is, whether you want to automate workforce for the entire IT departments with multiple teams, environments, internal systems, or whether you just do it for the data team.
[00:02:56] Tobias Macey:
And another aspect of the challenge of trying to really pin down what data orchestration means and what you should use to execute those workflows is that in the technical arena and in organizations, there are numerous different scheduling systems, workflow systems, automation systems, in particular, things like CICD for software delivery. There is a scheduler in Kubernetes and other container orchestrators. There are things like CRON and various other time based scheduling or event based systems such as Kafka or different streaming engines.
And a lot of times, because something already exists within the organizational context, when a new task or requirement comes up, the teams will naturally just reach for what they already have even if it's maybe not necessarily designed for the specific task at hand. And I'm wondering if you can talk to some of the ways that those tendencies can lead to anti patterns and some of the limitations in the approach of using what they already have for data specific workflows.
[00:04:06] Anna Geller:
Yeah. So I I believe there is a lot of overlap of, functionality, between, all those, CICD, scheduling and orchestration tools. If you if we think about it, they all have a trigger. Right? So for example, when a new PR pull request is is open or merged, you need to do something. They all have a list of jobs or tasks to run, when when some event is received. They also all have states. So they are all state machines in the end. If if a given step fails, you want to maybe restart, the entire run from a failed state. And also many CIS tools, maybe in the data space, we don't realize it, but they also have things like, notifications on failure.
They have ways to maybe pause after build step to to validate if the build was correct and to approve or reject deployment. Right? So there's quite large overlap and it's I think it's quite natural for companies to, instead of directly looking at, considering dedicated orchestrator, that they first try to use what they have and see if they can expand it to to use cases like, data workflows, automation of microservices, or automation of business processes. I think the limitations usually show up, when you have true dependencies, across workflows, across repositories, even across teams and infrastructure.
And also when you start running workflows at scale, because then you just lack visibility. It's kind of the same as with AWS Lambda. When you have tons of those different functions, at some point, you are just confused. You you have no overview, of what is what is actually the health of my platform. And, let's take GitHub action as one concrete example. GitHub actions is great, but the moment you have complex dependencies or custom infrastructure requirements, GitHub actions start becoming maybe not the right solution. For example, you want to run this job on ECS Fargate and run this job on Kubernetes and run this another job on my on prem machine, to connect to my on prem database to perform some, data processing. Then you have patterns like, run this job only after those 4 jobs complete successfully, or run things at scale. And you want to manage concurrency.
You want to manage multiple code bases, from multiple different teams. Already managing secrets across all those multiple repositories as you would have to do with GitHub actions can become a bit painful when you have, like, multiple teams that maybe you want to reshare them. This kind of visibility and governance at scale is something where I I believe you you may consider, like, a true orchestra orchestrator.
[00:07:01] Tobias Macey:
Another challenge in the opposite direction is that teams that do invest in data orchestration will say, again, I already have something for doing orchestration. Why don't I also use that for CICD or whatever other task automation I have? And I'm curious what you have seen as some of the challenges in that opposite direction of using a data orchestrator for something that is not a data driven workflow?
[00:07:25] Anna Geller:
It depends on what we, in the end, consider as data orchestrator because many data orchestrators, they will not be able to perform this task, like, triggering a CICD pipeline, to deploy some containers. For example, dbt cloud. If you consider dbt cloud to be an orchestrator, you will not be able to, maybe start some Terraform apply from dbt. It's obviously not not this use case. For Python orchestrators, like, you know, all the airflow, and all tools in this space, I I think it's more feasible, but it can be a bit clunky to run to orchestrate CI from, just from Python, because mostly in CI, what you do is you run CLI commands.
You want to, maybe if you do it from airflow, you would need to have some HTTP sensor that listens to, some event webhook, maybe after your pull request was merged or something like this. So it it it would be feasible, but it can be quite quite clunky and, not not easy to maintain. In Kestra, we try to make this pattern, really easy given that you can simply you add a list of tasks with your CLI commands, then you add a webhook trigger that can react maybe to your pull request event. And then it's it's very simple. I actually have one quote. I don't know if I should just, like, read it out loud with one user who who is doing a CICD in Kestra, and he mentioned that it it was really refreshing. It's so simple yet powerfully flexible.
It really does allow you to create pretty much any flow you require. I have been migrating our pipelines from GitHub Actions to Kestra, and it's been so simple to replicate the logic. The ability to mix and match plugins with basic shell scripting or script from a language is just amazing. I think it's, possible we have, some, good testimonies that kind of, like, prove that the transition was fairly seamless.
[00:09:16] Tobias Macey:
Another element of data orchestration is the way in which it's presented and controlled. There have been a number of generations of data orchestration, each focusing on the specific problems of the overall ecosystem at that time. And one of the main dichotomies that has existed throughout is the question of whether it's largely a UI driven or a low code approach where you're dragging and dropping different steps and connecting them up in a in a DAG or whether it's largely a code driven workflow where that also has some degrees of how code heavy it is or maybe it's a YAML description of what the different tasks are. Maybe it's pure code where a lot of times that will lock you into a particular language environment.
And I'm wondering what you see as some of the main motivators for those UI versus code driven workflows at the technical and the organizational level.
[00:10:11] Anna Geller:
The main motivation to combine, code and UI driven parents is is to close the mark to close the the market gap. The way, we see the the orchestration and automation, tool market is that on the one hand, you have all those, code only frameworks often to requiring you to build your workflows in, Python, JavaScript, or or Java. And on the other spectrum, you have all those, drag and drop ETL or automation tools. And in both of those categories, there are there are many solutions you can pick from. There are bunch of, pie like, orchestration frameworks. There are a bunch of no code, drag and drop solutions, but there are very few tools in the middle. And this is the the gap that Kestra tries to fill. And in general, we believe that Kestra is the best among low code orchestration solutions. And if we if we make this claim that we are we are the best, so why why are we the best? Most tools in this no code UI, space, you would first build something in the UI, and they will create a dump of JSON schema, and they will call it code. So in the end, I believe, what what Kestrel does differently is that with every new feature, we start with code and API first, and all those UI components come later. And as a result, the YAML definition is readable. It has full auto completion, syntax validation.
You have great UX from in terms of that you have a built in documentation, revision history, Git integration so that you you can iteratively start building everything in the UI. You can then push it to Git when you are ready, and you cover this whole spectrum of being able to to have this, like, nice intuitive UI to to iteratively build workflows without compromising the engineering benefit of a framework. To kind of maybe summarize this is that existing solutions are I usually either too rigid, like all the no go tools, or they are too difficult, like all the frameworks, you know, to some extent, with Kestra, you have all the benefits of a code based orchestration framework without the complexity of a framework. So you don't have to, deploy and, package, your code. You can just go to the UI.
You quickly edit it. You run it to check if it's working, and you are you are done in in just a few minutes.
[00:12:36] Tobias Macey:
One of the challenges of having a low code interface, even if there is a code driven workflow available, is that it imposes necessary constraints to be able to ensure that even if you do have a code element, you're able to visually represent it for people who are using that UI driven approach. And a lot of times, I've seen that lock the tool chain into a specific technology stack where maybe it is UI driven. It will generate code for you, which you can then edit, and it will translate that back to the UI, but only if you're running on Spark or only if you're running on Airflow. And I'm wondering if you can talk to some of the ways that that by modality and the the requirement to be able to move between those different interfaces and maintain parity between them imposes constraints as far as
[00:13:32] Anna Geller:
the interfaces or the workflow descriptions or the types of tasks or runtime environments that you're able to execute with? There there are no constraints in terms of, what you can orchestrate in terms of, technology you want to integrate it with. The only constraint is that Kestra has built in syntax validation, which means that the API doesn't allow you to save the flow if it's invalid. So this is, one constraint. There there are obviously tons of benefits, with this. There are no surprises at run time because the flow is validated, during its creation, at the build time. If you have invalid, let's say, indentation in your in your Kestra YAML, Kestra won't let you save that flow. And in contrast, like, we can maybe compare it to, like, how it's handled in Python because I believe your audience, a lot of them use, tools like airflow. So with a DAG defining a Python script, your workflow logic can be potentially more flexible, but a wrong indentation in your Python script will be detected at run time. So in the end, it's it's more flexible, but also it's more fragile. And in the end, as with pretty much everything in technology, it comes to the trade off of, constraints and guarantees that we can we can offer. With Python, you can have potentially a bit more flexibility in how you define this this workflow logic, but at the risk of having additional runtime issues if something is is incorrect. And you have also this downside that you have to actually package and deploy that code, with the benefits of, being in Yamo. Kestrel is a bit more constrained, but it's also portable and self contained. It's it's quite painless to deploy.
It's validated at build time, and you can be sure that everything is working. So, yeah, the the pretty much the only constraint is that you cannot save an invalid flow.
[00:15:27] Tobias Macey:
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
So in order to explore a little bit further as far as the constraints and benefits, I think it's also worth discussing what the overall architecture of Kestra is and some of the primitives that it assumes for those orchestration tasks. And then we can dig more into some of the ways that you're able to use those primitives to build the specific logic that you need. So if you can just give a bit of an overview about how Kestra is implemented and some of the assumptions that it has about the level of granularity of the tasks and the types of inputs and outputs that it supports.
[00:16:37] Anna Geller:
Yes. So maybe let's start with the with the architecture. Kestra started with the architecture that that relies on Kafka and Elasticsearch, and it was really great in terms of, scalability, no single point of failure. But at the same time, it made it more difficult for new users to, get started with the product and to to explore it. Many of, listeners probably know maintaining Kafka in production can be difficult. So that's why Kestra edits the the architecture with JDBC back end, in the open source version. This means that, you can use, Postgres, MySQL, SQL Server, or h two as your database. And on top of that, you have the typical server components you can expect from from orchestration tool, which is you have executor, scheduler, web server, and, workers. All of those components can be, scaled independently of each other because, all of those are kind of like microservices.
And, yeah, you can if you need more more scheduler, more executor, you can just increase the number of replicas in your Kubernetes deployment and everything just works. So this is from the, let's say, DevOps back end back end perspective of the architecture in terms of user experience. Kestra really relies, heavily on the API. We are API first product. This is not, orchestration framework that you would just define your code, run it locally, then deploy the code instead, everything interacts through API. So the restrictions in terms of tasks and triggers you can do are restricted by the plugins that you have in your Kestra instance. You can have as many plugins as you want. By default, Kestra comes prepackaged with all plugins, so you don't need to install anything. This is kind of the main benefit you you already get with orchestration platform like Kestra that there's no need to pip install every dependency that you need to, use all those kinds of different integrations.
Everything is by default prepackaged. And if you need a bit more flexibility, you can cherry pick which plug ins are included. So let's say you are AWS shop. Like, you you don't use, Azure GCP. You don't want those extra plug ins for those other cloud vendors. You you simply don't include them in your plug ins directory in Kestra, and you just, like, cherry pick the plug ins that you that you need. On top of that, you can build your custom plug ins. The entire process is fairly easy. You have a template repository that you can, simply fork and build your code on top. Then you build your your jar file included in the plugins directory, and then you have the custom plugin. Then in terms of governance that you can have on top of this, as Kestra administrator, you can set plugin defaults for each of those, plugins that you added, to, for example, ensure that everybody is using the same AWS credentials.
Or you or if you want to globally enforce some pattern that maybe everybody should use, this way of working those properties, you can enforce them on a all globally using plugging defaults. And this pluggable infrastructure has some constraints in terms of that. If you don't have plug in for something, you will not be able to use it. But the the benefit is, yeah, you have a lot of governance. It's scarce really well with, more plug ins that you can always add. And we also have the the possibility to create custom script tasks. So if some plugin is missing and you don't want to touch Java to to build custom plugin, you can do that, for example, in, Python, r, Node. Js.
You can write your custom script, and you can just run it as a container. That's kind of like how Kestra can, support all those different kind of integrations.
[00:20:28] Tobias Macey:
And so in terms of the level of granularity of the tasks or the data assets that you're operating over, what are the assumptions of Kestra as far as the, I guess, scale of data, the types of inputs and outputs, and in particular, the level of detail that you're able to get to as far as what a given task or plug in is going to execute and how that passes off to the next task or plug in? That's mostly
[00:20:59] Anna Geller:
coordinated through inputs and outputs. So each workflow can have as many inputs as you want in, all inputs are strongly typed. So you can say, okay. This plug in is a bullion. This plug in should be integer, and this plug in is a select. So you you can only select the value from the drop down. Maybe this input is multi select, so you can only have one of the predefined values. You can have JSON, URL, all kinds of different inputs, and that's that's, already the benefit that they are strongly typed. So the end user who may not be as technical, will know what are the value values they can input into the workflow. Then the communication between tasks to pass data between each other is mostly operating in terms of, metadata and internal storage.
If you want to pass some data objects directly, you can do that. If your if your plug in, specifies that some data should be, output. Indirectly, you also have input files and output files for for script task. So you need to explicitly declare that, let's say, this Python task should output, those those two files or maybe all adjacent files, and then they will be captured and automatically persisted in Kestra's internal storage. You can think of internal storage as s three bucket. It can be s 3, GCS, etcetera, or just local storage. People familiar with airflow can think of internal storage as airflow's, x comms without the the complexity of having to do, like, x comm, push and pull. So, yeah, that's that's how how tasks can pass data between between charter, and you can even pass data across workflows. I think this is huge for governance. We have many many users who use, for example, subflows to compose the workforce in a more modular way that you can have, one parent flow that triggers multiple processes, and each of them is comprised into, subflows.
And the subflows can output some data as well, and they can pass it between each other, so that you have this way of exchanging data between, different teams, different projects, without having to hard code any dependencies and without having to rely on implicitly stored files somewhere, locally. Another
[00:23:14] Tobias Macey:
trend that's been growing in the data orchestration space is the idea of rather than data as tasks, treating data as assets where one task might produce multiple assets. The canonical example largely being dbt where you might have one dbt command line execution that produces tens or hundreds of different tables as an output and being able to track those independently, particularly if there are downstream triggers that depend on one of those tables being updated or materialized. And I'm wondering how Kestra addresses or some of the ways that Kestra is thinking about that level of granularity in terms of a task producing multiple different outputs or assets as a result.
[00:24:02] Anna Geller:
Yeah. That's that's totally feasible. Each task can output as many results as you want to. Maybe I I wouldn't recommend to outputs like 1,000 files because maybe the UI could break potentially. But, overall, you can output as many things as you wish, and, Kestra is doesn't introduce, any restrictions in terms of, like, what your specific outputs can be. There is one really great feature that people really appreciate in Kestra, which is, outputs preview. So it's, your task run returns, maybe CSV file or JSON file. You can easily preview it in the UI, so that you know if if the data format is right, if everything looks looks good. In the same way, if something fails, you can maybe preview the data and, see, okay. Maybe what I have in this downstream task is some some error in my code. Like, maybe you didn't capture some edge cases. You can redeploy your workflow. So, essentially, you create a new revision and you can rerun it for only for this, new, last task. This is a feature called replay. It's super useful for, like, failure scenarios.
If you have, if you process data and you have some things that are unexpected, and you don't want to rerun, like, all those previous things. Right? Because everything else worked. Only this single thing, didn't didn't work. So you can very easily, reprocess things that don't work simply by fixing the code and pointing the execution to the updated revision.
[00:25:28] Tobias Macey:
In terms of the audience that you're targeting, given the fact that it has this UI and code driven approach, I'm wondering how you think about who the target market is, the types of users, and some of the ways that that dual modality appeals to different team or technical boundaries across the organization?
[00:25:48] Anna Geller:
Yeah. That's that's, that's a great question. Our target audience currently are mostly engineers who build internal platforms. So usually, you would build some workflow patterns and you want to expose some workflows maybe to to less technical users to external stakeholders. We have lots of architects, software architects coming to to Kestra to support them in replatforming. This usually means they want to maybe move from on prem to cloud, or there's also this completely reversed pattern. There are many companies who now these days move from cloud back to on prem because of some additional compliance reasons. So, yeah, a lot of people using Kestra are those, platform builders who then expose those workforce to to less technical users for a variety of use cases. Kestra is not focused on exclusively on data pipelines. We also support infrastructure and API automation, business process orchestration.
You have things like approval workflows. Mhmm. One very common scenario is that there are some IT automation tasks that, for example, provision resources, and some DevOps architect or manager needs to approve if those resources can be deployed. So you you have this approval process implemented in Kestra that the right person can approve the workflow to continue. We have also all those, event driven, data processing use cases, where you can have events. You receive events, for example, from, Kafka, SQS, Google pops up, and you want to trigger some microservice in response to this event.
That's also perfect use case for Kestra. So it's not restricted to data pipelines. And I would say state is data orchestration because you you react to some data changes in the business, and you want to run some data processing in response.
[00:27:36] Tobias Macey:
As a listener of the data engineering podcast, you clearly care about data and how it affects your organization into the world. For even more perspective on the ways that data impacts everything around us, you should listen to Data Citizens Dialogues, the forward thinking podcast from the folks at Calibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale among others. In particular, I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast moving field.
While data is shaping our world, DataCitizens Dialogues is shaping the conversation. Subscribe to DataCitizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts. At the organizational level too, I'm interested in some of the ways that Kestra is able to implicitly bridge these different workflows without the different teams needing to know every detail of what the available data is and how it's produced, where, for instance, I have a workflow that's the available data is and how it's produced, where, for instance, I have a workflow that's taking a file from an SFTP server, processing it, generating some table in the data warehouse as a result, and then somebody else's workflow depends on the contents of that data warehouse table, does some analysis, produces some data visualization or a report generation, execution of the next task without the person who controls each tasks having to explicitly communicate between them or requiring that the workflows are directly built on top of each other? So Kestra supports that pattern, using
[00:29:19] Anna Geller:
flow trigger. This was in fact, like, one of the most popular patterns already from from the beginning of of the product. The use case typically looks as follows. You have multiple teams that don't have tight dependencies between each other. So you would say, run this flow only when those 3 other workflows from different teams have successfully completed within the last 24 hours. So you can easily define it as, as a condition for this flow, so that it only runs after those preconditions are true. And you can additionally add conditions for the actual data. So you can say only if this data returned maybe 200 status code or if this data has this number of rows, do something, like trigger this workflow. Kestra doesn't introduce, like, any new concept for this. We already have the concept of triggers.
So implementing this those kinds of patterns is a matter of explicitly declaring in your YAML what are expectations to trigger this workflow. And you can explicitly list all of those, flow execution that should complete it should be completed within this given time frame, and then it will run. So I think it's like in in the mindset is quite similar to how many other, like, also data orchestrator orchestrators do do that, but without
[00:30:44] Tobias Macey:
restricting it directly to only being data. Another aspect of what you're building with Kestra is the fact that it is an open core product with a paid service available on top of it. And I'm wondering if you can talk to some of the ways that you think about what are the elements that are available as the open source, what are the things that are paid, and some of the ways that you think about the audiences across those 2 and how you are working to keep the open source elements of it sustainable and active.
[00:31:17] Anna Geller:
That's that's the challenge every open core company is asking themselves every day. I'm pretty sure. We have, this framework that all features that are about security, scalability, and governance, they all go into the enterprise edition and all features that are single player, core core orchestration capabilities, they go into into the open source version. And that's how we try to balance. There is I believe there is no single answer. Every company tries to find the best solution. What we found out so far is that, we have some prospects, some, people coming to the to to Kestra who would prefer to have fully managed service. And currently, Kestra doesn't offer that. We have open source and, self hostable enterprise solution. So that's something we'll be working on next year. It will be a big priority, especially to enable even more people to try the product, see how it's working, and including just trying even those paid enterprise grade features, without having to, like, first maybe talk to sales and start official POC.
[00:32:22] Tobias Macey:
And as you have been building and working with Kestra, what are some of the most interesting or innovative or unexpected ways that you've seen it
[00:32:29] Anna Geller:
applied? Yeah. Some of the interesting is, we we have one solopreneur who was automating their entire business with with Kestra, including, payment automation, categorizing, customer, customer support tickets, using, OpenAI. So, super interesting use case and great to see that, Kestra can can be applied in, for cell burners. For more surprising and unexpected, I would expect more people to be able to write custom code. And what we have found out is that there are many, many users who purely use our plugins. If they need to have some transformations, they would often just add custom pebble expressions. So this is like Python Jinja, where they transform some data on the fly without, writing, dedicated code, like extra Python functions. So, yeah, I was frankly a bit surprised. Like, sometimes it seemed to me personally easier to maybe write custom code for for this aspect, but I see users just prefer to just keep things simple, just simple transformation function, and, move to the next task. I also was a bit surprised how many users actually leverage the low code, aspect of Kestrel. Our default UI is to is to use the code interface, so you need to write your workflow. We have, beautiful, like, auto completion, syntax validation, when you just type things from the UI. But menus are still explicitly opt in to the topology view and just add things from the low code UI forms. So that's one aspect was which was also surprising to me. And, overall, I think it's always surprising to see how broad spectrum of users are coming to us. We have some who, as I mentioned, just prefer to keep things simple. They only use our plugins. And there are other people who just write custom code for everything. So, like, every task is maybe, Ruby or JavaScript or Python task. So the spectrum is really wide, and it's really interesting to see this.
[00:34:31] Tobias Macey:
Another aspect that I forgot to touch on earlier is given that Kestra is by default a platform service, what does the local development experience look like for people who are maybe iterating on a script or trying to test out a workflow before they push it to their production environment?
[00:34:49] Anna Geller:
Yeah. I I believe the local development is is really great. We have feedback from one user who mentioned that writing workflows in Kestra is fun, which is, unheard of in in the world of orchestration that building more first can be fun. So, essentially, to to get started, you, you run a single Docker container. You open the UI, and you hit a single button to create a flow. From here, you add your ID for the flow, the namespace to which it belongs, the list of tasks that you that you want to orchestrate, and the triggers. So whether this should run on schedule or based on some event when new file arrives in history, etcetera. And then when you start typing your tasks, you get this auto completion, built in documentation.
You have also blueprints that will guide you through examples on how to leverage some, usage pattern. So I believe the local development experiences is really unique to Kestra. And as I mentioned, some use even consider this fun, which is very refreshing.
[00:35:48] Tobias Macey:
And in your own work of building and using and communicating about Kestra, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? One of the most challenging or interesting lesson we've learned is that, following this the common, VC advice of, you know, just start with a niche, then,
[00:36:09] Anna Geller:
lend in an extent. This I think this approach didn't work for Kestra as as good as we would wish. So at 1st, Kestra targeted, mostly data engineering for analytics use case. And over time, we we expanded to operational teams, and we focus on engineers who are building, orchestrating custom applications, reacting to events, building scheduled backup jobs, or building infrastructure, with Kestra. And this is where the adoption really took off. So the lesson learned is, you you don't need to follow the VC advice always. Sometimes, following your your own vision can can be better. Then in terms of product building, some lesson learned is that we were trying to use Versus code editor within Kestra. We, within one release, we, launched embedded Versus code editor within the UI. Over time, we found it it was really difficult. And in the end, it was much easier to build our custom editor than keep maintaining the one from Versus code because you have so little control over how everything looks like, how how we interact between this, Versus code extension and the UI. Yeah. So I think this was something that was surprising. We thought it would be easier. We also thought that Versus Code would be more open and not as restricted.
So if you want to, for example, use GitHub Copilot in your in your product, you you cannot do that. It's it's really restricted
[00:37:34] Tobias Macey:
to Microsoft only. And for individuals or teams who are evaluating orchestration engines, they're trying to decide what fits best into their stack. What are the cases where Kestra is the wrong choice?
[00:37:48] Anna Geller:
Yeah. So Kestra is the wrong choice if you build stateful workflows that implicitly depend on side effects produced by other tasks or by other workflows. To give you an example, let's say you have 1 Python function that writes data to a local file. And there is another task in another workflow that tries to read this local file. Technically speaking, if you use worker group feature in Kestra, you could make this work. But we consider this implicitly stateful approach a bad practice. We prefer that you declaratively configure that this task outputs a file, and this file will then be persisted internal storage. And then it can be accessed transparently by the task or even by the flows.
In general, we we try to bring infrastructure as code best practices, to all workflows. So we assume that you you your local development environment should be the same as what you were in the end doing production. So if in prod, you usually run things in a distributed fashion. So you cannot guarantee that those two tasks will run on the same worker to access this local file. That's why we we consider this an anti and anti pattern, and each execution in Kestra is by default considered stateless. And only if your tasks explicitly output some results, those results are persisted and can be, can be processed.
[00:39:18] Tobias Macey:
And as you continue to build and iterate on and explore the market for orchestration engines in the data context. What are some of the things you have planned for the near to medium term or any particular problem areas or projects you're excited to dig into? Yeah. We we are we are really
[00:39:33] Anna Geller:
excited about the, the feature we are, we will be releasing on December 3rd, this year, and this will be apps. This will allow you to build custom applications directly from Kestra. So you can treat your your workflows as a back end, and you build custom UIs directly from Kestra. So let's imagine that, you want to have some business stakeholders who want to request some data. They can go to the UI. They can select from the inputs what type what type of data they want to request. Then your your workflow can fetch and process and transform all the data in the way this end stakeholder needs it.
And it can then output this data directly from this, like, custom application. So this eliminates this need, you know, and often I think as data engineers, we we know this use case where, stakeholder comes in and ask, like, could you fetch this data for me? I just need this report. So effectively, they can fully self self serve with this approach. Similarly, if you have, patterns that need approval. Right? So, let's say somebody wants to, request compute resources. You can fill those inputs in a custom form, then this will go to the, manager or to the DevOps engineer who can look at the request.
They can approve it, and then you can maybe see the result. So in the end, those custom applications, I think this will be feature that will unlock tons of different use cases, and we are very excited about this one. Similarly, since we follow this approach of everything as code, we are building a feature which is custom dashboards. So you can build custom dashboards that, visualize how your execution data should look like, and you can do that as code. So similarly to how you have worked for azimuth, you also have your custom dashboard, azimuth, which which you can version control. You can track revision history.
This is also another feature that will be, launched in December. And long term, in terms of what is on our road map, it's, cloud launch. We need this, fully managed service as I mentioned before, and also some improvements to, human loop. I think, to accommodate to AI driven world where AI generates some data, you need to have, reliable human need the loop processes that where human can approve
[00:42:12] Tobias Macey:
the output generated by AI. So that's also something that we that we work on even more. Are there any other aspects of the work that you're doing at Kestra or the overall space of UI and code driven orchestration that we didn't discuss yet that you'd like to cover before we close out the show? No. I think we've we've covered a lot of ground. Thank you so much for inviting me to the show. It's been great. Yeah. I'm very grateful. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the Kestra team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:42:54] Anna Geller:
We briefly mentioned the topic of everything as code. And so far, DBT has brought this approach to analytics. We see BI tools are catching up. So, slowly, you can start building dashboards as as code, which,
[00:43:10] Tobias Macey:
can follow the same engineering practices. I think we are still far away from the world where you can really have everything in the data engineering process, managed as code, and I think we we should probably close this gap at some point. Alright. Well, thank you very much for taking the time today to join me and share the work that you and the Kestra team are doing on bridging the gap between code and UI driven workflows and expanding beyond data only and ETL only execution. So appreciate the time and energy that you're all putting into that, and I hope you enjoy the rest of your day. Thanks so much. Thank you for listening, and don't forget to check out our other shows.
Podcast.netcoversthepythonlanguage, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Macy. And today, I'm interviewing Anna Geller about incorporating both code and UI driven interfaces for data orchestration. So, Anna, can you start by introducing yourself?
[00:00:26] Anna Geller:
Yes. Of course. I'm Anna Geller. I'm a data engineer and technical writer turned product manager. I worked in many data engineering roles, including, consulting, engineering, and later also DevRel. And currently, I worked as a product lead at Kestra. And, yeah, that's the subject of today's, podcast. And do you remember how you first got started working in data? Yes. So I I think I started working with data during internship at KPMG, processing data for a year end audits. So there there was a lot of Excel spreadsheets and queries to SQL Server. Yeah. That was how I started. I actually also studied, kind of data engineering,
[00:01:14] Tobias Macey:
as as my master. So, yeah. In terms of the scope of this conversation, can you start by giving your definition of what constitutes data orchestration and what is necessary for a system to be able to orchestrate data effectively?
[00:01:31] Anna Geller:
Yeah. So it's it's always, a bit difficult to agree, in the industry on definitions. The way I see data orchestration is that it's, automated coordination of workflow nodes that touch data. This means that, essentially, any workflow nodes that interact with data, whether they produce data or consume data, they all fall into this category. I think one misconception I see, that many people associate the orchestration only with, ETL and analytics. And instead, I think that, we should see it a bit more as a broader concept that, covers how DITA moves, across your entire business.
So I think, every company has their, internal APIs that need to exchange data. You need to react to events, like, sending an email, and maybe update inventory anytime there's a new shipment. You need to process data across ERP, CRM, PLM, all kinds of internal systems, and and you often needs need to do that in real time, rather than in just nightly ETL jobs. Yeah. So I think the distinction is, whether you want to automate workforce for the entire IT departments with multiple teams, environments, internal systems, or whether you just do it for the data team.
[00:02:56] Tobias Macey:
And another aspect of the challenge of trying to really pin down what data orchestration means and what you should use to execute those workflows is that in the technical arena and in organizations, there are numerous different scheduling systems, workflow systems, automation systems, in particular, things like CICD for software delivery. There is a scheduler in Kubernetes and other container orchestrators. There are things like CRON and various other time based scheduling or event based systems such as Kafka or different streaming engines.
And a lot of times, because something already exists within the organizational context, when a new task or requirement comes up, the teams will naturally just reach for what they already have even if it's maybe not necessarily designed for the specific task at hand. And I'm wondering if you can talk to some of the ways that those tendencies can lead to anti patterns and some of the limitations in the approach of using what they already have for data specific workflows.
[00:04:06] Anna Geller:
Yeah. So I I believe there is a lot of overlap of, functionality, between, all those, CICD, scheduling and orchestration tools. If you if we think about it, they all have a trigger. Right? So for example, when a new PR pull request is is open or merged, you need to do something. They all have a list of jobs or tasks to run, when when some event is received. They also all have states. So they are all state machines in the end. If if a given step fails, you want to maybe restart, the entire run from a failed state. And also many CIS tools, maybe in the data space, we don't realize it, but they also have things like, notifications on failure.
They have ways to maybe pause after build step to to validate if the build was correct and to approve or reject deployment. Right? So there's quite large overlap and it's I think it's quite natural for companies to, instead of directly looking at, considering dedicated orchestrator, that they first try to use what they have and see if they can expand it to to use cases like, data workflows, automation of microservices, or automation of business processes. I think the limitations usually show up, when you have true dependencies, across workflows, across repositories, even across teams and infrastructure.
And also when you start running workflows at scale, because then you just lack visibility. It's kind of the same as with AWS Lambda. When you have tons of those different functions, at some point, you are just confused. You you have no overview, of what is what is actually the health of my platform. And, let's take GitHub action as one concrete example. GitHub actions is great, but the moment you have complex dependencies or custom infrastructure requirements, GitHub actions start becoming maybe not the right solution. For example, you want to run this job on ECS Fargate and run this job on Kubernetes and run this another job on my on prem machine, to connect to my on prem database to perform some, data processing. Then you have patterns like, run this job only after those 4 jobs complete successfully, or run things at scale. And you want to manage concurrency.
You want to manage multiple code bases, from multiple different teams. Already managing secrets across all those multiple repositories as you would have to do with GitHub actions can become a bit painful when you have, like, multiple teams that maybe you want to reshare them. This kind of visibility and governance at scale is something where I I believe you you may consider, like, a true orchestra orchestrator.
[00:07:01] Tobias Macey:
Another challenge in the opposite direction is that teams that do invest in data orchestration will say, again, I already have something for doing orchestration. Why don't I also use that for CICD or whatever other task automation I have? And I'm curious what you have seen as some of the challenges in that opposite direction of using a data orchestrator for something that is not a data driven workflow?
[00:07:25] Anna Geller:
It depends on what we, in the end, consider as data orchestrator because many data orchestrators, they will not be able to perform this task, like, triggering a CICD pipeline, to deploy some containers. For example, dbt cloud. If you consider dbt cloud to be an orchestrator, you will not be able to, maybe start some Terraform apply from dbt. It's obviously not not this use case. For Python orchestrators, like, you know, all the airflow, and all tools in this space, I I think it's more feasible, but it can be a bit clunky to run to orchestrate CI from, just from Python, because mostly in CI, what you do is you run CLI commands.
You want to, maybe if you do it from airflow, you would need to have some HTTP sensor that listens to, some event webhook, maybe after your pull request was merged or something like this. So it it it would be feasible, but it can be quite quite clunky and, not not easy to maintain. In Kestra, we try to make this pattern, really easy given that you can simply you add a list of tasks with your CLI commands, then you add a webhook trigger that can react maybe to your pull request event. And then it's it's very simple. I actually have one quote. I don't know if I should just, like, read it out loud with one user who who is doing a CICD in Kestra, and he mentioned that it it was really refreshing. It's so simple yet powerfully flexible.
It really does allow you to create pretty much any flow you require. I have been migrating our pipelines from GitHub Actions to Kestra, and it's been so simple to replicate the logic. The ability to mix and match plugins with basic shell scripting or script from a language is just amazing. I think it's, possible we have, some, good testimonies that kind of, like, prove that the transition was fairly seamless.
[00:09:16] Tobias Macey:
Another element of data orchestration is the way in which it's presented and controlled. There have been a number of generations of data orchestration, each focusing on the specific problems of the overall ecosystem at that time. And one of the main dichotomies that has existed throughout is the question of whether it's largely a UI driven or a low code approach where you're dragging and dropping different steps and connecting them up in a in a DAG or whether it's largely a code driven workflow where that also has some degrees of how code heavy it is or maybe it's a YAML description of what the different tasks are. Maybe it's pure code where a lot of times that will lock you into a particular language environment.
And I'm wondering what you see as some of the main motivators for those UI versus code driven workflows at the technical and the organizational level.
[00:10:11] Anna Geller:
The main motivation to combine, code and UI driven parents is is to close the mark to close the the market gap. The way, we see the the orchestration and automation, tool market is that on the one hand, you have all those, code only frameworks often to requiring you to build your workflows in, Python, JavaScript, or or Java. And on the other spectrum, you have all those, drag and drop ETL or automation tools. And in both of those categories, there are there are many solutions you can pick from. There are bunch of, pie like, orchestration frameworks. There are a bunch of no code, drag and drop solutions, but there are very few tools in the middle. And this is the the gap that Kestra tries to fill. And in general, we believe that Kestra is the best among low code orchestration solutions. And if we if we make this claim that we are we are the best, so why why are we the best? Most tools in this no code UI, space, you would first build something in the UI, and they will create a dump of JSON schema, and they will call it code. So in the end, I believe, what what Kestrel does differently is that with every new feature, we start with code and API first, and all those UI components come later. And as a result, the YAML definition is readable. It has full auto completion, syntax validation.
You have great UX from in terms of that you have a built in documentation, revision history, Git integration so that you you can iteratively start building everything in the UI. You can then push it to Git when you are ready, and you cover this whole spectrum of being able to to have this, like, nice intuitive UI to to iteratively build workflows without compromising the engineering benefit of a framework. To kind of maybe summarize this is that existing solutions are I usually either too rigid, like all the no go tools, or they are too difficult, like all the frameworks, you know, to some extent, with Kestra, you have all the benefits of a code based orchestration framework without the complexity of a framework. So you don't have to, deploy and, package, your code. You can just go to the UI.
You quickly edit it. You run it to check if it's working, and you are you are done in in just a few minutes.
[00:12:36] Tobias Macey:
One of the challenges of having a low code interface, even if there is a code driven workflow available, is that it imposes necessary constraints to be able to ensure that even if you do have a code element, you're able to visually represent it for people who are using that UI driven approach. And a lot of times, I've seen that lock the tool chain into a specific technology stack where maybe it is UI driven. It will generate code for you, which you can then edit, and it will translate that back to the UI, but only if you're running on Spark or only if you're running on Airflow. And I'm wondering if you can talk to some of the ways that that by modality and the the requirement to be able to move between those different interfaces and maintain parity between them imposes constraints as far as
[00:13:32] Anna Geller:
the interfaces or the workflow descriptions or the types of tasks or runtime environments that you're able to execute with? There there are no constraints in terms of, what you can orchestrate in terms of, technology you want to integrate it with. The only constraint is that Kestra has built in syntax validation, which means that the API doesn't allow you to save the flow if it's invalid. So this is, one constraint. There there are obviously tons of benefits, with this. There are no surprises at run time because the flow is validated, during its creation, at the build time. If you have invalid, let's say, indentation in your in your Kestra YAML, Kestra won't let you save that flow. And in contrast, like, we can maybe compare it to, like, how it's handled in Python because I believe your audience, a lot of them use, tools like airflow. So with a DAG defining a Python script, your workflow logic can be potentially more flexible, but a wrong indentation in your Python script will be detected at run time. So in the end, it's it's more flexible, but also it's more fragile. And in the end, as with pretty much everything in technology, it comes to the trade off of, constraints and guarantees that we can we can offer. With Python, you can have potentially a bit more flexibility in how you define this this workflow logic, but at the risk of having additional runtime issues if something is is incorrect. And you have also this downside that you have to actually package and deploy that code, with the benefits of, being in Yamo. Kestrel is a bit more constrained, but it's also portable and self contained. It's it's quite painless to deploy.
It's validated at build time, and you can be sure that everything is working. So, yeah, the the pretty much the only constraint is that you cannot save an invalid flow.
[00:15:27] Tobias Macey:
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
So in order to explore a little bit further as far as the constraints and benefits, I think it's also worth discussing what the overall architecture of Kestra is and some of the primitives that it assumes for those orchestration tasks. And then we can dig more into some of the ways that you're able to use those primitives to build the specific logic that you need. So if you can just give a bit of an overview about how Kestra is implemented and some of the assumptions that it has about the level of granularity of the tasks and the types of inputs and outputs that it supports.
[00:16:37] Anna Geller:
Yes. So maybe let's start with the with the architecture. Kestra started with the architecture that that relies on Kafka and Elasticsearch, and it was really great in terms of, scalability, no single point of failure. But at the same time, it made it more difficult for new users to, get started with the product and to to explore it. Many of, listeners probably know maintaining Kafka in production can be difficult. So that's why Kestra edits the the architecture with JDBC back end, in the open source version. This means that, you can use, Postgres, MySQL, SQL Server, or h two as your database. And on top of that, you have the typical server components you can expect from from orchestration tool, which is you have executor, scheduler, web server, and, workers. All of those components can be, scaled independently of each other because, all of those are kind of like microservices.
And, yeah, you can if you need more more scheduler, more executor, you can just increase the number of replicas in your Kubernetes deployment and everything just works. So this is from the, let's say, DevOps back end back end perspective of the architecture in terms of user experience. Kestra really relies, heavily on the API. We are API first product. This is not, orchestration framework that you would just define your code, run it locally, then deploy the code instead, everything interacts through API. So the restrictions in terms of tasks and triggers you can do are restricted by the plugins that you have in your Kestra instance. You can have as many plugins as you want. By default, Kestra comes prepackaged with all plugins, so you don't need to install anything. This is kind of the main benefit you you already get with orchestration platform like Kestra that there's no need to pip install every dependency that you need to, use all those kinds of different integrations.
Everything is by default prepackaged. And if you need a bit more flexibility, you can cherry pick which plug ins are included. So let's say you are AWS shop. Like, you you don't use, Azure GCP. You don't want those extra plug ins for those other cloud vendors. You you simply don't include them in your plug ins directory in Kestra, and you just, like, cherry pick the plug ins that you that you need. On top of that, you can build your custom plug ins. The entire process is fairly easy. You have a template repository that you can, simply fork and build your code on top. Then you build your your jar file included in the plugins directory, and then you have the custom plugin. Then in terms of governance that you can have on top of this, as Kestra administrator, you can set plugin defaults for each of those, plugins that you added, to, for example, ensure that everybody is using the same AWS credentials.
Or you or if you want to globally enforce some pattern that maybe everybody should use, this way of working those properties, you can enforce them on a all globally using plugging defaults. And this pluggable infrastructure has some constraints in terms of that. If you don't have plug in for something, you will not be able to use it. But the the benefit is, yeah, you have a lot of governance. It's scarce really well with, more plug ins that you can always add. And we also have the the possibility to create custom script tasks. So if some plugin is missing and you don't want to touch Java to to build custom plugin, you can do that, for example, in, Python, r, Node. Js.
You can write your custom script, and you can just run it as a container. That's kind of like how Kestra can, support all those different kind of integrations.
[00:20:28] Tobias Macey:
And so in terms of the level of granularity of the tasks or the data assets that you're operating over, what are the assumptions of Kestra as far as the, I guess, scale of data, the types of inputs and outputs, and in particular, the level of detail that you're able to get to as far as what a given task or plug in is going to execute and how that passes off to the next task or plug in? That's mostly
[00:20:59] Anna Geller:
coordinated through inputs and outputs. So each workflow can have as many inputs as you want in, all inputs are strongly typed. So you can say, okay. This plug in is a bullion. This plug in should be integer, and this plug in is a select. So you you can only select the value from the drop down. Maybe this input is multi select, so you can only have one of the predefined values. You can have JSON, URL, all kinds of different inputs, and that's that's, already the benefit that they are strongly typed. So the end user who may not be as technical, will know what are the value values they can input into the workflow. Then the communication between tasks to pass data between each other is mostly operating in terms of, metadata and internal storage.
If you want to pass some data objects directly, you can do that. If your if your plug in, specifies that some data should be, output. Indirectly, you also have input files and output files for for script task. So you need to explicitly declare that, let's say, this Python task should output, those those two files or maybe all adjacent files, and then they will be captured and automatically persisted in Kestra's internal storage. You can think of internal storage as s three bucket. It can be s 3, GCS, etcetera, or just local storage. People familiar with airflow can think of internal storage as airflow's, x comms without the the complexity of having to do, like, x comm, push and pull. So, yeah, that's that's how how tasks can pass data between between charter, and you can even pass data across workflows. I think this is huge for governance. We have many many users who use, for example, subflows to compose the workforce in a more modular way that you can have, one parent flow that triggers multiple processes, and each of them is comprised into, subflows.
And the subflows can output some data as well, and they can pass it between each other, so that you have this way of exchanging data between, different teams, different projects, without having to hard code any dependencies and without having to rely on implicitly stored files somewhere, locally. Another
[00:23:14] Tobias Macey:
trend that's been growing in the data orchestration space is the idea of rather than data as tasks, treating data as assets where one task might produce multiple assets. The canonical example largely being dbt where you might have one dbt command line execution that produces tens or hundreds of different tables as an output and being able to track those independently, particularly if there are downstream triggers that depend on one of those tables being updated or materialized. And I'm wondering how Kestra addresses or some of the ways that Kestra is thinking about that level of granularity in terms of a task producing multiple different outputs or assets as a result.
[00:24:02] Anna Geller:
Yeah. That's that's totally feasible. Each task can output as many results as you want to. Maybe I I wouldn't recommend to outputs like 1,000 files because maybe the UI could break potentially. But, overall, you can output as many things as you wish, and, Kestra is doesn't introduce, any restrictions in terms of, like, what your specific outputs can be. There is one really great feature that people really appreciate in Kestra, which is, outputs preview. So it's, your task run returns, maybe CSV file or JSON file. You can easily preview it in the UI, so that you know if if the data format is right, if everything looks looks good. In the same way, if something fails, you can maybe preview the data and, see, okay. Maybe what I have in this downstream task is some some error in my code. Like, maybe you didn't capture some edge cases. You can redeploy your workflow. So, essentially, you create a new revision and you can rerun it for only for this, new, last task. This is a feature called replay. It's super useful for, like, failure scenarios.
If you have, if you process data and you have some things that are unexpected, and you don't want to rerun, like, all those previous things. Right? Because everything else worked. Only this single thing, didn't didn't work. So you can very easily, reprocess things that don't work simply by fixing the code and pointing the execution to the updated revision.
[00:25:28] Tobias Macey:
In terms of the audience that you're targeting, given the fact that it has this UI and code driven approach, I'm wondering how you think about who the target market is, the types of users, and some of the ways that that dual modality appeals to different team or technical boundaries across the organization?
[00:25:48] Anna Geller:
Yeah. That's that's, that's a great question. Our target audience currently are mostly engineers who build internal platforms. So usually, you would build some workflow patterns and you want to expose some workflows maybe to to less technical users to external stakeholders. We have lots of architects, software architects coming to to Kestra to support them in replatforming. This usually means they want to maybe move from on prem to cloud, or there's also this completely reversed pattern. There are many companies who now these days move from cloud back to on prem because of some additional compliance reasons. So, yeah, a lot of people using Kestra are those, platform builders who then expose those workforce to to less technical users for a variety of use cases. Kestra is not focused on exclusively on data pipelines. We also support infrastructure and API automation, business process orchestration.
You have things like approval workflows. Mhmm. One very common scenario is that there are some IT automation tasks that, for example, provision resources, and some DevOps architect or manager needs to approve if those resources can be deployed. So you you have this approval process implemented in Kestra that the right person can approve the workflow to continue. We have also all those, event driven, data processing use cases, where you can have events. You receive events, for example, from, Kafka, SQS, Google pops up, and you want to trigger some microservice in response to this event.
That's also perfect use case for Kestra. So it's not restricted to data pipelines. And I would say state is data orchestration because you you react to some data changes in the business, and you want to run some data processing in response.
[00:27:36] Tobias Macey:
As a listener of the data engineering podcast, you clearly care about data and how it affects your organization into the world. For even more perspective on the ways that data impacts everything around us, you should listen to Data Citizens Dialogues, the forward thinking podcast from the folks at Calibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale among others. In particular, I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast moving field.
While data is shaping our world, DataCitizens Dialogues is shaping the conversation. Subscribe to DataCitizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts. At the organizational level too, I'm interested in some of the ways that Kestra is able to implicitly bridge these different workflows without the different teams needing to know every detail of what the available data is and how it's produced, where, for instance, I have a workflow that's the available data is and how it's produced, where, for instance, I have a workflow that's taking a file from an SFTP server, processing it, generating some table in the data warehouse as a result, and then somebody else's workflow depends on the contents of that data warehouse table, does some analysis, produces some data visualization or a report generation, execution of the next task without the person who controls each tasks having to explicitly communicate between them or requiring that the workflows are directly built on top of each other? So Kestra supports that pattern, using
[00:29:19] Anna Geller:
flow trigger. This was in fact, like, one of the most popular patterns already from from the beginning of of the product. The use case typically looks as follows. You have multiple teams that don't have tight dependencies between each other. So you would say, run this flow only when those 3 other workflows from different teams have successfully completed within the last 24 hours. So you can easily define it as, as a condition for this flow, so that it only runs after those preconditions are true. And you can additionally add conditions for the actual data. So you can say only if this data returned maybe 200 status code or if this data has this number of rows, do something, like trigger this workflow. Kestra doesn't introduce, like, any new concept for this. We already have the concept of triggers.
So implementing this those kinds of patterns is a matter of explicitly declaring in your YAML what are expectations to trigger this workflow. And you can explicitly list all of those, flow execution that should complete it should be completed within this given time frame, and then it will run. So I think it's like in in the mindset is quite similar to how many other, like, also data orchestrator orchestrators do do that, but without
[00:30:44] Tobias Macey:
restricting it directly to only being data. Another aspect of what you're building with Kestra is the fact that it is an open core product with a paid service available on top of it. And I'm wondering if you can talk to some of the ways that you think about what are the elements that are available as the open source, what are the things that are paid, and some of the ways that you think about the audiences across those 2 and how you are working to keep the open source elements of it sustainable and active.
[00:31:17] Anna Geller:
That's that's the challenge every open core company is asking themselves every day. I'm pretty sure. We have, this framework that all features that are about security, scalability, and governance, they all go into the enterprise edition and all features that are single player, core core orchestration capabilities, they go into into the open source version. And that's how we try to balance. There is I believe there is no single answer. Every company tries to find the best solution. What we found out so far is that, we have some prospects, some, people coming to the to to Kestra who would prefer to have fully managed service. And currently, Kestra doesn't offer that. We have open source and, self hostable enterprise solution. So that's something we'll be working on next year. It will be a big priority, especially to enable even more people to try the product, see how it's working, and including just trying even those paid enterprise grade features, without having to, like, first maybe talk to sales and start official POC.
[00:32:22] Tobias Macey:
And as you have been building and working with Kestra, what are some of the most interesting or innovative or unexpected ways that you've seen it
[00:32:29] Anna Geller:
applied? Yeah. Some of the interesting is, we we have one solopreneur who was automating their entire business with with Kestra, including, payment automation, categorizing, customer, customer support tickets, using, OpenAI. So, super interesting use case and great to see that, Kestra can can be applied in, for cell burners. For more surprising and unexpected, I would expect more people to be able to write custom code. And what we have found out is that there are many, many users who purely use our plugins. If they need to have some transformations, they would often just add custom pebble expressions. So this is like Python Jinja, where they transform some data on the fly without, writing, dedicated code, like extra Python functions. So, yeah, I was frankly a bit surprised. Like, sometimes it seemed to me personally easier to maybe write custom code for for this aspect, but I see users just prefer to just keep things simple, just simple transformation function, and, move to the next task. I also was a bit surprised how many users actually leverage the low code, aspect of Kestrel. Our default UI is to is to use the code interface, so you need to write your workflow. We have, beautiful, like, auto completion, syntax validation, when you just type things from the UI. But menus are still explicitly opt in to the topology view and just add things from the low code UI forms. So that's one aspect was which was also surprising to me. And, overall, I think it's always surprising to see how broad spectrum of users are coming to us. We have some who, as I mentioned, just prefer to keep things simple. They only use our plugins. And there are other people who just write custom code for everything. So, like, every task is maybe, Ruby or JavaScript or Python task. So the spectrum is really wide, and it's really interesting to see this.
[00:34:31] Tobias Macey:
Another aspect that I forgot to touch on earlier is given that Kestra is by default a platform service, what does the local development experience look like for people who are maybe iterating on a script or trying to test out a workflow before they push it to their production environment?
[00:34:49] Anna Geller:
Yeah. I I believe the local development is is really great. We have feedback from one user who mentioned that writing workflows in Kestra is fun, which is, unheard of in in the world of orchestration that building more first can be fun. So, essentially, to to get started, you, you run a single Docker container. You open the UI, and you hit a single button to create a flow. From here, you add your ID for the flow, the namespace to which it belongs, the list of tasks that you that you want to orchestrate, and the triggers. So whether this should run on schedule or based on some event when new file arrives in history, etcetera. And then when you start typing your tasks, you get this auto completion, built in documentation.
You have also blueprints that will guide you through examples on how to leverage some, usage pattern. So I believe the local development experiences is really unique to Kestra. And as I mentioned, some use even consider this fun, which is very refreshing.
[00:35:48] Tobias Macey:
And in your own work of building and using and communicating about Kestra, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? One of the most challenging or interesting lesson we've learned is that, following this the common, VC advice of, you know, just start with a niche, then,
[00:36:09] Anna Geller:
lend in an extent. This I think this approach didn't work for Kestra as as good as we would wish. So at 1st, Kestra targeted, mostly data engineering for analytics use case. And over time, we we expanded to operational teams, and we focus on engineers who are building, orchestrating custom applications, reacting to events, building scheduled backup jobs, or building infrastructure, with Kestra. And this is where the adoption really took off. So the lesson learned is, you you don't need to follow the VC advice always. Sometimes, following your your own vision can can be better. Then in terms of product building, some lesson learned is that we were trying to use Versus code editor within Kestra. We, within one release, we, launched embedded Versus code editor within the UI. Over time, we found it it was really difficult. And in the end, it was much easier to build our custom editor than keep maintaining the one from Versus code because you have so little control over how everything looks like, how how we interact between this, Versus code extension and the UI. Yeah. So I think this was something that was surprising. We thought it would be easier. We also thought that Versus Code would be more open and not as restricted.
So if you want to, for example, use GitHub Copilot in your in your product, you you cannot do that. It's it's really restricted
[00:37:34] Tobias Macey:
to Microsoft only. And for individuals or teams who are evaluating orchestration engines, they're trying to decide what fits best into their stack. What are the cases where Kestra is the wrong choice?
[00:37:48] Anna Geller:
Yeah. So Kestra is the wrong choice if you build stateful workflows that implicitly depend on side effects produced by other tasks or by other workflows. To give you an example, let's say you have 1 Python function that writes data to a local file. And there is another task in another workflow that tries to read this local file. Technically speaking, if you use worker group feature in Kestra, you could make this work. But we consider this implicitly stateful approach a bad practice. We prefer that you declaratively configure that this task outputs a file, and this file will then be persisted internal storage. And then it can be accessed transparently by the task or even by the flows.
In general, we we try to bring infrastructure as code best practices, to all workflows. So we assume that you you your local development environment should be the same as what you were in the end doing production. So if in prod, you usually run things in a distributed fashion. So you cannot guarantee that those two tasks will run on the same worker to access this local file. That's why we we consider this an anti and anti pattern, and each execution in Kestra is by default considered stateless. And only if your tasks explicitly output some results, those results are persisted and can be, can be processed.
[00:39:18] Tobias Macey:
And as you continue to build and iterate on and explore the market for orchestration engines in the data context. What are some of the things you have planned for the near to medium term or any particular problem areas or projects you're excited to dig into? Yeah. We we are we are really
[00:39:33] Anna Geller:
excited about the, the feature we are, we will be releasing on December 3rd, this year, and this will be apps. This will allow you to build custom applications directly from Kestra. So you can treat your your workflows as a back end, and you build custom UIs directly from Kestra. So let's imagine that, you want to have some business stakeholders who want to request some data. They can go to the UI. They can select from the inputs what type what type of data they want to request. Then your your workflow can fetch and process and transform all the data in the way this end stakeholder needs it.
And it can then output this data directly from this, like, custom application. So this eliminates this need, you know, and often I think as data engineers, we we know this use case where, stakeholder comes in and ask, like, could you fetch this data for me? I just need this report. So effectively, they can fully self self serve with this approach. Similarly, if you have, patterns that need approval. Right? So, let's say somebody wants to, request compute resources. You can fill those inputs in a custom form, then this will go to the, manager or to the DevOps engineer who can look at the request.
They can approve it, and then you can maybe see the result. So in the end, those custom applications, I think this will be feature that will unlock tons of different use cases, and we are very excited about this one. Similarly, since we follow this approach of everything as code, we are building a feature which is custom dashboards. So you can build custom dashboards that, visualize how your execution data should look like, and you can do that as code. So similarly to how you have worked for azimuth, you also have your custom dashboard, azimuth, which which you can version control. You can track revision history.
This is also another feature that will be, launched in December. And long term, in terms of what is on our road map, it's, cloud launch. We need this, fully managed service as I mentioned before, and also some improvements to, human loop. I think, to accommodate to AI driven world where AI generates some data, you need to have, reliable human need the loop processes that where human can approve
[00:42:12] Tobias Macey:
the output generated by AI. So that's also something that we that we work on even more. Are there any other aspects of the work that you're doing at Kestra or the overall space of UI and code driven orchestration that we didn't discuss yet that you'd like to cover before we close out the show? No. I think we've we've covered a lot of ground. Thank you so much for inviting me to the show. It's been great. Yeah. I'm very grateful. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the Kestra team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:42:54] Anna Geller:
We briefly mentioned the topic of everything as code. And so far, DBT has brought this approach to analytics. We see BI tools are catching up. So, slowly, you can start building dashboards as as code, which,
[00:43:10] Tobias Macey:
can follow the same engineering practices. I think we are still far away from the world where you can really have everything in the data engineering process, managed as code, and I think we we should probably close this gap at some point. Alright. Well, thank you very much for taking the time today to join me and share the work that you and the Kestra team are doing on bridging the gap between code and UI driven workflows and expanding beyond data only and ETL only execution. So appreciate the time and energy that you're all putting into that, and I hope you enjoy the rest of your day. Thanks so much. Thank you for listening, and don't forget to check out our other shows.
Podcast.netcoversthepythonlanguage, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Defining Data Orchestration
Challenges in Workflow Systems
UI vs Code Driven Workflows
Kestra's Architecture and Features
Target Audience and Use Cases
Open Source and Enterprise Strategy
Future Plans and Innovations