Summary
Machine learning has become a meaningful target for data applications, bringing with it an increase in the complexity of orchestrating the entire data flow. Flyte is a project that was started at Lyft to address their internal needs for machine learning and integrated closely with Kubernetes as the execution manager. In this episode Ketan Umare and Haytham Abuelfutuh share the story of the Flyte project and how their work at Union is focused on supporting and scaling the code and community that has made Flyte successful.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data lake architectures provide the best combination of massive scalability and cost reduction, but they aren’t always the most performant option. That’s why Kyligence has built on top of the leading open source OLAP engine for data lakes, Apache Kylin. With their AI augmented engine they detect patterns from your critical queries, automatically build data marts with optimized table structures, and provide a unified SQL interface across your lake, cubes, and indexes. Their cost-based query router will give you interactive speeds across petabyte scale data sets for BI dashboards and ad-hoc data exploration. Stop struggling to speed up your data lake. Get started with Kyligence today at dataengineeringpodcast.com/kyligence
- Your host is Tobias Macey and today I’m interviewing Ketan Umare and Haytham Abuelfutuh about Flyte, the open source and kubernetes-native orchestration engine for your data systems
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Flyte is and the story behind it?
- What was missing in the ecosystem of available tools that made it necessary/worthwhile to create Flyte?
- Workflow orchestrators have been around for several years and have gone through a number of generational shifts. How would you characterize Flyte’s position in the ecosystem?
- What do you see as the closest alternatives?
- What are the core differentiators that might lead someone to choose Flyte over e.g. Airflow/Prefect/Dagster?
- What are the core primitives that Flyte exposes for building up complex workflows?
- Machine learning use cases have been a core focus since the project’s inception. What are some of the ways that that manifests in the design and feature set?
- Can you describe the architecture of Flyte?
- How have the design and goals of the platform changed/evolved since you first started working on it?
- What are the changes in the data ecosystem that have had the most substantial impact on the Flyte project? (e.g. roadmap, integrations, pushing people toward adoption, etc.)
- What is the process for setting up a Flyte deployment?
- What are the user personas that you prioritize in the design and feature development for Flyte?
- What is the workflow for someone building a new pipeline in Flyte?
- What are the patterns that you and the community have established to encourage discovery and reuse of granular task definitions?
- Beyond code reuse, how can teams scale usage of Flyte at the company/organization level?
- What are the affordances that you have created to facilitate local development and testing of workflows while ensuring a smooth transition to production?
- What are the patterns that are available for CI/CD of workflows using Flyte?
- How have you approached the design of data contracts/type definitions to provide a consistent/portable API for defining inter-task dependencies across languages?
- What are the available interfaces for extending Flyte and building integrations with other components across the data ecosystem?
- Data orchestration engines are a natural point for generating and taking advantage of rich metadata. How do you manage creation and propagation of metadata within and across the framework boundaries?
- Last year you founded Union to offer a managed version of Flyte. What are the features that you are offering beyond what is available in the open source?
- What are the opportunities that you see for the Flyte ecosystem with a corporate entity to invest in expanding adoption?
- What are the most interesting, innovative, or unexpected ways that you have seen Flyte used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Flyte?
- When is Flyte the wrong choice?
- What do you have planned for the future of Flyte?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Flyte
- Union.ai
- Kubeflow
- Airflow
- AWS Step Functions
- Protocol Buffers
- XGBoost
- MLFlow
- Dagster
- Prefect
- Arrow
- Parquet
- Metaflow
- Pytorch
- dbt
- FastAPI
- Python Type Annotations
- Modin
- Monad
- Datahub
- OpenMetadata
- Hudi
- Iceberg
- Great Expectations
- Pandera
- Union ML
- Weights and Biases
- Whylogs
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Kyligence: ![Kyligence](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/krLMJxWU.png) Kyligence was founded in 2016 by the original creators of Apache Kylinâ„¢, the leading open source OLAP for Big Data. Kyligence offers an Intelligent OLAP Platform to simplify multidimensional analytics for cloud data lake. Its AI-augmented engine detects patterns from most frequently asked business queries, builds governed data marts automatically and brings metrics accountability on the data lake to optimize data pipeline and avoid excessive number of tables. It provides a unified SQL interface between the cloud object store, cubes, indexes and underlying data sources with a cost-based smart query router for business intelligence, ad-hoc analytics and data services at PB-scale. Kyligence is trusted by global leaders in financial services, manufacturing and retail industries including UBS, China Construction Bank, China Merchants Bank, Pingan Bank, MetLife, Costa and Appzen. With technology partnership with Microsoft, Amazon, Tableau and Huawei, Kyligence is on a mission to simplify and govern data lakes to be productive for critical business analytics and data services. Kyligence is dual headquartered in San Jose, CA, United States and Shanghai, China, and is backed by leading investors including Redpoint Ventures, Cisco, Broadband Capital, Shunwei Capital, Eight Roads Ventures, Coatue Management, SPDB International, CICC, Gopher Assets, Guofang Capital, ASG, Jumbo Sheen Fund, and Puxin Capital. Go to [dataengineeringpodcast.com/kyligence](https://www.dataengineeringpodcast.com/kyligence)today to find out more.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others. Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies. Sign up for the SaaS today at dataengineeringpodcast.com/accryl. That's acryl. Your host is Tobias Macy. And today, I'm interviewing Ketan Umare and Haitham Abuel Fatou about Flyte, the open source and Kubernetes native orchestration engine for your data systems. So, Ketan, can you start by introducing yourself?
[00:01:39] Unknown:
My name is Ketan. I'm a software engineer. I still write code, but I'm also an accidental CEO and cofounder at Union. I started flight probably about 5 years ago at Lyft. And then before that, I paid homage to multiple different companies that I, you know, gave my blood insight to them. Along the way was Amazon and in Oracle, Lyft, and then most recently at Union.
[00:02:05] Unknown:
And, Hitam, how about yourself?
[00:02:07] Unknown:
So I started my journey at Microsoft building enterprise software at Google afterwards, and then Lyft is when I met awesome cofounders here at Union. I we were building flights and the source flights and then moved on to build Union, leverage what we built in flight in a commercial offering. I'm the CTO at Union and loves code and continues.
[00:02:34] Unknown:
Alright. And going back to you, Keitan, do you remember how you first got introduced into the area of data management?
[00:02:39] Unknown:
I've actually just before when I was thinking of data management, I don't know. Like, I've we've always been doing data management as a software engineer. Right? But, specifically, in this case, what we are now calling data engineering and data management, I was at Lyft. I used to it was only at Lyft, really, where I started building things that was specifically for machine learning. And I was responsible for a team that managed to build models that powered ETAs for Lyft. And that team had interesting challenge. It has lots of data. And from and with lots of data, it also wanted to build a lot of machine learning models to power the app. And that is where, I guess, I got introduced to the modern data management.
But, otherwise, I've typically, I've led teams doing block storage. And in some way that's managing, you know, data on your cloud file systems. And before that, I worked in mapping, as I said. And in mapping, there was probably 1 of the I don't wanna say the original big data because some insurance people told me that was the original big data, which I think makes sense. But mapping is really a big big data problem, which is unstructured and lots of interesting problems. So, yeah, I've worked with the and then, you know, I also worked in high frequency training. So there's been data all around, but in the current sense of the word, I guess, at Lyft when I was working on machine learning. And, Hitam, how did you get started in the area of data?
[00:04:07] Unknown:
I think a lot more excellent than that story, but I think, yeah, data is everywhere. Any application you're right is probably manipulating data or creating data in some form, but I haven't really was introduced to, I think, big data, you know, management and I in the current sense of more modern sense of the world until I joined the flight team, and it was mostly for building the infrastructure that others use to do that than doing it myself, and I certainly enjoyed doing that.
[00:04:39] Unknown:
And so now that brings us to the topic of flight, and I'm wondering if you can just start by describing a bit about what it is and some of the story behind how it came to be and why you decided that it was a good idea to write a new workflow engine.
[00:04:53] Unknown:
Yeah. It's never a good idea to write a new workflow engine, by the way. But having done that a couple of times in the past, and say that again, I just joined Lyft or maybe I tried Lyft about a year ago or something, and I would did not want to do infrastructure. I was specifically working on a problem, and the problem that I was working on, as I said, just previously is ETAs. ETAs are when you open up the app, it tells you the time it takes for a car to, you know, come to you. And also when you when you're about to sit in the car, it tells you the price to the destination. It's ETA and ETD in technical terms. But these are 2 numbers. They look like time, but they are so different from each other. ETA's are usually short. ETDs are usually long. And so these 2 numbers help drive a lot of, you know, top line and bottom line revenues for Lyft, and so they were extremely critical. The team that I inherited or started leading had to deliver these models on a consistent basis.
And we were actually working on trying to deliver a new set of models based on analyzing the traffic in the real world. And if you know, traffic changes constantly. So we wanted to update those models almost every 10 minutes, which in 2016 was almost unheard of that you are now training a model and deploying into production and training another model, deploying to production. If something goes wrong, then you're rolling back to a previous model or so on. Right? And so just coming from a pure software engineering world and, you know, know, cloud systems, I was like, there is a way to do this. You have we've we've kind of built this DevOps system, deploy software reliably.
Sure there must be a way to deliver models reliably. And the other thing, during that time, we did not only want to consistently deploy it. We also wanted to, you know, retrain and understand what's happening in production. So we looked at the existing set of tools and the story. So when I joined the team, there was 1 engineer who would actually running all these models locally on his laptop and, you know, triggering functions in the remote system. And sometimes, he created a runaway bill of $1, 000, 000 just by forgetting to shut something down. And we realized that we wanted to actually do this in a better way. We wanted to manage how we build these models. I'm sure we cannot run away, create bills of $1, 000, 000 at every time we run something.
So we wanted to bring structure around it. And I'm like, hey. From my past experience, I know that I work for an that do this sort of stuff because it was not a single step process with multiple steps and so on. And I knew there were hosted solution offerings, but I was let's see what's in open source and what's available at Lyft. And some folks at Lyft are already using Airflow, and that was it seemed to be the best solution out there in open source. It was surprising to me because it did not look like the workflow engines that I had worked on in the past. And so but it was interesting. It was very easy to hack. It was written in Python. So I took it over the weekend, modified it, and we got it running. We delivered, like, with blood and sweat, really with lot of tears. We delivered something, that worked to solve that problem, but he did not solve many other problems. The moment you create you know, solve 1 problem, other problems creep up, and you're like, okay. Now how do I do this? Back testing of a model and, like, scaling and so on.
And moreover, we actually did a small talk within the company that, hey. We did this. We actually saw, you know, every 10 minutes, we're deploying a new model and people were amazed. We wanna do this different teams. And that led us to thinking that, hey. By the way, what I built is not a platform, so please don't use it. And that led me to actually write a paper that described what the platform could be like if I take a first principles approach, which nobody else decided to build. So I was forced to build almost just to get people off my back at some point. And within about 6 months of building it, it took, like, a month or something to build it on. And we used step function, batch, and couple other pieces, whatever we got from the, you know, kitchen and sink, literally, and put it together, and we got it running.
And almost 15 to 20 teams started using it with with no time. These were all critical teams. Right? From pricing, to driver engagement, to growth, to targeting, fraud, all kinds of team within the company. And I was like, what is happening? And so we realized that there is something here. And while we're doing this, we're learning what are we doing wrong, what should this new architecture look like, and so on. And then other companies started approaching us, asking us about, like, well, what are you guys doing? And so we explained we had kinda stumbled upon weirdly at a system that should exist should have existed, but we think these are the set of things that people need in machine learning. And let us to talk to a bunch of team, including Google with Qflow and so on. And then around 2020 January or 2018 or 2019, we started rewriting it using Kubernetes, learning everything that we had done for the last 2 or 3 years. And then in 2020 January, we decided to open source it, and then, you know, the pandemic happened. But that's the story of how Yeah. We came up with an open source flight.
[00:10:02] Unknown:
Workflow orchestration is something that's been around in some form or another for many years and specific to data workflows for maybe on the order of a decade or so, you know, with some of the earlier ones being things like Yarn or Oozie for for the Hadoop ecosystem. And a lot of people will maybe orient their kind of baseline around the Airflow project because that was the first kind of major breakout, you know, workflow orchestration engine for data pipelines. And you mentioned that machine learning is 1 of the core capabilities that you were trying to support when you were building flight, which is definitely something that I think everybody can agree airflow is not optimally suited for. Like, you can do it, but you have to do a lot of hacking to make it work well. I'm wondering if you can just talk through the kind of current landscape of workflow engines, how you view it, and how you would characterize flight's current position and maybe juxtapose that with the situation when you first started flight and how you've seen the overall ecosystem evolve from, you know, airflow being kind of the de facto engine that everybody's going to use to where we are today where there's a lot more choice and granularity in terms of how you make a decision about which workflow engine you actually decide you want to invest in for being kind of the life force of your data platform?
[00:11:23] Unknown:
Oh, fantastic question. A very loaded question, by the way, So I have to be careful. So, yeah, we did start with Airflow, and then we quickly learned that Python is you know, you need to have Python as a first class within the ecosystem. And then we as I as I in the story explained, we started using step functions, and we did not wanna write the entire workflow engine. Writing workflow engines from scratch is a hard problem. The reason why it's hard is it's doing you do a lecture. It's doing it's a reliability property that you need to have with flow engines. And so I knew of that because I had written a workflow engine in the past. And I was like, this is a hard problem. We wanna do, you know, score that problem. We didn't wanna do it. So we wrote a Python library on top of step functions and getting those things to solve. But what we realized is that the problems were really different. I did not know this at that time. I actually went with the gusto of a software engineer to solve this problem.
Now when I step back or maybe in the last couple of years, I've loved this hypothesis of why it's different. Right? I would be lying if I said, I don't need what I do this. It was not. Right? And what happened is I'll give you an analogy here. So if I tell you that there is a new database technology that comes out today and you start using it, you're like, oh, this is crappy. But, you know, the team that's behind it is awesome, and they have 2 years of money to keep on working on it. 2 years from now, if I ask you the question, would the database would it be better? Mostly, your answer would be, yeah. It's gonna be better. You know? Bugs get fixed, scalability gets solved, problems get solved. But now if I tell you there's a model that gets built today, it's 98% accurate. And 2 years from now, maybe 2 months from now, will this model be as accurate? I mean, like, probably not. If it's COVID, you know, in pre COVID and then COVID, may it's gonna be 0% accurate. What's happening is this dichotomy in the in the way the software is built versus how models and machine learning products are built. That got us to thinking it's okay. So then the tooling itself for building these models and deploying them to production has to be different.
Like, the number 1 problem in data software is that you actually kind of, like, create this entire pipeline of when you build what, how you scale, how you deploy. It can takes months to have an incremental approach to delivering small amounts of value. But in machine learning products and data products, that's not the case. You cannot do the small thing. You have to do, like, a this bag, large because and you have to go really quick at it. Because if you don't do it, you probably will hit COVID and then boom, your models are gone. And then many times you don't even know if your model's gonna make sense. If if you get a model, you'd apply it to production. And then maybe you find out in an a b test that, it doesn't really work. It looked as if it worked. Look. It doesn't really work. Right? And so you need this constant iteration and instant productionization story. We think that's a different sort of tooling that you need.
So even though we have a workflow engine, the problem we are not solving is workflow orchestration. The problem we are solving is this machine learning tooling ecosystem. And why a workflow engine? And the workflow engine is because we think there's no 1 solution to solve all the parts of the problem within the data and machine learning ecosystem. Right? You know, there are efforts now to build that 1 system. I don't think that exists. I'll give you an example. Some of your data, it could be Snowflake. And even for most people who are using Snowflake, I've heard that some of their sensitive data is not in Snowflake in in some cases. And the reason is because it's not on prem. And so now you have this dichotomy of storing of data already. And then once you have it installed, you may wanna take it out and process it. You may wanna deploy it. You wanna serve it. So you need to do these multiple steps, and there are some tools which are fantastic at processing, like Spark and Ray and, like, all of these other tools that are coming up. 1 of the problems that the users of the system have is it's also counterintuitive for me is that you first go and start a cluster.
Install all your requirements on the cluster, then go and write the code and install that code. And then, you know, let's say in some cases, DASK or things or Spark or so on, your process will learn. Now you did this. Somebody else in your team wants to write another piece with some different dependencies. They have to do the same thing over again with a different different cluster, and the 3rd person has to do this over again with a different cluster. And now you end up proliferating the number of clusters in the company, and there are hundreds of people. Who's managing this process? So, yeah, like, lot of shutdown, lot of setup. Like, weird things that keep on happening in the system. So what we realize is that you need to abstract that infrastructure from the user. But the users should just focus on their business logic.
And in the ideal world, 1 day, hopefully, we'll take the we'll find the holy grail, and people will just write code and it just works. And if it's involved with multiple different systems, it kind of orchestrates all of these systems to do the end goal. And so that's how we think it's different from the current set of solutions out there, and that's because it is actually a different way of looking at the problem and a different way of solving that problem. And we took a very first principles approach in this where we're like, okay. If we were to do this, if we were to take every single piece of execution, let's say, Spark job, as a unit of execution, what do we know about it? We know nothing. It could scale to hundreds of machine, thousands of machine, could be run on 1 machine. We know what inputs you wanna provide it and what outputs you expect from it. And we know kind of, like, how meta, some metadata about it that people wanna scale it. So we actually built everything around these first principle that everything is a monad of some sort from pure functional style. When they exist, then we call that a task, and then you bring together multiple of these tasks to form a workflow.
And that's a pipeline. And we created a abstract language to model this system. And what those gave us is, like, when we were working with other people, we saw that even though Python is awesome, not every task is written in Python. There are still Scala, Java, c plus plus tasks that exist. We're like, okay. This allows us to abstract languages. That's how flight was built, to be a serverless orchestration platform that is language agnostic, yet offers you type safety so that you can get Python. Do you wanna add anything to this, Seithem?
[00:17:42] Unknown:
A great summary, I guess, about what the platform looks like today. I might add that just maybe dig a bit deeper in what few of the constructs. I think you talked about the tasks and how they are, you know, arbitrarily sized and formatted. So you can be, the containers. It can be just calls. It can be really anything in this representation of an intermediate language we have. We approach this with a, you know, compiler, I guess, in mind that you, you know, you can write in whatever language and then things get transformed into this intermediate language. We use for it and allow that to abstract your programming language, abstract, you know, data types. The typing system is completely represented in this language as well. You can invent new languages in your sort of worlds in your codes and still be able to interrupt those with existing pipes in other languages, in other workflows that other people built in your company or in your team.
We have the workflows on top of those. That's where you try to, like, connect different tasks to form a pipeline. Workflows, we took them even step further, I think, from what you traditionally see in pipelines. You can not just connect tasks. They are composable. You can connect other workflows. You can pull tasks and workflows from completely different code bases. Like, they are just, you know, essentially, a remote reference to somebody else's published work. You don't have to think about the dependencies and dependency conflicts between your code and their code. You just need to know that, you know, somebody in your company built a, you know, a task that trains, you know, mixed boost models or something, and you can just reference that. They can also be dynamically generated. And we see in the current offering, at least in the market, there are tools that either give you these, you know, statically defined pipelines or you have to choose the other tools that only give you dynamically produced workflows.
You can't mix and match, and I don't think this is right. I think there are problems that are very easily defined in a static, you know, graph. You can visualize it. There are a lot of benefits in having statically compiled workflows. Right? You can do static time analysis of the, you know, data types and the interrupt between the interfaces, and you can see them in, like, some UI, inspect them. Others can view them and so on. And there are a lot of other use cases for also dynamic, you know, workflows where you want to have some custom logic and, you know, if conditions and loops and so on to help you define the workflow at 1 time. And with flights, you don't have to, you know, choose 1 or the other. You can mix the 2 and use the right tool for the use case you have. You have a statically defined workflow with parts of it being, you know, this dynamically generated graph.
You also saw that because you said Lyft where you build a reusable workflow that gets run per, like, region. Right? It's lifts and, like, all the right tier companies will have this, you know, construct of custom or tuned models per region because not all regions in the world or in the US or whatever will behave exactly the same. And it's a very common problem with embed, right, in general models or not. Right? You can't just build 1 model and expect it to run-in all demographics and, you know, in all areas exactly the same. And you want to give people tools to parameterize these models or the pipelines they are built on.
And flight has a, you know, what we call launch plans, which is essentially a customized launch form or a template for launching the workflows you built and add SKUs on them, retrain, and so on and so forth. As you're talking through the sort of complexities of machine learning and then juxtaposing that with the
[00:21:39] Unknown:
overall sort of exercises that people will need to go with to be able to do the data preparation for feeding into those models and just managing the overall platform ecosystem. There are a couple of different categories of tools that I see flight kind of sitting in between, you know, where 1 is the data workflow engine where you're looking at things like Airflow, Prefect, DAXTER, which are very focused on this. You know, I need to build a data pipeline so that I can, you know, do my extract load transform, and then particularly with DAXTER and Prefect being able to then say, I wanna then kick off a, you know, a training job. Maybe I wanna integrate with an MLflow or a Metaflow.
And then, you know, those are the others type of tools that I see you kind of sitting in that same kind of Venn diagram with where you have Metaflow, kube kubeflow, MLflow that are very focused on that model training and delivery and kind of MLOps workflow piece of it. It seems that you're focused on being able to, you know, serve both of those aspects and the intersection of that Venn diagram. And so I'm wondering what you see as being some of the deciding factors that, you know, people in the community and some of your customers now that you're building this business around it are going through as they're deciding, okay. Do I wanna use Airflow and then pair that with Metaflow to be able to do my full you know, Airflow does all my data prep, and then I hand it off to Metaflow for the machine learning piece, But, you know, maybe I just throw it all at at flight or, you know, maybe I'm using Kubeflow because I've already got the Kubernetes engine and just what that sort of decision matrix looks like as people are deciding which sets of tools they wanna use and what are the cases where they ultimately decide that, no, Flight actually does all the pieces that I need or, you know, flight does the 80%, and this is the other, you know, 1 or 2 components that I wanna integrate, just what that process looks like as people are going through this very complex decision structure.
[00:23:29] Unknown:
I wanna say that I'm putting my open source flight contributor hat on. Like, it's very easy to model the line between, you know, the commercial offering and flight, and we wanna say that flight is a Sank open source project that, as union, we are just contribute to it. We do not own it. So let me put the open source hat on, and the answer to that question is that and some of the answer to your previous question as well. Like, 5 years ago, some of the things that flight is doing was not possible. That's because we actually hit the tipping point where, you know, running serverless is an acceptable and possible phenomenon. Right? And Lambda and Kubernetes and containers and all of these things make it just possible.
And if I draw this timeline to 5 years into the future, all data products would love to be weekly serverless from the user point of view. Right? And so that's the underlying premise here. That being said, flight is a general purpose platform, but Union, the company, is focused on, you know, machine learning usage of flight. And we encourage all users of Flyte to actually see potentially, as a Flyte to see where machine learning is required in their flow. And then I think they'll find good value in the system. And I'll give you some examples. Right? For example, 1 of the key decisions in our system was to because it's a workflow engine. Right? You know, if the task is running and let's say some network error happens, you cause a heartbeat failure to that task, and so now you restart the task.
Now if you deploy the control plane itself, if the flight is, you know, designed as a cloud native system. So if you deploy the control plane, you will lose those heartbeatings potentially. You may, right, if something gets delayed with the task. And so should we kill the task? No. It may be a running training job that was running for 18 hours and just for control plane deployment, you cannot kill it. We actually designed it in a way that that's would not happen ever. It is designed to sustain these kind of catastrophe failures or operator errors. But in case of failures, we also designed so that what happens if a spot machine is used. Right? Like, if you're using spot instances. Because cost, as I said, in machine learning is an extremely important attribute.
So what might happen is the spot machine may come in. You are doing a 10 hour training job. It's been 9 hours. Boom. The spot machine goes away. Now do you restart and do the 9 hour work again? Spend the money on GPUs for 9 hours, or do you resume from the point of that 9 hour? And so flagships with intercast checkpointing. So even within a workflow engine, we realize that they need checkpointing in this because there's no workflow engine that could solve a training loop. It's too tight a loop, and so we don't wanna solve that loop. Right? The next part over here is but I did 8 hours, and I have 3 retries.
I reached 9 hours, and I again got pulled out. And now it's my last retry. Do I now again put you on a spot machine and hope that things like, hope is not a strategy? So it put you on an on demand or a reserve machine automatically. Flight understands that you're using a spot machine. Right? So these are the extremely important trade offs that we made while designing the system. These are not done later on in time and so on. These are core to the system. Along with that, like, having a first class data frame schema within the system was essential to us because most of the users are using Pandas data frames or Spark data frame that they're analyzing and working. And then this when I think about data platform, they're part of data platform that just does query transformations on our data warehouse.
Flight can do it, but we don't really think it is optimal. There's DBT, which is much better probably than us. Right? But then in when you're in the machine learning world, you're probably working with the data frames itself. And so you want to use arrow and use parquet and, like, optimize the load and the store of these data frames. It's a lot of boilerplate to do. Even, like like, just last, like, last week, I was working on a PyTorch model on GPU, and I wanted to load it back in my CPU. And I forgot, oh, I have to store the weights in a device independent way. This is boilerplate. So this is where we are putting a lot of our resources so that users of the platform don't have to think about this boilerplate. It just works. So you just add it and go, yeah. This model was trained. I pull it down from flight, and it works on my CPU and push it back. It works on a GPU.
So to answer your question, you should think if there are workflows, if serverless is a requirement or is an idea that you like, if you are a Kubernetes shop, because we are completely Kubernetes dependent, then that may not be true for everybody. If you think operating, scaling servers is not something that you wanna do, then probably flight is the right fit for many parts of it. That being said, for doing pure query transformation, that might be better to go with dbt. If you are a small, 3 people shop. Right? It running flight on its own might involve running a Kubernetes cluster. Because even if you get an EKS and a GKE cluster, there is some involvement in running those machines. But if you're comfortable with Kubernetes, it's like a much natural fit rather than running something which needs you to write YAMLs or write Kubernetes, understand the Kubernetes API and do mistakes. And if you care about reproducibility and auditability in your process, then we actually think flight is a very, very good fit for these scenarios. And that's why we see a lot of adoption within the biotech community.
[00:29:05] Unknown:
So Yeah. I want to add 1 thing about, I guess, the the notion of the Venn diagram and the sort of intersection between these tools. So we are trying, at least in flight, not to reinvent the wheel. Right? There are a lot of very good parts of, a lot of these projects that we just take out of the box. Like, Kubeflow has bunch of operators that do distribute training on Kubernetes. Excellent job on how these run. Right? There is no reason you have to choose between either Kubeflow or flight and not just, like, get the best of both worlds. And, yeah, Ketan was talking about DBT and how, like, you know, data transformation and so on, how easier there and absolutely true. And there's, again, no reason you have to choose.
You either do transformation in flight or, you know, whatever the workflow engine is or DBT. Like, you should be able to say, I want to run this part in DBT and then get the data in that format and pass it along, you know, to a serial training job in TensorFlow operator and take that and ship the model in, like, fast API. We should be able to express that and not have to think about every or learn about the infrastructure behind each piece of those. We try at least to make that possible with flights. Right? You focus on the business logic. You pick the right tools for the right parts of your workflow. Run the most cost efficient way. Like, if you have a training job, run that on GPU. There's no point in running, like, you know, a regular, like, data processing or data passing data or moving data around on a GPU machine. Very expensive. Right? You shouldn't do that. So, yeah, Play just sits in between. That's true, but it tries to integrate for the most part with the right tools in these right in these platforms.
[00:30:54] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder. Digging now into the architecture of the flight project, I know that it's modeled very closely on the Kubernetes API and its capabilities. I'm wondering if you can just talk through some of the implementation and some of the ways that you approached the design and user experience of the projects to maybe abstract out some of the more complex elements of Kubernetes to make it approachable for data practitioners?
[00:31:53] Unknown:
I think that you hit the needle, I guess, on the head here with abstracting Kubernetes. Nobody want to talk to Kubernetes APIs directly or figure out how to write a pod or or a job. Right? Or, like, an email or CRD or how do you deploy that and how you clean this up, how do you scale that and recover when it fails and all of that. Lights overall, if you were looking at, like, the 10, 000 foot view, I think it's just a very typical cloud service. Right? Is a control plane completely or can be deployed completely separate from what's your data plane looks like where the executions happen and the data happen. All of these pieces can be scaled independently horizontally and vertically. So they were designed to if you have more loads, throw more machines at it, and it will scale with you, of course, like, redundant data stores in the back. And it's very portable. Thanks to Kubernetes.
It's very portable to run on different clouds, run on premise, run on your machine, and, you know, like, start with a very tiny nodes that, like, 1 box that's run on your laptop with very low resource usage can scale all the way to your, like, multi cluster setup in production, grids with, like, you know, data collection and redundancy and so on. When we're at Lyft, we had this like, the setup was with 1 control pin, essentially, 1 URL backed by, of course, redundant services, and it was called managing multiple data planes. We had, I don't know, 5, 7, but it's clusters. And from a user perspective, they just know that when you are up, they have no idea about how where their executions are on or how many machines they take and all of that beforehand. So they don't have to know. We there's a lot of also metrics flight exposes that help users with, like, cost management and, you know, just understanding about their solutions.
But, yeah, from, coding against flight perspective, you don't need to learn about architecture in the back. We care a lot, like, to give, I guess, the guarantees about reproducibility and the robustness of these executions and being able to recover executions when they fail and so on. We rely a lot on packaging codes and the interfaces, very strict interfaces. There's requirement to having typing. So any, like, task or any workflow you describe in flight, the system needs to know about, you know, the input types and output types, And this might add a bit of friction in the development, but we see a lot of increased velocity as soon as people adopt this paradigm. You write a function. You add type annotations.
You add a you write the workflow. You add type annotations, and it fits right in your sort of, like, development workflow in any language you use. Even in Python, there's no requirement of additional 5 additional risk. You just use the native pipe annotations, and the system takes care of transforming them, of doing the validation and doing the pipe checks. They gets compiled sort of on the fly when it all flows are registered in the system. And what else, but the yeah. Architecture, what other pieces we would talk about? Yeah. So
[00:35:03] Unknown:
1 other thing that when we first started, we were I said Python is critical. So when we first started, we were Python 2.7 prior to the type annotations world and so on, and we didn't know how to really seamlessly bridge the gap. And that's 1 of the reasons why we moved to this new Kubernetes based architecture in 2018 because we wanted to build a very seamless experience for the user. It should feel as if you're writing natively in Python, but it's magic when it goes to this, like, crazy distributed system. And about in January 2021, we actually rewrote the entire SDK into a full type safe Python experience where you write code in Python and it gets transformed.
What this enabled is now folks who write coded Python field almost at ease. Everything runs locally with the Python. Simple, single threaded, local, you know, execution kind of an experience. And when you wanna scale, it just seamlessly scales to these multiple machines. Another thing that it does is because of the typing system, we always were able to this was a core thing, but, like, becomes even more natural to do to use Panda's data frame and then consume it as a Spark data frame and then downstream consume it as a modern data frame or whatever. Right? You know? And they are all kind of abstracted using the same underlying type system within flight. Moreover, like, we'd support things like memorization and caching. So that means you've run something. You don't have to run this again in the entire company.
If 1 guy runs something, let's say you ran a query that results in some data frame, you don't have to rerun the query again and again. Just run it once, and it'll capture the results of that. So yeah. So we are extremely focused on UX. We actually found we were very dependent on Docker containers. We have been you know, we are still dependent on Docker containers, but we are kind of abstracting how often you have to build out containers to a point where you write 1 file and Google just runs immediately. And you're further and further abstracting how Jupyter Notebooks and flight can interact without really needing to lose the power of gate and so on. You're constantly innovating on behalf of the users in the user experience department because we think that's the only way. And then the second part is we still want people to use the right language for the right tool for the right problem. We don't want you to force you to you have to use Python for this stuff. No. If R is the right thing, use R for that stuff. If Java is the right thing, it's Java for that stuff. And that will continue to happen more as cost becomes a problem with machine learning. And that's why our design is essentially for this multiline world just like, you know, RPC services.
We actually you know, when earlier when we started doing this, people were like, every company should use 1 language, And then, no, most companies actually use bot polyglots. There are teams that use code, and teams use Java, and the team uses Scala. And they all work because of strong interfaces
[00:37:53] Unknown:
way of thinking about this problem. To add to that, Oyes, I think I did hear that feedback about flight. They wanted to again, it's a good opportunity, I guess, to talk about this. That flight is mostly focused on the production grades of, like, writing workflows and pipelines. Maybe not focused on the being sort of easy to use or easy to get started with. I think that's related to what Kate is saying that the way it was designed, it was built to really give you all the, you know, robustness guarantees and reproducibility guarantees, and that meant this very, like, stringent sort of rule book you have to follow, like, how you package your code and then how you register, how you commit any code changes, everything is tracked, and so on. And I would say over the past couple of years, we managed to not lose any of that while making the Experience seamless. Yeah. The experience, like, way skewed towards the, like, prototyping and super quick to get started and iterating really quickly over your codes and, you know, still guaranteeing all of the production good things, yeah, you would expect from a flying production grade also.
[00:39:05] Unknown:
In terms of the end users of flight, I'm wondering what you see as the predominant kind of user roles and sort of personas that you're focused on and the different types of affordances that you've built into the system and the user experience and the API design to be able to collaborate across data engineering and MLOps and data science and business users and figuring out how to bring all the different stakeholders together on this kind of core use case and utility that's necessary for every aspect of the business?
[00:39:40] Unknown:
Yeah. Great question. So I think the primary user persona that flight's authoring system is focused on is a person who is not I don't wanna say not afraid, but likes writing code in Python or in Java or Scala, like, 1 of those languages. And they were contributed by Spotify, the Java Scala SDKs. But likes writing code. This is not shy of using Git to track their code and knows a little bit of Docker containers. So it just needs to know barely enough to get started with. So that's the user persona that the authoring system. But we realized that it's not only the authoring system, right, that there could be folks who are interacting with the system other than the authors of the court. And that's where the UI the we've actually built a UI that is, we don't think it's as pretty as it can be. It should be better, but it's very functional. It builds a launch form automatically, which is type safe. It forces you to use and that's why I, you know, use types for annotating your data inputs. It will automatically say, like, give me a parquet file, give me, you know, a integer.
And some of our end users have actually built further layers on top that allows us to give me an integer between 0 and 100. Alright? And that helps it bring it down to almost any user. Almost a like, you know, in some cases, like, LatchBio and Zymergen and these other users, of flight. Their end users are lab technicians, and those people can use flight directly without really knowing that they're using flight. And then we have also been we are very, very heavily invested in after the fact story of an execution because in machine learning, that is often the the regular work. You you write something and quickly look at it, look at the results, then go back and write something. So and that all works through Jupyter Notebook because it all runs through an API. Everything's powered by an API. So you can just pop up a Jupyter Notebook on any machine, maybe colab, and connect to the flight API and whoop you are running. You can rerender the results, pull the results, get a model, do other things with it. So so the user persona for authoring is, as I said, data engineer, ML engineer, software engineers, lots of product engineers nowadays, like pipelines.
And, also, some data scientists and research scientists can who are comfortable with writing Python code at least and Git. Then the end user scenarios could be almost anybody.
[00:42:12] Unknown:
The other challenge that often comes in when you're dealing with these work flow engines, because of the fact that pipelines have a habit of proliferating, is figuring out how do you manage the logical organization of these different types of tasks? How do I encourage reuse of these different sort of modules to be able to ensure that, you know, this 1 metric is being computed the right way every time or that, you know, this dependency chain is happening every time. So, you know, this spark job needs to happen before I can do this transformation in the data lake. So I just wanna make sure that anybody who's trying to do a transformation will automatically pull in that dependency and just ensuring that sort of organizational scaling of usage for the flight workflow engine?
[00:42:55] Unknown:
Fantastic question. Yeah. That's why I say it's not a workflow engine. Right? It is more of a platform. And that's because, again, because of our probably where we started, we wanted to offer it as a central service for the entire company. Right? The moment you think about offering a central service for the entire company, you have to think about how different tenants will sit on that. And there will be 1 who was big and heavy. Like, oh, it's just like, you know, they'll come in and say, like, I want to spend $1, 000, 000 on your product. And then there'll be some others who would be like, you know, I just wanna do this 1 small thing. And so you have to be fair across this spectrum. And so flight builds in constructs for project management, and we also have concepts of managing projects through their life cycle of how when you're, like, creating on it to production.
And that's built in. And it's I think it's 1 of the only kind of its product, really, that that kind of does this out of the gate. And within those, you can now have, oh, for my development domain within my project, I wanna use Rolex. That doesn't have permission to, let's say, some datasets. But when I go to production, I want to access role use role. When I say role, I am roles or service accounts to access some dataset. So we are trying to build in governance in here, but we are not really a governance platform. But we are trying to build in some sort of data management parts within the system. So I think yeah. And the reusability aspects. I think the reusability aspects is something that I think Ethan briefly touched.
Everything is a a task is a monad in the pure functional sense. What that means is they're relocatable. You can just pick it up and throw it on somewhere else, and it should run. That's the construction. Again, it's harder said than done. If you don't have side effects, that's when it works. If there's side effects, then there's no way out. But let's assume there are no side effects, then, Spotify's blog showed probably and some other people. What they are doing is they're creating base level tasks, what we call them as platform tasks, and then anybody else within the company can just reuse them. And then also across teams, you can just reuse them. And this dramatically improves a system reuse. And because you don't have to really think about what was this task written in. Is it written in Java, Scala, Python?
What dependencies? Do they use TensorFlow, PyTorch, or something else? You don't have to think any of that. You just say, like, I wanna call this service. Boom. It runs with you and then maybe Spark. It may turn out that it's a Spark job, and then you get the results out of it. Another element of
[00:45:24] Unknown:
the work that you're doing at Flight and just workflow orchestration in general is the fact that it's a natural choke point for metadata creation and manipulation and incorporating metadata from the broader platform back into the work flows. And so I'm curious what types of constructs you have for being able to generate that rich metadata because of the fact that you have all the context of the task graph and the operations and the datasets and the type information, being able to both consume that from external systems such as OpenMetadata or DataHub or, you you know, the AWS Glue catalog, and then also being able to propagate it back out into those systems so that the metadata doesn't get locked into flight so that it can be used more broadly across the overall life cycle and use cases for data in an organization?
[00:46:10] Unknown:
Excellent question. I love the way you phrased it because I think more often than not, a lot of these systems are sort of silos. Like, the data comes in, you do stuff, and then you get the data out, but that's pretty much it. As soon as you give the system, you don't know all the metadata and how this was generated and who did that and all of that. You lose all of this. And as we were talking earlier, like, again, flight is not trying to reinvent the wheel here. We would love to integrate, and we do integrate with the best tools for the job. There is a natural, of course, meter data engine within flights because of all the inquiries, all the cataloging of meter data we collect throughout.
But from the entry point, the ingestion point, and the egress points, we also provide tools and events pipeline and, you know, like, the data coming in can come in with beta data and get that going out, can go out with annotated with metadata that you can consume. We have some customers who do integrate or use Data Hub for data lineage, and they just all they do is consume the egress events coming out of flight to build up beautiful data hub, lineage graph of, you know, the transformation that happened to the data. I think the dream, I think, is that when you get, you know, some prediction or maybe bad prediction in production that you want to be able to link it to the specific version of the data that came in the pipeline maybe a month ago that made it all the way, you know, to the production today.
[00:47:47] Unknown:
And without, you know, all this rich metadata throughout different systems, across different systems, you will not be able to. Yeah. I think 1 place where we're seeing a little bit of a miss, and it's it's we are working on this part on our side as well. So flight probably is 1 of the, again, only system that takes versioning to the like, as an intrinsic construct. What that means is, like, every time you do something, it is versioned and tracked and then, hence, reproducible from that, like, historical point of view.
We've seen that many of the metadata systems do not take the time access, but that is changing. But we are seeing a change in that. Even the the file formats like Apache, Ruby, and, like, the other ones, Iceberg, and so on, and then they, they all are now allowing immutable datasets with, like, point in time snapshots and recovery. And then that works really well with our kind of philosophy of versions. And so going back in time, I think, like, I can really dream of a world in the future where it's just everything is kind of version. It's immutable at some level, so that makes it really beautiful where you can just go back in time and rerun it, or we just observe it again as if it happened.
And for that, we all as an as a community. And and some of it was taught to us. I don't wanna say wrong, but misinformed data engineering practices with that, you know, oh, if something is broken, let's just delete the table and rerun it. I'm like, but what if there was a count, like, a side effect that you had already caused? You were probably from that corrupted table, we've already trained a model. It's already deployed to production, and some people are consuming it. Now you've gone ahead and you've deleted the entire table and you recreated it. I've lost the history of that model, and I've I cannot really trace back why it did some other things. And who will inform them? Right? That's the other thing. How are we gonna inform them? I think flight does some today. Like, it's able to produce events that you can listen to. There is no we don't have a way to consume metadata today at the point of, like I mean, not metadata, like, datasets today. That's 1 of the things that we'll be working on later in the year. And I think there is, in general, a little bit of like, this is the distinction, I guess, happening with ML and data, really. And that's 1 of the reasons why we are in this feared state when you say that, oh, are you here or there? And, like, they're trying to bridge the gap.
[00:50:09] Unknown:
Data lake architectures provide the best combination of massive scalability and cost reduction, but they aren't always the most performant option. That's why Kylogence is built on top of the leading open source OLAP engine for data lakes, Apache Kylin. With their AI augmented engine, they detect patterns from your critical queries, automatically build the datamarts with optimized table structures, and provide a unified SQL interface across your lake, cubes, and indexes. Their cost based query router will give you interactive speeds across petabyte scaled datasets for BI dashboards and ad hoc data exploration. Stop struggling to speed up your data lake. Get started with Kyligence today at dataengineeringpodcast.com/ That's kyligence.
There are a whole number of directions that would be great to dig into, but in the interest of not letting this go on to 2 and 3 hours long, I'd like to dig a bit into what you're building at Union as well. So last year, you founded Union AI to offer a managed version of flight and to act as a kind of corporate backing for the open source project, which is housed at the Linux Foundation to my understanding. And I'm wondering if you can just talk through some of the overall goals that you have as a business, some of the business model that you are building on top of the flight engine, and some of the additional features and capabilities that you're looking to layer on top of the open source foundation?
[00:51:36] Unknown:
Again, this time I'm gonna put a Union AI hat on, and so, you know, we'll not talk about flight in the same way. So from Union's point of view, the business model is really purely open source driven. Right? Like, I think we are doing an open core. I don't like to call it open core. We are trying to really make you know, some VCs may not like this, but we are trying to be agnostic, really. We are trying to build purely open source platform that serves a large number of companies. And the reason ethos behind that is essentially that if such a system, we have talked about it already, if you realize is if not open and purely, truly open, it can never survive and thrive long term. And this is a long term play. It's not like, oh, yeah. Let's in the next 2 years, print something and get done with it. We're trying to build it for, like, a generation. We wanna change the way maybe how we think about the problem and solve it. And so we think open source is the only real way to build such a thing. If the open source is successful by the virtue of us being known as the creators and retainers, we think there's a lot of value add purely that we can bring by that. And then as we were talking last year, we we actually, you know, thought about a lot of different models of what we should do. And we were talking last year. We were talking with a bunch of customers, users, and so on. And we're like, what's the best value that we can add? We can add by improving the accessibility to flight. Really, that's the best value that you can can add today along with contributions that we do with the open source. And so when talking with a bunch of customers, we realized that they really want to get started with using the best of the week technology very quickly and without really having a big infrastructure team, because some of our users have big infrastructure teams. And so we are like, how do I level the playing field with, like, the large users of flight and the new folks who wanna start using flight by making Union be the their infrastructure counterparts.
And that's really, like, you know, our model in the beginning is that, essentially, we wanna evangelize flight and get more people using flight by using Union when it's right for them. When it's not right, the Stripes and Spotify's and Lyft's and whoever's of the world, let them continue to use the open source. What they really give us is the access to their brains and the best engineers in the world, literally, to drive this problem. So I think it's a win win for both of them. That's ideally. And over a period, I think union is also building other pieces within this ecosystem that we think we can help companies not only adopt flight, but, like, improve their total machine learning progress overall.
It will all be underpinned by flight. We have 1 product strategy. Our product is like, But we think there are, like, attributes to it that help either reduction of flight or adoption of various interesting processes by companies. So that's our business model.
[00:54:30] Unknown:
In terms of the usage of flight now that it has gone through such a drastic evolution from when you first started working on it to where you are today, and there's been widespread adoption at every scale. What are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:54:47] Unknown:
Yeah. Like, I think it's everything just amazes me. Like, I had no idea. I told you about my briefly mentioned biotech, for example. A lot of biotech people are using it, and they do. Like, in our head, we kinda group them together as 1 category. It's not really right. They are, like, you know, within biotech, it's just like some people are solving early detection of cancer. Some people are creating synthetic materials. Some people are finding discovering new drugs. Some people are running CRISPR in the browser. And so just the number like, some people DNA analysis for pets and, like, animals and humans. So Yeah. It's just fantastic for me to see that. And on the other hand, there are people who are actually running blackstar dot ai is a company that actually takes satellite imagery and converts it to three-dimensional space that powers Microsoft Flight Simulator amongst other things and can be used for making games and, like, all kinds of fun stuff, can be used for synthetic data for training new models.
And and they reuse flight extensively. And they, like, really scale flight that I you know, for me to be like, oh my god. Maybe it can work to the scale? Yeah. So, yeah, it's just been fantastic, and people are so humble and nice and easy to work with. It's just pleasure to be part of this community and seeing amazing usages. Like, I think we knew about the Gojek's and the lifts and war walls and so on of the world because we worked at it. But, like, these other companies, I'm getting to live a vicarious life of, like, you know, working at these multiple company companies.
[00:56:18] Unknown:
Yeah. Yeah. I think there are maybe 2 experiences I had throughout this journey that I love. 1 of them is a company that was part of the, you know, various companies, I guess, that's helped build the COVID vaccine. And it was, like, amazing to hear their story and how they, you know, use flight for the processing and the some of the simulations. And the other 1 is a couple of companies started leveraging flight to build their own platform and was sort of resell that. And I think this is part this would, like, sets our model apart from the typical open core company pay the other companies because we ship flights really open source. You can take flight and build a completely new platform on top and call it your own. And don't have to rebuild the workflow engine. You'll have to work you know, rebuild all the pieces we already did. The launch forms and, like, all of these things just come out of the box, and you can add, like, your value add on top. And people you know, the builders appreciate that and the end users see that, and it's just a creates a win win situation for the entire industry.
[00:57:27] Unknown:
And in your own experiences of building flight and bringing it to where it is today and now founding a company to help support its continued growth and adoption. What are some of the most interesting or unexpected or challenging lessons that you've each learned in the process?
[00:57:41] Unknown:
Again, putting an open source hat on. So commercial entity is required, sadly, today to back open source product. And the reason is because I'll give you an example. A few weeks ago, we had a vulnerability that somebody discovered in the product. If it was a purely a Linux Foundation project, which was completely community driven, it would have taken a long time to fix that vulnerability. And looking at the gravity of the problem, we kind of jumped on, and we solved that problem in a power at a batch out, and we kind of almost pinged the not that we couldn't have reached the entire community, but as many as we could to patch their systems live, literally. And this is the benefit of having a company that backs an open source product.
And from the company's point of view, you know, in Union's point of view, it actually reduces the cost for us to have sales team and so on. Like, we are literally like, you interviewing us is a great sales pitch for, Union potentially saying that, hey. This is a product which we are offering for free. But if you're interested, definitely work with Union. And this was a great journey just to see how this has to be built and how do we make it but the question still remains, how do we make it sustainable? And we are trying to do it. We can do it sustainably if enough people adopt flight and if we think we can always drive value for you from that. And, hopefully, we are on our way to doing that, and we'll see more about it this year.
[00:59:11] Unknown:
This is my first, experience building a company. So pretty much every day is a surprise to me, a learning experience to me. Everything is new, you know, having to put on different hats and doing sales pitches and write in technical deep dives with security teams and, like, all of that, setting up infrastructure. Like, it was all learning experience. Amazing journey and loved every minute of it. I hope it just never stops being a great learning experience. And as Ketan was saying, the community that somehow we all managed to build around flight is just amazing.
They're amazing set of people who would love to help. They always, you know, contribute back, and this is really all you want from a open source project. And yeah. Wish us luck building the business around.
[01:00:00] Unknown:
Yeah. 1 shout out to our investors also actually at this point. This is, like, you know, from an investor's point of view, this crazy to do potentially. And so they have been really supportive. And every time they care about open source a lot deeply, and that's how I think open source great open source products get built. Thank you. So we've talked about this a little bit earlier as well. But for people who are interested
[01:00:23] Unknown:
in adopting a new workflow engine or they wanna be able to streamline their data preparation through the machine learning, what are the cases where flight is the wrong choice?
[01:00:33] Unknown:
Great question. Today, I'll put the union hat on. Union might be the right choice in some other cases, but, put the flight head back on. But if you are a small team that does not really use Kubernetes production or do not have much DevOps strength, it might be a little bit of a challenge to deploy a Kubernetes cluster and manage it. So that's not great. Purely doing query transformations only and you do not have any need for machine learning or any kind of data processing that's complex, then probably better off using some of the other tools. And the last part is if you have real time requirements, then flight is, again, not the right tool at the moment. But I'd say that at the moment.
[01:01:18] Unknown:
And so as you continue to invest in the flight project and community and ecosystem and grow the business at Union, what are some of the things you have planned for the near to medium term or any particular areas or additional features that you're excited to dig into?
[01:01:35] Unknown:
Just like metadata, we don't think quality should not be an afterthought because here's an another analogy. If I am, you know, ingesting some data and I'm and then I put a store it and then I perform a quality check on it, What if I find that there's a problem in the quality? What am I supposed to do? Am I supposed to go back and delete or somehow roll back, which most of the systems don't support today? So it's like a question to really think about. And the way I think about the problem is that quality should be a gate. It's like unit tests. Like, you should not allow data to go through if it doesn't meet the standards, and you probably store it in an alternate way, like, as a broad data, and then you perform the quality improvements that you need. So things like that. So we have been working with the great expectations team. Pretty awesome.
We also are working with Pandera and another open source awesome open source project. So definitely check it out. More on that to come soon. And then I think we can use this forum to talk about it. We quietly launched a new open source product called Union ML, which is on top of flight, but it's under the Union AIOS umbrella. So we are using this umbrella to migrate faster on those. And so Union ML is this early product that allows you to train a model. Like, it does not talk about workflows and pipelines at all. It actually talks about models and datasets, and it makes it possible to declare a dataset, train a model, and then serve that model almost hundred lines of code. And this model serving can happen using rest rest APIs or, you know, streaming.
And so this is a project that we are really excited about we're working on, and we are trying to partner with the ML community on this. Our vision for this is to build the micro web framework like thing for machine learning, where we only are providing opinions on, like, where the hooks should go. And we are working with the community to essentially build hooks like weights and biases and why logs and, like, you know, explainability, etcetera. So we are excited about this. And through the year, you should see another 1 that we are really excited about is how sharing should happen in this world, how collaboration can be driven not just by releasing papers, but actually usable code so that all, organizations can drive impact really quickly.
[01:03:53] Unknown:
Are there any other aspects of the flight project or what you're building at Union that we didn't discuss yet that you'd like to cover before we close out the show?
[01:04:02] Unknown:
No. So yeah. Again, Union hat on this time. If you have looked at flight and you think that, oh, you know, how will I do all of this? You you would like an infrastructure partner. We would love to talk to you. We would love to see how we can partner up, and and we're working with some design partners. We'd love to see if we can add and work with you guys. You putting the flight hat on. You know, if you tried flight as Tobias set up maybe to us some time ago, it's like, you tried flight 2 years ago, try it again. Give it a try. It is a continuously improving project. It's an outstanding community. Join the Slack channel, slack.fight.org.
Ask questions. There's no question is off limits some as in Scotia, but don't be shy. We love when people come in and ask questions and any kind of criticism is also appreciated. Thank you.
[01:04:49] Unknown:
Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tool in your technology that's available for data management today. I think the aspects of versioning and reproducibility
[01:05:10] Unknown:
are underplayed, and we need to understand the time access as a data engineering community like this. You know? Data is a living thing, and it has different snapshots at different time, may mean different things about that same dataset. Like, if you look at number of COVID cases, if I look at yesterday's snapshot, it's different versus 2 years ago. So the as of point of view is very interesting. And I know there are projects solving some of it, but I think they think more about it. Because data is useful when you ask questions from it, and you will ask questions like this, essentially. So
[01:05:48] Unknown:
Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at Flight and at Union. It's definitely a very interesting project, great ecosystem that you're building around it. Definitely wish you great success in the business. So I appreciate all of the time and energy that each of you have put into both the business and the open source project and helping to make that capability available to everyone. Thank you for that, and I hope you enjoy the rest of your day. Thank you. Thank you for having us. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Flyte and Guests
The Origin and Development of Flyte
Workflow Orchestration Landscape and Flyte's Position
Flyte's Architecture and User Experience
User Personas and Collaboration in Flyte
Metadata Management and Integration
Union AI: Goals and Business Model
Future Plans and Features for Flyte and Union AI