Summary
As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts sustainable, the core capability they need is for data scientists and analysts to be able to build and deploy features in a self service manner. As a result the feature store is becoming a required piece of the data platform. To fill that need Kevin Stumpf and the team at Tecton are building an enterprise feature store as a service. In this episode he explains how his experience building the Michelanagelo platform at Uber has informed the design and architecture of Tecton, how it integrates with your existing data systems, and the elements that are required for well engineered feature store.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show.
- You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat!
- Your host is Tobias Macey and today I’m interviewing Kevin Stumpf about Tecton and the role that the feature store plays in a modern MLOps platform
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what you are building at Tecton and your motivation for starting the business?
- For anyone who isn’t familiar with the concept, what is an example of a feature?
- How do you define what a feature store is?
- What role does a feature store play in the overall lifecycle of a machine learning project?
- How would you characterize the current landscape of feature stores?
- What are the other components that are necessary for a complete ML operations platform?
- At what points in the lifecycle of data does the feature store get integrated?
- What types of data can feature stores manage? (e.g. text vs. image/binary vs. spatial, etc.)
- How is the Tecton platform implemented?
- How has the design evolved since you first began building it?
- How did your work on Uber’s Michelangelo inform your work on Tecton?
- How has the design evolved since you first began building it?
- What is the workflow and lifecycle of developing, testing, and deploying a feature to a feature store?
- What aspects of a feature do you monitor to determine whether it has drifted?
- How do you define drift in the context of a feature?
- How does that differ from drift in an ML model?
- How do you define drift in the context of a feature?
- How does Tecton handle versioning of features and associating those different versions with the models that are using them?
- What are some of the most interesting, innovative, or unexpected projects that you have seen built with Tecton?
- When is Tecton the wrong choice?
- What do you have planned for the future of the product?
Contact Info
- kevinstumpf on GitHub
- @kevinstumpf on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Tecton
- Uber Michelangelo
- MLOps
- Feature Store
- Blog: What Is A Feature Store
- StreamSQL
- AWS Feature Store
- Logical Clocks
- EMR
- Kotlin
- DynamoDB
- scikit-learn
- Tensorflow
- MLFlow
- Algorithmia
- SageMaker
- Feast open source feature store
- Jaeger
- OpenTelemetry
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Do you want to get better at Python? Now is an excellent time to take an online course. Whether you're just learning Python or you're looking for deep dives on topics like APIs, memory management, async and await, and more, our friends at Talk Python Training have a top notch course for you. If you're just getting started, be sure to check out the Python for absolute beginners course. It's like the 1st year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to data engineering podcast dotcom/talkpython today and get 10% off the course that will help you find your next level. That's data engineering podcast.com/talkpython, and don't forget to thank them for supporting the show.
Your host is Tobias Macy. And today, I'm interviewing Kevin Stumpf about Tekton and the role that the feature store plays in a modern MLOps platform. So, Kevin, can you start by introducing yourself?
[00:01:46] Unknown:
Yeah. For sure. And first of all, thanks very much for having me. Excited to be on here. I'm the cofounder and CTO of Tekton, which is the first enterprise grade feature store.
[00:01:58] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:02] Unknown:
So a couple years ago, I was 1 of the tech leads of Michelangelo, which is Uber's centralized end to end ML platform for operational ML. And it got started in 2015 where we built this centralized system to make it really easy for data scientists and data engineers to get an machine learning model into production end to end, really from the entire workflow of cleaning your data, turning the features, training the model, evaluating it, deploying the production, and serving the production. That's what this centralized platform allowed all the data scientists and data engineers to do, and it led to this Cambrian explosion of machine learning at Uber and eventually led to, like, thousands of models that are running in production to drive things like Uber's ETAs, Uber Eats, restaurant recommendations, and a bunch of other use cases.
[00:02:56] Unknown:
From that work, you have built out the Tekton platform. You mentioned that it's an enterprise grade feature store. So I'm wondering if you can just describe a bit more about what it is that you're building at Tekton and the motivation for turning it into a business?
[00:03:10] Unknown:
So what we learned with Michelangelo at Uber and also what we doubled down on as we talked to more enterprises was really the understanding that building and deploying operational machine learning applications is really hard. And operational machine learning is really ML apps that are running in production that typically drive products that the customer directly interacts with. It's a model that need to make decisions within just a couple of milliseconds and that need to run at really, really high scale, just different to offline machine learning models.
And what we've recognized is that building and deploying these applications is really hard because operational ML really consists of 3 different components, like an application, an ML model, and data. And let me give you an example here. Let's say an, ML operational ML use case is an Uber Eats recommendation. You've got the application, which needs to make a prediction, and that will be the Uber Eats application or the back end. Then you've got the model, which makes actually the prediction based on a bunch of data about the user and their preferences. And then that leads to the 3rd component, which is the data or oftentimes referred to as the features, which are the predictive signals that the model actually makes a prediction on to indicate you or recommend different restaurants to you. And what we've recognized is that building and deploying applications like the mobile app or a back end microservice is a fairly solved problem.
Even building and training ML models and getting them into production has become an increasingly solved problem with MLOps platforms. But what has not really been solved yet is the data problem for machine learning, meaning how do you turn raw data into features that is then feed to a model that's running in production or that you feed where you feed the data and the features consistently to your ML model training pipeline that builds the model in the first place. And at Uber with Michelangelo, we solved the model and the data problem together, and the data problem itself we solved with Michelangelo's component, which we call the feature store, And that is exactly what we focus on with Tekton, solving the data problems and all of the data problems around machine learning.
[00:05:37] Unknown:
And before we dig too much into Tekton itself and the ideas and components of a feature store, I'm wondering if you can just give a bit more detail about what a feature actually is in the context of machine learning and maybe a particular illustration of how it differs from just a discrete data point?
[00:05:57] Unknown:
Yeah. Absolutely. So a feature can really be interpreted as high density information that provides a signal to ML model to an ML model to make a prediction on, and it would typically be derived off of raw data. It's it could be aggregated raw data. For instance, let's say for an Uber Eats recommendation, it would be, like, 1 feature that is very predictive of which restaurants to recommend to you is what is the type of cuisine that you've most frequently ordered from in the last 30 days, or what is the restaurant that you've just clicked on 10 minutes ago. Those are all derived pieces of information that allow the ML model to come up with statistical correlations that allow it then to make a prediction for what you may really like to like to purchase next.
[00:06:53] Unknown:
To drive that home a bit more, the raw data that those particular features might be pulling from is, you know, you have click tracking events that are coming in through something like segment or your customer data portal. And for the case of what purchases you've had in the past 30 days, you're looking at discrete order data so you can see what is the restaurant and then what is aggregation of those raw data points into that derived descriptor that you're that you were mentioning.
[00:07:30] Unknown:
That's exactly right.
[00:07:32] Unknown:
Taking more into the feature store itself, can you discuss the different elements that go into that that gives the ability to take that raw data and build these derived aggregates that can be then fed into a machine learning model?
[00:07:47] Unknown:
So a feature store really manages all of your data for machine learning, and it makes it really easy for you to run data pipelines that transform the raw data and turn it into feature values, and that raw data could come from, like, your Kafka stream or your Kinesis stream. It could come from your data warehouse, ARP Snowflake or Redshift, or it could come from your data lake, s 3, and whatnot. A feature store allows you to manage these transformations, to manage these data pipelines, to automatically orchestrate them and execute them. And then it also manages the storage of these feature values where it typically would make it possible to store those feature values for offline consumption and for online consumption.
The offline consumption is necessary to drive batch predictions, like offline predictions, and to drive training processes that typically happen offline over large amounts of data. And then the online store is the 1 that is used to serve feature values in production to make these ultra low latency and very high scale types of predictions. And that's then where the 3rd part of the feature store comes in, the serving of this feature data to make that feature data consistently available for training purposes and for inference purposes. And on the training side, it's typically very important that a feature store allows you to time travel to any point of time in the past and give a data scientist a snapshot of what the world looked like at any moment of time in the past so that the model can learn based on these individual data points from the past, what did the world look like and what happened to them with that information eventually train the ML model. And so taken together, the feature so that makes it really easy to productionize new features for a data scientist without requiring a ton of support from data engineers and whatnot.
It automates the feature computation, the backfills, and all of the logging around it. And then and, of course, also makes it very easy to share and reuse those features once you've contributed them to a feature store. It also tracks things like the feature versions, the lineage, and all the metadata around it. And then finally, it monitors the health of those feature pipelines in production. And that is generally what, like, feature stores as a as to what and to what we've layered on besides that, like, core capability and and feature set off feature store.
[00:10:23] Unknown:
Yeah. Definitely interested in digging a bit more into Tekton and some of the discrete elements that go into the feature store and how it shapes the interim data teams and the life cycle of machine learning. But before we get to that, I'm wondering if you can just give a bit of an overview about how you characterize the current state of the landscape for feature stores because I know that it's a fairly recent category of product. And so I'm interested in understanding a bit more about your views about the level of maturity and the type of adoption that they're seeing right now.
[00:10:56] Unknown:
It's definitely an up and coming category that I'd say we first coined with Michelangelo a couple of years ago. And since we first blogged about it in 2017, we've seen how a lot of other tech companies and other companies have built their own in house feature stores. Like you've seen things at Airbnb where they've built zipline, booking.com build their own feature store, Facebook has its own feature store, And so there are a lot of in house developments, particularly at the large engineering powerhouses. Now looking at what's been happening in the last 3 years, besides Tekton, there are other startups in the space. There's logical clocks and there's stream sequel to name 2 examples.
There is now since this morning, actually, very app timing, AWS has released its own feature store, which seems to really be a feature repo that allows you to store feature values and serve them in production. You still need to take care of the data preparation, the data cleaning, and all of the running of the data pipelines and whatnot yourself, but they do give you a way to to store feature values and serve them consistently to training pipelines and to models that are running in production.
[00:12:11] Unknown:
In terms of the overall life cycle of machine learning operations, the feature store is definitely a core component of it. And you mentioned some of the other aspects as far as model serving and deployment and their monitoring capabilities that are necessary there. I'm wondering if you can just give a bit of an overview about how the Feature Store fits and what are the additional components that are necessary to have a fully fledged machine learning operations life cycle?
[00:12:39] Unknown:
So the entire ML workflow typically consists of multiple stages. It begins with fetching the raw data, then cleaning the raw data and doing your feature engineering work to turn it into your derived feature values, then you train your model. Once you've got your trained model, you evaluate it. You back test it and see if it is any good. Then you manage the model artifact, meaning that you store the artifact for reproducibility purposes, for versioning and whatnot. Then you deploy the model into production where you have it behind, say, a microservice, and then that microservice is able to actually make predictions in production and to monitor.
Now all of these different workflow steps, they typically require you to have MLOps platform, which allows you to train your model, manage the artifact, deploy it, and serve it in production, and then you need, like, a feature store to solve the data problems around machine learning that I highlighted earlier. And so with an ML pop ops platform and a feature store, you have a pretty good setup in order to go through the entire ML life cycle end to end.
[00:13:54] Unknown:
Some of the other interesting aspects of a feature store is the maintenance of the actual pipeline of deriving those features and some of the challenges that might come about as far as resource contention where you want to ensure that everybody's able to build out the features that they want and that the feature stays fresh, but that you also don't overtask the platform because of maybe some unoptimized code that's actually being used to aggregate the information where perhaps you're going with a brute force loop as opposed to a more mature algorithm that might be able to achieve the same output with fewer cycles necessary. And so I'm curious how you approach some of those types of challenges and things like Tecton for being able to ensure that you have this shared resource that's scalable and accessible, particularly for engineers who maybe aren't focused on the performance characteristics, and they're just trying to achieve a certain outcome, and they might be going for a naive approach and just how that factors into the overall development life cycle of these features.
[00:15:00] Unknown:
Yeah. Definitely. So Tecton is is a cloud native platform, and it takes full advantage of the horizontal scaling opportunities in the cloud. So with with TechTown, you don't have this noisy neighbor type problem that you've just described. And then the way that we avoid this is that the different data pipelines that generate 1 or multiple features, they're actually all run on independent data processing clusters. And so what we would do is we would spin up individual on AWS, for instance, individual EMR jobs to then run Spark jobs, which process the, the feature values for a given feature pipeline.
And that 1 Spark cluster is completely independent of any other Spark clusters that may be producing other types of feature values at different freshnesses or that may be producing feature values off of a stream. So all the different pipelines are really completely independent from each other, which allows for avoiding the noisy neighbor problem, avoids the the resource contention that you mentioned.
[00:16:11] Unknown:
And so for data scientists or analysts who are actually building out these feature pipelines and trying to define the data aggregation to create these concrete features, what does the workflow look like, and what is their interface for being able to actually define those pipelines and manage them?
[00:16:32] Unknown:
So at Tekton, we're big believers that features need to be managed as code, which allows you to bring all the DevOps best practices to machine learning and specifically to machine learning data. And because of that fundamental belief, the interface to defining and managing your features is actually happening in Python files that are laid out in directory structure of your choosing, where you use Tekton's Python SDK to define individual feature groups. And so you define various different metadata for features, like the name of the feature, the entity it's assigned to, like a user or a transaction or something like that, the owner in the company who's responsible for it, And then you also define, of course, the transformation code itself, which could be Pandas transformation code or it could be SQL or it could be PySpark code, and that is all defined in these Python files on disk. And what that allows you to do is, of course, to back these Python files up in a Git repository, which gives you all the typical Git abilities, like doing code reviews and whatnot.
And then when you're happy with the definitions of your features, you would use Tekton's CLI in a very similar style to Terraform, run Tekton plan to look at all the Python files and all the feature definitions and which is now your feature definition goal state and compare it to what's currently running in production. What are all the feature pipelines that you've already configured that are running day in and day out? And then this plan shows you what is the delta of your change, like, what are new features you're about to create? What are existing features you're about to modify that may actually impact models that are running in production?
And it prevents you actually from making these changes unless you know exactly what you're doing. It is also able to show you expected costs, like, is that actually a change you you want to be making because it could be spinning up pretty expensive data pipelines. And then when you're happy with the plan that you're seeing, then you run Tekton apply in the CLI and actually apply these changes to the production system. And now Tekton would be running these pipelines for you on an automatic basis and store the feature values for offline and online service. And what that tooling allows you to do is to integrate it with your existing CI and CD pipelines. So as earlier mentioned, you can use code reviews to have somebody actually approve your changes to the feature repository.
You can write and run unit tests. You can write and run integration tests to ensure that the feature transformations actually all make sense and are valid. And then your CICD pipeline would use the CLI to actually roll out those changes to production. And then if you wanted to, you could even monitor changes. Anything looks fishy, you could kick off an automatic rollback and whatnot, and that is what the the primary interface looks like to contribute new features to the feature store. Now on the consumption side, we have 2 main APIs. 1 is, of course, the API for the online serving in production, which is a gRPC or a REST interface that you can query to fetch feature values in very low latency and a very high scale.
And then there is Python SDK that you can use in your Jupyter Notebooks, for instance, or on your laptop to fetch historical values from Tactile's offline feature store to generate your training dataset above which you're going to train your ML model.
[00:20:15] Unknown:
Digging a bit more into Techton itself, can you discuss how the overall platform is architected and some of the ways that the design has evolved since you first began working on it?
[00:20:26] Unknown:
So the platform itself is microservice oriented architecture. The brains of Tekton run on Kubernetes, and the main technologies that we use in our stack are go for all of the low latency serving, and then we use Kotlin for all of the non latency sensitive back end processes. The Python then the SDK itself is implemented in Python, and some of the major changes that we've made a little over 2 years since we've gotten started are on the customer facing front. 1 big change that we made was that, initially, the feature transformations could only be expressed using Tecton's DSL.
That is something that we did at Michelangelo beforehand as well, and it worked really well, but it is somewhat constraining. It works for a lot of different use cases, but it doesn't work for all use cases. And so we extended that interface to allow customers to also express features using SQL and PySpark directly or just your Python pen does really the code. Then another change that we made was that initially, we said, hey, Tekton itself really manages the end to end feature life cycle, and that means that it manages both the feature transformations and all the data pipelines, and it takes care of the survey of feature values.
But there are some customers who are actually quite happy running their own data pipelines using Airflow, using DAX or whatever it is, And they mostly wanna use Tekton to serve feature values online or offline or to leverage its monitoring abilities or have its central catalog. And what we then did was actually separate out the transformation capability and storage capability, so the customers really have the choice of only using the storage and the serving without the transforms or use the entire platform fully integrated where Tecton runs the transformations and pipelines for you as well. 1 of the biggest changes that we made earlier this year was adding this DevOps capability to Tekton that I've earlier mentioned where, initially, you could create features only through a Python SDK in a notebook in an imperative style.
But we said that in order to really support the DevOps best practices, you should be able to define your features in a declarative style in files that you can check into Git where you can have the entire source of truth of your feature store be laid out in in a file system and manages and get And that's when we've moved over to supporting this declarative framework, adding the CLI to run these rolls out and whatnot.
[00:23:09] Unknown:
You invest so much in your data infrastructure, you simply can't afford to settle for unreliable data. Fortunately, there's hope In the same way that New Relic, Datadog, and other application performance management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo's end to end data observability platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business.
By empowering data teams with end to end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visitdataengineeringpodcast.com/monte carlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 people will receive a free limited edition Monte Carlo hat. And were there any particular lessons that you learned in the process of building out Michelangelo and interacting with the users at Uber that were useful? And and what are some of the assumptions that you built up as a result of that work that proved to be invalid as you exposed the work at Tekton to a broader range of use cases and industries?
[00:24:33] Unknown:
1 of the things that was invalid is the 1 that I just mentioned where at Michelangelo, it's totally fine to leverage a lot of the DSL to express future transformations, and we saw that by interacting with much broader set of of customers with Tekton, we need to give customers more flexibility, allow them to write their transformations using Pandas code or PySpark code or SQL code and not just constrain them to DSL. On the lessons that we learned at at Michelangelo that definitely proved to be true was that having a centralized platform which can standardize these workflows really is a tremendous gain of efficiency because data scientists and data engineers, they don't need to make as many decisions about how do I define a new feature, how do I run, how and where do I find a new data pipeline to run my feature transformation code? There's just 1 central place that you go to where you know how to use the system and it takes care of all the heavy lifting for you.
And another thing that we also learned, actually, just running Michelangelo in production and managing all of its SLAs was how important it is to use cloud native infrastructure as much as possible. So for instance, with Tekton, our key value store is AWS, for instance, Dynamo. Instead of having Tekton manage its own key value store so that we can actually offload this burden to a managed service, which deals with supporting strong SLAs. And at Uber, for instance, we had a had a separate team within the company, which was managing entire Cassandra clusters teams that would require a a key value source running in production.
[00:26:33] Unknown:
Digging back into the workflow of the teams who are working with the feature store, both on the data engineering and data producing side and also on the analysis and data consuming side. How do you see the feature store as changing the relationship between those different roles within the data organization?
[00:26:58] Unknown:
A feature store really fundamentally empowers the data scientists. They're now able to create new features on their own and go from idea to having them in production without having to Whereas, beforehand, if a data scientist has an idea for a Whereas beforehand, if a data scientist has an idea for a new feature, they would have to typically write up feature engineering code in their Jupyter Notebook, test it offline, and then eventually throw this Jupyter Notebook over the wall to a data engineer or an ML engineer. And then that productionization work where they would typically reimplement the the pipeline in, like, Java or some other production ready language would have to be prioritized.
The data scientists would have to wait several weeks or sometimes months to actually get this into production. And by the time everything's productionized, the data scientists may even just already be in a different team or have left the company altogether. And so with a feature store, data scientists are now really fundamentally empowered to go end to end from turning the raw data into features, turning those features into a trained model, deploying the model using their MLOps platform into production, and then serving the features in production without having to depend too much on other teams and their priorities.
[00:28:27] Unknown:
On the side of actually connecting to the raw data, I'm wondering if you can dig a bit more into how Tekton and maybe Feature Store is generally interface with things like a data warehouse or a data lake or some source of streaming data and how you approach the unification of the interface for the end user for being able to just treat all of those different data sources as a generic store of raw data?
[00:28:56] Unknown:
Yeah. So Tecton itself integrates with data source that are themselves already basically consolidated data source. So for instance, we a feature source shouldn't be confused with a massive data integrator that integrates across various different MySQL or Oracle databases across tons of different teams and organizations in a company. It really integrates with existing consolidated data sources like your data warehouse, like your centralized data lake, and your centralized streaming infrastructure, whether it's Kafka and Kinesis. And then those different data sources are onboarded onto TechTown, and the data scientists can then, without having to even know whether they're fetching raw data from a stream or from a data warehouse, write very simple SQL code or PySpark codes that only deals with the raw data transformation into the feature values that they care about.
And then they can rely on Tecton to actually go back to the appropriate data source to fetch the raw values. To give you 1 example here is if you develop a feature which which turns raw data from a stream into feature values, you typically have to be able to also run backfills. And so if you create a new feature or you wanna be able to bootstrap your feature store with historical feature values. Where do you get the feature? Where do you get the raw data from in order to actually process and create these historical feature values.
The streaming infrastructure typically doesn't preserve all of the historical raw data, so you have to go back to a data lake or a data warehouse which persists all of the historical raw data. And with Tekton, you're actually able to tell the system to look in, say, a hive table for the historical raw data and to look in a Kinesis stream or a Kafka stream for the online data. And And so as you create a new feature, Tekton would automatically go back to that Hive table, load all the historical raw data, bootstrap the feature store with the process feature values, and then have a seamless handoff to now run this streaming data pipeline, which processes the raw data from your Kinesis stream or your Kafka stream.
[00:31:23] Unknown:
Digging a bit more into the actual manifestation of the feature, you have this set of code that says perform these processes on the raw data to be able to create this derived attribute where, going back to our earlier example, we have the order history of the last 30 days of this particular user. And so I know that they prefer Chinese food For the actual training of the model and for its online operation, how does that actually manifest in terms of the API? Are they sending a parameter to the feature to say, here's the customer ID. Now give me back their preferred style of cuisine, or is it something where it uses this piece of code and then generates all the pairings of customer ID and preferred cuisine based on the historical data. And then as new information flows in, it keeps those values updated.
[00:32:18] Unknown:
On the training side, there is a Python SDK where you say, hey, Tekton. Give me historical feature values for this set of features. And then you can say you can optionally filter those feature values to a set of user IDs or transaction IDs or whatever entity of your features are associated with. You can also filter the features that you want to the feature store to return to you by a time range. And you can also, of course, say, hey. Don't filter it by anything for these 17 different features. Just give me all the historical data that you have since the dawn of time and for all of the users or transactions or whatever it is that you've ever ingested into the feature store. And then the feature store would return a data frame to you on which you do your training to generate your model. And then later on, when your model is running in production, you need to now fetch feature values for a given entity, for a given instance on which you wanna make a prediction. So for instance, for in the example of the Uber Eats recommendation, you have a user who opens up the app, and that user is associated with a unique ID, and then your back end system, which hosts the model and wants to make a prediction, will then call out to Tekton and say, hey. Give me feature values for user x, y, and z.
And you would then specify, like, which features do you actually care about. Is it the cuisine that this user has most frequently ordered from in the last 7 days, or is it what that user has just most recently clicked on?
[00:33:58] Unknown:
And then going back to another element that you brought up in terms of managing the life cycle of machine learning operations is the concept of monitoring where I know that there is monitoring that you would do on the actual actual deployed model to determine things like concept drift. But I also know that with feature definitions, there's the possibility of drift there and that you wanna monitor those. So I'm wondering if you can just give a bit of an overview about what types of information you're looking at when you're monitoring a feature within a feature store and some of the signals that you might be looking at to determine when you might need to update the definition of a model or redefine it entirely?
[00:34:38] Unknown:
So typically, what I wanna make sure is that the statistics of your features do not change too much over time as your model is running in production. So for instance, for numerical features, you would look at the mean or the standard deviation and stuff like that and make sure that those stay within certain bounds, typically within the bounds that you've observed when you generated the training dataset in the first place. Because if there's massive drift across any of these statistics, then that would typically mean that you should retrain your model because the world could have just changed and it just looks different now. And you've trained your model on the state of the world and what it looked like a couple weeks, months, whatever it is ago.
And if the statistics and the data don't look anymore like what they did when you first trained the model in the first place, then your model may just make really poor predictions. And so it's super important to not just monitor the predictions themselves and to check whether they're drifting, but to monitor upstream also how are the features themselves behaving and how are they changing over time. Because if you imagine that a model depends, say, on a 1000 features or so, if you were to only monitor the the predictions themselves, if anything looks off here, it would be very hard to tell why does anything look off, what's the root cause for it, and you may even notice way too late that something has changed because it's, like, oftentimes very minuscule, very tiny changes that may only be happening on, like, say, 1 or 2 features out of the 1, 000 features. And so you really wanna be looking at all of the individual features in production and remain confident that the distributions and the statistics of them are not changing too drastically over time.
[00:36:34] Unknown:
Another aspect of the feature life cycle is the idea of versioning where you do have the original definition. You maybe then decide based on your monitoring that you need to tweak some of the parameters for this derivation function, and so then you redeploy it. If you have a machine learning model that's in production that was originally trained against the first definition of that feature, or maybe you're creating a slightly different function that is a slightly different intent. How do you associate the models that are running in production with the particular version of the feature that is necessary? So doing things like version pinning and then managing releases and rollbacks for the actual feature code itself?
[00:37:20] Unknown:
So in Tekton, feature definitions themselves are immutable. Once you create a new feature and you deploy to production, it has a unique version associated with it and a unique hash associated with it. And then typically, before you use features in production, you create what we call a feature service. And a feature service is really a set of multiple different features that a model in production depends on. Say it's a selection of 10 different features, which points at 10 different features and their respective immutable versions. And this feature service now provides a unique API endpoint that your model and the microservices that host the model in production would query in order to fetch the feature values.
And now if you say create a new or unchanged feature, that would be a new modified feature with its own unique version. And then you would create a unique separate API endpoint, which now points at that modified set of features. And once you've trained your model and once you've deployed it, you now change that model to fetch feature values from this new updated API endpoint.
[00:38:36] Unknown:
For the actual process of getting the models from training based on the generated feature data into production and running against live data that's being generated from your online feature definitions, what is the integration path for you at Tekton for being able to work with the broader ecosystem of platforms that actually manage the serving and monitoring of the models themselves and maybe some concrete examples of a workflow from defining the model, training it on the feature definition from Tekton, and then actually getting it into production on hosted platform or a self managed platform?
[00:39:14] Unknown:
So the interfaces that Tecton exposes to integrate with other platforms are very generic, which makes it super easy to use Tectonic in your modeling platform of choice or in your production system of choice. And so to give you a concrete example, say you use Databricks notebooks or Jupyter notebooks or EMR notebooks to train your ML models. Here, you would PIP install and use Tacton's SDK, which is configured to connect to a specific Tacton cluster, and then they're used to SDK to generate training data. And then within your model, you would say use scikit learn or TensorFlow to train your model, then you would use something like MLflow to package up that model artifact and to make it deployable.
You also use something like MLflow as your model registry. And then from the notebook, you would communicate with your model serving system, whether it's Algorithmia or Seldon or SageMaker to actually deploy your wrapped up model bundle too. So that you now have a model endpoint that is available in production. And then that model endpoint would have to be called by some microservice, which actually wants to make a prediction. Microservice would integrate with Tekton's gRPC or REST API to fetch feature values for a given set of features, and then pass those features into the request to the model endpoint on, say, Algorithmia or SageMaker to make the prediction.
[00:40:53] Unknown:
In terms of the actual uses of Tekton and some of the example workflows that you've seen people build, what are some of the most interesting or unexpected or innovative ways that you've seen it used?
[00:41:06] Unknown:
1 interesting use case we didn't foresee ahead of time was that we have some customers who don't only use Tecton to manage their data pipelines for machine learning purposes, but also for heuristic driven applications that are running in production that want to query some feature about a user and then make a heuristic driven decision based on that. And in those instances, you don't need to fetch the historical data in an offline setting because you're not instead of training in a ML model, we actually have somebody hand implement heuristics that are running directly in, say, a microservice And then there, they just integrate with Tekton's online service system to fetch the fresh feature values for a given user or whatever they wanna make a prediction on.
[00:41:53] Unknown:
And in terms of your experience of building out Tekton and working with customers and turning it into a business, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:42:06] Unknown:
1 of the things that we've learned over the time was that, of course, lots of different enterprises are very different. Their preferences may be vastly different than you have, for instance, some companies who really, really care about open source, and they really care about being able to customize whatever system they're going to use in production or they wanna be able to just manage and host it on their own. And other companies, they really just wanna rely on a company like Tech Ton and on an enterprise software which comes with governance, which comes with strong SLAs, which comes with support and whatnot.
And we really wanna be able to serve the entire market and serve both ends of that spectrum. And because of that, we've partnered up with Feast a couple of weeks ago. We've we've published an announcement about that last week where the main creator of Feast, which is an open source feature store, which was created at Gojek, has decided to join Tecton, and he has contributed fees to the Linux Foundation, and we're committed to investing significant resources to make fees really the best open source feature store out in the industry and give enterprises the choice whether they wanna use their whether manage their own open source feature store, run it on their own, maybe customize it, or whether they wanna rely on an enterprise proprietary product like Tekton.
[00:43:34] Unknown:
And so for people who are considering using a feature store and they're looking at open source or managed solutions or trying to understand if a feature store is the right choice for them, what are the cases where Tecton might be the wrong choice?
[00:43:49] Unknown:
Yeah. Definitely. So Feast is an excellent choice for you if you care a lot about hyper customizability where you can go in there, you can read the code, you can make changes to it, you can fork it, and we really also care about just running it on your own infrastructure, maybe even on prem, and you wanna manage and scale it yourself, Techton is the right choice when you care about having a hosted offering, if you care about having strong SLAs and support. And, also, of course, if you care a lot about governance and enterprise security and ACLs and things like disaster recovery and high availability, that's when you would come to TechTown.
[00:44:31] Unknown:
And as you look to the near to medium term of the Tekton platform and the business, what are some of the things that you have planned for the future?
[00:44:40] Unknown:
Over the next couple of months, we'll be announcing our support of Tekton, not just for AWS, but we'll also be releasing it on gcp as well as Azure. Besides that, we'll also make it much easier for analysts to develop their own features, specifically analysts who are not super comfortable with PySpark or let's say Python and Pandas, but folks who really care much more about having a very intuitive and simple to use UI. And then finally, today, Tekton supports batch transformations and streaming transformations as well as real time transformation that are executed on demand in production when you wanna get a feature value.
And we are going to invest much more heavily in that area where in the future, you'll be able to execute entire tags of real time features that are executed in production and that can depend on each other.
[00:45:38] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:45:53] Unknown:
I think that really, for me, comes down to end to end data quality monitoring and lineage tracking in an organization. Very similar to how you have things like Jaeger to do end to end request tracing in a distributed microservice system. I think we need something like that to track the life of a piece of data. Where does it originate from? Which systems have touched it? How does the distribution of it change over time, who are, say, the owners of the different pipeline steps, and something like that gets especially interesting when you're crossing the analytical offline and the operational online stack.
[00:46:32] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing at Tekton and building out feature stores. It's definitely a very interesting area and 1 that I'm excited to see a lot of development on recently. So thank you for all the time and effort you've put into that, and I hope you enjoy the rest of your day. Thank you very much. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Kevin Stumpf and Tekton
Building Michelangelo at Uber
Understanding Features in Machine Learning
Components of a Feature Store
Current Landscape of Feature Stores
ML Operations Lifecycle
Scalability and Resource Management in Tekton
Workflow for Data Scientists and Analysts
Architecture of Tekton
Lessons from Michelangelo and Tekton's Evolution
Impact of Feature Stores on Data Teams
Integration with Data Sources
Feature Manifestation and API Usage
Monitoring and Feature Drift
Feature Versioning and Lifecycle Management
Integration with ML Platforms
Interesting Use Cases of Tekton
Challenges and Lessons in Building Tekton
Choosing the Right Feature Store
Future Plans for Tekton
Biggest Gap in Data Management Tooling