Summary
Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host is Tobias Macey and today I’m interviewing Simba Khadder about his views on the importance of ML feature stores, and his experience implementing one at StreamSQL
Interview
- Introduction
- How did you get involved in the areas of machine learning and data management?
- What is StreamSQL and what motivated you to start the business?
- Can you describe what a machine learning feature is?
- What is the difference between generating features for training a model and generating features for serving?
- How is feature management typically handled today?
- What is a feature store and how is it different from the status quo?
- What is the overall lifecycle of identifying useful features, defining and generating them, using them for training, and then serving them in production?
- How does the usage of a feature store impact the workflow of ML engineers/data scientists and data engineers?
- What are the general requirements of a feature store?
- What additional capabilities or tangential services are necessary for providing a pleasant UX for a feature store?
- How is discovery and documentation of features handled?
- What is the current landscape of feature stores and how does StreamSQL compare?
- How is the StreamSQL feature store implemented?
- How is the supporting infrastructure architected and how has it evolved since you first began working on it?
- Why is streaming data such a focal point of feature stores?
- How do you generate features for training?
- How do you approach monitoring of features and what does remediation look like for a feature that is no longer valid?
- How do you handle versioning and deploying features?
- What’s the process for integrating data sources into StreamSQL for processing into features?
- How are the features materialized?
- What are the most challenging or complex aspects of working on or with a feature store?
- When is StreamSQL the wrong choice for a feature store?
- What are the most interesting, challenging, or unexpected lessons that you have learned in the process of building StreamSQL?
- What do you have planned for the future of the product?
Contact Info
- @simba_khadder on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- StreamSQL
- Feature Stores for ML
- Distributed Systems
- Google Cloud Datastore
- Triton
- Uber Michelangelo
- AirBnB Zipline
- Lyft Dryft
- Apache Flink
- Apache Kafka
- Spark Streaming
- Apache Cassandra
- Redis
- Apache Pulsar
- TDD == Test Driven Development
- Lyft presentation – Bootstrapping Flink
- Go-Jek Feast
- Hopsworks
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. What advice do you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly Media on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering podcast.com/97 things to add your voice and share your hard earned expertise. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest helm charts from tools like Pulsar to get you up and running in no time.
With simple pricing, fast networking, s 3 compatible object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode, that's l I n o d e, today, and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Simba Kadir about his views on the importance of ML feature stores and his experience implementing 1 at StreamSQL. So, Simba, can you start by introducing yourself? Yeah. Hey. I'm, Simba Kadir. I'm, the CEO and cofounder of Stream SQL. We're building a feature store for machine learning as you said. And do you remember how you first got involved in the area of machine learning and data management?
[00:01:43] Unknown:
Yeah. So I actually started out working at a small startup back in college, and I used to do a lot of hackathons and just had always really been interested in distributed systems. So my first real role was actually at Google, where I was working on a a cloud data store. And again, like, I always love Distributed Systems because of how messy they are. Unlike other fields, it's there's never really a right answer. Everything is always a trade off. And on the machine learning end, I actually kind of fell into it because 1 of my friends asked me to help him and he was on a team of astrophysicists and they were trying to like find a planet kind of on outskirts of the solar system and they were like hey, we have all this image data. We need someone like crunch through it like you have some time to work on that and I just kind of picked it up and just learned it on the fly and lots of the same things that made me love Distributed Systems like how there isn't really right answers. There's lots of messiness. There's a lot of creativity that goes into it, also exists from machine learning. So I kind of fell into it that way and fell in love with it for the same reasons. And so
[00:02:46] Unknown:
in terms of the work that you're doing at Stream SQL, can you give a bit of a description about what you've built there and what motivated you to start the business? Yeah.
[00:02:56] Unknown:
So Stream SQL is an open source feature store, and it enables teams to be able to share, reuse, and discover features across teams and models. I mean, generally, how it works is like this. 1st, you connect your data sources, whoever's streaming or batch, Kafka, s 3, whatever. You transform and join them as you wish using SQL, and then you define your machine learning features on top of those sources. From there, you can serve those features of productions. You can generate training sets out of them. You we also manage all the versioning and monitoring of the features, and you can even include like third party features, like embeddings or WebData or whatever.
In a nutshell, like the mission of it is to help, machine learning teams focus on building models rather than ML pipelines. And how I've got into it, I actually was at a startup. I founded a startup before this called Triton, and we worked with a lot of media companies. We did all kinds of stuff. We really focused around, the kind of b to c subscription space. So all those paywalls and stuff, a lot of those are us. Sorry. And so we, we did a lot of stuff around personalization, propensity, paywalls, all this stuff that was powered by machine learning. At our peak, we're handling like 100, 000, 000 monthly active users. And, you know, I was looking into the data science teams and and seeing how they're doing things and I realized that most of our time was actually spent building Flank and Spark jobs to build out datasets and get features in the productions.
And the whole process was just so chaotic and hard to manage. But the problem was that a lot of the feature engineering, like, the stuff they were doing was actually what was driving the biggest increases, so it made sense. So at the time, like, the term feature store wasn't really a thing. There wasn't really like if you Google feature store nothing would really come up and you know I googled around and and some other companies have talked about similar problems like Uber had Michelangelo, Airbnb had something called Zipline, Lyft had something called Drift and there's all these things like that that we saw and we were looking for something open source that was good enough for our use case couldn't find anything and decide to build in house.
From there, at 1 point I realized that this was really like something special that would really benefit our current clients and all kinds of other people. So I actually decided to take just that piece of the product and to roll it out into its own company, which is now Stream SQL. And before we get too far into
[00:05:14] Unknown:
what you've built there, can you give a bit of a background about what a machine learning feature is and some of the difficulties that people have in working with them in the
[00:05:24] Unknown:
sort of typical deployment paradigm and typical workflow that they have for building machine learning models and serving them in production? So a machine learning feature, it it's kinda funny because isn't really like it's kind of like a a nascent thing. There isn't really like a super clear definition for it. But the way I think about it is a machine learning model is essentially a function. It takes an input or inputs and generates an output. So let's say, like, I'm Spotify and I want to recommend some songs for you. The output will probably just be me, the user, and then output should be the recommendations.
Now if I just give it that, it doesn't really have much to work with the model if I just give it like a user ID. So usually what teams do is they'll take a user ID and I kind of break it down into kind of who the user is the user's context per se. So you might say, hey, what was the last couple songs this user listened to? What's their favorite genre? And all this gives them all context they can use to make a better recommendation. So each of those things are features. So like their favorite genre could be made into a feature, for the model. And, you know, it sounds kinda easy but it's actually a lot harder than that. For example, let's say I want to try to tell the feature the model how diverse a user's music taste is. Well, there's an exactly like equation to like define like music taste so you need to figure out a way to creatively model that to give it to the model itself. And so that's what feature engineering is in terms of what makes that hard, like, why are features hard to so that's feature engineering. That's kinda like the basic part of, like, what is a feature. Now generating a feature, you really have to generate in 2 contexts, various the serving context. So, for example, I wanna recommend a song for a user now. Well, then you just need to know the values of all the features at that moment. In many situations, you have streaming data and you can't exactly especially if you're like Spotify and you you need to recommend it now. Like, it's it's relatively low latency. You need to know all these features with very more or less current value very, very quickly. So usually things are preprocessed. So you use something like Flank to to process all the streaming data coming in and maintain all the values of the features so that you can just kind of pull them over lookup rather than trying to, generate them at, recommendation time. So that's the serving part. So there's that's kind of hard hard enough but there's also the training part. So models are trained by in this way. Essentially, you think of it as they, you give a model and the inputs that's used to, and you give it the actual output what actually happened. So like hey, like here's all this information about a user at a specific point in time and here's like what they actually chose to listen to next then you would give the the model the inputs you'd make it make a recommendation and then you'd see you'd give it the actual value then it would change itself, according to that to try to be better next time. And that's how training happens in a nutshell.
Now that means that from, the feature generation side, I not only need to be able to generate the features now, I need to be able to generate features at any point in time in the past. So I need to say so that that's like a whole another problem. Usually, there's a whole another pipeline there. So you end up with like all these ml pipelines. Some are streaming. Some are bad. Some are for serving. Some are for training and it's all kind of broken up across all these different layers in your infrastructure
[00:08:42] Unknown:
and that's kind of the the problem that we we strive to solve. And so as far as the current approach that most companies use, I know that it's still a sort of burgeoning field where there are more people coming into it and more people building their own off the shelf solutions. So what is the typical approach to handling features and being able to provide them both for training and online contexts?
[00:09:07] Unknown:
Yeah. So on online context, usually I see 1 of 2 things. Use it's either, the company has enough data where they pre process it, then it will go through Kafka, whatever, for uh-uh for the event bus and then it will be processed by something like flank or spark streaming and kept maybe in a Cassandra or Redis the actual values of of of the features so that you can just pull them very quickly. So that will be the streaming side. Then usually you generate you have a whole another side. So all the best that come in will also go into s 3 or or or HEFS or some sort of file store, and then you'll have a whole another pipeline on Spark or some other batch streaming system that will generate all the training data. So So that's usually how people do it today. Every single feature kind of at the very least gets written twice. Once for streaming, once for, batch.
And then if you have a mix of streaming and batch sources, you kind of have to double that up again and you end up with 4 different pipelines. So that's typically how people do today. So it kinda exists on the Flink and Spark and whatever else you use there. And then in terms of a feature store, how does that improve the lives of people working on machine learning models and who are the sort of main downstream consumers of it and who's responsible for building it and keeping it healthy? Yeah. So in doing this, I've learned how, you know, so many different companies where so many teams that have to go into like the process because you need a data engineering team you need maybe depending on how big your company is you need an IT team to have all infrastructure up. You need someone to actually generate your features that could be the data sciences themselves. It could be data engineers. It could be a mix of both. And then you have data analysts and other other teams that also might add features or have, things to do with the model themselves. So the way we think about it is rather than having everything at such a low level, We allow people to define, 1st sources which you can generate from Spark, whatever you can do SQL on the sources so that you don't have to go down. You you can make queries and write SQL regardless of things that are streaming or batch. And we just kinda try to unify it. We're relatively generic layer to materialize views. From there, you can define features and essentially JSON, like, it's just configured. And that means that you have 1 feature definition that's used across all contacts, whether it's training or surveying or anything else. From that feature, you can actually generate your training datasets. So you can give it a set of features, you can give it a set of labels so like the right answer and then it will generate a training set for you. You can ask hey what are the values of these features right now we'll be able to do that in near real time.
So the idea is that it lets things exist in 1 space, rather than it being fully across all your infrastructure. And it's defined in a very, very clear way. Beyond that, we have a feature registry so that you can actually look for you can look at your features first so you can see the space statistical analysis of each of your features. You can also search for other features that other teams maybe have used. You can pull a feature from another model. We think of features like the building blocks of your models and so we built the whole platform to allow you to actually use them in that way rather than thinking of them in that way but then having to define them at,
[00:12:20] Unknown:
the Spark or Flink level. And with using a feature store, it seems that that provides a sort of concrete interface for data engineers to hand off things to the machine learning engineers rather than having sort of a blurry line between where the responsibilities of 1 ends and the other begins and having to reach inside each other's workflows to ensure that the overall development and delivery of a machine learning model is able to make it all the way through to production. And I'm wondering what your experience has been in terms of how it modifies your own work doing machine learning workflows and how it impacts the data engineers who are supplying the raw data that are being turned into these features?
[00:13:00] Unknown:
Yeah. So 1 really cool part, I think. So from the data engineer side, a lot of machine learning stuff makes no sense outside the machine learning. For example, like, if let's say I want to know the average price a user has spent on an item. Well, what do I give the model if it's null? Like, a user has never bought anything. In a database, you just set as null but for machine learning, you might set it as the median or to me or something else. And so, someone on the data engineering side has to have this like really archaic set of requirements of like, hey, this is how the feature has to work. I need to generate for training and serving. And so then they have to go and implement that. So this is nice because it lets them not have to do that. They can just generate all the tables that are kind of generically needed, like a purchase table or whatever else. The data science team on the other hand could just plug all that stuff in. Now they have very nice clean datasets and they can define their features just more as a configuration. So there's no there's much less code and it's much more oriented towards their workflow.
So all the generic feature engineering techniques are kinda built in. Everyone does the same things, you know, fill in missing values, normalize, number from 1 or negative 1 to 1 or whatever remove outliers all that stuff we have this built in so they can think again at the configuration level and they can just quickly add things without having to know Flink, without having to know Spark and having to work at that side of the code base. They can kind of work agnosticly of each other, which is really nice for both teams. And as far as the user experience
[00:14:28] Unknown:
of the data scientists and machine learning engineers. You mentioned that there's a registry for being able to view the different features that have been defined. But I'm wondering what types of additional user experience improvements are useful or additional capabilities that are necessary for an effective feature store to be able to be useful and sort of maintain its overall health and utility within the system? Yeah. So 1 basic piece is versioning,
[00:14:57] Unknown:
which sounds like it should be a solved problem, like versioning features. But the way most people do versioning now is just through git. So like it's just as you change spark or whatever code, it just gets versioned and git and you can roll back in that way. That could be really messy especially if you want to use a single version of the feature or if you're using a feature from another team you might wanna make sure that you have the same version all the time so if they change the pipeline doesn't break your model. So versioning is a big piece that's something that hasn't really been figured out much for, ML features and it's it's a core component of of our feature store. Another piece has to do with monitoring.
Now features are obviously based on underlying data underlying data changes user behavior might change for example, you know now in a lot of people are in shelter in place so a lot of behavior in terms of buying things has changed dramatically and lots of models are probably underperforming because they might have been trained in a different context So monitoring allows you to see changes happening, understand what's happening, and be able to either retrain or or change your feature accordingly to be able to handle those problems. And then the final piece has to do with training set generation. Our training set generation is implicit, which means you just tell us like what is a stream or a set of, what actually happened and maybe a time stamp and we will generate features at each timestamp and mix up the label. So if building training sets which is like a core piece of just the workflow the iteration cycles is as easy as just defining the feature in JSON essentially and saying telling the the, training set generator, hey, I want to add this feature to this training set and it just kinda happens. You don't really have to think about where the data is, how can I transform it, all these other parts of it, it just becomes way, way faster to iterate? We also backfill streaming data which is another really nice feature which means that even if your features are stateful, like average price spent per user, you don't have to wait 3 months to build enough data. It just will use historical data to generate the stateful feature. So all that together comes to just speed up the iteration cycle for ML teams, especially on the data science side, which is really again 1 of the core mission statements of stream SQL. And in the discovery piece,
[00:17:13] Unknown:
what are some of the useful pieces of metadata that should be mapped to a given feature? And what are the options for people defining the features for being able to define that metadata in terms of the structure and content?
[00:17:26] Unknown:
Yeah. So 1 part is just literally what is the data type. For example, non networks don't take strings. They only take floats or numbers. And so you can just filter on that. So there's a piece of like, does this thing even fit into my model? So that's 1 piece. Another piece is this a description which is just plain text. Now this is nice because how many times or, like, how many and how many companies does exist where there's a million definitions of essentially the same thing like how many items a user bought like different databases will have different values across an org and it's just all of a stuff is really massive. There's obviously a lot of tools like Looker and and BI tools and other things are trying to fix that problem. We're never part of that solution. So we can if you build a feature around you know how many items you use your bot other people can also use that feature either day or or you know add a new version to it and that creates kind of a source of truth for your features, which is again like really really helpful for, the ML workflow especially when you have multiple teams. Machine learning like engineering like it's especially I mean, especially machine learning is such a new field that a lot of the, processes are still being figured out. So getting 50 people to work together on 1 problem is very, very chaotic and this becomes a piece where everyone can kind of work together and collaborate and benefit from each other's work in a much easier way. And digging more into Stream SQL itself, can you talk through how the feature store aspect of it is implemented and the workflow that it provides in terms of being able to define features and then pull them into the models that you're trying to build? Yeah. So first around, like, just deployment, you you know, you can go stream SQL to IO, and there's a cloud based version that you can use as a free tier.
And then you just choose where your cloud is. So if you're on AWS or Google Cloud, you just tell us and we'll host it there. We also have an open source version that you can use, it's slightly less features but it's still, it's it's definitely I mean I use it, for my my, local, local machine learning. And then finally with a lot of our clients, we actually deploy it in their cloud directly or on prem or or whatever else. So that's just the deployment aspect of it. The way it works is 3 steps. 1st, you plug in your data. And so that could be maybe you're using Google Pub Sub. So you just say, hey, like, this is a stream of data. It's Google pops up. The format is JSON. Once you plug all the data in, you can choose to join and transform, and all that data that you're getting in and then define your features. Defining features is it's, you know, our our main APIs in Python, and you just define it in what looks like JSON. And that will define your features for you. And then you can use it for training sets and for streaming. So then you have 2 main methods. 1 is generate training set. You give it a set of features and as a label which is again like it could be a stream or a file that has the correct answers so to use for training and will generate features at the point in time of each of those labels to generate training set For online features, you just use stream SQL dot, you know, get online features, a set of features, and entities like maybe the user ID and we'll just generate that. So it's as easy as connect your sources, define your features, start, using it for serving online data and and generating training datasets.
[00:20:38] Unknown:
And how is the underlying architecture of the feature store implemented as far as being able to pull in the data and integrate it and then being able to
[00:20:49] Unknown:
create and store the features for being able to be served up? Yeah. So underlying feature store today day is built on top of Apache Pulsar, Flank, and Cassandra and Redis. The first layer is, is where the events come in and where batch data, kind of lives. So all the events go in the Polestar, we we store them forever, so we retain them forever that lets us regenerate stateful features. We also have s 3 if you just wanna upload a straight up file, or or GCS, or HEFS, whatever. That plugs into Flink. We use Flink to to do the SQL transformation to materialize views and to also generate the training datasets. And then the online features are being constantly processed as events come in and the values are stored in either Cassandra or Redis depending on the feature and the size of, the feature set. And so that's how it works today. It used to be on Kafka and it was actually a lot messier just because we couldn't retain all of our data. Well, the cool part about Pulsar is it has infinite retention. So every event that comes in, we can keep forever and we can actually offload the test 3 to lower costs and to increase scalability. Yeah. So that's that's how it's that's underlying,
[00:21:58] Unknown:
architecture today. And you mentioned that you started on Kafka. And I know that you've got a fairly detailed blog post about your motivations and the process of making the migration to using Pulsar. And I'm wondering what are some of the other ways that the feature store has evolved since you first began working on it and some of the original assumptions that you had going into it that have been invalidated or updated as you've continued to build out the capabilities of the platform?
[00:22:26] Unknown:
Yeah. I think a big piece of it. 1 thing we've learned is this how machine learning is changing over time. Before most features are pretty simplistic, they were just like summations of everything user did, maybe it's normalized. Nowadays, like, a lot of people are using embeddings which are essentially vectors that maintain a lot of data inside of them. So, like, you can turn a user and all of our behavior into a single vector. You can do the same for items. It's very, very common to do it for text. Like like, Google has released, all kinds of pretrained text embeddings that people use all the time.
It BERT is kind of like a new hot algorithm, but also generates text embeddings from your text. And we have to learn how to kind of make it flexible enough where you could include all of these external third party features and also be able to still obviously do all the simple basic features. So that required a lot of changing. The other piece, that was interesting is 1 of the biggest problems at first for us was around streaming and batch data because they forced us to kinda create multiple pipelines. So, again, like, you have to all your past events are stored maybe in GCS or s 3, and and you would use Spark or something similar to generate a training dataset, and that's 1 pipeline. And then you'd have Flink or whatever else taking in all the streaming data and generating, your online features.
And there's a whole, like, kind of juggling act you had to do there to every time 1 create a stateful feature you want to put a feature in production or whatever else. So that was the core problem we were solving at first for ourselves when we bit stream SQL. And it's obviously, a big problem, but, we also learned that feature versioning, feature sharing, all this other stuff was kinda second order. They just were things that became obvious that we could now do it because of how we decide to implement and we realize that those value props are actually like what made this much more interesting for me and maybe actually decide to spin it off into its own business because it kind of becomes a new way to iterate, a new way to do data science and machine learning, and a new way to think about features in general as as fundamental building blocks of your models rather than just these inputs that you use to generate outputs.
So that forces to change the whole model. It made us think about what a feature would look like even let us think about, hey, let's say there was no streaming data. We're just working on files and batch data. How can we make this thing valuable? So I think that that's that kind of learning and then thinking in terms of what does a feature store really mean beyond just like this beyond this abstracting way underlying
[00:24:59] Unknown:
architecture is where things are really cool and where we had to, like, make a lot of changes later to, to fit that vision we have. And as far as the capabilities of the platform, I know 1 of the things that supports is being able to monitor features or freeze them in time. And I'm wondering what are some of the influences that can cause a feature to drift? And what are some of the signals or metrics that you look at in your monitoring? And how do you define the thresholds for determining when a certain action or remediation needs to happen? Yeah. So
[00:25:32] Unknown:
features different types of features have different levels of sensitivity. So if you take the average price a user spend or users have spent on something, if that uses everything anyone's ever bought and you have a ton of data, it might be very, very hard to change that. If you're just taking the top song of the last week or whatever, that's gonna change really, really quickly. So 1 is just first looking at how often do these features change, I mean, how often did they change historically. That helps us see how because the model is trained on on a at a certain point in time. So once the model is trained, it it's kind of used to the features looking as they were, having the same sort of standard deviation, having a similar mean, having a similar kind of statistical, look, to it. What we're looking for is when things happen that quickly change features that typically don't change. Because that means that the mall probably has never seen anything like that and it could cause the its predictions to become really bad. Again, let's say, you know, I mean, the easiest example I can think of because it's very relevant now is, you know, all of a sudden most of the US goes into shelter in place. Well, every e commerce recommender system will have to have changed dramatically. All of our input features will probably have changed pretty dramatically. The way people buy things has changed.
The way, people browse, you know, their their preference to like if they wanna pick up in store or to have it delivered. All that stuff just changed very very quickly overnight. And if you were to look at the input features for all those models, they will have changed dramatically and the mall might have just started spitting out garbage because it just it's it's not models are flexible to an extent but most models will break down if you change a feature too dramatically. It just has never seen it before. So that's what we're looking for. We're just keeping track of standard deviation, keeping track of averages. If we see things shift really quickly in a way that is unusual for that feature, we will just flag it to the user and they can tell us what they wanna do. Depending on the feature, they can just freeze and say, hey, you know what, like, for now until we fix this, let's just if we see something you don't just ignore or set to the average null value or whatever or they can just retrain and they might wanna actually change their features entirely like they might say hey our recommender system, we wanna add a new feature, which is just the last 3 weeks of data such that it kinda catches the effect of whatever. Yeah. So that's that's that's things that cause feature drift and how people typically handle it. Typically, just handled by retraining, changing the feature, just freezing in place, or or just telling it to ignore changing your outlier detection
[00:28:01] Unknown:
so that you can just like keep it within range. And on the versioning side of things, how do you approach being able to
[00:28:09] Unknown:
iterate on the different features and ensure that you don't accidentally introduce errors into an existing machine learning model that's actively using a given feature? Yeah. So when you add a new version, you're typically gonna train first. Now it'll it'll kinda give away. I mean, you you usually would not put a new feature into production to, like, start serving it unless, like, you train the whole new model on it. There are exceptions. For example, if you add new outlier detection or whatever else like you might want you might be doing that to solve something you're seeing in production. The versioning is power comes in 2 parts. So 1 piece is this the versioning. Other teams will have to depend on that feature, and you don't want it If I'm a team and I have them all and it's depending on the feature, I don't want another team to be able to change it kinda underneath me. I won't be able to depend on the current version and if they increment it, then I can see that and I can decide what I wanna do based on what changes they made. The other piece of features is sometimes some a feature might look better for in a training set.
Like, so you train all this data, feature looks like it's working really well. You put in production, you AB test it maybe against the old old model and realize that oh this is actually not doing better in production and this is part of what makes machine learning messy is even if a model looks like it's better in training and offline, it might actually do worse in online. So it's really common to just AB test your models in online even if you know it should be better theoretically. So just being able to roll back is really powerful there. Also, it's having like a clear view of how this model change of Urbex feature has changed over time can help people understand design decisions that were made, why are things normalize the way they are. It it's just like yeah. It's just nice to have your features version and all in 1 place so that you can see what's happened, what's changed, why has model changes it has, etcetera.
[00:29:56] Unknown:
And as far as the integration process for working with data and stream SQL, you mentioned that the ingest pipeline is built on top of Pulsar as far as being able to get data into the given source? And what is the interface for being able to merge data across different topics as it's being ingested by Pulsar
[00:30:18] Unknown:
and then processed by Flink? Yeah. So you can literally just write SQL as as you would expect to. So you could set 2 dependencies on 2 streams and just join them using just whatever join you want. Vowel runs in Flink. So we we clean it up. We make it work, the way, you expect it to and then and then we generate your features from that. The other cool part is the ability to join. Like for example, I might have a just a file of of items every item in my e commerce store and I have a stream of what users are doing like they bought this item or whatever. And I can actually join a file in the stream which is also a really nice thing to have. It just lets you, so data scientists, not have to worry about the underlying capabilities of your tools and have to think about where are these things, how do I join them together? You just write SQL as you would expect to and then we can handle the majority I mean, the an average use case, we probably handle it. If you're doing something really, really specific, then, you know, maybe it makes sense to go down to the Flink level. But on on average, most data scientists are just trying to, you know, join join different parts of different streams or, or just, you know, kind of add certain,
[00:31:25] Unknown:
extra data to a stream. And then as far as the materialization of the features, how are those stored and how do you handle updates to those and being able to keep track of the different versions in the materialized locations?
[00:31:41] Unknown:
So there's actually there's a kind of a lot, that goes into that. So you can materialize 2 things really. Like 1 is a stream. In which case, we will take in your sources and then we'll generate a new stream and Pulsar from it. We'll give it a name, so it says version, and then same with files like we will or tables. We will just generate a table and keep up to date in that way. So each materialization actually exists to us as if it was a native stream or a native, table. The difference is that from your point of view, you can't directly change it. You can only change its sources and it will just feed up through it if you change a materialization and create a new materialization, etcetera. And if you have a materialization that depend on other materializations, there's kind of like a game of, we just start at the beginning. It's almost like the airflow, the egg mall. Like, we just start at the beginning, update everything needs to get updated. Once it's once the stream is set up, we can generate this all the streams that depend on it. So it's kind of a process there, but the good thing is that it's really simple, so it doesn't really break. Often where if like, you try to be smart of it, there's just so many gotchas involved. We just play as simple as possible. Every time we create a mutualization, we'll just build it from scratch from the sources and if it if something's dependent on it, we'll eventually build it from scratch. It's eventually consistent so we'll let you know when it's like ready the new version but until then it will just remain the value will remain as the old version and that's kind of how we do it today. So it's a it's a it's a process that we've built on top of airflow to like make sure all that stuff can happen. In terms of the selection process for determining which components to include in the overall infrastructure,
[00:33:16] Unknown:
what was your guiding principle for determining build versus buy? And what was the necessary set of capabilities for incorporating into your infrastructure?
[00:33:26] Unknown:
Yeah. I mean, we try to build as little as possible for that layer on the infrastructure side. Our value prop isn't like I mean, we often hit certain levels of latency. We hit certain levels of of capability, but most tools can handle that, like, Kafka and Polestar. Like, you can build a system on top of Veeva. It's just harder and and my experience to build it with Kafka than it wants to build it with Pulsar, for example. So our guiding principle is simplicity. We want we actually want and I know when you look at infrastructure it seems silly because of how many components are put together but really we do want it to be as simple as possible while maintaining its reliability. So everything we did was just around but in fact, like, remember the feature store itself, we built internally not because we wanted to but because we had to originally at the old start where we built it. So I've always just been a proponent of, like, figure out what you do, build that, and use whatever you can off the shelf that you can to get, your requirements met. So it's like a TDD kind of thinking of like, hey, this is what my requirements are. How do I get there as fast and easy as possible? And then the other piece is simplicity because the simpler something is, you know it's it's like kids just keep it simple. So all of our infrastructure choices when we decide to switch things out, we just look at what would life be like if we had this other piece of infrastructure and would it be simpler?
Not would it be faster? Not would it be able to handle more whatever unless we hit a point where we need that usually, we just aim towards simplicity. Almost any especially the Apache tools like it takes a lot to get to a point where they're just unable to handle what you're throwing at it, especially if you take the time to configure it. And then something is so complicated to configure, but that goes into simplicity problem again, and then maybe it makes sense to, switch it out.
[00:35:12] Unknown:
And in terms of the overall landscape of feature stores, what have you used as reference material for determining how to go implementing it? And what is the overall landscape look like as far as the availability of feature stores for somebody to be able to pick up and use and even just prior art that is not necessarily open source, but at least has some sort of white paper or reference architecture for being able to look at? Yeah.
[00:35:40] Unknown:
So when we first looked up the problem, we actually landed on a talk, that someone from Lyft gave. I think it's called bootstrapping Flink, the talk and that gave us an idea of what the problem was and how our people were solving it. It also kind of validated in our head that there wasn't something off the shelf that we could use so that it did exactly what we needed, and then we would have to build it if we wanted it. So, yeah, Lyft has Drift, which to talk is with sharpening Flink. Airbnb has something called zipline, which actually a lot of our decisions were influenced by how Airbnb did their feature. So 1 thing that makes their feature store really unique and something that we also do is that it can generate training data sets implicitly. You just give it a set of labels and it will generate the training set for you. Other feature stores don't usually do that. You have to you can give it a time stamp and I'll tell you features of the time stamp, but it's your job to generate the training set. So that was another piece that made Airbnb zipline really interesting.
1 of the first to my knowledge, 1 of the first people to talk about a feature store was Uber. They built something internally called Michelangelo that, yeah, that they've spoken on as well. And I think that's 1 of the earliest cases of like a feature store like where someone would actually define it as a feature store that was spoken about publicly. So that's existing kind of the the proprietary domain. None of those are open sourced, and none of those are even, like, publicly available. You can't really use them. You can just look at their talks and how they talk about them. In terms of other stuff, Gojek, which, has open source something called Feast, which you can check out. They handle a lot of the kind of the middle layer like defining features.
You can't use it currently to, like, generate or materialize views, stuff on that sort, discovery, etcetera. So there's that and open source, there's a company called Hopwork Hopsworks that that has built a feature survey, open source parts of it, but you can check out. And yeah. So there's like it's definitely becoming a really hot space. There's a lot of startups raising money now kind of in this space as people are starting to realize that this should be a core piece of machine learning infrastructure.
[00:37:47] Unknown:
And what have you found to be the most challenging or complex aspects of working on or with a feature store as you build out the capabilities of stream SQL and use it for your own work for personal use and at Triton?
[00:38:00] Unknown:
Yeah. I think the hardest thing is user experience. When you're building, I mean, I would argue most big data tools have a problem where they have to balance the ability to let you do everything you need to do while also being simple to use if you just wanna use it in the most basic ways. So that's always something that is kind of a constant tension, opening up more stuff, making it more, tunable, but then making it much harder to use. So kind of getting that developer experience down so that this becomes because again the goal is to let people iterate faster on machine learning so that's kind of our guiding North star.
So everything we do is is thinking about things from that way. So that means that sometimes you know we will there might not be a way to do a feature if you have to do this very, very specific, very, very, like, maybe super hyper low latency, feature serving, like, maybe, like, you don't use stream SQL for it because we're optimizing for the average use case, which is what the majority of people have, which is just like, hey, I have this dataset. I need to be able to generate this feature as fast as Flank can do it by default. It's kind of it's kind of how we think of it. So every the most challenging and unexpected parts have been how hard again, like, it's not as unexpected but I guess every time you start designing API or something like that like you feel like oh I think I can do this I think I have a handle on it but you always find that, oh, there's all these other requirements. There's all the stuff that gets added on.
And, you know, designing APIs and and just, like, basic things like how to name things with Sarah are really, really hard. It's like 1 of these unsolved problems that every time you think you're you got it, but every time,
[00:39:41] Unknown:
you you know, there's always so much to learn in that space and so much iteration to do. And as far as your own experiences, what have been some of the most interesting or challenging or unexpected lessons that you've learned in the process of building Stream SQL?
[00:39:55] Unknown:
It's I think it just comes down to I think a lot of people think that tools are especially new tools are being pushed by hype and people think that, oh, you can just if you create enough hype about something, like, people will just use it. You know, people point fingers at all these technologies that quote unquote like only exist because of hype, but I don't actually buy that. I think you actually really need to solve a problem for someone. You need to someone needs to be able to use your tool and feel like, cool. I love this thing. I wouldn't why I would never not use it. I would always use this thing. And I think just like getting back to your the fundamentals of like what are you trying to do and are you doing it? How can you do it best?
I think it's just like you always feel like there's these ways you can just pull that off. But really, like, I truly believe that over time, like, the best product will win eventually. And, you know, even if, like, the specifically, like, 1 process and 1, like, certain paradigms will eventually float up. They just eventually, someone will get it right and it will just work. So I think I'm kind of optimistic as, like, over over time, the best tools will end up being the tools that most people use and that we're moving forward constantly. We're not just, like, kind of moving around quickly. And then
[00:41:10] Unknown:
for people who are working on, providing data to machine learning teams or working as a machine learning engineer or data scientist, what are the cases where either using a feature store in general or a stream SQL in particular is the wrong choice?
[00:41:26] Unknown:
Yeah. I think so 1 piece is this is currently streamed. I think it's specific of almost every I think every feature store right now is they don't really handle image, video, audio. Images, like, is is obviously a very core 1 because a lot of machine learning is, image processing. So feature storage is varied, don't really aren't really a space. But also with certain types of problems like image, image processing, like the model itself becomes really, really important, much more important with the feature sometimes. So it depends on the problem space. But if you feel that the model is actually the most important piece that's gonna drive the most the most, performance gains, then a feature server probably isn't gonna do that much for you. If your features are super super simple and all you're doing is like constantly iterating on the model. The other piece is if you're like like, for example, there are certain models that have to be so low latency that they will actually put parts of the model or the whole model on the browser, for example. And there's always like different deployment tools where where you're actually spinning a model across many different beyond just like different servers. Like, it's like, hey, like, part of this model runs on the browser, part of this model runs on the server, etcetera. When you have stuff like that where it's like you're hyper optimizing for latency or something else, especially latency for for feature serving, then a feature store is probably not gonna be able to hit the lanes you need if you're going through jumping through hoops to get it. You know, it will be as fast as like a lookup in Cassandra and as fast as like Flink can process, but if you're at a point where it just needs to be perfect or it needs to be so so blazing fast, then, yeah, it's not the right tool. You should you should, continue kind of building custom. I don't think there's actually any off the shelf tool you could use in that situation.
[00:43:14] Unknown:
And as you continue to use and improve the stream SQL platform, what do you have planned for the future of the product?
[00:43:22] Unknown:
Yeah. I mean, the the the core of Stream SQL is allowing feature engineering, feature generation to be something that is a simple and unified process. So, you know, like I said, we're we're constantly pushing the envelope on how complex the features you generate are, what kind of materializations you can make, what kind of data sources we can handle, and allowing and so everything we're doing is just kind of expanding the feature set, making it easier to deploy, making it, easier to use. But the guiding star is is is, making it such that, teams
[00:44:00] Unknown:
as a whole can all work together, working and building on machine learning, specifically the feature sets. Well, for anybody who wants to get in touch with you or follow along with with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah, I think
[00:44:22] Unknown:
unifying streaming and batch, I think a lot of processing systems are trying to get there. Like Flink has a batch API, Spark has streaming, but it's just we're not there yet. Like we there's so much work to do in that space to, like, really have a unified processing engine. And I think that is a really big problem to solve. Doing it well and doing it, in a way that solves all the needs and is still useful, that's where, Finjan and J. I think Flink and Spark both, are are kind of getting there. But
[00:44:56] Unknown:
definitely, I wouldn't say that it's it's as easy as just using 1 or the other. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing on stream SQL. It's definitely very interesting problem space. And as you won, that is becoming increasingly necessary as we move more and more of our application logic to machine learning. So thank you for all of the effort you put in on logic to machine learning. So thank you for all of the effort you put in on that front, and I hope you enjoy the rest of your day. Yeah. Thank you for having me. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Call for Contributions
Interview with Simba Kadir Begins
Simba's Background and Journey into Machine Learning
Overview of Stream SQL and its Mission
Understanding Machine Learning Features
Implementing Stream SQL's Feature Store
Handling Feature Drift and Versioning
Infrastructure Choices and Evolution
Challenges and Lessons Learned
Future Plans and Final Thoughts