Summary
Putting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Your host is Tobias Macey and today I’m interviewing Demetrios Brinkmann and David Aponte about what you need to know about MLOps as a data engineer
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what MLOps is?
- How does it relate to DataOps? DevOps? (is it just another buzzword?)
- What is your interest and involvement in the space of MLOps?
- What are the open and active questions in the MLOps community?
- Who is responsible for MLOps in an organization?
- What is the role of the data engineer in that process?
- What are the core capabilities that are necessary to support an "MLOps" workflow?
- How do the current platform technologies support the adoption of MLOps workflows?
- What are the areas that are currently underdeveloped/underserved?
- Can you describe the technical and organizational design/architecture decisions that need to be made when endeavoring to adopt MLOps practices?
- What are some of the common requirements for supporting ML workflows?
- What are some of the ways that requirements become bespoke to a given organization or project?
- What are the opportunities for standardization or consolidation in the tooling for MLOps?
- What are the pieces that are always going to require custom engineering?
- What are the most interesting, innovative, or unexpected approaches to MLOps workflows/platforms that you have seen?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on supporting the MLOps community?
- What are your predictions for the future of MLOps?
- What are you keeping a close eye on?
Contact Info
- Demetrios
- David
- @aponteanalytics on Twitter
- aponte411 on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- MLOps Community
- Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz (affiliate link)
- MLOps
- DataOps
- DevOps
- The Sequence Newsletter
- Neptune.ai
- Algorithmia
- Kubeflow
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others. Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies. Sign up for the SaaS today at data engineering podcast.com/acryl. That's acry l. Your host is Tobias Macy. And today, I'm interviewing Demetrio Brinkman and David Aponte about what you need to know about MLOps as a data engineer. So, Dimitrios, can you start by introducing yourself?
[00:01:37] Unknown:
Yeah. I haven't really figured out a good way to really talk myself up in these introductions, but I fell into the MLOps world about 2 years ago when a company I was working for went out of business, and I was starting to interview different people in MLOps. And so now I run the MLOps community. We've grown to over 9, 000 people in Slack, which has been just incredible to see. That's me in a nutshell.
[00:02:08] Unknown:
And, David, how about yourself?
[00:02:10] Unknown:
Awesome. So my name is David Aponte. I am a software engineer at Microsoft and a board member in the MLOps community. I got into the space hands on working as a data scientist, managing my own models, getting them to production, and then later on working on the machine learning infrastructure side where I worked very closely with data scientists helping get their projects to production and got really interested in the operations around all of that and got in touch with Dimitrios. This is around the pandemic. No meet ups. You know, everything was online and found an opportunity to work with him and help out, and a lot has happened since. But right now, I am working at Microsoft doing mostly MLOps work in my day to day.
[00:02:51] Unknown:
And going back to you, Dimitrios, you mentioned that you kind of fell into the MLOps community a couple of years ago. So I'm wondering if you can just share a bit more about how you got involved there and what it is about this space that's keeping you interested and motivated.
[00:03:05] Unknown:
I come from a sales background, and I was working at a company that was doing MLOps. And I was on the sales team. And so I was the guy that nobody likes who is pestering them on LinkedIn and probably sending you cold emails or trying to cold call you. And so forgive me. I've repented for my sins. And since then, basically, I was working at this company. We were doing data lineage and really trying to focus on the reproducibility of certain aspects of the data science and machine learning workflow. And when the pandemic hit, right before it hit, the sales calls just dried up completely. And our CEO, he has open source in his blood, and he said, Why don't we try this community thing? Maybe that will allow us to talk to people that we wouldn't necessarily be able to get on the phone because we're a small startup, and we don't have the connections.
And so we started it. The company went right out of business about 2 weeks into starting these whole meetups. And I had this moment of, like, should I go and try and look for another job? Should I go and try and do other things? Or do I like this? Do I wanna keep going and see where this can take me? And so I chose the latter, and it's just been a roller coaster ever since. I mean, I've probably learned so much coming from no background to being able to talk with people that I have no business talking to every week, and sometimes 2 or 3 times a week, and then getting to also rub shoulders with people like David and have him teach me along the way too.
[00:04:45] Unknown:
David, do you remember how you first got involved in the area of data? I was actually a teacher before I was in tech, teaching science and math and a little bit of programming. And when I was teaching myself how to program, I was doing some soul searching. What do I wanna do for the future? What skills are in demand? I had studied molecular biology in my undergrad, so I had a craving to work on hard problems. And not to say that teaching is not hard. It is very challenging. But I wanted to be an individual contributor. So I did some reading and searching, and I think 1 of the books that sticks out to me was a book by a data scientist from Google, I think. It's called something Everyone Lies or something like that. I forget the name of the book. But I just found it so interesting how people were using data and computers to solve problems, to learn something about their customers, to make a product better. And I just, you know, dove into that. And then through that, learned about machine learning, was just like, I guess, okay. You're using all this data to do something cool with it.
And that just led me down the rabbit hole with all the things that are related to that. All the engineering, right, all the math, even some of the science. And that's how I got into the data space by pure passion, looking kind of forward to what I thought was gonna be really important in
[00:05:55] Unknown:
the future. And so that brings us to the conversation for today, which is MLOps and why data engineers should care about it, what their role is there. And I'm wondering if we can start by just establishing a definition of what MLOps actually is and maybe how it relates to some of the other nebulous ideas such as DataOps and DevOps?
[00:06:16] Unknown:
It's a great question. Probably the most common question in this space. Right? And I think you're gonna get different answers to this question depending on who you ask. But my intuition on this is it is where machine learning system development meets machine learning system deployment. It's about systems. You're not working in isolation. It's not just a model. It's a model with data and code. And the interaction between all these different components is what I consider the MLOps space. So it does, in my opinion, involve data operations. It does involve dev operations. Right? But this added element of complexity now with the data science being mixed in. And the reason why I think that makes it a little bit more challenging is it's, a lot of the time, fundamentally research.
We have some proof of concept that we're trying to see if it makes things better. We don't know if it's gonna be better than even a random baseline. That experimental nature makes it very iterative, makes it very challenging. And so, like, a lot of the disciplines that we have out there, say, for example, MLOps or sorry, DevOps, how does that help us with the development space? It helps a ton. Right? Because software engineers have been developing things and then shipping them out. Right? But I guess it's hard when you have something that maybe needs lots of manual intervention, lots of validation, lots of different personas being involved as well. It's not just engineers. It's also product managers and domain experts that are involved. And I think all that mixed together, again, where systems development or ML system development means the deployment side is what makes it challenging. It's like all of these things mixed in. So it's not that it's so different than these other disciplines.
It definitely is built off of them, but I think it's coming into its own as something that stands on its own because of these kind of unique challenges, right, the data, the code, and the model.
[00:07:58] Unknown:
I think that 1 of the guests we had at our meetup 1 time, Andy McMahon, said really eloquently when he said, productionizing machine learning is 1 thing. But then once you productionize it, whatever, n plus 1 times, now you're going into the MLOps sphere. And that kinda rang true to me because it's like, yeah, you can get it out once. But then once you start making that a process and you start really going through it, that's where you're starting to get into MLOps. And David also said this another time to me, and it was how MLOps encompasses DataOps, but DataOps doesn't necessarily encompass MLOps because you need that extra piece of ML in there.
[00:08:43] Unknown:
In terms of the actual scope of MLOps, as with DevOps and DataOps is that it's cross cutting and all encompassing at the organizational level. And given the fact that there are multiple personas involved, I'm curious if we scope it down to the data engineering position, what do you see as the set of responsibilities for that role, and how does the adoption of MLOps sort of change the actions that you take as a data engineer? And how does the additional considerations of the fact that this data is being used for machine learning workflows and that it will, you know, come full circle and feedback into your source systems, how does that factor into the ways that you think about the work that you're doing
[00:09:28] Unknown:
as a data engineer? It's a good question because I think it's also changing. Like, it depends on what type of data engineers you're working with. Right? Like, at Microsoft, a lot of the tech stack that they're used to there, SQL, you know, c sharp, things of that nature. That's 1 thing I would know is that sometimes the personas in an MLOps team, they're learning specific tools that maybe general data engineers may not be familiar with. In my opinion, what a data engineer is responsible for is getting the data to the model, but not only getting the data there. It's also what happens after that. For example, let's say, you could even store the prediction somewhere. We could set up monitoring on the predictions. And all of that, even though it's outputs from some model, it's really just a data that needs to be extracted from some place, maybe transformed and loaded into another environment. So a lot of the same data engineering principles that, you know, a data engineer working on a data warehouse does is very similar. But now let's say if we need to think about working with a data scientist that has maybe a more involved feature engineering pipeline that maybe does a little bit more involved like a selection process. And so now you have to work closely with the data scientists to help them accomplish their goal. So it's it's more specific than just, you know, I'm doing some transformations and loading them elsewhere. Maybe I need to do that in a very iterative way. And so I think it's a lot of the same stuff, but now you have the domain expertise of machine learning being involved.
And I guess this is super important in the MLOps space because the model is a function of the data. Right? Everything is dependent upon the data. If the data is bad, the model is bad. So it's not just about, like, you know, the infrastructure. It's all about the quality of it. So there's a need for data engineers to understand the data as well to help the data scientists find good signal, and it varies, like, in my experience. Some data engineers are more on the pure software engineering side, so their experience with machine learning is they kinda treat it as just like some general artifact that needs to be shipped versus maybe another data engineer that's more on the data science side and is more used to maybe doing the feature transformations and more involved kind of, I guess, engineering with respect to not just moving it from 1 place to the other. Again, that's not all that data engineers do, but it's a big part of it. Right? Getting the data to the model in a reliable way, all the pipelines that are around that. A data scientist may not be experienced managing.
Right? They maybe are used to the algorithm and the kind of experimental work involved, not necessarily the whole system. And so a data engineer works closely with the data scientist and usually also with ML engineers if they're involved. Sometimes sometimes, actually, they have the same responsibilities. But maybe a data engineer is responsible for all the data that comes to the model, and then the ML engineer is responsible for productionizing that model in some target environment. Let's say, like, a web service or maybe even a batch scoring pipeline. It really varies. But, again, just to kinda bring it back to what I was saying earlier is that it depends upon the skill set that this data engineer is is used to doing, and I guess exactly what their role is in a team. In Microsoft, it's again, it feels very simple. Getting the data from 1 place to another, but it's super fundamental
[00:12:30] Unknown:
because without that, the model just won't even have a good model. I love how you said that too. Like, I've heard people say that MLOps, its current iteration is just basically glorified ETL in a way. And so being a data engineer, you're kinda set up for being able to crush it in MLOps. It's just adding that little extra sprinkling of flavor on it. But the other thing that I was gonna mention is MLOps will give you a common framework and common, like, language that everyone can coalesce around. You can find this area where you're saying, okay. You know what? If I say we need a feature store, a data engineer can understand what that is just as much as a machine learning engineer or a data scientist.
Hopefully, everybody understands what that is. Although sometimes, like, maybe Feature store isn't the best example because that is still up for contention. But if you start to say things like, alright. We need a model store. We need reproducibility here. Hopefully, a data scientist will understand that just as much as a data engineer. So that's another thing that it's not exactly the question you had, Tobias, but that's another thing that I wanted to mention on there. Yeah. It's definitely
[00:13:45] Unknown:
an interesting and complex space, and there are definitely a number of areas I wanna dig into. But before we get too far down the road, I'm also interested in getting your perspectives on what you see as being the kind of open and active questions in the MLOps community and what are the pieces that people are still confused about? What do people say are you know, these are the 5 or 10 or 10 dozen things that you need to be able to actually say you're doing MLOps because, you know, taking DevOps as the example, that took about 10 years before there was any real sort of general industry adoption of DevOps as a set of practices and principles and what it even was loosely supposed to mean. So I'm curious to see sort of, like, what is the general industry adoption of MLOps? What are the open questions? What are the pieces that people are struggling with?
[00:14:37] Unknown:
Yeah. 1 thing that comes to mind is the organizational problem in MLOps. Like, we kinda were just discussing what persona should be involved, what type of team players do you wanna have, and it varies. And I think related to that is what tools do we need. If we're using this, does this count as MLOps? And the other question is, do we have 1 tool that has them all or get 1 tool for each? And I think these are the sort of questions that like, very practical sort of concerns. Right? Not so much about the theory of it or what it means. It's more so, like, how can it help me do what I need to do and get my models to production reliably, for example.
But it's these organizational kind of cultural questions that I see come up the most. That question about what is a data engineer's role in an MLOps team is still an I think, again, there's different answers to that question. Where does the responsibility of the data scientist end and the responsibility of the ML engineer begin? These are tough questions. Again, it's it's such a cheap answer, but it the answer is it depends. Depends on the problems that you're working on, the team structure that you have, resources available.
A very common scenario is that people just starting, like maybe a small start up, they don't have a ton of resources. So they have 1 guy that does it all, you know, 1 girl that does it all. So it's like, that's MLOps. You know, they're doing their productionizing machine learning. Maybe it's not as as organized. You don't have a bunch of different personas. Maybe you have 1 guy doing it all. But that still counts is my point. And so there's a spectrum of what MLOps looks like, and people have described different maturity models. You know, usually it means more automation means more mature. I'm not entirely convinced that that's always the best case. I'm still I'm actually kinda wrestling with that at work automating stuff, but it's kind of, again, these sort of putting it into practice questions. The very again, from what I sense, very practical. Like, what do I need to do? Who do I need to hire? How do I structure my team? Yeah. What are the processes?
[00:16:31] Unknown:
And 1 of our guests on the podcast was talking about actually and this is a sentiment that you'll hear echoed all throughout the podcast. It's like, I pray for those days when it's just a tech problem. Those are the best kind of problems that we could have because when it's a processes or it's a people problem, that's much more difficult to figure out a well thought out solution for. And so I also was gonna mention, like, I just quickly scanned through Slack and picked out a few of, like, the very common questions that come up. You have 1 that's like, do I bake weights into Docker containers?
The classic 1 is, can I use Jupyter Notebooks in production? And there are very, very opinionated people on each side of the field on this 1. That's like, if you want to just start a
[00:17:21] Unknown:
Twitter war, go and say 1 side of that, and you'll see what happens there. Or ask people. Be like, yeah. Your data scientists need to know Kubernetes. That's another 1 that gets people
[00:17:32] Unknown:
There's lots of trigger in there that goes on. And I find this 1 there's 2 questions that I haven't really gotten a good answer to. 1 is, how do you manage dependencies between all of these different tools that you're using or frameworks or whatever? Because that's really complicated. I haven't heard someone who says, oh, we do it like this, and it's, like, it's cake because we set this up. And the other 1 is, how do you create an effective knowledge base for machine learning? And so you make sure that there is not that, like, siloed information, and it's 1 person who has that know how on how to bring the model into production. And if that person doesn't show up for work or they just move on to the next job, we have to be Sherlock Holmes and figure out, like, what is going on. So those 2 questions, I think, are probably the biggest that I haven't seen.
[00:18:22] Unknown:
Yeah. I just wanna piggyback up what you said. Yeah. Like, minimizing the bus factor. Right? We wanna make sure that if this person got hit by a bus, which we could still run the show and nothing will will halt. But that's a hard thing to do because not everyone has the same skill sets. Right? Like, you know, for example, think about data engineers and the tools that they're used to using. It's a bit different than a data scientist. It's gonna be a mixed bag. And how to get them all to play nice and be responsible for the same set of systems and components, you run into the same questions over and over again. What's the best way to do this is what I'm seeing. If there's precedent, you could look at what Google is doing, what micro these big companies are doing, but even then, that doesn't always fit because the scenario is different. The domain is sometimes really different, and the resources available. Like, this is not always the same thing as, like, what these big companies are doing. The last 1 is, who gets the call when a model goes
[00:19:13] Unknown:
AWOL at 2 AM? Right? Like, who has to go and sort that out? The CEO. That's a huge question.
[00:19:20] Unknown:
The CEO. Yeah. Related to that is something that we're dealing with is coming up with an on call rotation where we have data scientists on that on call rotation. And we think we gotta think about, okay, how do we enable them and equip them to actually troubleshoot these live side issues effectively without having to learn all these different things that are just not maybe relevant to their future career. Right? Like, we're talking about Kubernetes. Right? Like, yes, some data scientists, they like learning all these tools, but is it the best use of their time to be doing all of that? And so now we're dealing with the question, okay, but we have a model that's failing.
It may have some statistical, you know, issues that a software engineer may not be in the best position to debug. So we actually need domain experts, the data scientists in this loop. But how do we do that with these different kind of skill sets? So these are the sort of interesting questions that come up. Yeah. Definitely seems like reflections
[00:20:15] Unknown:
of all of the previous iterations of DevOps and DataOps of how do we manage the context propagation across these different sort of areas of expertise while maintaining effective communication and understanding, you know, what are the sort of interfaces and handoffs at the different stages of the life cycle and some of what the things you're talking about as far as what folks are faced with as challenges in the MLOps space of, you know, managing, how do we understand on call, how do we understand observability in the context of ML, how do we work through, you know, the sort of deployment factor? You know, what are the pieces that make sense to automate? Those are all things that have been struggled with since we started working on software. And it seems like the answers are probably similar, but there is the, you know, extra degree of uncertainty that is inherent in machine learning and data science, and how do we factor that into the tools that we already have.
Given the fact that there are so many overlaps, I'm curious what you're seeing as the kind of primary personas who are engaging in the communities that you're a part of that are focused on MLOps specifically and, you know, maybe some kind of general categorization of the backgrounds that those people have?
[00:21:33] Unknown:
A majority are people coming from the DevOps sphere or SREs trying to then figure out how much machine learning they need to know to be able to play in this field. And then you get data scientists that are coming into the DevOps field trying to figure out how much SRE knowledge they need to know to play in this field. And, personally, I think it's a lot easier coming from the DevOps or being an SRE and then picking up a bit of machine learning than it is being a data scientist and then having to learn, like, proper coding and all that good stuff. There are a lot of data engineers that are doing the same thing, and there are I would say, we also will get the occasional analyst. But, again, like, that's a lot less than for me, the main 2 that are coming into the MLOps community or that I don't know. But, David, maybe you have other ideas.
[00:22:33] Unknown:
No. I I think you said it right. There's lots of people coming from, like, site reliability engineering background, right, where they're used to maintaining. I guess you could think about them as, like, system engineers. You know, they treat the things being deployed as just, like, some general artifact that needs to be shipped and maintained across multiple environments. So, like, they think about it kind of at that high level. So usually, like Demetrius is saying, they're just how much domain expertise do I need to know to get in this space? And there's lots of people. For example, we've spoken to Todd Underwood at Google where he believes actually you don't need a ton of machine learning experience to be effective. A lot of the times, it's just being good with working with systems. Again, sorry if I'm misquoting you, Todd. But I actually agree with that because I think the domain is something that you can pick up as you're working on the job, and you could always supplement with lots of different things. But if you're responsible for maintaining something and that's just you, you have to kind of know all this stuff already. It's just hard. Like, I don't know. I'm thinking about, like, all the random networking issues that come up, and it's like, if you don't understand networking, it's very challenging to actually use the tool that you're trying to set up so that people can do what they need to do. Again, this is a headache that, like, if you're, like, responsible for deploying Kubeflow, for example, a lot of the issues are not machine learning issues. They're, like, these very kind of common general infrastructure issues. So that usually is a strength if you have that if you have that experience and you have that background. And I've seen that they tend to do well in these MLOps related teams. And so that makes me feel like it's actually more on that side, more on the engineering side than anything. But maybe that's just because, again, MLOps is about the operations, which is not the entirety of what, you know, machine learning is about. Right? That's just a part of it. Yeah. Another interesting element of
[00:24:17] Unknown:
machine learning and operationalizing it is the question of what does production mean for a machine learning model, where for a software application, it's very clear that production is when your end user is interacting with it. But who is the end user for a machine learning model? It's very context dependent. It could be that it's an ML model that's powering a recommendation engine that determines what to put on the web page in front of you. It could be machine learning model that's responsible for determining when there is a high probability of a system failure in your network infrastructure. It could be a machine learning model that's determining what prediction to make for your CEO's business dashboard to determine how many widgets to buy for next quarter. So I'm curious how folks are thinking about the categorization of different environments and what production means for machine learning.
[00:25:02] Unknown:
Yeah. I think that's related to, like, even the kind of common question. Are Jupyter Notebooks acceptable for a production environment? So as a notebook, does that count as a productionized model? Again, I would argue it depends, but my opinion on a lot of these things is where the model delivers value is how I see it as a production environment. Sometimes that delivering value could be actually manual. For example, there are some models at Microsoft within our team where the outputs to that model are given to reviewers. So they need to take a look at these outputs and see if they make sense. And if it does, then eventually the model will be integrated in a more automated way. So there, it's already in production if your customers are seeing the outputs of that, in my opinion.
You need to have a process to reliably do that. Maybe you don't need to have all the bells and whistles to have it completely automated, but you should be able to reproduce things fairly easily. That's like the bare minimum that I think. And usually, it can mean having it in source control somewhere. So that way it's not lost or, you know, it only runs in your computer type of thing. And it's like that sort of consideration of it's not just about me and you're working on here, it needs to work elsewhere for the customers. I think that's when you're in a production environment. And there, again, it depends. Sometimes it is just as simple as, like, here's a CSV, validate this, and that works if that's what you need to do. It's more automated, then I think that's maybe when it becomes a little bit harder to do that easily.
[00:26:29] Unknown:
Do you want to learn how the Joybird data team reduced their time spent building new integrations and managing data pipelines by 93%? Join a live webinar hosted by RudderStack on April 20th, where Joybird's director of analytics, Brett Chawney, will walk you through how retooling their data stack with RudderStack, Snowflake, and Iterable made this possible. Go to rudderstack.com/joybird to register today. Yeah. The other question that has been very nebulous for a long time is, how do you know when it's actually machine learning? Does it have to be a deep learning, you know, convolutional architecture? Is it just a random forest? Is it a expert system that just has a very detailed decision tree? Like, when does it become machine learning, and when is it just a software system?
[00:27:14] Unknown:
Classic question. Yeah. I think Dimitris answered, but I think for me, it's like when the output is a function of the data, which you could argue is a heuristic. Right? Like, if this else that, that does something. Right? Well, it is a model of sorts. When it becomes machine learning, maybe it's when you have, I guess, like the classic optimization. You're trying to optimize something or you're trying to learn something from the data and then take whatever you learn and apply it to a new dataset. I think it usually involves that like, the these artifacts. I guess you could think about them as the training artifacts. But a lot of what has helped me maintaining machine learning and production is actually understanding data in production. So it is data. I don't know. I guess it comes down to the algorithms in some respect. Yeah. I was just gonna mention,
[00:27:57] Unknown:
so many times we've heard this that it's like, if you can't avoid using machine learning, you should try and do that first because you're adding so much extra complexity by bringing on the machine learning. And I think about that quite a bit. And I also think about how there are certain use cases or there are certain especially now, certain use cases that have been proven out where it's like, we should definitely use machine learning on this 1. And we need to do it because if we're not,
[00:28:30] Unknown:
we're behind in a way. 1 more thought while while we're on this subject. What makes a machine learning? I was thinking, you know, like, something as simple as, like, if I if I wanna make recommendations, I could just sort them, right, by their scores and then present that. That would work as, like, maybe a baseline, for example. And that's usually what a lot of companies do. Right? They have some baseline to compare. Usually, it's maybe random. Maybe it's some data transformation. But I think when it becomes into the machine learning spaces, probably maybe it's like a cheap answer, but when you don't have to explicitly program what you want it to do in some way. Right? Because it learns what it wants to do from the data. So I think that also makes it a bit different versus, like, the sorting example. Right? Like, that you have to program it to do that versus if I'm using a model, who knows what may be prioritizing that list. Right? There's no guarantee that it's the same, and that's where we get the randomness. Right? Which also adds that other layer of complexity, especially in the cases where your machine learning models labels, let's say if you're doing supervised learning, is not readily available. When I was working at a company called Benevolent AI, we were predicting, for example, whether or not this gene would have a successful assay or would be an assay hit, for example. And that is expensive to validate and test, and the answers don't come back for months, sometimes even longer.
That may again, it's like it's not just data engineering there. You have to think about the labels and how that relates to everything. And, again, that uncertainty makes it hard to validate if your model is doing what it should be doing, which is, I think, 1 of the other things that makes machine learning a bit different than just general data transformation.
[00:30:02] Unknown:
Just to be a little facetious here, machine learning is when you get the answer and don't understand the question.
[00:30:08] Unknown:
Yeah. Exactly. In some ways, right, like a black box.
[00:30:13] Unknown:
Yeah. Yeah. Oh, that is great. Did you just make that 1 up right now? Yes. Dude, we gotta put that I'm gonna put that on a shirt. That is incredible.
[00:30:22] Unknown:
Yeah. Yeah. I'm still surprised at how effective some models are. Like, think about a transformer or something like that. Like, how does it actually do what it does? I think that's where MLOps makes it a little bit different from some of these other ones is maybe there isn't a coherent theory putting all of these things together. Right? Like, if you were to ask someone like Yann LeCun or someone else to explain what deep learning is, I've heard different answers. I've heard someone describe it as, like, almost like a language. And, you know, someone will talk about it as it's just a function or, you know, I think that's where we don't all agree yet. And maybe MLOps comes into play because, like Demetra said, it helps us find some common ground. Even if it's not a coherent theory, maybe it's a tool, maybe it's a process, but it's something that unifies these different parts that are involved.
[00:31:08] Unknown:
Digging now into the sort of technological space that's growing up to be able to support this whole effort of MLOps and people starting to adopt these practices, what are some of the core capabilities that you have seen people coalesce around as being required to be effective at building these MLOps functions? And what are some of the kind of standout technologies that you have seen as people are starting to iterate on this problem?
[00:31:38] Unknown:
So it's interesting you mentioned that because I have this, like, half written blog post that I've been trying to finish, but I just haven't been able to get it across the line on this idea of, like, what is the modern ML stack? Is there such a thing as that? And, actually, my whole blog post is about how there is no such thing as a common ML stack or a modern ML stack. And and I also think that there never will be because of what we kind of touched on earlier on how you play this game. Like, which of these is not like the other? Then you say, computer vision with autonomous vehicle, computer vision in healthcare, fraud detection, you know, with tabular data, or just go down the list, robotics.
And you start to realize that each use case is so very specific in its needs and what you are trying to optimize for that it's really hard to say, like, oh, well, you definitely are gonna want this tool as your base and your foundation. So what I'm trying to figure out is what are the fundamentals that we can build off of. And I like what my old CEO, when I was at that company that went out of business, he had this, like, manifesto, and it was like the MLOps manifesto. And his idea was that you need something that's reproducible. You need something that is collaborative.
You need something that's continuous. And then I also would add on top of that, you want something that is ethical. The thing is that all of that is very like it's like I'm selling new air. Right? Like, it's really hard to nail down what exactly that is and how is that right for my use case. But I just think about, like, if you're trying to figure out computer vision on the edge for an autonomous vehicle, you need much different SLOs than if you're trying to figure out computer vision for some kind of cancer detection for a doctor.
And because you are really looking at something that is using computer vision underneath it all, but you're optimizing for 2 highly different problem sets, you can't really say, oh, there's these core components that you definitely need. Or at least I haven't been able to figure them out, hence why the blog post is not finished. If anyone has ideas, please reach out to me. I'm gonna steal from if you guys follow, I think it's called The Sequence. There's, like, a newsletter. There was the CEO of
[00:34:13] Unknown:
ML experimentation platform. He just had an interview, and I like what he said about what he considers the key components of a robust ML experimentation tool. But I think it applies to actually all MLOps tools. And he says scalability or scalable back end. I think that's really important because the nature of data science is kind of experimental. Like, a lot of times, you start small and then work your way up, or sometimes you need a lot from the very beginning just to even accomplish something. So you need and this is, like, I guess, to getting it down to it, the compute. Right? You need a lot of memory. You need access to certain accelerators. All that should be very easy to use, and it should allow you to do your work infinite like, to infinite scale. Other thing is, like, he says it's flexible and expressiveness. Like, it should make you more productive, I think. If it's a good tool, in my opinion, embeds best practices into it. So it's hard to do the wrong thing and very easy to do the right thing. And usually, that right thing makes you more productive in your workflow, what you're trying to do. And then lastly, like he says, like, there usually should be some way to visualize or, like, keep track of what's going on. And maybe this is a dashboard. Maybe just a simple view of, like, all the data that you have, all the assets, but some way to look and peek into what's going on. And this is where we talk about observability a lot. But this is a key component to, you know, model development when you're doing training, hyperparameter tuning. You wanna know what's going on. You wanna be able to scale and you wanna be productive.
Model deployment, the same thing. Right? You need to be able to scale. If if more customers start using your web service, you want it to be able to, I don't know, for example, add more notes or something. You wanna be productive in these different areas, and that's why I feel like a lot of tools are kind of trying to become, you know, specialized in this 1 area. And I don't think that the tools that try to do all of them in 1 work well because it's like, you know, if you're working with most data warehouses, right, they don't have all these tools mixed into it. Right? Like, sometimes, like, you know, BigQuery allows you to do some modeling in it. I think that's totally fine. But a data warehouse that maybe has, like, a data visualization tool embedded in it. There are, like, Tableau or something that lets you pull your data in, but then it it affects something is going to be not as good as a result of you trying to do all of these different things. What I see as as being a trend is is there's best in class coming about, and, usually, it's around the productivity for whatever that workflow is. Right? So and what, the CEO of Neptune was talking about was experimentation, being able to, like, look at your different experiments, be able to tie them to 1 model that was produced. So there's that lineage aspect, I think, is important.
It's again, I hate to say this again and again, but it is a common tool is feature stores. And the reason why I think it's so popular is because, again, everything is dependent upon the data. So you need to get your data to your model in a timely fashion, in a reproducible fashion. It's also hard, you know, the transformation side. A lot of the times, like, you need to know Spark or maybe how to do, like, I don't know, write some stored procedures in SQL. So if there's an easy interface to do that, that's also very helpful. And another thing is model deployment. Think of Algorithmia.
I have a training artifact. I wanna ship it as a web service really easily, or I wanna regularly run it as a batch job. And so I've seen that those tools that are dedicated to
[00:37:22] Unknown:
model deployment are very popular. So feature stores and model deployment are, like, the 2 biggest ones that come to mind. Yeah. Then monitoring, there's a huge amount of money going into the monitoring space. And I think it still isn't clear if you need a dedicated machine learning monitoring tool or if you can use something that is like your DevOps monitoring tool with with a few, like, tweaks on it. But there's a lot of companies in this space right now, the machine learning monitoring space, and they probably have a mouth for why you need their tool. But I was gonna mention something that you said, David, on top of that when you were talking back about that first part of the question. And 1 of my favorite things to ask guests these days when it comes to them creating their platforms is, what metrics are they looking at with their platform? Like, what is it that they're looking at to know that the platform is actually doing its job? Is that time to deployment? Then you shorten that, and that's the metric that they're looking at. Or how did they see that the platform is actually bringing more value, and how can they make the platform better?
[00:38:35] Unknown:
That's a great question. That's a tough 1 to answer.
[00:38:39] Unknown:
Yeah. And how do you select that metric? Because you always wanna be careful about what behaviors you're incentivizing
[00:38:46] Unknown:
because they're not always going to be the ones that lead to the outcome you're hoping for. I think that where is this from where they say once you make something a metric, it's like almost not as good. I dealt with this a while ago, but like as soon as you start observing something, something happens there. It's like it's not as good as what it used to be. Are you going into string theory? You didn't you didn't tell me that? No. No. No. Not exactly. Not exactly. No. Yeah. Yeah.
[00:39:16] Unknown:
And in terms of the practice of actually starting to adopt MLOps as a core consideration for the organization, what are the technical and organizational questions at the design and architecture and process and procedure levels that need to be made as you're starting to go on that journey?
[00:39:38] Unknown:
It seems like everyone is probably going to ask the question, should we use Kubeflow?
[00:39:43] Unknown:
That's the first question that everybody is going to ask. And once they get past that question, you can actually get down into the real question. I think the other 1 I dealt with this when I first started at Microsoft. I was thinking about an architecture that could be applied to any project. And so I came up with kind of, like, a general kind of template project. And that's actually something that I think we'd look for is, like, how can we standardize all the different things that are going on? That's 1 of the first questions that we were asking. It's like, okay. We have this team that's used to using this tool, doing it this way, and they've been doing it very manually. This team, maybe it's a little bit more automated. And so the first question is how do we organize all these different ways of working to meet their needs? And then you gotta think about the technology choices that you're gonna use. It depends on, obviously, what resources are available. Are you do you have the expertise and the time and, I guess, like, the okay to build your own stuff? Or if not, do you can you afford using a managed service?
Picking the right 1 is also a good question because, you know, you're building kind of, like, your whole strategy on some company. And they change, you have to change. And so sometimes that's too big of a concern and they decide we're gonna do everything on our own so we could stay nice and decoupled. But then even then, you deal with those challenges because if you're depending on open source stuff, open source has its own host of challenges because now if something goes wrong, you are on your own sometimes. Like, you have to figure out how to make it better. And some companies have the expertise to contribute and make it even better. Think of, like, Uber who is regularly contributing tools in the open source space that they use in house and and develop. And it's like, this didn't work for us, so we decided to build this whole new feature to make it work for us, and now we're giving it back. But a lot of companies don't, they can't afford that. They're just trying to get their model in production again, going back to what that means. I want the value of this to to get to my customers. I don't care how it happens. I just want it to happen. Then there's like, okay. Well, how do we once we have the tools in place that help us do that, what about the processes? Who do we have on this team? What type of person should lead this team? What experience should they have?
These are the things that come up, and and usually it's a mixture of kind of what Demetrius is saying, more on the engineering side with the domain experts kind of being the customers. Like, these I see the data scientists as the people that I'm primarily serving. I'm I'm trying to make their work better, their workflows more efficient, their things more scalable and more reliable. They should be involved because you're building things in the service of some product or some tool, and and they need to be very closely linked, not separated. And so you also need that and then obviously the business. The most important is, like, are you solving the business problem, which is, again, I think that's related to MLOps because you're not just doing engineering for the sake of engineering. Right? You're doing it with a very specific purpose in mind, and that needs to be factored into how you think about solving this problem. I was just gonna mention 1 thing that I just like to follow-up on what you said, David. It is so true
[00:42:35] Unknown:
that machine learning engineers calling or thinking of yourself as an engineer, only an engineer, is probably a bad decision because machine learning is so close to the business. It's like you have to understand the business side just as much as you understand the engineering side, in my opinion. And I've heard that echoed many times, and that will allow you the ask questions on these design decisions that you potentially wouldn't ask. It's like, wait, what problem are we trying to solve? Like, what business problem? Because I'm sure that everyone listening has had this experience or at least knows somebody who has had the experience where a data scientist or a machine learning scientist goes and they hack on something for however long, potentially up to a few months, maybe it's just a few weeks, and they come back and they say, Look at this, what I created. Behold, I've got 99% accuracy on this model.
And you see it and you realize, Wait, I don't care what kind of accuracy you've got. You're missing the mark completely. And so that is something that is very, very common in the machine learning world. And I don't know I mean, it's common in other areas, but I think that with machine learning, you really have to be conscientious
[00:43:56] Unknown:
of that. Yeah. I see that particularly with data scientists. They should know that the business side of things. It's I think it's super important and it makes you it's like you're not just building some solution in the abstract, like you're building a solution to a very specific problem, and you have to understand that problem really well. I will say that engineers usually don't need to know them that much. It's like you need to know, like, the overall high level. At least that's just me. Like, I like to think about the infrastructure and the tech. But, yes, the domain is absolutely important. Right? Like, what are we doing all of this for? If you can't answer that question, it's gonna be hard to answer some of those other questions.
[00:44:31] Unknown:
And a little bit of a tangent, but 1 thing that I think is interesting bringing it back to, like, data engineering and then looking at design decisions as you're trying to create your MLOps practice. Again, I'm just throwing this question out there. I don't really have a good answer to it, but I think it is something that has come up time and time again that I really find it interesting. And if someone is tackling it in a really innovative way, I would love to hear about it. But it's how can you give people vision or enough insight into what they are working on so that they know because the way the data flows. Right? Someone who is a data engineer and they tinker with something upstream, they have no idea what kind of downstream effects that could have when the data scientist is putting that model together. And so that could be someone way upstream just changing something very minuscule to them, and they kind of have an idea of like, Okay, well, 2 steps down the line, this is the changes that that's going to create. But they don't understand the full impact of that. And so back to what we were saying, that's more of a people problem, right? Or that's more of a processes. It's not really a tech problem. You can't really solve that with tech, or at least it hasn't been solved yet. So that's another interesting problem that is happening right now.
[00:45:54] Unknown:
Yeah. And continuing on that thread briefly, as the data engineer, you are the person who is responsible for providing the data that is actually going to feed into these machine learning models. And as such, you are the person who is going to either afford or constrain the types of questions that can be answered. Because if you don't have the data, you can't answer some questions. Or if the data is structured in a certain way or is lacking some necessary context, you're not going to be able to actually ask and answer and experiment with the questions that you might care about. And that's where some of that feedback cycle comes in where the machine learning engineer or the data scientist says, this is what I'm trying to do. And then the data engineer says, okay. Well, I'm going to have to pull in data from this other system, or I'm going to have to, you know, modify the way I'm structuring the data model in the lake or the warehouse, or I'm going to have to feed in additional metadata to propagate context to these different downstream cases. And as the person who is so early in that chain, you're the person who has the greatest force multiplier, multiplier, particularly as you get into machine learning where the value of the data is compounded because of the ways in which it's being aggregated and then fed back out into the external world. Something that you made me think of there is, like, when you're working with
[00:47:11] Unknown:
interesting challenge of getting like, this is something that comes up. A lot of the times, our pipelines will fail because some upstream data, like some pipeline failed. Like, something as simple as, like not maybe not simple, but, you know, having a good line of communication with the different teams that you depend upon. It could be something more formal like an SLA, a formal kind of agreement of what you expect. But these sort of things come up and do make a difference in the productivity of a data science team because, like you said, if you don't have the data or you don't have access to that data, there's some data that's restricted or it's in this very private environment. There's, like, all these things you have to do to get it. And usually, a data engineer will be in the best position to do that because they have the technical side to help, like, thinking about, like, data governance. They're thinking about all these things while a data scientist is just maybe more focused on their particular problem. I just wanna use this data and train my model on it. And that's where it makes me feel like it's important to have these different personas because if a data scientist had to worry about all of the things that he has to worry about, now getting access and data governance and if I'm storing this in, like, I don't know, some storage container, after how long do I have to delete it to be compliant? Like, these are all sorts of concerns that I don't know, like, if data scientist energy should be best spent there. Although they usually are involved in that, we do need people to kind of focus on these sort of challenges, and that's where I think the data engineer comes in. I talked to this guy last week, actually, Gabriel Straub, and he is the, like, head of AI at Ocado
[00:48:38] Unknown:
Technologies. And he was talking about when he thinks about the platform, just the data platform in general there, and they have a lot of different use cases, but what he is really discerning about is whether or not someone is a data producer or a data consumer. And depending on which 1 they are, he's making different decisions on that and how those 2 can play nicely together.
[00:49:03] Unknown:
Yeah. And that feeds into another thing I was going to talk through is the interfaces and contracts that you create to be able to compartmentalize and compose the different concerns and responsibilities throughout the full MLOps life cycle and as it feeds back into itself. And 1 of the pieces that we've mentioned briefly a few times here and has come up in other conversations is the idea of the feature store as the interface between the data engineer and the data scientist or the ML engineer because it's a very clear contract of saying, I'm going to provide all of the inputs to this feature store, at which point it's your responsibility to actually build those features and maintain them. And so that's the sort of clear hand off for that interface. Or in the case where you're talking about a data engineer and an analytics engineer, it's going to be the data warehouse as that interface where you say, I'm going to land everything into the data warehouse, and then you have the power to be able to do your transformations and analyses from there. And just figuring out, like, what are the additional interfaces that are necessary as you are trying to complete this full feedback cycle to be able to have this continuous process of improvement for your machine learning systems, be able to get them into production, manage the sort of monitoring and model drift, understand when to retrain and redeploy, what additional data sources are necessary to augment the models or build additional models so that they can collaborate in a more, you know, systems oriented approach.
So just curious to explore that as well.
[00:50:33] Unknown:
It's interesting how naturally that's what happens. We kind of we separate things by, like, what do you know, what do you know how to do. And related to that, I think, is what people are passionate about doing. So a lot of the times, like, data scientists may not be that interested in all the infra side of things. They just wanna focus on, like, the fun stuff, the ML. And so that sometimes naturally creates that, you know, kind of separation or that, like you said, that contract where, okay, once it's here, now it's my responsibility. That's a natural kind of way to organize things. But I have seen in my own personal experience that that doesn't always work. There is, again, that feedback, this back and forth between the different personas that it's just hard to cleanly separate them. I think that sounds nice, but in practice, the data engineer will usually be involved with after those features are defined, maintaining them and making sure that they're correct. For example, like, let's say if the data scientist, you know, has some definition of these features and let's say there's some tool that you provide the definition and then then computes them and puts them where it needs to be. But how do you know if it's still doing what it needs to do? Maybe it would create specific alerts to see if it shifts, the distribution shifts. And there, you can't just say, here. The data is there. Create your alert on it. Maybe you need I need to know, okay. Well, why am I seeing this difference in my performance because of this skew? Can you maybe you're more familiar with the upstream data sources. So it's like we need both parties involved is what I usually see. And it is nice to, like, have some separation of concerns where you can depend on this person for being good at this 1 thing, which is usually how we I see it being kind of played out. Like, I am the data engineer.
They like getting all the data from all these different sources, doing what they need to do with it, and pushing it elsewhere. How people use it, I guess, you know, maybe they're not so concerned about. And that's good, while the data scientist is all they want is the data available so that they can do all their fun stuff in a Jupyter Notebook or whatever it is. And I think that that's okay, but there will be a point in time, especially when you start getting more and more models in production and more and more pipelines feeding into those models.
Both parties need to know what's going on. It's not helpful to just say, that's not my responsibility. I don't know. You know that. It's okay for it to be like that for some time, but I've seen that it can be a problem if you don't try to share that knowledge, you don't try to minimize those bottlenecks, whether it's knowledge or expertise in something. That makes me feel like it's natural to have this I like this technology. I'm familiar with this set of tools, and I'm gonna do this. But, again, it just feels like that that won't always work. And I have seen small instances of that, but I would love to hear from Dimitris. Maybe he's heard otherwise.
[00:53:10] Unknown:
No. I was just gonna mention that for the visual people out there, I look at it more like if you have a picture and then it's clear separation of the different colors, you know, picture with 3 colors. I'm onboard fully with what you're saying, David, where it's not a clear separation. It's more of like a gradient going from 1 color to the next so that there is a little bit more crossover in each 1 of these, and you don't ever get that position. It's like you're setting yourself up for failure in a way if you not only and going back to the people part, like, that's not my problem or I throw it over the fence. We know that that is not the best way to do it because of all the years in other disciplines that have been doing it that way. So I like to look at it more as a gradient, not just like very clear, lines in dividends.
[00:54:11] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Monte Carlo, the leading end to end data observability platform. Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/montecarlo to learn more. In terms of your experiences working in the MLOps space or engaging with the community as people are starting to explore what this means, what they need to do to be able to be effective as they're bringing machine learning systems into production and maintaining them over their life cycles? What are some of the most interesting or innovative or unexpected approaches to the workflows or platform solutions that you've seen?
[00:55:32] Unknown:
It's related, but it's kinda like a meta conversation. But I'm thinking about, like, the importance of efficiency. Like, you know, you think about climate change and how we're using it. Like, some of our accelerators use so much electricity. So there's actually now it's not just like, oh, yeah. We wanna train our models faster, but it's important now to put a lot of effort into making everything that's involved there that's very expensive more efficient. So training in particular is very expensive. Like, I've heard some estimates of how long it takes to train, like, GPT 3 or some of these models in the order of months to years and requires 100 of GPUs. And that just seems it's like that's unacceptable if that's what's required. And right now, you're already kinda seeing it. Machine learning is becoming ubiquitous. It's just being embedded in everything, all sorts of products. It's just it's everywhere. And I think it was Jeff Dean that talked about, if we keep up at this rate, there's not gonna be enough data centers to provide enough compute for all these different applications. Something I'm thinking about is we need to really focus our efforts in making that process more efficient.
So it's an interesting thing because it seems like it's already a relevant problem in the MLOps spaces like distributed training, more efficient training, scalable training, scalable inference. Right? But now it's like with this added element of, like, this existential threat. Like, if we don't do this, gonna be a problem. And also from a commercial perspective, it's just too expensive. Like, think about a small startup trying to rent some VMs. That's a lot of dollars if you are just, you know, carelessly training. And so this other aspect of, like, understanding how to use the right resources is I think an important aspect of the tooling space. Something that comes to mind is, like, being able to take a fraction of a GPU instead of a whole GPU. There's lots of systems that they don't enable you to do that by default where you have to for example, you can only select either 1 or 2, and that's a waste. Because a lot of the times, these training jobs, you'll see the utilization is quite low for GPUs. So, like, you're wasting all this compute, spending all this time, spending all this money. So I think more efficient resource usage and allocation at the hardware level and at the software level is an in an interesting area that I see being more and more relevant, especially, like, not just from, like, again, like, the text perspective, but even more existential.
1 more thing. This is, like, random. This is probably people out there listening, like, oh, what are you talking about, Dave? But as an example, like, there's I think it's called f star. It's a project. It's a language that's trying to rewrite the HTTPS protocol, like, from scratch. But what's cool about the language is it's a verification language. Like, as it compiles, it, like, formally verifies if it's correct, like a proof of correctness type of thing. And that makes total sense for, like, something like HTTP. Right? Because the whole Internet depends on that. Like, these protocols are so fundamental to our entire infrastructure that if that has vulnerabilities, we're screwed.
Why I bring it up? Because I think that that somehow being mixed into the ML space would be amazing. Now we don't really have there's not lots of formal proofs of correctness for machine learning that I have seen, but it would be cool to introduce more formal, yeah, more formal proofs to machine learning. And I relate that to the the MLOps space because imagine a simple application like security. Security is also becoming a really interesting problem in the MLOps space, like you think about. I wanna say federated learning, but that's part of it. But there's another name that I forgot what it was. But Generative adversarial networks?
It's related to it. But, like, I guess, like, this whole space of now we're not just, like, learning on data, but now we need to do it securely, right, in a way that we can trust is, I think, a really interesting problem. You're talking about synthetic data? No. No. It's just more so, like, I haven't seen anything like this, but the intersection of, like, formal proofs of correctness or formal verification and machine learning. I think there are already machine learning models that do that. Right? They can, like, predict like, a lot of mathematicians are, like, exploring the area where machine learning can be used for math. Not necessarily talking about that. I'm more thinking about the application side of it actually being used to support some of the infrastructure. Again, this is, like, probably totally out there, but it was just something interesting with the question. Right? Yeah. And totally out of left field. And not quite answering
[00:59:39] Unknown:
your question, Tawise, but something that is what I feel like is important to say is a lot of companies think that they need to do machine learning, and so they hire a data scientist when really they need a data engineer. And you've probably seen this more than anybody, but some way to get that kind of education out there and make sure that people understand like, yeah, a data engineer is probably the foundation that you want to start with as opposed to a data scientist. Because if you don't have these foundational pipes put in place, your data scientist isn't going to be able to do much, or you're going to ask of them to do things that a data engineer should be doing. And even though we just got done talking about how, oh, yeah. Data scientists should know a little bit of data engineering, and data engineers have gradient view of looking at it. So I'll go back on that and just say, like, yeah. When you're in that foundational piece and you're figuring out how to do things, that's 1 thing that I will advocate for. Like, probably hire a data engineer first before the data scientist, and really see what kind of data you've got, see
[01:00:49] Unknown:
if it's valuable, and if you can do something with it. In your own experience of working in the MLOps space, and in your case, Dimitrios, in particular, helping to foster this community around it, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[01:01:07] Unknown:
I don't know if people find it as funny as I do, but how to deal with vendors? Like, it's really rough because you've got a community. It's very engaged. It's very active. It is the target audience. I understand I used to be in sales. Remember, I understand that you wanna go in there, and you wanna sell to every single person that is in the community, but you're not making any friends by just going and spamming the community. And so really trying to figure out, like, what is spam? What how do we deal with spam? And how do I deal with people that are spamming?
Like, it's just been a long road, and at least once a week, I have to deal with somebody coming in and putting something where they shouldn't be putting it or putting it in every channel. And I guess that's my karma for being that guy for 2 years.
[01:02:03] Unknown:
I'm the type of learner. I like to learn things from first principles and work my way up. And so I find like, I'm back in grad school. I'm I'm learning things. I'm I'm studying computer science. A lot of the things that I deal with in my work are more cultural, but what I think about and I'm interested are kind of the technical challenges there. And I'm learning things from scratch to see how things could be made better using very fundamental things. And my opinion, a lot of the biggest innovations in this tech are small small insights, small little things that make a big difference. And so 1 thing I have seen is in practice where big impact can come from learning the fundamentals and knowing the fundamentals well. And so I guess within the MLOps space is what are the things that allow our work to be reproducible? What allows us to communicate things effectively? These seem like kind of, like, assumed things or givens, but they're not. You know, learning how to collaborate well is a challenge, and these are the sort of things that I deal with the most. You know, how do I get buy in for my idea to move to this new tool? How do I get buy in for this new process that I think will make things better but will require the data scientist to learn this or have to do that? How do I convince data scientists to do this? It's like these sort of kind of cultural human problems that I think about in the day to day, but I look for inspiration from, like, basic things, first principles.
You know? So for example, like, in the DevOps space, there is a lot of books and literature on the organizational aspect of a DevOps team, and I think that has been helpful. But, again, it's like it's not enough because there is these unique challenges that keep coming up, things that I didn't anticipate before, and I just find that I'm sure it's like that in the data engineering space, but I found that especially in, like, the the machine learning space. There's just always something that comes up that that I didn't think about, some challenge that I couldn't anticipate. I think that's what makes it super exciting too though, but it's those kind of unknown human challenges that you can't really predict and just solve with some algorithm or some tool that I find take up most of my time and most of my energy.
[01:04:03] Unknown:
Yeah. Maybe I'll retract my last statement so I'm not just shitting all over vendors who piss me off. And because it is a fine line, right? Like, especially, oh, if we have sponsors and but whatever. That's a whole another topic of discussion. And I'll use this 1 instead, which is 1 thing that has surprised me, and it's been just an incredible surprise, is to see how many people are so excited for this space that they want to just contribute, and they wanna get involved. And it's not like there's an open source project behind the MLOps community. There's not something that people can go and hack on that you would normally see with community projects. But there's so many people that are interested in this space moving forward and it becoming more mature.
And also so many people that are struggling with it, and they come to this community and they find like their little safe haven and they can talk to people. And I think a lot of people needed that over the last 2 years, especially when you couldn't just turn to the person that you were working with and ask them. And maybe you don't even have somebody you're working with. So for me, it's been just the power of the community, the power of the people in the community to raise their hand, volunteer, do all kinds of super cool initiatives that I would never be able to do myself, but they are very proactive about it and they get it going. Like, I mean, David's 1 of them. He's doing system design reviews, and we get to use the sponsor's money for good things where him and another guy from the community, they read these very technical blogs that are coming out from the Ubers and the Airbnbs of the world, and then they create animation videos that try and distill it and break it down and make it into something that you can digest, I mean, in a much easier way than reading through this very dense blog post. So that's an incredible initiative. I mean, there's a million other initiatives that we have that I think are so cool, and that's been probably the most unexpected thing when when this community started. I would have never thought that there would be
[01:06:13] Unknown:
so much energy around it. As you continue to be engaged with and work with the overall MLOps community in this particular problem space, what are some of your predictions for the near to medium term future of what people are going to be doing? What are the capabilities that are going to be unlocked? How are things going to coalesce or expand? And what are the pieces that you are personally keeping a close eye on?
[01:06:38] Unknown:
I'll say 1 quick 1, and then I'll let David go. Because I wonder about, like, machine learning as a service, and if we're going to start seeing more vertical machine learning, like recommender systems as a service, I've seen starting to happen. And maybe there will be more of that as opposed to a company trying to bring it in house. And so that's something I'm really excited about. I wonder how that space will play out and if that will will see the light of day and become 1 of the ways that we do it.
[01:07:12] Unknown:
As far as the future is concerned, man, this space is full of surprises. Right? I mean, I'm so young. You know, I'm only 30 years old and have been in the space for 5 years now. And it's funny, like, when people ask me about trends, I always think, like, man, I haven't been here long enough, but a lot of things have actually happened in these 5 years. I'm starting to make my own predictions to start thinking about things. And I've kinda hinted that earlier, but I do think that security is gonna be it's already important, it's I think it's gonna become even more important as machine learning becomes even more ubiquitous because it's 1 thing to have it shipped and and available, but doing it in a way where we're not exposing certain piece of information. Like, you know, I I think about how data is now becoming a much more valuable asset. You know, you think about GDPR and how that's gonna affect our machine learning workloads. I think that's gonna be a big problem later if you're, like, not thinking about that now.
You know, kind of baking that into your tool afterwards may be a big challenge if that's not a philosophy or a or something that you thought about from the very beginning, first principles. And so I think that's cool. And then the second thing is efficiency. My prediction is that there will be more innovations in the compute space that make it more efficient and make it easier to train a model, make it more scalable to train a model. Right now, models are really big, especially like deep learning models. And that's a bottleneck for a lot of people because either you are, you know, you have a high performance computing team available, you have a supercomputer, or you that's what you guys do or else you're kind of dependent upon the cloud and that's very expensive. So making that more available and accessible to people more efficient is where I see things going.
And definitely the community part, I see that being a big innovator. Someone brought this up, I think, in a previous podcast where communities help these certain technologies thrive and make it better. So curious to see what that's gonna look like in the MLOps space. There's already lots of open source tools. Maybe it's not gonna be something like that. It could be something different. Maybe kind of a group come up with, like, a coherent theory of what MLOps is or something like that and propagating that to the rest of the world. Who knows? But I see it being interesting, the community being a very important factor. Not just MLOps, but just in general MLOps.
[01:09:19] Unknown:
Yeah. You sparked the thought in my mind too about the EU regulations that are coming. There's proposed EU regulation on AI and how that will affect things, and how that will affect the way that machine learning is being done and data is being collected and data is being kept, like you were talking about with GDPR. And then also the other idea that I had too while you were talking is, like, around standardization. Are we going to standardize things? Is it just going to be like that meme where it's like, yeah, we wanted a new standardization, and then it just creates another framework that nobody uses. And so how is that going to look?
That's fascinating to me. Like, I think it was David Arincic who said 1 time when he came on the podcast, he was like, we need contracts with a small c. Not like these contracts that we have to force people to use, and they're not, like, going to get people and they're not gonna back them into a corner, but still something that we can find a common ground and work from there. Because right now, it's very fragmented as anybody who knows the space will tell you.
[01:10:33] Unknown:
Are there any other aspects of the overall space of MLOps, the role of the data engineer in that ecosystem, or the work that you're each doing that we didn't discuss yet that you'd like to cover before we close out the show? Just kind of related thought. I'm sure there's lots of up and coming data engineers listening. You're thinking about what do I focus on? What tools are relevant? If I wanna get in the MLOps space, what do I need to do?
[01:10:55] Unknown:
I would say that's you. You know, skills are usually what helps you get far, in my opinion, like, if you have a broad skill set. It's nice to be focused in some areas, but having something as simple as having good documentation, knowing how to write good documentation goes a really long way. Being able to work with different stakeholders and winsomely communicate your ideas to them. These are things that usually you think about after the fact, but I would argue you should start thinking about them now. I think it will go a long way in your career, especially in a very interdisciplinary space like MLOps where you have lots of different personas involved. So learning how to not everyone's gonna understand what you know. So it's important to help people understand what you know and why they should know what you want them to know. It's something I regularly deal with. And I would say in terms of tools, there's so many, but let's start with languages. Python is obviously the workhorse for a lot of data science. It is worth knowing. As an engineer, it may also serve you to learn something a little bit more level, maybe gives you more manual control over things. I'm thinking like c plus plus or something. And I I do think that if you know lots of different tools, not all of them, but enough to make you versatile, it makes you more independent. And I think that's a really valuable skill in an MLOps team when we kinda need people to do it all in some respect, especially when you're first starting.
So that will make you a valuable asset. People do like that where, you know, it's maybe you don't know all these things, but you're interested in learning these new tools. I think that really helps, and that's about it. Get involved in a community too. I would also say that's a big thing because even if it's something as simple as leading a reading group or participating in reading groups, like, this getting involved in the community, it's hard to describe, but it does help you in your career. I I wanna take some more time to think about that, but I do know personally it has brought in even what I'm concerned about, and it makes that a little bit easier to navigate, especially as I go into work and deal with these unique challenges.
[01:12:47] Unknown:
Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:13:03] Unknown:
Goes back to what I was talking about earlier on the ideas about how to give more visual, how to make things more transparent so that people can recognize, like, the butterfly effects that they're creating downstream. And if they do 1 thing, then they understand the full outcome. That's 1 piece. And then, like, how to not centralize or maybe not centralized is the word, but just how to have the knowledge hub. And what is the best way of creating a knowledge hub for machine learning and when you build the processes out? How does that look? We've been working with data for a long time, so there's lots of, like, nice specialized databases
[01:13:49] Unknown:
and lots of tools and even languages that work well with data. But I I think it's such a cheap answer. But, like, you know, like, the governance aspect of it, there is all these new concerns, legal concerns that are coming into play that I think I often think about. There are some tools that I have seen, like, in Microsoft has a lot of custom tools and stuff that they think about this. But I'm thinking about, like, where that's just like, you know, how data warehousing, there's some query language. That's a important part of, like, that tool. I wanna see something like that. There are tools that prioritize it. I'm thinking of Moses Barr, I think is her name. I love, like, kind of her philosophy of data and data management and data governance. I think there's lots of cool tools that are focused on, like, latency and maybe scalability, but maybe not so much on, like, these other kind of it's like that kind of management part, the boring stuff that I don't like to think about, but I just want to be taken care of. So it's not that so much as a gap, but I I would love to see more of that.
[01:14:44] Unknown:
Alright. Well, thank you both very much for taking the time today to join me, share your experience and perspectives on the overall space of MLOps and how the data engineer fits within that context. So I appreciate all of the time and energy that you're each putting into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Likewise. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Guests and Their Backgrounds
Defining MLOps and Its Importance
Data Engineer's Role in MLOps
Open Questions and Challenges in MLOps
Personas and Backgrounds in MLOps
What Does Production Mean for ML Models?
Technological Space and Core Capabilities in MLOps
Adopting MLOps: Technical and Organizational Questions
Interfaces and Contracts in MLOps
Interesting Approaches and Innovations in MLOps
Lessons Learned in the MLOps Space
Predictions for the Future of MLOps
Final Thoughts and Gaps in Current Tooling