What Does It Really Mean To Do MLOps And What Is The Data Engineer's Role?

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others.

Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies.

Sign up for the SaaS today at data engineering podcast.com/acryl.

That's

acry

l. Your host is Tobias Macy. And today, I'm interviewing Demetrio Brinkman and David Aponte about what you need to know about MLOps as a data engineer. So, Dimitrios, can you start by introducing yourself?

Yeah. I haven't really figured out a good way to really talk myself up in these introductions, but

I fell into the MLOps world about 2 years ago when a company I was working for went out of business, and I was starting to interview different people in MLOps.

And so now I run the MLOps community. We've grown to over 9, 000 people in Slack, which has been

just incredible to see.

That's me in a nutshell.

And, David, how about yourself?

Awesome. So my name is David Aponte. I am a software engineer at Microsoft and a board member in the MLOps community.

I got into the space hands on working as a data scientist, managing my own models, getting them to production, and then later on working on the machine learning infrastructure side

where

I worked very closely with data scientists helping get their projects to production

and got really interested in the operations around all of that and got in touch with Dimitrios. This is around the pandemic.

No meet ups. You know, everything was online and found an opportunity to work with him and help out, and a lot has happened since. But right now, I am working at Microsoft doing mostly MLOps work in my day to day.

And going back to you, Dimitrios, you mentioned that you kind of fell into the MLOps community a couple of years ago. So I'm wondering if you can just share a bit more about

how you got involved there and what it is about this space that's keeping you interested and motivated.

I come from a sales background, and

I was working

at a company that was doing MLOps.

And I was on the sales team. And so I was the guy

that nobody likes who is pestering them on LinkedIn and

probably sending you cold emails or trying to cold call you. And so forgive me. I've repented for my sins. And since then,

basically, I was working at this company. We were doing data lineage

and really trying to focus on the reproducibility

of certain

aspects of the data science and machine learning workflow.

And when the pandemic hit, right before it hit,

the

sales

calls just dried up completely. And our CEO, he has open source in his blood, and he said, Why don't we try this community thing? Maybe that will

allow us to talk to people that we wouldn't necessarily be able to get on the phone because we're a small startup, and we don't have the connections.

And so we started it. The company went right out of business about 2 weeks into starting these whole meetups.

And I had this moment of, like, should I go and try and look for another job? Should I go and try and do other things?

Or

do I like this? Do I wanna keep going and see where this can take me? And so I chose the latter, and it's just been a roller coaster ever since. I mean, I've

probably learned

so much

coming from no background to being able to talk with people that I have no business talking to every week, and sometimes 2 or 3 times a week, and then getting to also rub shoulders with people like David and have him teach me along the way too.

David, do you remember how you first got involved in the area of data? I was actually a teacher before I was in tech, teaching science and math and a little bit of programming. And when I was teaching myself how to program,

I was doing some soul searching. What do I wanna do for the

future? What skills are in demand? I had studied molecular biology in my undergrad, so I had a craving to work on hard problems. And not to say that teaching is not hard. It is very

challenging. But I wanted to be an individual contributor. So I did some reading and searching, and I think 1 of the books that sticks out to me was a book by a data scientist from Google, I think. It's called something Everyone Lies or something like that. I forget the name of the book. But I just found it so interesting how people were using data and computers to solve problems, to learn something about their customers, to

make a product better. And I just, you know, dove into that. And then through that, learned about machine learning, was just like, I guess, okay. You're using all this data to do something cool with it.

And that just led me down the rabbit hole with all the things that are related

to that. All the engineering, right, all the math, even some of the science.

And that's how I got into the data space by pure passion, looking kind of forward to what I thought was gonna be really important in

the future. And so that brings us to the conversation for today, which is MLOps and why data engineers should care about it, what their role is there. And I'm wondering if we can start by just establishing a definition of what MLOps actually is and maybe how it relates to some of the other

nebulous ideas such as DataOps and DevOps?

It's a great question. Probably the most common question in this space. Right? And I think you're gonna get different answers to this question depending on who you ask.

But my intuition on this is it is where machine learning system development meets machine learning system deployment.

It's about systems. You're not working in isolation. It's not just a model. It's a model with data and code. And the interaction between all these different components

is what I consider the MLOps space. So it does, in my opinion, involve data operations. It does involve

dev operations. Right? But this added element of complexity now with the data science being mixed in. And the reason why I think that makes it a little bit more challenging is it's, a lot of the time, fundamentally

research.

We have some proof of concept that we're trying to see if it makes things better. We don't know if it's gonna be better than even a random baseline.

That experimental nature makes it very iterative, makes it very challenging. And so, like, a lot of the disciplines that we have out there, say, for example, MLOps or sorry, DevOps,

how does that help us with the development space? It helps a ton. Right? Because software engineers have been developing things and then shipping them out. Right? But I guess it's hard when you have something that maybe needs

lots of manual intervention, lots of validation,

lots of different personas being involved as well. It's not just engineers. It's also product managers and domain experts

that are involved. And I think all that mixed together,

again, where systems development or ML system development means the deployment side is what makes it challenging. It's like all of these things mixed in. So it's not that it's so different than these other disciplines.

It definitely is built off of them, but I think it's coming into its own as something that stands

on its own because of these kind of unique

challenges, right, the data, the code, and the model.

I think that 1 of the guests we had at our meetup 1 time, Andy McMahon, said really eloquently when he said, productionizing machine learning

is 1 thing. But then

once you productionize it, whatever, n plus 1 times, now you're going into the MLOps sphere.

And

that kinda rang true to me because it's like, yeah, you can get it out once.

But then once you start

making that a process and you start really going through it, that's where you're starting to get into MLOps. And David also said this another time to me, and it was how

MLOps encompasses

DataOps, but DataOps doesn't necessarily

encompass MLOps

because you need that extra piece of ML

in there.

In terms of the actual

scope of MLOps,

as with DevOps and DataOps is that it's

cross cutting and all encompassing at the organizational level.

And given the fact that there are multiple personas involved, I'm curious if we scope it down to the data engineering position,

what do you see as the set of responsibilities

for that role, and how does the adoption of MLOps

sort of

change the actions that you take as a data engineer? And how does the additional considerations

of the fact that this data is being used for machine learning workflows and that it will, you know, come full circle and feedback into your source systems,

how does that factor into the ways that you think about the work that you're doing

as a data engineer? It's a good question because I think it's also changing. Like, it depends on what type of data engineers you're working with. Right? Like, at Microsoft, a lot of the tech stack that they're used to there,

SQL,

you know, c sharp, things of that nature. That's 1 thing I would know is that sometimes the personas in an MLOps team, they're learning specific tools that maybe

general data engineers may not be familiar with. In my opinion, what a data engineer is responsible for is getting the data to the model, but not only getting the data there. It's also what happens after that. For example, let's say, you could even store the prediction somewhere. We could set up monitoring on the predictions. And all of that, even though it's outputs from some model, it's really just a data that needs to be

extracted from some place, maybe transformed and loaded into another environment. So a lot of the same data engineering principles that, you know, a data engineer working on a data warehouse does is very similar. But now let's say if we need to think about working with a data scientist that has maybe a more involved feature engineering pipeline

that maybe does a little bit more involved like a selection process. And so now you have to work closely with the data scientists to help them accomplish their goal. So it's it's more specific than just, you know, I'm doing some transformations and loading them elsewhere. Maybe I need to do that in a very iterative way. And so I think it's a lot of the same stuff, but now you have the domain expertise of machine learning being involved.

And I guess this is super important in the MLOps space because the model is a function of the data. Right? Everything is dependent upon the data. If the data is bad, the model is bad. So it's not just about, like, you know, the infrastructure. It's all about the quality of it. So there's a need for data engineers to understand

the data as well to help the data scientists find good signal,

and it varies, like, in my experience. Some data engineers are more on the pure software engineering side, so their experience with machine learning is they kinda treat it as just like some general

artifact that needs to be shipped versus maybe another data engineer that's more on the data science side and is more used to maybe doing the feature transformations and more involved kind of, I guess, engineering with respect to not just moving it from 1 place to the other. Again, that's not all that data engineers do, but it's a big part of it. Right? Getting the data to the model in a reliable way, all the pipelines that are around that. A data scientist may not be experienced managing.

Right? They maybe are used to the algorithm and the kind of experimental work involved, not necessarily

the whole system. And so a data engineer

works closely with the data scientist and usually also with ML engineers if they're involved. Sometimes sometimes, actually, they have the same responsibilities.

But maybe a data engineer is responsible for all the data that comes to the model, and then the ML engineer is responsible for productionizing that model in some target environment. Let's say, like, a web service or maybe even a batch scoring pipeline.

It really varies. But, again, just to kinda bring it back to what I was saying earlier is that it depends upon the skill set that this data engineer is is used to doing, and I guess exactly what their role is in a team. In Microsoft, it's again, it feels very simple. Getting the data from 1 place to another, but it's super fundamental

because without that, the model just won't even have a good model. I love how you said that too. Like, I've heard people say

that MLOps,

its current iteration is just basically glorified ETL

in a way. And so being a data engineer, you're kinda set up for being able to crush it in MLOps. It's just adding that little extra sprinkling of flavor on it. But the other thing that I was gonna mention

is MLOps will give you

a common

framework and common, like, language

that everyone can

coalesce around. You can find this area where you're saying, okay. You know what? If I say

we need a feature store, a data engineer can understand what that is just as much as a machine learning engineer or a data scientist.

Hopefully, everybody

understands

what that is. Although sometimes, like, maybe Feature store isn't the best example because that is still up for contention. But if you start to say things like, alright. We need

a model store. We need reproducibility

here. Hopefully, a data scientist will understand that just as much as a data engineer.

So that's another thing that it's not exactly the question you had, Tobias, but that's another thing that I wanted to mention on there. Yeah. It's definitely

an interesting and complex space, and there are definitely a number of areas I wanna dig into. But before we get too far down the road,

I'm also interested in getting your perspectives

on what you see as being the

kind of open and active questions in the MLOps community

and what are the pieces that people are still confused about? What do people say are you know, these are the 5 or 10 or 10 dozen things that you need to be able to actually say you're doing MLOps because,

you know, taking DevOps as the example, that took about 10 years before there was any real sort of general industry adoption of DevOps as a set of practices and principles and what it even was

loosely supposed to mean. So I'm curious to see sort of, like, what is the general industry adoption of MLOps? What are the open questions? What are the pieces that people are struggling with?

Yeah. 1 thing that comes to mind is the organizational

problem in MLOps. Like, we kinda were just discussing what persona should be involved, what type of team players do you wanna have, and it varies. And I think related to that is what tools do we need. If we're using this, does this count as MLOps? And the other question is, do we have 1 tool that has them all or get 1 tool for each? And I think these are the sort of questions that like, very practical sort of concerns. Right? Not so much about the theory of it or what it means. It's more so, like, how can it help me

do what I need to do and get my models to production reliably, for example.

But it's these organizational kind of cultural questions that I see come up the most.

That question about what is a data engineer's role in an MLOps team is still an I think, again, there's different answers to that question.

Where does the responsibility of the data scientist end and the responsibility of the ML engineer begin? These are tough questions. Again, it's it's such a cheap answer, but it the answer is it depends. Depends on the problems that you're working on, the team structure that you have,

resources available.

A very common scenario is that people just starting, like maybe a small start up, they don't have a ton of resources. So they have 1 guy that does it all, you know, 1 girl that does it all. So it's like, that's MLOps. You know, they're doing their productionizing machine learning. Maybe it's not as as organized. You don't have a bunch of different personas. Maybe you have 1 guy doing it all. But that still counts is my point. And so there's a spectrum of what MLOps looks like, and people have described different maturity models. You know, usually it means more automation means more mature.

I'm not entirely convinced that that's always the best case. I'm still I'm actually kinda wrestling with that at work automating stuff, but it's kind of, again, these sort of putting it into practice questions. The very again, from what I sense, very practical. Like, what do I need to do? Who do I need to hire? How do I structure my team? Yeah. What are the processes?

And 1 of our guests on the podcast

was talking about actually and this is a sentiment that you'll hear echoed all throughout the podcast. It's like,

I pray for those days when it's just a tech problem.

Those are the best kind of problems that we could have because

when it's a processes or it's a people problem, that's much more difficult to figure out a

well thought out solution for. And so

I also was gonna mention, like, I just quickly scanned through Slack and picked out a few of, like, the very common

questions that come up. You have 1 that's like, do I bake weights into Docker containers?

The classic 1 is,

can I use Jupyter Notebooks in production?

And there are very, very

opinionated people on each side of the field on this 1. That's like, if you want to just start a

Twitter war, go and say 1 side of that, and you'll see what happens there. Or ask people. Be like, yeah. Your data scientists need to know Kubernetes. That's another 1 that gets people

There's lots of trigger in there that goes on. And I find this 1 there's 2 questions that I haven't really gotten a good answer to. 1 is, how do you manage dependencies between all of these different tools that you're using or frameworks or whatever? Because that's really complicated. I haven't heard someone who says, oh, we do it like this, and it's, like, it's cake because we set this up. And the other 1 is,

how do

you create an effective knowledge base for machine learning? And so you make sure that there is not that, like,

siloed information,

and it's 1 person who has that know how on how to bring the model into production. And if that person doesn't show up for work or they just move on to the next job, we have to be Sherlock Holmes and figure out, like, what is going on. So those 2 questions, I think, are probably the biggest that I haven't seen.

Yeah. I just wanna piggyback up what you said. Yeah. Like, minimizing the bus factor. Right? We wanna make sure that if this person got hit by a bus, which we could still run the show and nothing will will halt. But that's a hard thing to do because

not everyone has the same skill sets. Right? Like, you know, for example, think about data engineers and the tools that they're used to using. It's a bit different than a data scientist. It's gonna be a mixed bag. And how to get them all to play nice and be responsible for the same set of

systems and components,

you run into the same questions over and over again. What's the best way to do this is what I'm seeing. If there's precedent, you could look at what Google is doing, what micro these big companies are doing, but even then, that doesn't always fit

because the scenario is different. The domain is sometimes really different, and the resources available. Like, this is not always the same thing as, like, what these big companies are doing. The last 1 is, who gets the call when a model goes

AWOL at 2 AM? Right? Like, who has to go and sort that out? The CEO. That's a huge question.

The

CEO.

Yeah. Related to that is something that we're dealing with is coming up with an on call rotation where we have data scientists on that on call rotation. And we think we gotta think about, okay, how do we

enable them and equip them to actually troubleshoot these live side issues effectively without having to learn all these different things that are just not maybe relevant to their future career. Right? Like, we're talking about Kubernetes. Right? Like, yes, some data scientists, they like learning all these tools, but

is it the best use of their time to be doing all of that? And so now we're dealing with the question, okay, but we have a model that's failing.

It may have some statistical, you know, issues that a software engineer may not be in the best position to debug. So we actually need domain experts, the data scientists in this loop. But how do we do that with these different kind of skill sets? So these are the sort of interesting questions that come up. Yeah. Definitely seems like reflections

of all of the previous iterations of DevOps and DataOps of how do we manage the

context propagation across these different sort of areas of expertise

while maintaining effective communication

and understanding,

you know, what are the sort of interfaces and handoffs at the different stages of the life cycle

and some of what the things you're talking about as far as what folks are faced with as challenges in the MLOps space of,

you know, managing, how do we understand on call, how do we understand observability in the context of ML, how do we work through, you know, the sort of deployment factor? You know, what are the pieces that make sense to automate?

Those are all things that have been struggled with

since we started working on software. And

it seems like

the answers are probably similar,

but there is the, you know, extra degree of uncertainty that is inherent in machine learning and data science, and how do we factor that into the tools that we already have.

Given the fact that there are so many overlaps, I'm curious

what you're seeing as the

kind of primary personas who are engaging in the communities that you're a part of that are focused on MLOps specifically

and, you know, maybe some kind of general categorization of the backgrounds that those people have?

A majority

are

people coming from the DevOps sphere or SREs

trying to then

figure out how much machine learning they need to know to be able to play in this field.

And then you get data scientists that are coming

into the DevOps field trying to figure out how much

SRE knowledge they need to know to play in this field. And,

personally, I think it's a lot easier coming from the DevOps or being an SRE and then picking up a bit of machine learning than it is being a data scientist and then having to learn,

like, proper coding

and all that good stuff. There are a lot of data engineers that are doing the same thing, and there are I would say,

we also will get

the occasional analyst. But, again, like,

that's

a lot less than

for me, the main 2 that are coming into the MLOps community or that I don't know. But, David, maybe you have other ideas.

No. I I think you said it right. There's lots of people coming from, like, site reliability engineering background, right, where they're used to maintaining. I guess you could think about them as, like, system engineers.

You know, they treat the things being deployed as just, like, some general artifact that needs to be

shipped and maintained across multiple environments. So, like, they think about it kind of at that high level. So usually, like Demetrius is saying, they're just how much domain expertise do I need to know to get in this space? And there's lots of people. For example, we've spoken to Todd Underwood at Google where he believes actually you don't need a ton of machine learning experience to be effective. A lot of the times, it's just being good with working with systems. Again, sorry if I'm misquoting you, Todd. But I actually agree with that because I think the domain is something that you can pick up as you're working on the job, and you could always supplement with lots of different things. But

if you're responsible for maintaining something and that's just you, you have to kind of know all this stuff already. It's just hard. Like, I don't know. I'm thinking about, like, all the random networking issues that come up, and it's like, if you don't understand

networking, it's very challenging to actually use the tool that you're trying to set up so that people can do what they need to do. Again, this is a headache that, like, if you're, like, responsible for deploying Kubeflow, for example,

a lot of the issues are not machine learning issues. They're, like, these very kind of common general infrastructure issues. So that usually

is a strength if you have that if you have that experience and you have that background. And I've seen

that they tend to do well in these MLOps related teams. And so that makes me feel like it's actually more

on that side,

more on the engineering side than anything. But maybe that's just because, again, MLOps is about the operations, which is not the entirety of what, you know, machine learning is about. Right? That's just a part of it. Yeah. Another interesting element of

machine learning and operationalizing

it is the question of what does production mean for a machine learning model, where for a software application, it's very clear that production is when your end user is interacting with it. But who is the end user for a machine learning model? It's very context dependent. It could be that it's an ML model that's powering a recommendation engine that determines what to put on the web page in front of you. It could be machine learning model that's responsible for determining when there is a high probability of a system failure in your network infrastructure. It could be a machine learning model that's determining

what prediction to make for your CEO's business dashboard to determine how many widgets to buy for next quarter. So I'm curious how folks are thinking about the categorization

of different environments and what production means for machine learning.

Yeah. I think that's related to, like, even the kind of common question. Are Jupyter Notebooks acceptable for a production environment? So as a notebook,

does that count as a productionized model?

Again, I would argue it depends, but my opinion on a lot of these things is where the model delivers value is how I see it as a production environment.

Sometimes that delivering value could be actually manual. For example, there are some models at Microsoft within our team where the outputs to that model are

given to reviewers. So they need to take a look at these outputs and see if they make sense. And if it does, then eventually the model will be integrated in a more automated way. So there, it's already in production if your customers are seeing the outputs of that, in my opinion.

You need to have a process to reliably do that. Maybe you don't need to have all the bells and whistles to have it completely automated, but you should be able to reproduce things fairly easily. That's like the bare minimum that I think. And usually, it can mean having it in source control somewhere. So that way it's not lost or, you know, it only runs in your computer type of thing. And it's like that sort of

consideration of it's not just about me and you're working on here, it needs to work elsewhere for the customers. I think that's when you're in a production environment.

And there, again, it depends. Sometimes it is just as simple as, like, here's a CSV, validate this, and that works if that's what you need to do. It's more automated, then I think that's maybe when it becomes a little bit harder to

do that easily.

Do you want to learn how the Joybird data team reduced their time spent building new integrations and managing data pipelines by 93%?

Join a live webinar hosted by RudderStack on April 20th, where Joybird's director of analytics, Brett Chawney, will walk you through how retooling their data stack with RudderStack, Snowflake, and Iterable made this possible.

Go to rudderstack.com/joybird

to register today.

Yeah. The other question that has been very nebulous for a long time is, how do you know when it's actually machine learning? Does it have to be a deep learning, you know, convolutional architecture? Is it just a random forest? Is it a expert system that just has a very detailed decision tree? Like, when does it become machine learning, and when is it just a software system?

Classic question. Yeah. I think Dimitris answered, but I think for me, it's like when the output is a function of the data, which you could argue is a heuristic. Right? Like, if this else that, that does something. Right? Well, it is a model of sorts. When it becomes machine learning, maybe it's when you have, I guess, like the classic

optimization. You're trying to optimize something or you're trying to learn something from the data and then take whatever you learn and apply it to a new dataset. I think it usually involves that like, the these artifacts. I guess you could think about them as the training artifacts. But

a lot of what has helped me maintaining

machine learning and production is actually understanding data in production. So it is data. I don't know. I guess it comes down to the algorithms in some respect. Yeah. I was just gonna mention,

so many times we've heard this that it's like, if you can't avoid

using machine learning,

you should try and do that first because you're adding so much

extra complexity

by bringing on the machine learning. And

I think about that quite a bit. And I also think about

how there are certain use cases or there are certain

especially now, certain use cases that have been proven out where it's like,

we should definitely use machine learning on this 1. And we need to do it because

if we're not,

we're behind in a way. 1 more thought while while we're on this subject. What makes a machine learning? I was thinking, you know, like, something as simple as, like, if I if I wanna make recommendations, I could just sort them, right, by their scores

and then present that. That would work as, like, maybe a baseline, for example.

And that's usually what a lot of companies do. Right? They have some baseline to compare. Usually, it's maybe random.

Maybe it's some data transformation. But I think when it becomes into the machine learning spaces, probably

maybe it's like a cheap answer, but when you don't have to explicitly

program what you want it to do in some way. Right? Because it learns

what it wants to do from the data. So I think that also makes it a bit different versus, like, the sorting example. Right? Like, that you have to program it to do that versus if I'm using a model, who knows what may be prioritizing that list. Right? There's no guarantee that it's the same, and that's where we get the randomness. Right? Which also adds that other layer

of complexity, especially in the cases where your machine learning models labels, let's say if you're doing supervised learning, is not readily available. When I was working at a company called Benevolent AI,

we were predicting, for example, whether or not this gene would have a successful assay or would be an assay hit, for example. And that is expensive to validate and test, and the answers don't come back for months, sometimes even longer.

That may again, it's like it's not just data engineering there. You have to think about the labels and how that relates to everything. And, again, that uncertainty makes it hard to validate if your model is doing what it should be doing,

which is, I think, 1 of the other things that makes machine learning a bit different than just general data transformation.

Just to be a little facetious here, machine learning is when you get the answer and don't understand the question.

Yeah. Exactly.

In some ways, right, like a black box.

Yeah. Yeah. Oh, that is great.

Did you just make that 1 up right now? Yes.

Dude, we gotta put that I'm gonna put that on a shirt. That is incredible.

Yeah. Yeah.

I'm still surprised at how effective

some models are. Like, think about a transformer or something like that. Like, how does it actually do what it does?

I think that's where MLOps makes it a little bit different from some of these other ones is maybe there isn't a coherent theory

putting all of these things together. Right? Like, if you were to ask someone

like Yann LeCun or someone else to explain what deep learning is, I've heard different answers. I've heard someone describe it as, like, almost like a language.

And, you know, someone will talk about it as it's just a function or, you know, I think that's where we don't all agree yet. And maybe MLOps comes into play because, like Demetra said, it helps us find some common ground. Even if it's not a coherent theory, maybe it's a tool, maybe it's a process, but it's something that unifies

these different parts that are involved.

Digging now into the sort of technological space that's growing up to be able to support this whole effort of MLOps and people starting to adopt these practices,

what are some of the core capabilities

that you have seen people

coalesce around as being required to be effective at building these MLOps functions?

And what are some of the kind of standout technologies

that you have seen as people are starting to iterate on this problem?

So it's interesting you mentioned that because I have this, like, half written blog post

that I've been trying to finish, but I just

haven't been able to get it across the line on

this idea of, like, what is the modern ML stack? Is there such a thing as that? And, actually, my

whole blog post is about how there is no such thing as a common ML stack or a modern ML stack. And and I also think that there never will be because of what we kind of touched on earlier on how

you play this game. Like, which of these is not like the other? Then you say, computer vision with autonomous vehicle, computer vision in healthcare,

fraud detection,

you know, with tabular data, or

just go down the list, robotics.

And you start to realize that each use case is so very

specific in its needs and what you are trying to optimize for that it's really hard to say, like, oh, well, you definitely are gonna want this tool as your base and your foundation.

So what I'm trying to figure out is what are the fundamentals

that we can build off of. And I like

what my old CEO, when I was at that company that went out of business, he had this, like, manifesto,

and it was like the MLOps manifesto. And his idea

was

that you need something that's reproducible.

You need something that is collaborative.

You need something that's continuous.

And then I also would add on top of that, you want something that is

ethical.

The thing is that all of that is very like

it's like I'm selling new air. Right? Like, it's

really hard to nail down what exactly that is and how is that right for my use case. But I just think about, like, if you're trying to figure out computer vision on the edge for an autonomous vehicle,

you need much different SLOs than if you're trying to figure out computer vision for

some kind of cancer detection for

a doctor.

And

because you are really looking at something that is using computer vision underneath it all, but you're

optimizing for 2 highly different problem sets,

you can't really say, oh, there's these

core components that you definitely need. Or at least I haven't been able to figure them out, hence why the blog post is not finished. If anyone has ideas, please reach out to me. I'm gonna steal from if you guys follow, I think it's called The Sequence. There's, like, a newsletter. There was the CEO of

ML experimentation platform. He just had an interview, and I like what he said about what he considers

the key components of a robust ML experimentation

tool. But I think it applies to actually all MLOps tools. And he says scalability or scalable back end. I think that's really important because

the nature of data science is kind of experimental.

Like, a lot of times, you start small and then work your way up, or sometimes you need a lot from the very beginning just to even accomplish something. So you need and this is, like, I guess, to getting it down to it, the compute. Right? You need a lot of memory. You need access to certain accelerators.

All that should be very easy to use, and it should allow you to do your work infinite like, to infinite scale. Other thing is, like, he says it's flexible and expressiveness. Like, it should make you more productive, I think. If it's a good tool,

in my opinion, embeds

best practices into it. So it's hard to do the wrong thing and very easy to do the right thing. And usually, that right thing makes you more productive in your workflow, what you're trying to do. And then lastly, like he says, like, there usually should be some way to visualize

or, like,

keep track of what's going on. And maybe this is a dashboard. Maybe just a simple view of, like, all the data that you have, all the assets,

but some way to look and peek into what's going on. And this is where we talk about observability a lot. But this is a key component to, you know, model development when you're doing training, hyperparameter tuning. You wanna know what's going on. You wanna be able to scale and you wanna be productive.

Model deployment, the same thing. Right? You need to be able to scale. If if more customers start using your web service, you want it to be able to, I don't know, for example, add more notes or something. You wanna be productive in these different areas, and that's why I feel like a lot of tools are kind of trying to become,

you know, specialized in this 1 area. And I don't think that the tools that try to do all of them in 1

work well because it's like, you know, if you're working with most data warehouses, right, they don't have all these tools mixed into it. Right? Like, sometimes, like, you know, BigQuery allows you to do some modeling in it. I think that's totally fine. But a data warehouse that maybe has, like, a data visualization tool embedded in it. There are, like, Tableau or something that lets you pull your data in, but then it it affects something is going to be not as good as a result of you trying to do all of these different things. What I see as as being a trend is is there's best in class coming about, and, usually,

it's around

the productivity for whatever that workflow is. Right? So and what, the CEO of Neptune was talking about was experimentation, being able to, like, look at your different experiments, be able to tie them to 1 model that was produced. So there's that lineage aspect, I think, is important.

It's again, I hate to say this again and again, but it is a common tool is feature stores. And the reason why I think it's so popular is because, again, everything is dependent upon the data. So you need to get your data to your model in a timely fashion, in a reproducible fashion. It's also hard,

you know, the transformation side. A lot of the times, like, you need to know Spark or maybe how to do, like, I don't know, write some stored procedures in SQL. So if there's an easy interface to do that, that's also very

helpful. And another thing

is model deployment. Think of Algorithmia.

I have a training artifact. I wanna ship it as a web service really easily, or I wanna regularly run it as a batch job. And so I've seen that those tools that are dedicated to

model deployment are very popular. So feature stores and model deployment are, like, the 2 biggest ones that come to mind. Yeah. Then monitoring, there's a huge amount of money going into the monitoring space. And I think it still isn't clear if you need

a dedicated

machine learning monitoring tool or if you can use something that is like your DevOps monitoring tool with with a few, like, tweaks on it. But there's

a lot of companies

in this space right now, the machine learning monitoring space, and they probably have a mouth for why you need their tool. But

I was gonna mention something that you said, David, on top of that when you were talking

back about that first part of the question. And 1 of my favorite things to ask guests these days

when it comes to them creating their platforms

is, what

metrics are they looking at with their platform? Like, what is it that they're looking at to know that the platform is actually doing its job? Is that time to deployment? Then you shorten that, and that's the metric that they're looking at. Or how did they

see

that the platform is actually bringing more value, and how can they make the platform better?

That's a great question.

That's a tough 1 to answer.

Yeah. And how do you select that metric? Because you always wanna be careful about what behaviors you're

incentivizing

because they're not always going to be the ones that lead to the outcome you're hoping for. I think that where is this from where they say once you make something a metric, it's like almost not as good. I dealt with this a while ago, but like as soon as you start observing something, something happens there. It's like it's not as good as what it used to be. Are you going into string theory? You didn't

you didn't tell me that? No.

No. No. Not exactly. Not exactly. No.

Yeah. Yeah.

And in terms of the

practice of actually starting to adopt MLOps

as a

core consideration for the organization,

what are the

technical and organizational

questions at the design and architecture and process and procedure levels that need to be made

as you're starting to go on that journey?

It seems like everyone

is probably going to ask the question, should we use Kubeflow?

That's the first question that everybody is going to ask. And once they get past that question, you can actually get down into the real question. I think the other 1 I dealt with this when I first started at Microsoft. I was thinking about an architecture that could be applied to any project. And so I came up with kind of, like,

a general kind of template project. And that's actually something that I think we'd look for is, like, how can we standardize

all the different things that are going on? That's 1 of the first questions that we were asking. It's like, okay. We have this team that's used to using this tool, doing it this way, and they've been doing it very manually. This team, maybe it's a little bit more automated. And so the first question is how do we organize all these different ways of working

to meet their needs? And then you gotta think about the technology choices that you're gonna use. It depends on, obviously, what resources are available. Are you do you have the

expertise

and the time and, I guess, like, the okay to build your own stuff? Or if not, do you can you afford using a managed service?

Picking the right 1 is also a good question because, you know, you're building kind of, like, your whole strategy on some company. And they change, you have to change. And so

sometimes that's too big of a concern and they decide we're gonna do everything on our own so we could stay nice and decoupled. But then even then, you deal with those challenges because if you're depending on open source stuff,

open source has its own host of challenges because now if something goes wrong, you are on your own sometimes. Like, you have to figure out how to make it better. And some companies have the expertise to contribute and make it even better. Think of, like, Uber who is regularly contributing tools in the open source space that they use in house and and develop. And it's like, this didn't work for us, so we decided to build this whole new feature to make it work for us, and now we're giving it back. But a lot of companies don't, they can't afford

that. They're just trying to get their model in production again, going back to what that means. I want the value of this to to get to my customers. I don't care how it happens. I just want it to happen. Then there's like, okay. Well, how do we

once we have the tools in place that help us do that, what about the processes? Who do we have on this team? What type of person should lead this team? What experience should they have?

These are the things that come up, and and usually it's a mixture of kind of what Demetrius is saying, more on the engineering side with the domain experts kind of being the customers. Like, these I see the data scientists as the people that I'm primarily serving. I'm I'm trying to make their work better, their workflows more efficient, their things more scalable and more reliable. They should be involved because you're building things in the service of some product or some tool, and and they need to be very closely linked, not separated. And so you also need that

and then obviously the business. The most important is, like, are you solving the business problem, which is, again, I think that's related to MLOps because you're not just doing engineering for the sake of engineering. Right? You're doing it with a very specific purpose in mind, and that needs to be factored into how you think about solving this problem. I was just gonna mention 1 thing that I just like to follow-up on what you said, David. It is so true

that machine learning engineers

calling

or thinking of yourself as an engineer, only an engineer, is probably a bad

decision because machine learning is so close to the business.

It's like you have to understand

the business side just as much as you understand the engineering side, in my opinion. And I've heard that echoed many times,

and that will allow you the ask questions on these design decisions that you

potentially wouldn't ask. It's like, wait, what problem are we trying to solve? Like, what business problem? Because

I'm sure that everyone listening has

had this experience or at least knows somebody who has had the experience where a data scientist or a machine learning scientist goes and they hack on something for however long,

potentially up to a few months,

maybe it's just a few weeks,

and they come back and they say, Look at this, what I created. Behold, I've got 99%

accuracy on this model.

And you see it and you realize, Wait, I don't care what kind of accuracy you've got. You're missing the mark completely.

And so that is something that

is very, very common in the machine learning world. And I don't know I mean, it's common in other

areas, but I think that with machine learning, you really have to be conscientious

of that. Yeah. I see that particularly with data scientists. They should know that the business side of things. It's I think it's super important and it makes you

it's like you're not just building some solution in the abstract, like you're building a solution to a very specific problem, and you have to understand that problem really well. I will say that engineers usually don't need to know them that much. It's like you need to know, like, the overall high level. At least that's just me. Like, I like to think about the infrastructure and the tech.

But, yes, the domain is absolutely important. Right? Like, what are we doing all of this for? If you can't answer that question, it's gonna be hard to answer some of those other questions.

And a little bit of a tangent, but 1 thing that I think is interesting bringing it back to, like, data engineering and then

looking at design decisions as you're trying to create your MLOps practice.

Again, I'm just throwing this question out there. I don't really have a good answer to it, but I think it is something that has come up time and time again that I really

find it interesting.

And

if someone

is tackling it in a really

innovative way, I would love to hear about it. But it's how can you give people vision

or enough insight

into what they are working on so that they know because the way the data flows. Right? Someone who is a data engineer and they tinker with something upstream,

they have no idea what kind of downstream effects that could have when the data scientist is putting that model together. And so that could be someone

way upstream just changing something

very minuscule to them,

and they kind of have an idea of like, Okay, well, 2 steps down the line, this is the changes that that's going to create. But they don't understand the full

impact of that. And so back to what we were saying, that's more of a people problem, right? Or that's more of a processes. It's not really a tech problem. You can't really solve that with tech, or at least it hasn't been solved yet. So that's another interesting problem that is happening right now.

Yeah. And continuing on that thread briefly,

as the data engineer,

you are the person who is responsible for providing the data that is actually going to

feed into these machine learning models. And as such, you are the person who is going to either afford or constrain the types of questions that can be answered. Because if you don't have the data, you can't answer some questions. Or if the data is structured in a certain way or

is lacking some necessary context, you're not going to be able to actually ask and answer and experiment with the questions that you might care about.

And that's where some of that feedback cycle comes in where the machine learning engineer or the data scientist says, this is what I'm trying to do. And then the data engineer says, okay. Well, I'm going to have to pull in data from this other system, or I'm going to have to, you know, modify the way I'm structuring the data model in the lake or the warehouse, or I'm going to have to feed in additional metadata to propagate context to these different downstream cases. And

as the person who is so early in that chain, you're the person who has the greatest force multiplier, multiplier, particularly as you get into machine learning where

the value of the data is compounded because of the ways in which it's being aggregated and then fed back out into the external world. Something that you made me think of there is, like, when you're working with

interesting challenge of

getting like, this is something that comes up. A lot of the times, our pipelines will fail because some upstream data, like some pipeline failed.

Like, something as simple as, like not maybe not simple, but, you know, having a good line of communication with the different teams that you depend upon. It could be something more formal like an SLA, a formal kind of agreement of what you expect. But these sort of things come up and do make a difference in the productivity of a data science team because,

like you said, if you don't have the data or you don't have access to that data, there's some data that's restricted or it's in this very private environment. There's, like, all these things you have to do to get it. And usually, a data engineer will be in the best position to do that because they have the technical side to help, like, thinking about, like, data governance. They're thinking about all these things while a data scientist is just maybe more focused on their particular problem. I just wanna use this data and train my model on it. And that's where it makes me feel like it's important to have these different personas because if a data scientist had to worry about all of the things that he has to worry about, now getting access and data governance and if I'm storing this in, like, I don't know, some storage container, after how long do I have to delete it to be compliant? Like, these are all sorts

of concerns that I don't know, like, if data scientist energy should be best spent there. Although they usually are involved in that, we do need people to kind of focus on these sort of challenges, and that's where I think the data engineer comes in. I talked to this guy last week, actually, Gabriel Straub, and he is the, like, head of AI at Ocado

Technologies. And he was talking about when he thinks about the platform, just the data platform in general there,

and they have a lot of different use cases, but what he is

really discerning about is whether or not someone is a data producer or a data consumer.

And

depending on

which 1 they are, he's making different decisions

on that and how those 2 can play nicely together.

Yeah. And that feeds into another thing I was going to talk through is the

interfaces and contracts

that

you create to be able to compartmentalize and compose the different concerns and responsibilities

throughout the full MLOps life cycle and as it feeds back into itself.

And 1 of the pieces that we've mentioned briefly a few times here and has come up in other conversations is the idea of the feature store as the

interface

between the data engineer and the data scientist or the ML engineer because it's a very clear contract of saying, I'm going to provide all of the inputs to this feature store, at which point it's your responsibility to actually build those features and maintain them. And so that's the sort of clear hand off for that interface. Or in the case where you're talking about a data engineer and an analytics engineer, it's going to be the data warehouse as that interface where you say, I'm going to land everything into the data warehouse, and then you have the power to be able to do your transformations and analyses from there. And just figuring out,

like, what are the additional interfaces that are necessary as you are trying to complete this full feedback cycle to be able to have this continuous process of improvement for your machine learning systems, be able to get them into production,

manage the sort of monitoring and model drift, understand when to retrain and redeploy, what additional data sources are necessary to augment the models or build additional models so that they can collaborate in a more, you know, systems oriented approach.

So just curious to explore that as well.

It's interesting how naturally that's what happens. We kind of we separate things by, like, what do you know, what do you know how to do. And related to that, I think, is what people are passionate about doing. So a lot of the times, like,

data scientists may not be that

interested in all the infra side of things. They just wanna focus on, like, the fun stuff, the ML. And so that sometimes naturally creates that, you know, kind of separation or that, like you said, that contract where, okay, once it's here, now it's my responsibility.

That's a natural kind of way to organize things. But

I have seen in my own personal experience that that doesn't always work. There is, again, that feedback, this back and forth between the different personas that

it's just hard to cleanly separate them. I think that sounds nice,

but in practice, the data engineer will usually be involved with after those features are defined, maintaining them and making sure that they're correct. For example, like, let's say if the data scientist, you know, has some definition of these features

and let's say there's some tool that you provide the definition and then then computes them and puts them where it needs to be. But how do you know if it's still doing what it needs to do? Maybe it would create specific alerts to see if it shifts,

the distribution shifts. And there, you can't just say, here. The data is there. Create your alert on it. Maybe you need I need to know, okay. Well, why am I seeing this difference in my performance because of this skew? Can you maybe you're more familiar with the upstream data sources. So it's like we need both parties involved is what I usually see. And it is nice to, like, have some separation of concerns where you can depend on this person for being good at this 1 thing, which is usually how we I see it being kind of played out. Like, I am the data engineer.

They like getting all the data from all these different sources, doing what they need to do with it, and pushing it elsewhere. How people use it, I guess, you know, maybe they're not so concerned about. And that's good, while the data scientist is

all they want is the data available so that they can do all their fun stuff in a Jupyter Notebook or whatever it is. And I think that that's okay, but there will be a point in time, especially when you start getting more and more models in production and more and more

pipelines feeding into those models.

Both parties need to know what's going on. It's not helpful to just say, that's not my responsibility.

I don't know. You know that. It's okay for it to be like that for some time, but I've seen that it can be a problem if you don't try to share that knowledge, you don't try to minimize those bottlenecks, whether it's knowledge or expertise in something. That makes me feel like it's natural to have this

I like this technology. I'm familiar with this set of tools, and I'm gonna do this. But, again, it just feels like that that won't always work. And I have seen small instances of that, but I would love to hear from Dimitris. Maybe he's heard otherwise.

No. I was just gonna mention that

for the visual people out there, I look at it more like if you have

a

picture and then it's clear separation of the different colors, you know, picture with 3 colors.

I'm

onboard fully with what you're saying, David, where it's not a clear separation. It's more of like a gradient

going from 1 color to the next

so that there is a little bit more crossover

in

each 1 of these, and you don't ever get that

position. It's like you're setting yourself up for failure in a way

if

you not only and going back to the people part, like,

that's not my problem or I throw it over the fence. We know that that is not the best way to do it because of all the years in other disciplines that have been doing it that way. So I like to look at it more as a gradient, not just like

very clear,

lines in dividends.

Are you struggling with broken pipelines,

stale dashboards,

missing data?

If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than Monte Monte Carlo, the leading end to end data observability

platform.

Trusted by the teams at Fox, JetBlue, and PagerDuty,

Monte Carlo solves the costly problem of broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, airflow jobs, and business intelligence tools,

reducing time to detection and resolution from weeks to just minutes.

Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box.

Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/montecarlo

to learn more.

In terms of your

experiences

working in the MLOps space or engaging with the community as people are starting to

explore what this means, what they need to do to be able to be effective as they're bringing machine learning systems into production and maintaining them over their life cycles?

What are some of the most interesting or innovative or unexpected approaches to the workflows or platform solutions that you've seen?

It's related, but it's kinda like a meta conversation. But I'm thinking about, like, the importance of efficiency. Like, you know, you think about climate change and how we're using it. Like, some of our accelerators use so much electricity.

So there's actually now it's not just like, oh, yeah. We wanna train our models faster, but it's important now

to

put a lot of effort into making

everything that's involved there that's very expensive more efficient. So training in particular is very expensive. Like, I've heard some estimates of how long it takes to train, like, GPT 3 or some of these models in the order of months to years and requires 100 of GPUs. And that just seems it's like that's unacceptable if that's what's required. And right now,

you're already kinda seeing it. Machine learning is becoming ubiquitous. It's just being embedded in everything, all sorts of products. It's just it's everywhere. And I think it was Jeff Dean that talked about, if we keep up at this rate, there's not gonna be enough data centers to provide enough compute for all these different applications. Something I'm thinking about is we need to really focus our efforts in making that process more efficient.

So it's an interesting thing because it seems like it's already a relevant problem in the MLOps spaces like distributed training, more efficient training,

scalable training, scalable inference. Right? But now it's like with this added element of, like, this existential threat. Like, if we don't do this, gonna be a problem. And also from a commercial perspective, it's just too expensive. Like, think about a small startup trying to rent some VMs. That's

a lot of dollars if you are just, you know,

carelessly training. And so this other aspect of, like,

understanding how to use the right resources is I think an important aspect of the tooling space. Something that comes to mind is, like, being able to take a fraction of a GPU instead of a whole GPU. There's lots of systems that they don't enable you to do that by default where you have to for example, you can only select either 1 or 2, and that's a waste. Because a lot of the times, these training jobs, you'll see the utilization is quite low for GPUs. So, like, you're wasting all this compute,

spending all this time, spending all this money. So I think

more

efficient resource usage and allocation

at the hardware level and at the software level is an in an interesting area

that I see being more and more relevant, especially, like, not just from, like, again, like, the text perspective, but

even more existential.

1 more thing. This is, like, random. This is probably people out there listening, like, oh, what are you talking about, Dave? But as an example, like, there's I think it's called f star. It's a project. It's a language that's trying to rewrite the HTTPS protocol, like, from scratch.

But what's cool about the language is it's a verification language. Like, as it compiles, it, like, formally verifies if it's correct, like a proof of correctness type of thing.

And that makes total sense for, like, something like HTTP. Right? Because the whole Internet depends on that. Like, these protocols are so

fundamental to our entire infrastructure that if that has vulnerabilities,

we're screwed.

Why I bring it up? Because I think that that

somehow being mixed into the ML space would be amazing. Now we don't really have there's not lots of formal proofs of correctness for machine learning that I have seen, but it would be cool to introduce

more formal,

yeah, more formal proofs to machine learning. And I relate that to the the MLOps space because imagine a simple application like security.

Security is also becoming a really interesting problem in the MLOps space, like you think about. I wanna say federated

learning, but that's part of it. But there's another name that I forgot what it was. But Generative adversarial networks?

It's related to it. But, like, I guess, like, this whole space of now we're not just, like, learning on data, but now we need to do it securely, right, in a way that we can trust is, I think, a really interesting problem. You're talking about synthetic data?

No. No. It's just more so, like, I haven't seen anything like this, but the intersection of, like, formal proofs of correctness or formal verification and machine learning. I think there are already machine learning models that do that. Right? They can, like, predict like, a lot of mathematicians are, like, exploring the area where machine learning can be used for math. Not necessarily talking about that. I'm more thinking about the application side of it actually being used to support some of the infrastructure. Again, this is, like, probably totally out there, but it was just something interesting with the question. Right? Yeah. And totally out of left field. And not quite answering

your question, Tawise, but

something that is

what I feel like is important to say is a lot of companies

think that they need to do machine learning, and so they hire a data scientist when really they need a data engineer.

And you've probably seen this more than anybody,

but

some way to get that kind of education

out there

and make sure that people understand

like, yeah, a data engineer is probably the foundation

that you want to start with as opposed to a data scientist.

Because if you don't have these foundational

pipes put in place, your data scientist isn't going to be able to do much, or you're going to ask of them to do things that a data engineer should be doing. And even though we just got done talking about how, oh, yeah. Data scientists should know a little bit of data engineering, and data engineers have gradient view of looking at it. So

I'll go back on that and just say, like, yeah. When you're in that foundational piece and you're figuring out how to do things,

that's 1 thing that I will advocate for. Like,

probably

hire a data engineer first before the data scientist, and really see what kind of data you've got, see

if it's valuable, and if you can do something with it. In your own experience of working in the MLOps space, and in your case, Dimitrios, in particular, helping to foster this community around it, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

I don't know

if people find it as funny as I do, but

how to deal with vendors?

Like, it's really rough because

you've got a community. It's very engaged. It's very active.

It is the target audience. I understand I used to be in sales. Remember, I understand that you wanna go in there, and you wanna sell to every single person that is in the community, but you're not making any friends by just going and spamming the community. And so really trying to figure out, like, what is spam? What how do we deal with spam?

And how do I deal with people that are spamming?

Like, it's just been a long road, and at least once a week, I have to deal with somebody coming in and

putting something where they shouldn't be putting it or putting it in every channel. And

I guess that's

my karma

for

being that guy for 2 years.

I'm the type of learner. I like to learn things from first principles and work my way up. And so I find like, I'm back in grad school. I'm I'm learning things. I'm I'm studying computer science.

A lot of the things that I

deal with in my work are more cultural, but what I think about and I'm interested are kind of the technical challenges there. And I'm learning things from scratch to see

how things could be made better using very fundamental things. And my opinion, a lot of the biggest innovations in this tech are

small small insights, small little things that make a big difference. And so 1 thing I have seen is in practice where

big impact can come from learning the fundamentals and knowing the fundamentals well. And so I guess within the MLOps space is what are the things that allow our work to be reproducible? What allows us to communicate things effectively? These seem like kind of, like, assumed things or givens, but they're not. You know, learning how to collaborate well is a challenge, and these are the sort of things that I deal with the most. You know, how do I get buy in for my idea to move to this new tool? How do I get buy in for this new process that I think will make things better but will require the data scientist to learn this or have to do that? How do I convince

data scientists to do this? It's like these sort of kind of cultural

human problems that I think about in the day to day, but I look for inspiration

from, like, basic things, first principles.

You know? So for example, like, in the DevOps space, there is a lot of books and literature on the organizational

aspect of a DevOps team, and I think that has been helpful. But,

again, it's like it's not enough because there is these unique challenges that keep coming up, things that I didn't anticipate before, and I just find that I'm sure it's like that in the data engineering space, but I found that especially in, like, the the machine learning space. There's just always something that comes up that that I didn't think about, some challenge that I couldn't anticipate.

I think that's what makes it super exciting too though, but it's those kind of unknown human

challenges that you can't really predict and just solve with some algorithm or some tool

that I find take up most of my time and most of my energy.

Yeah. Maybe I'll retract my last statement so I'm not just shitting all over vendors who

piss me off.

And because it is a fine line, right? Like, especially, oh, if we have sponsors and but whatever. That's a whole another topic of discussion. And I'll use this 1 instead, which is

1 thing that has surprised me, and it's been just an incredible

surprise, is to see how

many people

are so excited

for this space that they want to just contribute, and they wanna get involved. And it's not like there's an open source project behind the MLOps community.

There's not something that people can go and hack on that you would normally see with community projects. But

there's so many people that are interested in this space moving forward and it becoming more mature.

And also so many people that are struggling with

it, and they come to this community and they find like their little safe haven and they can talk to people. And I think a lot of people needed that over the last 2 years, especially

when you couldn't just turn to the person that you were working with and ask them. And maybe you don't even have somebody you're working with. So for me, it's been just the power

of the community, the power of the people in the community to raise their hand, volunteer,

do all kinds of super cool initiatives that I would never be able to do myself,

but they are

very proactive about it and they get it going. Like, I mean, David's 1 of them. He's

doing system design reviews, and we get to

use the sponsor's money for good things where him and another guy from the community, they read these very technical blogs that are coming out from the Ubers and the Airbnbs of the world, and then they create animation videos that try and distill it and break it down

and make it into

something that you can digest,

I mean, in a much easier way than reading through this very dense blog post. So that's an incredible initiative. I mean, there's a million other initiatives that we have that I think are

so cool, and that's been probably the most unexpected thing when

when this community started. I would have never thought that there would be

so much energy around it. As you continue to be engaged with and work with the overall MLOps community in this particular problem space, what are some of your predictions for the near to medium term future of what people are going to be doing? What are the capabilities that are going to be unlocked? How are things going

to coalesce or expand? And what are the pieces that you are personally keeping a close eye on?

I'll say 1 quick 1, and then I'll let David go. Because I wonder about,

like, machine learning as a service,

and if we're going to start seeing more vertical machine learning, like

recommender systems as a service, I've seen starting to happen. And maybe there will be more of that as opposed to

a company trying to bring it in house. And so that's something I'm really excited about. I wonder how that space will play out and if that will

will see the light of day and become

1 of the ways that we do it.

As far as the future is concerned, man, this space is full of surprises. Right? I mean, I'm so young. You know, I'm only 30 years old and have been in the space for 5 years now. And it's funny, like, when people ask me about trends, I always think, like, man, I haven't been here long enough, but a lot of things have actually happened in these 5 years. I'm starting to

make my own predictions to start thinking about things. And I've kinda hinted that earlier, but I do think that security is gonna be

it's already important, it's I think it's gonna become even more important as machine learning becomes even more ubiquitous because

it's 1 thing to have it shipped and and available, but doing it in a way where we're not

exposing certain piece of information. Like, you know, I I think about how data is now becoming a much more valuable asset. You know, you think about GDPR and how that's gonna affect our machine learning workloads. I think that's gonna be a big problem later if you're, like, not thinking about that now.

You know, kind of baking that into your tool afterwards may be a big challenge if that's not a philosophy or a or something that you thought about from the very beginning, first principles.

And so I think that's cool. And then the second thing is efficiency.

My prediction is that there will be more innovations in the compute space that make it more efficient and make it easier to train a model,

make it more scalable to train a model. Right now, models are really big, especially like deep learning models. And that's a bottleneck for a lot of people because either you are,

you know, you have a high performance computing team available, you have a supercomputer, or you that's what you guys do or else you're kind of dependent upon the cloud and that's very expensive. So making that more

available and accessible to people more efficient is where I see things going.

And definitely the community part, I see that being a big innovator. Someone brought this up, I think, in a previous podcast where

communities help these certain technologies thrive and make it better. So curious to see what that's gonna look like in the MLOps space. There's already lots of open source tools. Maybe it's not gonna be something like that. It could be something different. Maybe

kind of a group come up with, like, a coherent theory of what MLOps is or something like that and propagating that to the rest of the world. Who knows? But I see it being interesting, the community being a very important factor. Not just MLOps, but just in general MLOps.

Yeah. You sparked

the thought in my mind too about

the EU regulations

that are coming. There's proposed EU regulation on

AI and how that will affect things, and how that will affect

the way that

machine learning is being done and data is being collected and data is being kept, like you were talking about with GDPR.

And then also the other idea that I had too while you were talking is,

like, around standardization.

Are we going to standardize things? Is it just going to be like that meme where it's like, yeah, we wanted a new standardization,

and then it just creates another

framework that nobody uses. And so

how is that going to look?

That's

fascinating to me. Like, I think it was David Arincic who

said 1 time when he came on the podcast, he was like, we need contracts with a small c. Not like these contracts that we have to force people to use, and they're not, like, going to get people and they're not gonna back them into a corner, but

still something that we can

find a common ground

and work from there.

Because right now, it's very fragmented as anybody who knows the space will tell you.

Are there any other aspects of the overall space of MLOps, the role of the data engineer in that ecosystem, or the work that you're each doing that we didn't discuss yet that you'd like to cover before we close out the show? Just kind of related thought. I'm sure there's lots of up and coming data engineers listening. You're thinking about what do I focus on? What tools are relevant? If I wanna get in the MLOps space, what do I need to do?

I would say that's you. You know, skills are usually what helps you get far, in my opinion, like, if you have a broad skill set. It's nice to be focused in some areas, but having

something as simple as having good documentation, knowing how to write good documentation goes a really long way. Being able to work with different stakeholders and winsomely communicate your ideas to them. These are things that usually you think about after the fact, but I would argue you should start thinking about them now. I think it will go a long way in your career,

especially in a very interdisciplinary

space like MLOps where you have lots of different personas involved. So learning how to not everyone's gonna understand what you know. So it's important to help people understand what you know and why they should know what you want them to know. It's something I regularly deal with. And I would say in terms of tools,

there's so many, but let's start with languages. Python is obviously the workhorse for a lot of data science. It is worth knowing. As an engineer, it may also serve you to learn something a little bit more level, maybe gives you more manual control over things. I'm thinking like c plus plus or something. And I I do think that if you know lots of different tools, not all of them, but enough to make you versatile, it makes you more independent. And I think that's a really valuable skill in an MLOps team when we kinda need people to do it all in some respect, especially when you're first starting.

So that will make you a valuable asset. People do like that where, you know, it's maybe you don't know all these things, but you're interested in learning these new tools. I think that really helps, and that's about it. Get involved in a community too. I would also say that's a big thing because even if it's something as simple as leading a reading group or participating in reading groups, like, this getting involved in the community,

it's hard to describe, but it does help you in your career. I I wanna take some more time to think about that, but I do know personally it has

brought in even what I'm concerned about, and it makes that a little bit easier to navigate, especially as I go into work and deal with these unique challenges.

Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Goes back to what I was talking about earlier on the

ideas about

how to give more visual,

how to

make things more transparent so that people can recognize,

like, the butterfly effects that they're creating downstream.

And if they do 1 thing, then they understand

the full

outcome.

That's 1 piece. And then, like, how to not centralize or maybe not centralized is the word, but just how to

have the knowledge hub.

And what is the best way of creating a knowledge hub for machine learning and when you build the processes out? How does that look? We've been working with data for a long time, so there's lots of, like, nice specialized databases

and lots of tools and even languages that work well with data. But I I think it's

such a cheap answer. But, like, you know, like, the governance aspect of it, there is all these new concerns, legal concerns that are coming into play that I think I often think about.

There are some tools that I have seen, like, in Microsoft has a lot of custom tools and stuff that they think about this. But I'm thinking about, like, where that's just like, you know, how data warehousing, there's some query language. That's a important part of, like, that tool.

I wanna see something like that. There are tools that prioritize it. I'm thinking of

Moses Barr, I think is her name. I love, like, kind of her philosophy of data and data management and data governance. I think there's lots of cool tools that are focused on, like, latency and maybe scalability, but maybe not so much on, like, these other kind of it's like that kind of management part, the boring stuff that I don't like to think about, but I just want to be taken care of. So it's not that so much as a gap, but I I would love to see more of that.

Alright. Well, thank you both very much for taking the time today to join me, share your experience and perspectives on the overall space of MLOps and how the data engineer fits within that context. So I appreciate all of the time and energy that you're each putting into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Likewise.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links