The Grand Vision And Present Reality of DataOps

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

RudderStack is the smart customer data pipeline.

Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse, enabling identity stitching and advanced use cases like lead scoring and in app personalization.

Start building a smarter customer data pipeline today. Sign up for free at data engineering podcast.com/rudder.

Your host is Tobias Macy. And today, I'd like to welcome back Max Boeschman, Lior Gavish, and Kevin Stumpf to talk about the real world challenges of embracing data ops practices and systems and how to keep things secure as you scale. So, Max, why don't you start by introducing yourself? 1st, thank you for having me on the show again. I think your audience might know me from previous appearances, but I will still give a very brief intro. My name is Max,

the original creator of Apache Superset and Apache Airflow.

And more recently about 2 years ago, I started a company

offering Apache Superset as a service.

And Lior, how about yourself?

Hi, everyone. I'm Lior. I'm co founder and CTO of a company called Monte Carlo.

I help teams with data observability and data ops in general.

Before Monte Carlo, I worked at a cybersecurity company called Barracuda Networks, built machine learning based products for fraud detection,

and I'm excited to be on the show today. And hi, everyone. My name is Kevin, and I'm the cofounder and CTO of Tekton.

Tekton provides an enterprise

ready feature store which solves all the data

challenges around productionizing

machine learning.

And before having founded Tecton, I was at Uber working as an engineering manager and tech lead on Uber's machine learning platform, Michelangelo.

So you've all been on the show before, so everybody has heard about how you each get into the area of data management. So we'll skip that question for today, and I'll add links to the show notes for people to be able to go back and review your previous appearances.

And so before we get too far into the topic at hand today,

I'd like to go around again and just get each of your definitions

of how you think about the term data ops and what that means to you. And so starting with you again, Max, if you wanna share your definition of the term. So I think there's some interesting parallels to be drawn here. So, you know, the term ops gets kind of suffix onto things to represent, I guess, like the daily, weekly workload around these things. So

marketing ops, sales ops, and now data ops, and there's this question of what does that mean in the data landscape or as part of a role or function in the data team. And I don't know. I think there's some interesting parallel to be drawn,

maybe what DevOps is to software engineering or maybe QA in some cases too. I don't know exactly

how that relates,

but I think it's interesting.

In the same way around DevOps and QA, I don't really think it should always be a specialty. Right? Sometimes you're like, oh, quality

and DevOps is kind of responsibility of everyone in the company. So So it's like, sometimes it can be a specialty in large enough organization, but I really think that,

you know, people should manage the operational side of their functions. So if you're a data engineer, you should probably,

you know, own your own

ops. And then maybe in larger companies, there are people that really specialize in driving

kind of the operational

maintainability,

right? Like this idea of making sure that things are working well on a daily, weekly, monthly basis.

People are, like, building less stuff, but making sure that the stuff that's been built works well. Pleore, how about you? What do you think about when you hear the term data ops?

I draw a lot of the parallels from DevOps, as Max said.

I agree. I think DataOps is oftentimes

a reference to

the methodology

and the tools and the people that help

data teams build

reliable data products. Right? You know, data teams are building dashboards and machine learning models

and other sorts of analytical products.

And DataOps is the process and the tools that help them

do that with high quality and high reliability.

But also and this is again a corollary with DevOps. It's also about doing that quickly with fast development cycles

in a way that's secure, that's compliant,

and that meets a lot of the other

requirements that the organization has. So it's a bit of a broad term, and I think it's still being shaped by the industry and by the companies that are here today and others. It's definitely something I'm very excited about. To me, DataOps really means bringing the DevOps practices to data management. And what are those data? Those DevOps practices? It's, I think, like, if you look at how it's applied to the software engineering practice,

you have your source code repository. You change a line of code.

You check it into a git branch. You get somebody to review a pull request, then it gets merged into masters.

Now you've got your CICD pipeline where your code is being built. It's being tested. It's going going to be bundled up in a Docker container,

and then it's going to be thrown into your Kubernetes environment where now your container is hosted and runs as a service. So you've got this

really nice, fairly automated process that makes it extremely easy to develop and deploy

software applications like your microservices and whatnot. And to me, DataOps brings those exact same lessons learned

to

data management, to the development

and deployment of data pipelines,

and also to the output of those data pipelines. And that's, to me, where the biggest difference actually comes in with DevOps, where

you with DevOps, you really develop and manage 1 class of artifact, and that's like your software application, your, say, microservice.

But with data ops, you're really concerned much more with 2 different types of asset categories, if you will. 1 is the data pipelines

that

turn raw data onto or some data into other data artifacts.

And then the other

class of or category of artifact that's being managed with DataOps is the actual output of your data pipeline,

which also needs to be tested. It needs to be versioned. You need to manage the lineage and whatnot. And so that's to me where the difference comes in. DevOps just deals with software artifacts,

and DataOps deals with data pipeline artifacts and the output of those data pipelines.

Yeah. 1 of the common terms that you see a lot in the DevOps sort of materials is shifting left where it used to be that, you know, the software engineer was the leftmost portion of the pipeline, and then it would go into QA, and then it would go to operations. And, you know, it was kind of pushing all of the responsibilities

of software delivery onto everybody so that, you know, more pieces moved left into the software engineer's responsibility

and the people who were traditionally later on in the life cycle became involved earlier in that process as well so that rather than having it be drawn out and you would say, you know, I'm gonna work on this month long cycle of having this artifact, and then I'm gonna hand it off to the QA to see, is it all broken or not? And then once that's done, I'm gonna hand it off to operations to set their hair on fire to try and get this thing to run-in production because I never bothered to see if it works on anywhere else other than my machine. And

I think that there's a similar level of maturity that's happening in the data ecosystem of people realizing that,

you know, because of the interconnectedness and all of the complexity that comes into

working on data products, you need to have everybody involved earlier in the process and shift a lot of that left where, you know, the person who's building the pipeline needs to understand what are the actual needs of the people downstream who are building the analytical dashboards and just having that complete

feedback cycle and maybe sort of shortening the time windows that people are working in and shortening the

or sort of reducing the size of the deliverables so that you can have those tighter feedback loops rather than having these months long projects then ended up end up getting derailed because the question that was asked 3 months ago is no longer relevant to the business today.

I agree. I think that's 1 of the biggest problems that are being tackled with the data ops practice and the platforms around it, where you can now much more

easily develop and deploy

your, say, data pipelines or anything around data. And that's, like, 1 of the core problems it solves, but I think it goes beyond that where it now solves other

still pretty prevalent problems in the data space around, like, for instance, like, who's the owner of a data artifact that you may be using? What's the version of your data if you're changing it? What's the lineage of it? Like, where does the data come from? What are all the upstream transformations? What are the upstream data sources, and how did they get produced in the 1st place?

Does your data look healthy?

Is anything broken here? And that's where Lior's company, Monte Carlo, of course, does a does a great job helping you out and figuring out if things are broken

or even ensuring that your data is being created in standardized ways that are easy to understand, easy to debug, easy to hand off to other types of data engineers.

These are all, like, different really, really important problem categories that are tackled with the DataOps

practices and processes as well as the tooling that's being built around it in the ecosystem. And 1 big kinda subtopic here or thing that's very related is just metadata management. Right? For people to do data ops well,

it requires having all this metadata that's well managed and well exchanged and surfaced the right place at the right time for the right people.

So we've seen, you know, the rise of, like, the data catalogs, like, slash, like, metadata search engine move and projects like emWinzhen,

right, or data portals.

That's kinda 1 window into the metadata that

is useful in all sorts of ways, but in parts to data ops type people. It's interesting to decompose to the nature of what kind of metadata exists and

what's most relevant to the data ops function. So there's business metadata, there's lineage metadata, there's operational metadata, like which job ran

where. There's, like, social metadata, who's using which asset.

And all this information seem

highly relevant to data ops. So maybe data ops tooling is really just some sort of, like,

nice window

to power the right processes

around all this metadata in the company. You know, with DevOps, it kinda seems like the

centralizing

force that kind of made it possible was the ubiquity of the CICD pipeline where everybody could look to see where are we in the life cycle of delivering this artifact.

And in the data ops analogy, it seems like the

metadata system and the data catalog is that same organizing force where everybody can look to see where are we in the life cycle of this project, where is this data asset in relation to

the overall flow of data through the system so that everybody can kind of coordinate around that without having to have a huge amount of communication overhead to be able to, you know, send around Slack messages to say, what's happening? Where did this data artifact come from and why? So there's that sort of central piece that everybody needs to have in place for them to really be able to effectively start to adopt, quote, unquote, data ops. I agree with both of you. The idea of collecting

a lot

of metadata and logs and metrics, and I'm borrowing here from the kind of application observability space

and making that data

available to the right person when they need to do their job

is kind of a critical part of DataOps. And I think there's a lot of companies that are trying to capture that metadata for various purposes, and that's very powerful. Right? Kind of taking

all that information and applying it to

a particular use case, whether it's you know, in our case, it's a lot about figuring out whether your data is healthy. And if it's not, how do you troubleshoot it? But there's also obviously other use cases where you want to figure out which dataset to use in the first place, kind of like a data catalog scenario

or which features you should use for your model. Or some of our customers are struggling with compliance and governance

challenges where they need to figure out where they have sensitive information and how to make sure it only gets to where it needs to be. So I think data ops will end up collecting all that metadata and organizing it in a way

that allows different people and different use cases to,

to operate more effectively.

Yep. And 1 thought here

is, you know, you see all these different systems emerging, like data catalogs and data quality centric systems and, like, you know, data ops centric upper, like system that help you operate your pipeline on a daily basis.

The slightly more kinda ML engineer oriented

version of that, right, which is closer to Tekton,

maybe, right? So what I see though is conceptually, we kinda need a metadata

data warehouse,

Right? At Facebook, we had something called data graph, which

was kinda this graph database of all of the nodes and the edges in the metadata catalog. Right? Like, what are the objects, the dashboards, the charts, the pipelines,

the tables, the data sets, the virtual data sets, the views, and a big inventory of that

with edges as things like usage,

lineage,

ownership, right? So we built that at Facebook. We built that at Airbnb too as a foundational layer to, something called data portal. And once you have this, like, metadata all in 1 place,

then you can create all the views on top of that and create a window that's specific to someone debugging a pipeline or, you know, someone working on an ML model that needs to have a different view or a different window into that world. Then there's a search engine use case too that's like, hey, I'm looking for, you know, I wanna see how we're doing with bookings at Airbnb.

Then I just wanna type bookings and see if there's any dashboard dataset I might be able to use. So 1 thing that's challenging is we see all these

emerging kind of tooling around it without a centralized

standard around like how to store and communicate and exchange that metadata. So I wanted to give like a a pointer. I know I did it on the previous podcast too, but towards Open Lineage. There's a GitHub project called Open Lineage

that's

trying to to centralize some of the ideas around, like, how to exchange and surface and store

metadata

around that. And that's, like, the beginning of

maybe, getting into towards, like, a metadata, data warehouse standard that different toolings could serve assist. On my end, you know, I'm more on the data consumption layer. Right? But there's a lot of value in in providing a certain window

into that world. I think, like, you guys have a slightly different window, but it's all the same metadata conceptually.

Yeah. I couldn't agree more. I think where we're heading over time is that at some point down the road, you actually just have, like, this 1

mega DAG that you can look at, and you see all the different data assets that are being produced. You see all the different upstream

data sources. You see the transformations in between, and you can zoom out in this this gigantic web, and you see, hopefully, mostly, everything is green, and then maybe there are a couple of red things and a couple orange things where you need to zoom in to see, hey. Something's going wrong here. Maybe you need to retry something or maybe you need to debug something, but you get this entire complete

DAG of the creation and the consumption of your data assets.

And the boundaries of your DAG are really just

the places where maybe you are getting new data in from some data vendor from whom you're purchasing data or maybe some raw events that are being emitted by your mobile application out in production.

But then within your

infrastructure, within your company, you have this fully connected DAG with full visibility, full monitoring, full lineage information.

And today, the reality in pretty much every company is that you only have these partial DAGs

that may even be purpose built just for a team and only show the data assets that this 1 team produces and consumes.

And it's completely disconnected metadata wise from the other partial DAGs of all the other teams even though implicitly there is actually a physical connection between these DAGs between because 1 team creates some data asset, dumps it onto s 3, and then at some cadence, the other team's DAG actually pulls it and then loads the data from the PAKE s 3 location.

And I think over time, these partial DAGs of today are all going to be more and more connected, and it's becoming more and more transparent. And you get this more global view on what is actually happening in your organization.

And, of course, in order to eventually get and arrive at this nirvana state

will require more standardization

and actual standards that these different teams and whatnot can use to manage and control their metadata.

Expanding a little bit on that. Right? People used to talk about data siloing. Right? Like, you have the data silos, and it's kind of an issue. I think we have, like, metadata

silos now that we need to to break

through. You know, it can be solved in different ways. It can be solved by if we add just 1 very standard easy stack, like a pass that the manage, you know, metadata consistently, it'd be great because then you would have all the information in 1 place. But what we see is still the divergence and, you know, there's still like, people use huge patchwork of tooling that store metadata in different ways without having any, like, standard as to how to exchange it. So the first phase of companies we're seeing in the space are, like, trying to stitch that metadata together. Right? Like, because it's in different format, there's no standards. So the best you can do is to try to go common denominator and align it and bring it in 1 place. And now as part of teams that that try to do that at Facebook, Airbnb, and it's fairly challenging because not everyone thinks of a task or a pipeline or workflow

or a dataset in the same way. And then if you get into the details of, you know, what are the key attributes of a task or task instance,

you know, Airflow is opinionated on that. I'm sure all the other tools in the space are equally

opinionated. So there's a lot of common denominator, but not always the same. And when you bring them together, it doesn't always jive very well.

Yeah. And I agree. It's 1 of the biggest challenges. Like, when we started Monte Carlo,

we kind of wished there was this, like, central metadata repository

that we could

integrate with and get all of the stuff that we need. And we kinda had to reconstruct all of that on our own and kind of build integrations with every individual technology and collect the metadata and reconstruct lineage and do all of that.

I'm hoping that the next generation of data ops companies will be luckier than us and can start from the central repository. And then I'm hoping that we, as an industry, can align around a certain standard or around a certain

metadata repository that can serve all of us because I think there's so much work to do in solving some of the

more specific use cases that

rebuilding that metadata system over and over again is probably

you know, it's going to be a challenge. We won't be making enough progress on the problems that we're trying to solve. I think this generation of company and tooling and open source product can figure it out. May maybe it's not too late or at least, like, set the stage for that to happen. Right? And I think that's what open lineage is trying to do or, like, do a step in that direction.

Hopefully, we don't have to wait until the next generation. You know, for us, we're committed to embracing the standards as they emerge on the Apache Superset side,

but the standards are still emerging. So there's some challenges there.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality,

tracing incidents, and testing changes can be daunting. It often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT

and seamlessly plugs into CI workflows.

Go to dataengineeringpodcast.com/

DataFold today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool waterflask.

Now that we've kind of painted this picture of the sort of data utopia that everybody wishes they had with, you know, visibility into every system and a fully connected graph of metadata to be able to understand where everything is coming from and where it's going.

What do you see as the biggest weak points in the kind of availability

of

a road map for companies to get from where they are to where they want to be and the capabilities of the systems that we're building to be able to even realize that goal. I can talk on some of the challenges around building that data graph. Right? Like, it's really hard to build it. Even if you define, like, you know, my notes are gonna be data objects and my edges are gonna be things like usage and lineage,

as soon as you bring in the time component into it, like how this is changing in time,

it becomes like a much harder problem if you bring other things like column lineage and, little bit more, you know, intricate edges and

different types of nodes too. It gets very complex very, very quickly,

You know, we're loading that into a graph database with a whole history of change over time and even, like, providing an interface for people to navigate the complexity

And this space at just the Airbnb scale and lift scale

was extremely

challenging of a problem. So that's kinda part of the problem, what we're trying to

solve here. I think,

more generally, in this space, there's some bigger problems

too, right, like, in terms of, like, being a data engineer. I wrote an article long time ago called the downfall of the data engineer

that tries to really depict why it sucks to be a data engineer, and that was back in 2, 000, maybe 16 or so. And I think, like, a lot of the things in this space have not been addressed too. So

outside of the metadata management problem, there's, like, some bigger issues

to to be tackled either with tooling or with

culture in some cases.

To me, like, 1 of the biggest challenges in getting to this utopian place that we discussed is, like, at the end of the day, people

have their tasks and they have the problems that they wanna solve. And so they just think about, okay, how do I, in the fastest way possible, somehow solve this problem at hand that I have? And there's some tools that I'm

well trained on that I know how to use in order to get my problem out of the way. But if I just solve

my problem with the tooling that I'm used to, then, k, the problem will be solved, but you're not now for the greater good of the company, so to say, or for the greater good of your future self

are they're not going to

integrate, say, with these systems that give you this proper metadata visibility that we're all talking about. And so you just implement these 1 off solutions here and there to solve the problem at hand right now with the tools that you're aware of to just move on to the next problem that you can then afterwards tackle. And at Uber, when we built Michelangelo, 1 thing that we saw there was

that

we needed to provide people with a centralized platform that solves these higher level visibility problems for the company

and

to entice people to actually use this platform in the 1st place and to not continue to using

their own way of solving problems in their 1 off fashion,

You had to provide them with a benefit at the end of the day. Right? You had to give them something that would incentivize and use this platform. In our case, it was the fact that now they could productionize things much, much, much faster.

They would not be

as free anymore, so to say, as they were beforehand. They couldn't use whatever tool anymore that they wanted, whatever framework anymore that they wanted. They had to use our standardized platform,

but they got certain benefits from it, the rapid productionization,

the rapid monitoring, all that type of stuff. And what did the company get from it? It got standardization. It got full central visibility into what's actually going on in the company. And so that's how we were able to create a win win situation

at Uber with Michelangelo,

and we brought those exact same lessons over to Tecton with our data platform for machine learning, the feature store, where the exact same thing applies, where people are able to productionize their data pipelines for machine learning much more quickly, and they do get these central visibility

benefits in return of that beyond, of course, many others. Yeah. It's definitely something that I see a lot of places, both in data and in regular software engineering and in operations is just being able to get that flywheel spinning in the first place where you have the things that in front of you and you just need to get it done. But you know that if you do it in a little bit of different way or if you try to bite off a little bit more than just this 1 task that it'll end up getting you further down the road than just finishing this 1 thing. And, eventually, over time, you'll be able to build up this flywheel that will allow you to deliver more things faster than if you just focus on just the task at hand. And just having the kind of rigor and motivation to put in that little bit of extra effort each time

is really where things start to pay dividends.

I think the opportunity

that we have right now to actually accomplish this vision is that

the world is moving to cloud

and specifically to

a smaller subset of data technologies, if you will, or at least data stacks. So So in a world where

we're using,

you know, the same set of

a small number of warehouses, small number of data lake technologies, small number

of feature stores, and small number of BI tools,

you can envision a world where it is not

too costly to to start building that DAG. Right? I think, historically,

there's a lot of homegrown stuff, a lot of on prem stuff. It was Wild West, and I'm trying to see a lot more consistency in the golden stack that everyone's using, and that's their opportunity. Right? We can automate a lot of that. As long as we need to

manually integrate metadata from every single system, it will just be too costly to ever accomplish given

given the other priorities that every data team has. Going back a little bit to what Kevin was talking about, which is like, you know, this idea of, like, if you wanna provide more guarantees, you have to use a system that has more constraints. So it's like the constraint guarantees straight

off. 1 thing that's challenging with that too is that there's a kind of a maturity life cycle to this where, you know, really senior data engineers really understand that or people more senior on the data literacy spectrum.

I think understand that it's worth, you know, fitting some constraints to get more guarantees, but it's not always a really clear thing for everyone.

And, you know, sometimes I talk about how

the success of Airflow is in part because it was the right amount of constraint and guarantees for the people at that time. Right? So maybe there was other systems at the time that offered more guarantees but were harder to learn or to use or to get started, you know, and then maybe that's just a little too much for people. So we have to have these progressive system too that maybe you can start, you know, maturing into them too as opposed to maturing outside

of them. Another point I wanted to make was, I think, Kevin, you said something about, like, giving the full visibility in the system. So that's, like, the promise land, right, of, like, if everyone uses

framework x, then we get full visibility, and it's, like, it's super awesome. Then we have the full picture, and all the metadata is consistent.

But the reality is, like, if you're anywhere short from that, if you're like at maybe 80% within transition to that system, you don't get a linear value from it. So the promised land of, like, if you could have certainty

that you have a 100%, the value would be tremendous.

But if you have, say, only 80% of your pipeline in that system, the value might be only 20% because you don't have that guarantee that, you know, if you do impact analysis, for instance, like, hey, if I change this pipeline, who else is gonna be affected,

you know, about some of the things that may be affected, but not the full picture, and that's

only a portion of value there. Yeah. And that's 1 of the big barriers to adopting various pieces of the tool chain as well is that if you already have something in place

and then you say, okay. Well, I wanna use this other tool because it solves this other problem, but they don't fit together the right way. So now I've got half of my stuff in 1 place, half of my stuff in the other place, and now I have to build this other scaffolding to go on top of it to be able to manage them across each other, or you just have to have them be completely siloed, and then you're kind of back in that same issue that we are all trying to escape from of the data silos. And so that's again where some of this kind of evolving standardization

really starts to come into play of if we just have ways that everything can communicate together,

then we don't have to have as much of a switching cost of bringing on new tooling, you know, even if it's the best thing in the world, if it doesn't talk in the right way to the things that I already have, then it's effectively useless.

Is there even a hope to fully deprecate any system system in the data ecosystem ever? Right? I know that if you selected

Airflow, Luigi,

or whatever the ancestors, Informatica,

you might be running that for a long time still, maybe for the life cycle of your whole data team too. So I don't know. Maybe there's such a promised land and such a great new framework that it will be worth for everyone to rewrite everything. Kind of like the react, you know, of the data engineering world. Right? That it's like everyone's migrating to this thing, but still there will be a Vue JS, and there will be, you know, websites that just never migrate to. We're still running COBOL, so I don't think we have a lot of hope in that regard. The long tail of, you know, old technologies

is just a reality too. Right?

Another thing that often gets shoehorned into software delivery, data ops, is the question of security and how that can

be a difficult thing to bite off unless you bake it in from the very first day. And so I'm curious what each of you have seen in terms of experience, both running your own businesses and previous careers of how to

factor security

more seamlessly into the life cycle of data rather than having to have it be a bolt on or a, you know, secondary concern after you've already built out the rest of your system. And then now you have to, you know, rework your flows or, you know, sacrifice some measure of security because it wasn't thought about early enough in the process?

Yeah. I think to me, the biggest concern that I have here when we talk about security

that I've been observing is what always worries me is that you have

these 1 off data dumps living individual people's laptops or they're on some EC 2 instances and whatnot, and they're just put there in order to do your data science work or some other work that you're doing.

And then once you're done with the job, you forget about the fact that you've created a clone of the data asset that now flies around. Maybe it has some super sensitive PII information

that you definitely don't wanna have floating around in the universe,

And that's just the reality in a lot of companies because

what's interesting here is that all else being equal, there is this interesting trade off between

security and data governance on the 1 side. On the other side, it's innovation, like how quickly are you able to move.

And depending on the organization, like, some people

are extremely

stringent with what they're doing, and nobody gets access to any data whatsoever unless they go through 6 months long approval processes.

As a result of that, they're not innovating quickly. They're not building new products

whatsoever. And on the other side, you've got the companies who are going really, really fast, and their culture is 1 of a lot of trust and whatnot, but they're running at risk that at some point, there may be a a misuse and abuse of data or a big data leak or whatever it is.

And, obviously, there isn't, like, 1 right answer for all the different companies, but it is, to me, a really fascinating trade off to think through and a very fascinating problem to observe

when we talk to companies out in the world that are trying to solve their data problems.

And, of course, like, 1 way of getting closer to a solution here is that you do use these platforms,

these toolings in the data ops space, which helps you to automate

your common data related work flows around creating data pipelines,

managing your data assets and whatnot, where security is actually built in, where lineage is built in, where it's extremely clear

who, let's talk about machine learning really quick, created a certain feature for what type of use case, for what type of model that's running in production,

what's the actual purpose of it,

does that data scientist and the model as a result of that, should that have access to a certain type of critical information?

Because maybe, for instance, what I've oftentimes seen,

certain data is

completely

legally allowed to be used for fraud use cases, but it it is not for marketing use cases.

And you can build constraints like this

into a central platform that standardizes

and that automates the workflows

of the internal users in your organization.

I agree with everything that Kevin said, and there's indeed a tension between making data

available to everyone so they can run quickly and innovate and keeping it controlled and safe.

1 of the challenges

of implementing any sort of strategy around security is that

a lot of the different parts of the stack have pretty granular security controls. So

if you go into your warehouse or your BI tool

or your daily lake on AWS or GCP or what or Azure or whatnot,

you might have very powerful tools to define permissions and access control and manage policies and all that. But put all of that control and manage policies and all that.

But put all of that together, and it's incredibly challenging to control

who really gets to see which dataset and who really gets exposed because

those systems are not aligned in terms of managing users and their permissions.

And there's a very complex relationship between

the user,

the dataset,

even

columns within the dataset. And

I'm not aware of any solution to that, but just pointing out that

it's incredibly challenging to manage permissions in a modern data stack, even if each particular component has very, very powerful capabilities around that. What I wanted to make was that, you know, I think security and access information is metadata too in some ways. Right? Like, you know, it's just got a different type of, like, business metadata of who should be able to see what.

And, you know, I think what's clear there is that the tooling that we're all building or at least the peep the people in this room are building and and people outside is, like, we all kinda wanna federate the data access policy into our tool because in the absence of, like, a a standard to do it in a central place, then we just need to allow for our users and customers and communities to do it. But I think, like, for the tooling to be able to defer security onto external system becomes really important.

Maybe something

like the what's the HashiCorp product, the Vault. You know? So I don't know. Maybe you can all integrate with Vault, and you can have your data access policy defined in Vault, and then the tooling kinda talks to Vault to be like, can Joe access this dataset right now?

That would be 1 way to solve the problem.

It's fairly challenging to solve. I think I've built a lot of tools that enable organizational

problem

organizational problem. Or if you ask, like, okay. Now you can put any data access policy that you want into this tool. Go for it. I built it. You just have to put it in. What we realized then is that people are like, wait, who's what's the how do we do governance here? Who can access what? And, like, organization are extremely confused and not very mature. I think even at the most data mature organizations that I've worked

at, people didn't really know how to approach and solve this problem. So we just end up with, like, super sensitive

with PII, the warehouse without PII and a little finance little cluster on the side. And there's like 3 or 4 level of access Despite us having building tool that enables people to give access a very, very granular level. Turning things around a bit,

you've all had storied careers working in the data industry, building fantastic products for other people, and you're now doing the same in your own businesses. And I'm curious how you're applying

both the platforms that you're building and the lessons that you've learned to your own work within your companies and some of the ways that that has

maybe shined more of a light on some of the challenges of actually

doing this particularly for newcomers to the space

and just some of the complexity that is inherent to the space and, you know, maybe there's

no way out but through.

I'll just touch on 1 1 thing that we've added to our own internal processes that

is definitely, I think, a core concept of DevOps and maybe not as corrugated in DataOps, but definitely very relevant there. And that is the

idea of canary rollouts.

Like, a canary rollout in software engineering is you have your

modified software artifact. You throw it also into production. You run some shadow traffic against it, and you see if they've got higher error rates than with your production microservice. And if you do, probably a good sign that you've screwed something up and you should roll it back and not do the full on canary or the full on rollout of your canary.

We've brought the exact same thing to Tecton into our platform and how we roll out our software.

And it's not just the software itself, but it's actually the data pipelines now too. And so whenever we roll out a new version of Tekton,

we need to be super careful that the data pipelines

for machine learning that we are managing,

that as we upgrade those for our customers, that we don't break things by accident.

And so we actually do deploy now canary data pipelines

in our customer's environment, and we do now look at the output of those data pipelines and

compare that output to

the source of truth, the previous system that's been or that's still running in production.

And if there are deviations between the output of those 2 different pipelines, then we realize, okay. There's probably an issue here.

Now we need to issue a rollback, and developer

needs to go in there and debug and see what's going on. And that's been extremely helpful in giving our developers confidence

to implement things quickly, iterate very quickly, and very safely roll out changes

to all of our customers who are running on our system to manage these data pipelines for ML in a in a safe way. It's the idea that democratization of access to data that we mentioned maybe gets in the way of the security thing, but I think it gets in the way of the data ops movement in some ways. And that does not exist as much in software engineering. So

perhaps data is a little bit more accessible and is more important to common

information workers

nowadays. So that means, like, we want for more people to partake in the analytics

process. Right? General, like, we want for more people to create data visualization

on the consumption layer, but I think it's true too that we want more people to speak SQL, learn SQL, derive data sets. And maybe that's not true, but I've been working a lot in that direction of, like, enabling more people

to write pipelines and to derive dataset and to just become more sophisticated

up the level of data literacy.

And that's a move in data that brings in all information workers, and we don't really have that in software engineering, you know, in the same way, right? Like what would be the DevOps movement that's all about kinda rigor and processes and certain way of doing things.

Adding like more structure on top of things so we can do things more progressively,

more continuously.

Like, how does that work in data while we're also trying to bring in, like, just a new wave of people? Is that okay to tell them, like, oh, now if you wanna play that game,

you gotta use this very complex framework.

I think it's fundamentally not the same. We haven't really talked about we've talked about what's similar between the DevOps movement, DataOps movement. We haven't talked about what's different. And some things are very different, like the data gravity, like data is heavy and it's hard to move.

Whereas

software, you know, you can write a unit test suite that, you know, test the whole software in 10 minutes maybe or maybe in an hour, but you can't really write a unit test suite for the data warehouse too. Or maybe you can, you know, but it's a little bit more tricky. You're not gonna get the same level of

certainty and confidence that what you're building is not gonna break.

Like, maybe, Lior, you can talk to that too, you know, because maybe I can I can go off of that?

Consulted.

No. I agree with you. 1 of the challenges with DataOps and and where it's different from DevOps is that

we're actually dealing with a lot of stakeholders

that are very different

anywhere between

data engineers

that have

similar characteristics to software engineers and through data scientists and data analysts and even kind of, business people that want to use the data. And

all these people have very different backgrounds, very different skill sets, very different programming languages that they use,

and a very different stack. So

how do you really implement

DataOps

across the org in that way, I think, is a whole other level of challenge compared to DevOps.

And then

I I think you also mentioned that, Max, it's, you know, testing,

I would argue, a little bit harder than traditional software just because of that added

because you do have software and code and logic

and you do have infrastructure.

And you also, on top of that, have

a huge amount of variance in the data.

And

you might need a really, really big dataset to even capture all that variability. And and so that makes testing

some of the challenge that that we've dealt with at Monte Carlo. Aren't there still a lot of things, like, with software? Like, you write a unit test and, you know, you have some fake inputs and you test the output or you have some integration tests, but it's still not a 100% complete test coverage. And so with data and data pipelines,

we can accomplish a very similar thing. Right? Like, you can have small mock datasets,

and you run them through your pipeline that you're building. And you look at the output, and you have some confidence that you didn't break things and that things are working fine, but you don't have a 100% complete

confidence

the exact same way that you don't have a 100% complete confidence in the typical software engineering world. I think it's true. 1 thing I wanted to bring up too is, like, change management just harder in data too. So if you have, like, petabyte scale data warehouse and, like, a gigantic table and you wanna change, I don't know, the the logic in there as to, like, how certain fields are managed, maintained, or yeah. Like, really often you need like, a backfill is very different than, like, a database migration script and software.

Not that database migration scripts are easy

on the OLTP side of things either, but they're easier. At least you don't have to, like, potentially rewrite, you know, petabytes of data when you change some business definition in places. So

that's a challenge. Like, for me, if I was, like, 1 of the big data ops challenge I would love to see tackle as, you know, backfilling

at scale well, like, having systems that

know what to re process and that don't necessarily force you to re process and accumulate, you know, changes and then let you decide which subset of things you wanna reprocess, how and when. And do it without having to copy all of your data and explode your operational bill. Because that's a new problem. The problem we had before was just, like, it would run for weeks. Like, I remember running backfills at Facebook that,

you know, we're just gonna run for, like, 3 or 4 weeks, you know. So we had to get creative. Cost visibility is a whole another And you better pray you didn't have a typo. Yeah.

Cost visibility and monitoring is another Yeah. To your point, how do you build a test environment that actually

works for you and allows you to ship changes more reliably?

Not that it's a solves problem with software, but I think data is even harder given

the gravity

that you talked about. Who's got a staging environment in their data warehouse that they're proud of? You know, they're like, oh, yeah. We have a really good staging and dev environment in our data warehouse. It works perfectly. It has just the right test sets that we need, and it's just amazing.

Said no 1 ever. Right? I do because I just created the database yesterday.

It's got 2 tables.

And so

in terms of the

lessons that you've all learned from building out your own businesses and trying

to gather some of the knowledge that you've gained from your careers and from working on various data teams

and working with customers now, what are some of the most interesting or unexpected or challenging lessons that you have come across

and that you're hoping to carry forward as you continue to build out new capabilities of your systems and contribute to the wider ecosystem of data?

Some of the most interesting and unexpected lessons that I've learned here, 1, when we talk to companies in the field, what I've oftentimes see is, like, most people are super overwhelmed by what are even the different tools and systems that they should be using, which ones are trustworthy.

And everybody feels that they're light years ahead of their entire peer group. But they don't realize that, actually, everybody's kind of struggling with the exact same challenges and, like, which tools should they be using? What's the right, proper, new, modern data stack? Which tools should they be using together?

And so that's just 1 thing I wanna put out there where, like, don't

have too much anxiety that you've fallen behind your your peer group. A lot of people are struggling with the exact same challenges, and we still need to do a better job putting, like, the reference stacks out there that you can trust and that you can use and whatnot.

And the other thing I wanna say

is, in general, is always start really, really small.

Don't try to boil the ocean,

whatever whichever problem you're trying to solve. Like, find a champion internally.

Find a real business use case where you can deliver value fairly quickly.

Because if you go too big and too large to begin with, it's almost bound to fail. What I'm observing on my side is, you know, I thought a decade ago that platform as a service was gonna be, like, the direction of things that we're gonna get, like, 1 big cohesive set of tools that are federated to kinda rule them all.

I used to say, like, oh, we're in this divergent phase where there's, like, this can be an explosion of tooling in the data space, and at some point, we'll

consolidate into, like, a standard stack.

And then we've seen a little bit of that in some areas, but it's still not very much happening. So what we're seeing is that we're gonna keep adding a lot of innovation and new tools and

emerging new technologies.

In that context, that becomes really important for tools like Superset, you know, that I've worked very closely with or Airflow that I started a while ago

to integrate very well with the rest of the ecosystem. So it becomes vital

for the tools that will succeed, I think, are the ones that will play nicely

with each other as part of the emerging data stack. And we're gonna have some really strong verticals.

Right? Some very strong kind of winner, whether it's open source projects or

commercial open source solution in some cases or on the other end of spectrum, like vendor serverless tools like, you know, Snowflake or BigQuery

that all kinda jive well together with, like, you know,

metadata exchange and being able to kinda schedule each other and know what's going on, across these different tools. So for us, that

means our strategy is to be

very kinda leaning forward on integrating with the things that people are selecting these days, whether it's, you know, DBT, Airflow,

you know, if it is Monte Carlo and Tekton

and really did, like, open source project, like, we're really leaning forward and figuring ways on how our tools can work very well with the rest of the ecosystem.

And for us, 1 of the biggest challenges

or or kinda insights

is that if you want to build

reliable data products,

it's somewhere

in the intersection

of

codes,

data,

operations,

and the metadata that we discussed earlier.

So you really need to create a lot of visibility around a lot of different parts of the stack. It's not just about,

you know, running some SQL queries on your warehouse. It's really about having that full visibility

of

what code is running, when,

what data is it operating on,

what are

the infrastructure, networking, permission structures around that, and, really, were the people that own this part or that

interacted with this part of the system before. So all these things kinda come into

solving the data reliability problem, and that's really kind of what's guiding us forward in terms of how we're building our our product and our solution.

Totally. We need that APM for data. Right? And there's a bunch of companies in the space, and I think that's exciting

too that just start collecting

observability metrics on all the data tooling and then shooting that shooting into 1 space. Hopefully, that place can become a hub that other tools can not only, like, ship data to, but, like, read from

that had the full picture.

Interesting how, you know, there's been all this movement of ETL and ELT to centralize all of our data into the data warehouse. And now that we've got that to, you know, some level of maturity, I'm certainly not gonna call it a solved problem, but we're now starting to see

evolutions in the ecosystem

where we're now using that collection point as a distribution mechanism with the, you know, so called reverse ELT or operational analytics where we're pushing from the data warehouse back into some of the SaaS tools that we're we're pulling data from where, you know, you take your customer data from your data warehouse and push it back out to Salesforce or push it back out to HubSpot or what have you.

And I think it'll be interesting to see how that same kind of pattern starts to play out in how we're building data pipelines where we're using

these various tools and integrations to be able to generate metadata and centralize it into 1 location. And then now we're starting to see people pull that back out to use that operational metadata to feedback into the pipeline to inform how the pipeline executes and trying to close that loop. And we're certainly seeing it among our customers. We're both collecting metadata and logs and metrics on their behalf,

and they're also using our APIs

to pull

that information back into their

workflows and pipelines and kind of drive

more efficiency and better experience for the data team around that. So we're we're very excited about that too,

Integration

between the tools.

I think that we've touched on all of the subcategories of metadata, maybe about 1. And I think, like, Lee are just gonna touch on it on the side too, which is, like, statistics.

Right? So if you look at, you know, I think so far we touched on

lineage metadata,

business metadata.

I think we touched on operational metadata, like, you know, resources, utilization, how long did this job run for? How much does it cost to run? But then the statistic metadata is also really great, like being able to summarize

what's in the database and access it in a way that's a lot less expensive than having to do full table scan on Metabytes or terabytes.

There's definitely a lot to that there to power, like, a data quality type tooling. I know you guys do that. Monte Carlo too, but it's something we did in previous places. You know? Like, none of this is new. Like, the idea too of, like, ELT to me, I was doing that ELT in 2003

or something like that. Right? Like, loading the data, staging it in a transient or a persistent staging area, and then doing derivation as

ETL. Same thing with, you know, at Facebook, we would

process large analytical pipeline in the warehouse in places like Hive and Presto, and then lift that up to

HBase where we could do a key value lookups to look at, you know, total lifetime value. Like, how many likes, you know, on this particular topic we were doing that

back then. I think now we see just a tooling emerging as to how to do that

in a more structured, streamlined way.

Are there any other aspects of the category of data ops or what you're building in your own businesses

or how you're using the platforms that you're building to inform your own internal operations that we didn't discuss yet that you'd like to cover before we close out the show? I mean, it's nice to be able to build from scratch. Right? So for us, like, the small teams are startup with about 50 people.

We're not carrying over, like, the COBOL equivalent or we don't have any Informatica.

We're probably getting tangled in our own way into, like, what's the hot stack of the day. But it's really nice to be able to pick things that are just as a service, right? So if you just get something like BigQuery that can scale to infinity or like Snowflake, you're just like, I don't need to have a database administrator

or data infra team. I can just kinda sign up for

this service, right, and just use it. So we try to be lean, you know, as much as possible.

We're gonna bet on whatever is the emerging stack at the moment, and

invariably, like, some of it's gonna age, and we have to carry it forever or move away from it in the future. This is the world we're in. Let's start a new company and build from scratch again. Enjoys a bit rot.

There you go.

Start clean for the 12th time. Sell off all your bit rot and you set that to go build a new greenfield.

That's right.

For anybody who wants to get in touch with any of you and for Kevin who recently left, I'll have you all add your preferred contact information to the show notes.

And so for Max and Liar, for the both of you who are still around, if you can just each give your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today and maybe specifically how it pertains to people being able to

realize the dream of data ops and sort of automated data workflows?

In the past, I think we've built data quality systems, you know, previous companies, you know, that

really helped us do more

alerting and data assertions,

and I think like to see that emerge I think would be a great contribution to the field, you know, and and what we need.

And then, you know, a place to centralize

again, you know, operational metadata so you can have the full picture of everything that's running in your system. That makes the point that we've made, multiple times on the show already.

From my perspective, the idea of observability and reliability is important. We need to fix that.

Something that a lot of data teams are struggling with, and that's what we're excited about at Monte Carlo. But on top of that,

the query about the metadata and specifically,

I would say, around discoverability.

So, like,

kind of rethinking the data catalog,

I think, is a very exciting opportunity.

I see it even small teams are struggling to make use of the data that they already have

because they can't find it or can't can't understand the context around it. And and there's a lot of exciting teams that are looking into how to

how to make that both more enjoyable and more powerful for data professionals.

And I'm also excited to see the innovation and what Kevin's working on, which is kind of how to take

all those wonderful models that we're building and all the great machine learning work that we're doing and take it into actual production systems, that remains incredibly challenging

today

and incredibly slow

in terms of development cycle. And I'm looking forward to see how the new stack evolves around that and how we

make that something that more and more companies are able to

dabble with and to do

consistently.

Well, thank you both. And, Kevin, who had to jump for taking the time today to join me and share all the work that you've been doing and share your perspectives on the overall space of data ops and how we as a community can continue to strive for something better. So thank you for the time and effort you've put into everything you do, and I hope you enjoy the rest of your day. Thanks, Tobias.

Thank you so much. Pleasure to be on the show again.

Listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and

tell

your

friends

and

coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links