A Reflection On Data Observability As It Reaches Broader Adoption

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Data stacks are becoming more and more complex.

This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating

the quality of the data and causing teams to lose trust.

Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring it's reliable from ingestion all the way to consumption.

Whether the data is in transit or at rest, Ciflae can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams on their preferred channels,

all thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2, 000 to use as platform credits when signing up to use Siflae. Siflae also offers a 2 week free trial. Find out more at data engineering podcast.com/ciflae

today. That's s I f f l e t. Your host is Tobias Macy, and today I'm interviewing Bar Moses and Lior Gavish about the state of the market for data observability and their own work at Monte Carlo. So, Barr, can you start by introducing yourself?

Yeah. Sure. It's great to be here again. I guess it's now 2 years later. I think the very first episode that we had with you was in 2020, I learned. So it's great to be back. A lot has changed, and some hasn't changed at all. I think it's pretty cool. So, yeah, my name is Barb. I'm the CEO and cofounder of Monte Carlo.

Best way to think about this is, like, a Datadog, AppDynamics,

or New Relic, but for data engineers.

So we help data teams make sure that their data is accurate and reliable so that organizations can actually use data in their various data products.

And, you know, a little bit of about my background, I was born and raised in Israel, moved to the Bay Area to study math and stats,

got stuck here since. I've worked with data teams kind of my entire life and was always looking at engineering and was jealous of all the set of tools and solutions that they had and the very little solutions that we had at data. And so kind of inspired by making it easier trying to make it easier for data teams started Monte Carlo today. And, Lior, how about yourself?

Hi, everybody. I'm Lior. I'm Bar's cofounder at Monte Carlo.

Also grew up in Israel. Started my career as we would now call

a machine learning engineer, though it wasn't exactly the way it was called back then.

And

came to Bay Area for school, started a company in actually

in the cybersecurity

space. And for all those who've been doing cybersecurity, you probably know it's a lot about analytics and ML under the hood to do the work. And I led the engineering team at a company called Barracuda,

and we built products that use machine learning

specifically for fraud prevention. And that kind of helped inspire some of the problems that today solve with Monte Carlo, right? Like the types of issues that we had to deal with to deliver good service to our customers

were oftentimes related to data problems and data reliability issues that made me kind of partner up with with VAR. I'm excited to be on a show with today. We don't we don't get together

very often. So

Yeah. It's definitely great to have you both on back. So for folks who haven't listened to your previous appearances, I'll add links in the show notes for that. You've both given a bit of your kind of cliff notes of how you got into data. For the longer details, I'll refer back to those past episodes.

You already kind of preempted my question about the elevator pitch for Monte Carlo. So I guess what I'll ask you now is what are some of those notable changes that you've seen in the overall

space for data observability,

data quality, and your own

reactions to it and

kind of planning around it for that you've done at Monte Carlo since the last time we talked? Yeah. I mean, reflecting on sort of the market, you know, 2 years, 5 years ago, I think it's actually insane how, like, the market has sort of accelerated at palpable rate, Not just for data observability, but rather for data more globally.

And we see that in various forms. Right? So if we look at sort of maybe the most obvious or most notable is the size of large data infrastructure companies. Right?

So whether it's Databricks with, I think, just over $1, 000, 000, 000 in revenue or Snowflake with 1.2, BigQuery with 1, 500, 000, 000,

Redshift rumored to be the fastest growing

service from above, you know, a 100 plus services AWS.

I think that sort of speaks to the strength of the market sense and how prevalent

it is for those technologies to be part of kind of a strong kind of modern data stack, if you will. On the other hand, if you look at, you know, like, programs like powered by or built on, that's kind of encouraged data teams to actually build data products. So more and more organizations

actually use data in production or share reports with customers. I think, you know, if I would get on a random call with a data engineer

5 years ago, 10 years ago, not a lot of that data would be actually exposed to customers and not a lot of that data would be in a machine learning model, and not a lot of that data would actually be used.

I think what the large thing that or kind of the primary thing that has changed in the last 2 years is how important data has become to organization today.

And, you know, as a result of that, we're also seeing kind of way stronger need for data quality, for data reliability.

That just wasn't the reality 2 years ago. I mean, when I was just reflecting on it 2 years ago, we didn't even know how we would call the category.

I definitely I was like, oh, observability. It's such a difficult word to pronounce.

And here we are. Right? Like, I can get on a call and lots of folks, you know, our customers and prospects,

not only recognize what data observability is, but also actually, you know, have started to think through what did it look like to measure data observability.

And for folks who are unfamiliar, sort of, you know, just taking a step back for a second explaining what data observability is, I can't assume that everyone knows what it is. Data observability, really kind of takes a page from software observability. And so in software engineering,

it's very sort of traditional, and you kind of have to be crazy to run an engineering team without something like AppDynamics or Datadog

or New Relic. And so, you know, engineers will use those solutions to make sure that their application and infrastructure

are reliable.

And, yeah, data teams, which, you know, produce kind of high stakes data products, oftentimes

are not aware of the data actually being wrong or are the last to know about it. Find out about data being inaccurate because of someone downstream that identified that. Maybe you have, you know, your finance team looking at a report and saying, hey. The number looks looks wrong. Or maybe a customer that's using

that's looking at a particular dashboard says, hey. Like, the price of this product doesn't make sense to me. Or maybe the data that you're feeding a particular model just stopped arriving, for example.

In all of these cases, the data is unreliable. The data team doesn't always know about that. And so the idea of data observability is

to proactively

monitor your data stack end to end to give your data team the confidence to know when data breaks, be the 1st to know about data issues, and to be able to resolve those quickly.

Yeah. Definitely some interesting things to dig into there. 1 of the pieces is definitely the question of what do you call this space when you're 1 of the people who's helping to define it. And your comment about the fact that

in the time from when you first started working on this idea to where we are now, there has been an increased prevalence of people actually exposing some of these data products to end users, whether it's embedded analytics or feeding that data back into product features or, as you said, machine learning models.

And I think that maybe 1 of the reasons that data observability

as a category and as a first order concern for engineering teams

did take so long to

be kind of the obvious answer

is that

up until that point, data was more of an internal process, and so you didn't have the high stakes issue that we have process. And so you didn't have the high stakes issue that we have

with applications where it's end user facing. And if your application is down for an hour, pure ecommerce, you're missing you know, losing out on 1, 000, 000 of dollars. So, of course, you wanna make sure that things don't break. Whereas with data, it's, oh, it's just people inside the business. So if things are wrong, we'll just fix it. No problem.

Without really taking that to its logical conclusion of, well, if you don't know that it's broken and people make decisions on that, then that's costing you 1, 000, 000 of dollars.

A 100%. You could have said it better. And I'll just give you a couple, like, tangible examples of that. Unity, gaming company,

released a couple weeks ago. It was actually, like, in the news.

1 mistake in their data that was related to their ads actually resulted in a loss of $100, 000, 000.

I'll just repeat that. 1 mistake cost a $100, 000, 000.

Isn't that sort of a scary proposition? Right? That's actually not uncommon. Right? Give me another example of a customer that we work with. Just to kind of, you know, give a little bit more context on what this looks like. 1 of the customers that we work with is JetBlue, you know, obviously, kind of a well known leading international airline and, you know, a very strong data team. And the JetBlue team basically managed all the company's data from bookings to flight times. And so you can think about some of the experiences that are driven by that data. Right?

Say, is my suitcase arriving on time? And do I have the right connection? And if I missed my connection, what's the next flight that I can get on? And, you know, how quickly am I getting support from someone online, on the phone call to actually book my next flight? All of that is incredibly data driven. And the team, it's 1 of the biggest DBT instances. They're a very big DBT user. They basically had to go to extraordinary lengths to fix data issues all the time in a very manual way. They actually had this team called eyes on glass, which basically, like, manually refresh dashboards to make sure that the operations are smooth. And just to be clear, that's a very common reality. We see that with so many data teams. The data is important. It's high stakes data. We need to look at it a very manual way and make sure that it's accurate.

And if it's wrong, you know, obviously, the the implications are, you know, not just for kind of the executives who are looking at the reports, making decisions based on it, but also for, you know, actual

people who are flying like you and I. And so, actually, we started working with the JetBlue team and set up sort of data observability

as a way to have coverage across their stack, understand when data is breaking, and actually ensure that both the operations and customer support data is up to date, the dashboard are accurate.

So for example, there's a bug in a particular model that causes, say, like, a downstream table or a trunket,

AniCarlo can actually send an alert and provide the right tools

to help identify the root cause and debug it in a timely manner before there's impact downstream.

So that's just an example. You know, I think the the number and importance of use cases for data observability has really, really sort of

accelerated and and changed in the last few years.

The kind of biggest change, I think, that we've seen

and, Tobias, you mentioned that part of the drive for data observability

comes from the fact that

there is much more

out order facing, customer facing consequences to bad data. I think the other thing that's driving

this is just

the growth of data teams and the growth and complexity of data systems.

We saw that in the last 2 years. I'd really drove the need to move from

testing and monitoring to observability.

Alright?

So the traditional approach to data reliability is to add

tests, various places in your data stack, and to perhaps put some monitoring in place, right, to track a small set of metrics, just by virtue of of a human having to write those

tests and those monitoring rules.

And actually, to date, while a lot of

technologies call themselves data observability, what they really do is that. So it allows people to define tests on top of their data and to and to put monitoring rules there and get alerted when they break. What we've seen though in the last few years is they really vindicated

the observability approach versus the testing and monitoring approach.

It's just that the huge complexity of data systems really forces you

to really monitor things at scale. Right? So you definitely wanna do the monitoring piece and the testing piece. And and Monte Carlo obviously

provides a lot of capabilities around that to

get very granular about monitoring the specific

metrics in 1 table or another, and to monitor all the different

statistics

and distribution

metrics about your data, and that's critical.

What we've also seen though is the importance of

tracking the entire

production dataset. Right? All of the different tables that lead

into those critical tables where you add testing

and explicit monitoring. This is really the only way to scale a reliability program, because

when you move from

10 tables to a 100 tables to 1, 000 tables, and when you move from 1 data engineer to 5 data engineer to data engineers, to eventually 5500,

it's no longer

possible to do everything

manually. Right? And

if you look at those critical tables that drive your critical features or your critical

dashboards or whatnot,

the problems that emerge there

are actually typically a result of something that happened way upstream. Right? It's it might be a data source, you know, that's 10 or 20 stages upstream removed

that changed in some unexpected way or a pipeline

that's far removed from those tables that broke or a change in logic

or in transformations

that happens way downstream. Now, if you're only monitoring

that key critical asset that you have, you're going to send the alert to the wrong person. Right? Because the problem actually happened, you know, to another person, maybe another team.

You're really going to confuse them about how to solve the problem. It would take them ages to investigate and go step by step backward or upstream

to find the root each issue. And you're also going to drastically

delay the time that you detect the problem. Right? You're going only find it when it impacts a critical asset. And then you're going to worry about

figuring out where it broke, and then fixing it, and then backfilling. And that's a very, very costly

process for the organization. And so, really, the only

way, at least that we're aware of, to

allow teams to scale that effort and to quickly get

to why things broke and not just like something's broken,

is through that concept of observability, of proactively

capturing

metrics from all across your pipelines, across the entire stack. That's something that we've kind of proved itself out with over 150

customers that we work with. And when you do these things,

you really need to think about scalability and performance, right?

If you do that in

a naive way,

you're going to get a really, really high bill on your Snowflake, or BigQuery, or Redshift, or whatnot.

That's an unpleasant surprise typically. You need to think about how you collect metadata effectively, about how you collect metrics effectively, about how you really leverage

the specific

capabilities in each of those platforms

to really do this observability thing at scale. And and I think Monte Carlo invested a lot in building that over the last 2 years or so, and that has led to phenomenal outcomes. The other side of it, I think, and this is something we've been preaching for a while now, but now, you know, it's come to fruition.

Thinking about reliability goes beyond just anomaly detection. Right? It goes beyond just,

hey, here's a metric in 1 of my tables. How do I get alerted when it deviates from normal? Right? That's critical. Monte Carlo does it. It's an important part. But to really get to reliability in data products, you need to think about not just how you find tech problems, but also about

how you solve them. Right? And we talked about it. The ability to look at things upstream, to understand lineage and dependencies.

And it's also about getting better over time and preventing issues from happening. Right? It's about looking at your past performance, understanding

what are some foundational changes that you can make to your platform to make it more reliable? What are some foundational changes you can make to your process to make it more reliable?

And this is something that's been, again, learned over on the DevOps side over

a long period of time. And we've kind of seen it happening and taking flesh over over the last couple of years at Monte Carlo, and that's been a a very exciting journey. I think that it's definitely worth calling out that for a lot of people, the kind of earth shattering moment is just being able to know something is wrong because a lot of people don't even have that piece or it's just like it went through the job ran. I don't know. And just being able to know, yes, something is wrong here. Like, that's just monumental. Just like they're going from 0 to 1 is an amazing step for a lot of people and, you know, then everything else beyond that is just gravy. You know? If I can know why it broke and how it broke and what I can do to fix it, that's amazing. But just knowing that it broke in the first place is, you know, light years ahead of where I was.

And so in terms of your kind of early efforts as you were developing the product and figuring out what your go to market looks like, you were very active in kind of content production and trying to own the narrative around data quality and, you know, putting out a lot of resources about helping people understand

what are the different elements of this space beyond just data breaks sometimes, you know, getting more into the details of, like, yes, data breaks. Here's how. Here's why. Here's how to think about it. And I'm wondering if you can just kind of summarize

your initial focus of how you thought about putting out that messaging and

the ways that you kind of shaped that narrative as you iterated towards your general release and how the types of reactions and feedback that you got as you went through that journey of kind of figuring out how do you talk to people about this problem?

Yeah. For sure. And, you know, I agree with you. I think the first moment is when you

know that there was a problem. Right? You identify that there's something, this light bulb goes up, and you're like, oh, I had no idea about that. Right? And a lot of the value is in that. And I think the next sort of iteration or phase of phase of that, that's something that sort of customers and folks very quickly ask themselves is, like, why?

Where? How? Right? Oftentimes, actually, people tell us to just knowing that something is broken, but not knowing not being able to answer those other questions is actually really frustrating for teams, and it sort of can become sort of alert fatigue or kind of noise, if you will. You know, so I think that's kind of an important point that comes up a lot in our discussions with folks in the industry. But I think going to, you know, your question on, discussions with folks in the industry. But I think going to, you know, your question on sort of content and how we've thought about, you know, how we've thought about what our customers are, you know, what folks care about.

You know, we've definitely in the early days, you know, maybe a couple years ago,

you know, as I mentioned before, there was a question of, like, how do we even call this thing? Right? And what resonates most with folks, and what does it even mean to have data observability?

Right? So we kind of coined the term data downtime,

you know, use a lot of sort of you know, what we heard from customers to help define what we call the 5 pillars of data observability.

So those 5 pillars,

freshness,

schema,

distribution,

volume,

and lineage, and having sort of an automated way to have a strong understanding of those 5 pillars across your data stack. So wherever your data is, data warehouse,

data lake, BI solution,

That, I think, is kind of the core of what we started with kind of in the data observability category. And we run a lot of content to help folks kind of actually understand

what that means. And our approach has always been, you know,

very customer focused or customer driven. We try to to focus on that more than anything. And so we try to kind of hang out wherever our customers hang it out, which was, for example, Medium or podcasts like this 1 or LinkedIn.

And then, you know, write content that's actually approachable and easy to consume. So, you know, we heard a lot of feedback that there's a lot of content out there that's actually, like, really technical and really hard to relate to. And so we focused on content that actually focuses on storytelling and focuses on, honestly, a lot of sort of fundamental questions that our customers had. And a lot of things are top of mind for them. So, you know, I'm I'm reflecting on the last couple years. Data mesh was a big deal at a certain point. How to build a data team, how to go from 1 to 50 in your data team, how to, you know, set up SLAs, SLOs, SLIs. What do those even mean? How do I set up data contracts?

A lot of those are sort of things that, you know, we've written about. Today, I think a lot of the questions that folks ask us are, how do we prove the value of a data team? How do we connect the value of the data team to the commercial reality that our business is asking us for? What are the metrics that we should be using in order to measure the success of our data observability efforts? All of those are things that are very much top of mind for our customers. And so, you know, we write content about those topics targeted at folks who are curious about those, like data engineers or head of data engineering. You know, I think another thing that has changed dramatically, as Gregor mentioned, with several hundreds of customers today, we actually found that some of our customers are starting to write about us. We're starting to write about data observability. And so, you know, we obviously included many of them in our content. You know, I think Shift Key was was another 1 that we just released a few weeks ago, and I know SeatGeek is in the works. So that's content has always been a big big focus for us, and as part of that, we're actually publishing sort of the first O'Reilly book on the topic of data quality. So we've done a bunch of sort of courses and classes with O'Reilly, and they asked us to write the first book on this. We're very honored to do that. And again, it's targeted at helping folks in the industry who have questions about how to build data teams, what data tools should I use,

and how do I build a reliable data stack.

So, you know, today, we have, like, over, you know, 30, 000 or so subscribers

and planning sort of a big conference this fall. And, again, it's all sort of with the goal of helping data teams, or wherever they are, answer

these questions so that they can continue to deliver reliable data products.

Data teams are increasingly under pressure to deliver.

According to a recent survey by Ascend. Io, 95%

reported being at or overcapacity.

With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation.

In fact, while only 3.5%

report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation,

orchestration, and observability.

Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture,

as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.

Go to dataengineeringpodcast.com/ascend

and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when become a

customer.

And then the

other interesting direction

to dig into is what you were mentioning about the

rapid evolution of the space and the rapid adoption of this idea of data observability and the growth of, you know, engineering teams

identifying that as a problem, identifying the fact that they need a solution for that. And I'm wondering if you can

highlight what you see as some of the main

motivating factors that have led to the widespread growth in

understanding of data observability and

kind of understanding the problems that it solves and how to know when you've hit that point that is a problem for you. I think what we've been seeing, Tobias, we touched on it a little bit earlier on this talk. I'm happy to expand, but I think what we've been seeing is just the growth and investment in data, and the growth of data teams,

and the more mission criticality of data applications. Right? All these things together

are

making data problems more prevalent. Right? The more you build, the more problems you're going to have with larger teams and harder coordination.

And then the other thing is the consequences of those

data problems

are increasingly

affecting bottom line at the end of the day. Right? Whether it's a product that's exposed externally or a machine learning model that

makes many decisions every day or

whether it's a dashboard that drives really key decisions for the company. Right? And so

as all these things are going, teams are also

recognizing the need to take a more,

let's say structured approach to data reliability. And as part of that also acquiring and putting in place the tools and the processes to do that. Right? And again, it starts with with testing. Right? Like and if you're a DBT user, you probably start to write tests, and then extending it into

observability and other tools

in your stack to really control and measure reliability

of data and and improve it over time and hold teams accountable to that.

Just to add on what's changed,

we actually commissioned the survey on the state of data quality by Wakefield Research. I think we sat we surveyed over 300 data engineers

and found some really interesting things. Like, for example,

over 60%

of companies or respondents actually said that data quality is way worse today than it was same time last year. I thought that was, like, 1 interesting data point. I'd say the second is, and this is maybe the good news, is that over 90% of respondents said that they are now actually actively investing in in data quality or data observability solutions. I think if you commissioned

the same survey, like, 3 or 5 years ago, it would be a far cry from these results.

And maybe tied to kind of what said earlier,

we also asked folks, like, what percentage of your revenue is impacted by bad data? Which percent of the company's revenue? I was actually shocked to find that

north of 25%

of the company's revenue is tied to bad data or to impact of data. Again, I think, you know, if you were to look at this several years ago, the results would not even be close to that, and so I was shocked by these numbers, and I think they also speak to how critical data is and how critical it is that the data is actually accurate.

Yeah. And I think it's interesting, your observation, that so many people said, oh, my data quality is so much worse than it was a year ago. And I think it's, you know, along with the conversation we were having earlier about not even knowing when there are problems, it's probably tied to the question of, you know, maybe they think about it as, well, a year ago, I didn't get any alerts for my data. Now I get it at least once a week

where it's like, yeah. My data is worse because I actually know how bad it is.

Yeah. For sure. You know, I think folks who actually don't have it doesn't have to be Monte Carlo, but doesn't you know, whatever it is in place,

if you don't have sort of a solution that's, like, data observability,

there's, like, this big clock ticking.

And every minute that goes by, there's, like, some significant impact on your customer, on your business as a result of data downtime. So even if you're unaware of it, even if you're not getting that, like, weekly alert, in reality,

businesses are impacted by that. Data teams, data products are

impacted by bad data. And so even for folks who are not aware of it, I'm convinced that there's just, you know, hundreds of data teams there that are just hit by it, but have not sort of thought about it proactively yet. And I think that's the big gap. It's not necessarily

not having data related to your products, but rather it's not being aware of it. And starting with that awareness that I think is so critical.

Another interesting aspect of this market is that

at roughly the same time that

you were

getting ready to release your product, there were I can think of at least 3 other businesses that were in what could easily be considered

either the same or very similar kind of product categories.

And it's definitely 1 of those points in time where

everybody has been dealing with this problem long enough that enough people decided that they were going to do something about it, and it just happened that they all did it at the same time. But maybe there's some element of the kind of level of attention in the venture capital market that had something to do with that too. And I'm just curious what you think are some of the catalysts that led to so many different companies

attacking this category from their own particular directions in such a close time frame and rapid succession.

That's definitely true. I've seen data observability.

Honestly, it's a pretty common pattern in in a lot of different software markets, and I can only guess as to why it happens. I think 1 aspect of it, as you point out, it's bias is is the venture capital community

that shifts its focus into 1 space or another. And, you know, it's more likely to fund companies in a given space. And I think that that was definitely the case for data observability in the in the last 2 or 3 years. I think another aspect is just

teams getting inspired

by 1 another. You know, maybe 1 or 2 teams start, and then a third team notices the traction, the excitement around the space, and maybe, you know, pivots into it or starts a new company

in it. And so I think you see that happening as well.

And perhaps the

third aspect is just

kind of the reasons why data observability became

important

during that time. And I think there's a lot of smart teams out there and they're all thinking about problems and as they emerge and become more and more important for the industry,

a lot of smart people think about it,

around the same time.

There's also a confluence of technology

here that enabled the space. I think the transition

into cloud data warehouses and data lakes and data lake houses and whatever you call it, the emergence of EBT,

all of these technical trends have made

data observability

possible. And so

for all of these factors combined,

it's typical that you see several teams operating in a similar space

around the same time,

and data observability I think is better for it. I I think we all make each other better. Excited to be part of that space. Yeah. And I would say, you know, just adding to that,

yeah, I think the last few years have been pivotal and and important for data. Right? And and I think COVID 19 and the acceleration of remote work have have helped that. So, you know, just to give another example, Vimeo, 1 of our customers,

it's a video hosting platform.

I think, you know, they have more than 200, 000, 000 users,

and they use data quite extensively. They have customer data, marketing data, product usage data. I mean, they literally have billions of streaming events ingested per day, and they use that to make decisions

on, for example, which customers need more video bandwidth at any given minute. You know, what type of devices are they using?

And actually, using data, they were able to kind of sustain growth in COVID 19 and actually open totally new revenue channels

by leveraging data.

And so they actually sort of doubled down on including SLAs for data, making sure that data is reliable throughout the life cycle.

And, you know, I think for lots of folks, this was just a similar reality. Right? A reality where

data teams are accountable for data products,

and they spend 20 to 80% of their time making sure that that data is reliable and accurate. Like, the worst thing that you can do, the worst thing that can happen to you is that the data is wrong. And I think, you know, companies invested so much in building,

you know, top notch

data infrastructure

and a really awesome stack, but then at the end of the day, data's wrong. And so when folks are facing that, I think like this, and and last year was money time for data teams. And data teams were like, okay, we've invested so much in having, like, the best data warehouses, the best data lake, the best ETL, the best VHI. We have all this awesome infrastructure.

Now let's actually make use of the data. And then when you turn to make use of the data, the data is inaccurate, then all of your efforts have basically been moot. And I think that's a very, very frustrating

reality for folks. And when that happens, folks immediately ask themselves, well, how can we regain trust in the data? I think that's sort of 1 of the biggest pushes I've seen.

In the growth of data observability,

there has been more widespread visibility and understanding

of the the fact that it's even an option.

And for people who were some of your customers when you first launched and started buying into this idea of data quality, data observability

as the initial set of products and tools were being released. You know, those are definitely the very early adopters, the forward looking engineers, the people who are constantly on the lookout for how can I do things better, faster, cheaper?

And now that we have hit a point where

the market has kind of agreed upon the terminology,

agreed upon the value proposition,

I'm wondering what you see as the

ways that customers are approaching this area have shifted,

and the ways that their motivations

and the ways that they come to this conclusion have changed from when you first started working on the product?

I think what's changed is just,

first of all, awareness. Right? Like you mentioned, in the early days, it was a bunch of early adopters, very visionary people that understood

why,

you know, why it makes sense

to take some concepts from, you know, from DevOps and apply them to the world of data. Today, I think we're seeing increased recognition in the industry that

data reliability,

not just quality, but reliability

is a foundational

part of a data strategy and of a data

stack. We're seeing a lot more

companies that actually

have a stated

objective or strategy

around

solving data reliability

at scale. Right? And sometimes it's driven by

going public and knowing that you're going to be sharing your numbers with the street.

Sometimes it's driven by

new products and capabilities

that are very customer facing

that are driven by data.

Sometimes it's driven by, you know, it's sad to say, but like massive failures or problems that affected the entire company and that resulted

in the team taking a proactive approach to data observability.

Well, we're definitely seeing it coming into the mainstream.

It's no longer

tech visionaries. We're now 1 of the biggest changes we've seen also is sectors, right? It's no longer tech companies that are doing data observability.

It's companies in every sector that you can think about, whether it's media or manufacturing

or CPG,

or even car companies or educational institutions, right? And we've seen all of them

adopting data observability

and making it a central part of their data stack. And it's just great to see how the industry is so rapidly adopting this practice, which we believe is critical.

As you have

evolved your own capabilities,

the ecosystem around you has evolved the specific

concerns or the level of detail that people need to be able to solve their increasingly complex systems has changed.

How has that driven your overall thinking about

the role that you play in people's data platforms and the direction of the product that you've taken as a result?

That's a really good question. And

what's happening is kind of 2 fronts, actually. 1 front is is the ability to

cover complex stacks. Right? And our customers use

multiple solutions

within their stack. You know, typically starting from a data warehouse or a data lake, but then they have the Ive tools on top of it, and they have

orchestration tools like DBT or Airflow or Daxter.

And they have streaming infrastructure that feeds into

the data platform typically.

And

our product has really evolved in its ability to cover all of those different pieces and correlate information. Right? Because

when you have

a data problem,

if you only know what's going on with your data,

you're limited. Right? And we touched upon it. Right? Like there could be a million different

reasons why it broke. And if you don't have that information from across the stack about how all the different assets tie together, and about how they work together, and what issues you've seen in every single part of the stack, you're going to find it really, really hard to deal with data problems and to fix them. Right? And so,

of course it's critical to monitor,

the metrics

the shines is the ability to pull together that information. And so for example, today in Monte Carlo, when you go and look at a data problem,

you will automatically see a map of everything that's upstream

of the table that's impacted,

including

things like DBT errors or test failures that happened anywhere upstream. Right? Or schema changes that happened

or other data anomalies. Right? And that ability to take all of that information together is really what

drives a lot of value for our customers. And that's really, really unique.

The other dimension that we saw evolving really nicely is that transition from, you know, from just monitoring and detecting issues to actually

helping

teams resolve them

and even prevent them. Right? And again, it's all about, you know,

it's not enough to monitor, like,

you know, your 10 most important tables. You actually need to have deep visibility

across your entire stack. Make sure

that alerts go to the right person, the person that owns the thing and can actually

act on fixing it. It's critical to give them all the relevant information, right? Whether it's from DBT or from their query engine, the user, or the BI layer, and really help them solve fast because we've seen it time and time again. If you alert the wrong person, if you inundate them with information

that they can't act on, you know, that your reliability program is just not going to be successful. And so

Monte Carlo invested a lot in, you know, getting the right alert to the right person, making it very, very meaningful,

and giving them all the context from around the stack to actually

go ahead and act on it and solve it. And that's something that's very, very unique in the market.

The other thing

is, you know, going beyond, you know, detecting problems and solving them

to actually making the system better. Right? And this is also an area where Monte Carlo invested in actually foundationally

changing the architecture and the system so that you actually have fewer problems and fewer issues.

And Monte Carlo invested a lot in creating

usage information, metadata, and statistics that help teams understand

where problems are happening,

how to simplify their architectures,

sometimes how to even reduce cost and complexity

in their systems.

And as you do all of these things,

you actually get,

fewer incidents. And we've had some amazing success stories where companies literally within 6

months dramatically reduced

the number of issues that they had

by actually learning from the past, by understanding

the structure of their systems and by simplifying it honestly.

And so

being able to address all these

3 pieces of detecting,

solving,

and preventing issues

is probably the most dramatic change. And then being able to do it across the stack is something that made some of our customers extremely successful.

Yeah. The question about the dialogue that it opens in terms of understanding

how to

evolve your own data architecture and data stack with the information that you get back from the observability platform was you beat me to the punch. I'm wondering if you can maybe speak a bit more to some of the types of optimizations

or evolutions

in the kind of data architecture and data platform that some of the insights that Monte Carlo and observability solutions provide can inform and direct?

Oh, yeah. Absolutely.

It actually starts mapping

all the assets that you have and understanding

how they are created on 1 hand and how they are used on the other hand. Right? For example,

1 simple thing that you could do is start looking at,

okay, what are all the assets today I have that I'm investing a lot of resources in to create,

and that are actually rarely or never used, never read. Right?

And start deprecating those. Right? We've seen a lot of our customers doing that. What you accomplish when you do that is twofold. A, you save on costs. Right? There's no point paying for all that compute if you're not gonna use it. But the other thing is you're actually reducing

your data debt. Right? You're

reducing the number of tables where things can go wrong. You're reducing the number of options for data consumers

to get their data. You're actually streamlining and simplifying

the architecture. So that's 1 example of how you could use usage data and statistics

to actually make the platform

better.

Another example is you could start looking at your

lineage

and see that certain things are duplicated. Right? Maybe you have a set

of dashboards

that are trying to give visibility into the same area of the business or the same set of metrics,

but they're actually reading from

2 versions of the same data. So you can actually see and moreover, you can

lay over

the reliability

data about those 2 datasets. Right? So if you have 2 datasets that are used to measure users or revenue or whatnot, you can actually see which 1 of them is more

reliable,

And you can

actually refactor the system to use a single dataset instead of 2 that are trying to accomplish the same thing. And so that level of understanding of the platform

with the ability to understand reliability

over time really gives you a lot of power

in simplifying

your architecture.

The biggest challenge with modern data systems is understanding what data you have, where it is located and who is using it.

SelectStar's data discovery platform solves that out of the box with a fully automated catalog that includes lineage from where the data originated all the way to which dashboards rely on it and who is viewing them every day.

Just connect it to your DBT,

Snowflake, Tableau, Looker or whatever you are using and Select Star will set everything up in just a few hours.

Go to dataengineeringpodcast.com/selectstar

today to double the length of your free trial and get a swag package when you convert to a paid plan.

The other

interesting thing to talk about is the question of understanding

the value that a product like Monte Carlo brings to

the data platform and a data team and the ways that organizations

think about

measuring and tracking the overall return on investment

and the types of top level indicators that they're looking to to understand,

you know, how much effort to put into something like a Monte Carlo

and, you know, how that will impact the overall business.

You know, I can speak a little to what folks or data teams are doing today.

You know, I think there is kind of things from

a few different ways to think about measurement goals. 1 are in terms of kind of internal measurement. So

service level agreement, service level objectives, and service level indicators.

But that honestly is probably like a step forward from where folks are at today. I think 1 of the things that folks really wanna understand at the very basic level is time to detection.

So how quickly do I find out about problem in my data? For most data teams, this can be days, weeks, months. Can we actually get that down to hours?

You know, some customers that we work with, Chuzil, for example, I think reduced that by 90%, basically.

The second pretty fundamental metric is time to resolution. So once there's an incident or data problem, how quickly do we resolve that? And, again, that could be months.

And when there's a clock ticking, you know, and every minute or hour is 1, 000, 000 of dollars, time to resolution is really critical. And then I finally, the thing that kind of on on that thread is actually the

reduction of data incidents overall.

So, you know, to the discussion earlier, if we can identify how to

improve the data stack or how to work with sources that are more reliable,

you know, generally sort of deliver on sort of data reliable data products,

we can actually sort of proactively

reduce those data incidents to begin with. And so, you know, those are some of the things folks start out with. And then over time, they look at, you know, measures like

a freshness

SLI or a volume SLI

and, you know, more kind of metrics that go towards

reducing data downtime overall and actually increasing data reliability at the same time.

And then on that question of kind of understanding the value, understanding the benefits of data observability,

you know, we talked a little bit about the kind of data architecture and data platform aspect, but there's also the way that it can inform

how and where engineers spend their time and think about

investing

their kind of efforts to be able to,

you know, build systems that are more reliable, but also be able to build data assets that are going to be more reliable and are going to be used. And I'm curious how you've seen some of the information that you're providing

influencing

the

aggregate behavior

of engineers and teams in organizations that are very data focused?

A 100%. So great question. So when it comes to sort of ROI or kind of, you know, what's in it for you from data observability,

there's actually a set of 2 pretty simple ways to look at it. 1 is, like, money saved by reduction of data incidents,

and that you can look at how many of incidents you have and what's the cost of each for the organization.

And then the second is data engineering time spent. And we have some data on both of those. So from our surveys and research, we know that data teams spend,

for sure, north of 40%

of their workday actually on data quality, and we hear numbers anywhere between 30 to 80%.

We also have heard that the average organization

experiences something like 50 or so data related incidents per month.

Actually, adds up to literally 100 of hours for a data team. And so for folks who look at the ROI, it's it's typically on those 2 measures. And we've worked with so many customers, like, you know, mentioned Chuzil and Vimeo and ASICs and Auth0 and Fox and CNN and many, many others where

you know, that reality where, you know, you're able to both reduce the number of data incidents and you're also able to reduce the time that your data team spends on data incidents overall. So that's typically how sort of folks think about it. That's kind of the, you know, I would say the easiest way to measure. There's obviously kind of other benefits. Right? Like all the things that you could be doing at the time that you were spent triaging your data incidents. Right? So new products that could be launched or, right, like other revenue generating activities that your data team could be working on that isn't. That's obviously a lot harder to measure. I would add that as sort of a qualitative

RO aspect of the ROI as well.

In your experiences

of building Monte Carlo and working with your customers and

engaging in the broader conversations around data quality, data observability that are happening in the industry and in the ecosystem.

What are some of the most interesting or innovative or unexpected ways either that you've seen Monte Carlo used specifically

or ways that you have seen

data observability

incorporated into,

data teams overall kind of objectives and workflow?

I think, you know, in terms of how

people

work,

we've seen teams that have done really cool things. Some teams

have created really,

really

sophisticated

and granular reporting about their data quality.

They actually

use data that Monte Carlo provides, and and we provide it through data shares or through

APIs.

And have built really, really granular

dashboards that allow them to

track the quality of their assets or their teams

in very, very sophisticated ways. And they actually use that to run their weekly operational reviews. And that's

that's something that we really, really enjoy

seeing.

At the more

development of workflow level, we've seen teams

basically using

observability

to better understand the impact of changes they're about to make. Right? They are no longer

changing the field or deprecating it and hoping for the best. They can actually

tell at a very granular level. In fact, at the field level, what the downstream

impact is going to be. They can tell exactly what tables and fields are going to be impacted and what reports and what applications

are going to be

influenced. And then do that proactive communication, make sure that everybody's advised about the change and that everybody's ready

for the change and that it doesn't break their systems

as they roll it out to production. And so it really kind of changed the way teams work, and that's been very gratifying.

That's why we started Monte Carlo,

you know, to make the lives of data engineers better and easier and less stressful.

And so these sort of things make me really happy. You know, a couple of years ago, 1 of our early customers said something like, you know, the only tabs that I use is Gmail, BigQuery, and Monte Carlo.

And that became our bar for what an excellent product looks like. Right? We think data observability will be so critical to data engineers that they won't be able to do their jobs without that.

And, you know, I would say I've been surprised by how quickly the industry is adopting that. It's really not uncommon for us to speak with folks today where

data observability is just so ingrained in their stack. They don't really operate their data stack without something like this. So I think that's really powerful.

In your experience of

building the platform, building the product, engaging with the conversation

observability? What are some of the most interesting or unexpected or challenging lessons that you have each learned in the process?

I think 1 of the things that I think a lot about is focus,

both for us at Monte Carlo, but also for our customers. Like, you know, if I put myself in the shoes of our customers,

they have a lot going on.

They can be doing

a ton of different things. They have a ton of asks.

They probably have a bottleneck of things that might take them about a year to get to, if they're lucky. Right? And so being really thoughtful about what are the 1, 2 things that you can really do that moves the needle

is a very important decision, both for our data teams and both for us internally.

So and I think for my kind of personal journey has been, know, there's a lot of noise. There's a lot of things that we could do. But at the end of the day, remembering the focus on our customers, remembering

on how do we allow someone, you know, data team,

both to make their lives easier and also to deliver a great experience. Right? Whether it's working with a company that is in the medical space or, you know, in InsurTech and, you know, has influence on the ability to deliver loans. Right? Making sure that, you know, folks can actually get credit cards and mortgage. And if the data that's powering that is wrong,

people's lives are impacted. Right?

And so focusing on that and focusing on how do we improve

those experiences through reliable data and having a relentless

focus on that and on what matters,

then 1 of the most important things that I've experienced.

It's actually pretty similar. The 1 thing I learned is always listen to customers.

We all come with preconceptions

about how, what are the important problems and how to solve them that we bring in from our own set of experiences and challenges and organizations that we worked in. But ultimately

every organization is unique and every

customer is unique.

And you have to really

listen deeply and understand how to solve people's problems.

Right? And 1 example of that that I always find curious is, you know, oftentimes people say like, Okay, let's do testing and monitoring. Right? You know, just pick the 10 assets that are most important and add a bunch of tests on them. And if you really listen to

a lot of teams,

you'll actually learn that most teams can't even say which assets are most important to them. Like, what are those 10 tables that you're going to to really monitor?

And that was a really important insight for Monte Carlo. Right? It's true for decentralized teams. It's even truer with centralized teams. People don't always know out of box

what, where to even start. Right? And so

building Monte Carlo in a way that allows teams to do that and gives them that information

was actually 1 of the most important aspects of Monte Carlo. And I don't think we would

have guessed that from our, you know, at least not from my personal experience.

It's something that we really learned from working with data teams out there and understanding what are their challenges.

As you continue to build and grow the Monte Carlo platform

and, you know, continue to explore the

possibilities and requirements of observability

for data platforms and data products. What are some of the things you have planned for the near to medium term, and what are some of the areas that you're excited to dig into?

There's several areas that we're focused on.

1 that, you know, that was a cornerstone of our strategy so far is continuing to work in our ability to cover as much of the stack as possible. And 1 recent release around that was the ability to support

the Databricks Delta

ecosystem

that they're building, and we just launched this and are continuing to invest in making that integration as it can be and give people

the most powerful Databricks solution out there. We're going to continue to invest in covering additional pieces of the stacks. Streaming is top of mind, and we definitely want to be able to do more for our customers there. We're getting a lot of demand around that. And I think the other side of it, going back to the tech resolve prevent framework,

we're going to launch a lot more capabilities

around helping teams actually prevent incidents and get over time

and foundationally improve and for the reliability of their systems and catch problems as early as possible

so that they have the least amount of impact on their data products. And so

this is where we're focused on top of, obviously,

streamlining and improving

all the core capabilities that we've built so far. We love working in partnership with our customers. And so today and forever,

you know, a large amount of our bandwidth

is dedicated to solving

the myriad of feature requests

that we get for our solution.

Well, for anybody who wants to get in touch with each of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I mean, I will say I'm biased, but, obviously, I think observability is a big 1.

So if anyone is investing in, modern data stack, I think, you know, you better make sure that that's reliable.

But other than that, you know, 1 of the areas that I'm excited is sort of around

folks who are enabling

other data teams to actually build data products in an easier way. So

there's a team that I think they're called patch.tech

that's allowing, kind of, the creation of APIs on top of Snowflake and making that easier.

Pretty excited about that. I think a lot of technologies around how to bridge that gap between

the data platform and the production system. Right? And Patch is is 1 example both Bart and I are pretty excited about. And then there's other companies, you know, basically helping make the products of the data platform

more usable

where the data is needed. Another example of that, by the way, is the data activation or reverse CTO companies like Hightouch.

I think they're

basically

teams to make data more valuable for the organization and get used in more than just, you know, the context of the data platform. And so that's really exciting.

I'm also

kinda curious to see what happens with the BI space. Right? With Google taking out Looker, which was the innovator in the space. It'll be interesting to see what's the kinda next generation

and what innovations we're going to see around that.

Well, thank you both very much for taking the time today to join me and for all of your efforts in building the Monte Carlo product and helping to

engage in and

collectively evolve the overall space of data observability and data quality

and helping people build more reliable and resilient data systems. So I appreciate all the time and energy that you and your teams are putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. Thank you for, you know, I think 1 of the best podcasts in data, certainly in data engineering. So thanks for having us and I hope we'll see you in a couple years again for V2 of What's New with Data Observability.

Thank you for listening.

Don't forget to check out our other shows, podcast dot init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast

dotcom. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it.

Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and and tell your friends

and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links