Better Data Quality Through Observability With Monte Carlo

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance?

How much time could you save if those tasks were automated across your cloud platforms?

Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud.

Their comprehensive data level security,

auditing, and de identification features eliminate the need for time consuming manual processes, and their focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms.

Learn how they streamline and accelerate manual processes to help you derive real results

from your data at dataengineeringpodcast.com/immuta, that's

imuta,

and get a 14 day free trial. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Bar Moses and Lior Gavish about observability for your data pipelines and how they are addressing it at Monte Carlo Data. So Bar, can you start by introducing yourself?

Yeah. Hi, everyone.

Great to be here. My name is Bar Moses. I'm the CEO and cofounder of Monte Carlo.

We are, as you mentioned, a data reliability and observability company.

We are dedicated to accelerating the world's adoption of data by eliminating data downtime.

Super excited to talk more about that topic and all this great stuff. Previously, I was VP at Gainsight, which is a customer success platform

where I work with enterprise customers on using data as a competitive advantage and have been have been in data in the last decade or so. And, Lior, how about yourself?

Hi, everyone. I'm Lior. I'm Bar's cofounder

and the CTO of Monte Carlo Data.

I focus mostly on product and engineering

at Monte Carlo.

Before that, I was SVP engineering at a cybersecurity company called Barracuda, where I did a lot of work around fraud detection using machine learning and kind of experienced a lot of the pains that today we we solve with Monte Carlo solution.

And going back to you, Bar, do you remember how you first got involved in the area of data management?

Yeah. Definitely.

So I was actually born and raised in Israel.

And so at the age of 18, was drafted to the Israeli Air Force. Did a lot of stuff with data there. Among other things was a commander of a data analyst unit, and so worked with sort of real time data for

IDF operations.

I moved to the Bay Area about a decade ago, and, after that, was a consultant in Bain and Company. Worked mostly with Fortune 500 companies on their data strategy and operations.

And then at Gainsight, I led the team that was responsible for our customer data and worked with our customers on helping them unlock their customer data. I have been fortunate to be work with data and data management in in a number of different

environments.

And, Liar, do you remember how you got involved in data management?

Totally. So, originally, I I actually

I worked a lot on kind of machine learning. I did it as early as the 2000,

some of the earliest

machine learning startups.

Then later on, when I came to the Bay Area,

which was also about 10 years ago, I started a company in the security space.

We used analytics and data and machine learning

to help detect fraud problems.

A startup was acquired by Barracuda Networks, which is back then a a a public security company.

And there, we actually

expanded our team quite a bit, ended up being around

80 engineers and product managers,

all working on products that heavily rely on data and data pipelines. So that's kind of how I got a lot of exposure to challenges and opportunities with data management, and that's where I really got excited about the idea of starting a company with BARR around that.

When you were mentioning the work that you're doing at Monte Carlo BARR, you used the term data downtime. I'm wondering if you can dig into that a little bit and some of the ways that it manifests in data workflows.

Yeah. Definitely. So starting with the definition of data downtime. Data downtime is basically periods of time when data is missing, inaccurate,

or for any other reason otherwise erroneous.

And data downtime today affects all data driven companies regardless of industry. It's a very costly problem for many data teams.

In fact, companies lose 1, 000, 000 of dollars per year just firefighting bad data issues It includes it impacts, you know, organizations, small to large, to governments as well. Actually, not too recently, it was published that the government sent no less than $1, 400, 000, 000

in

checks to dead people because of data downtime. On top of that, data teams waste north of 30% of their time

troubleshooting data issues. Now,

you know, why do we actually think that that this is a problem that worth thinking about other than kind of the impact that it has, more kind of where does the name actually come from data downtime?

Actually, I'll drop corollary to application downtime.

Application downtime is something that everyone is familiar with. So I'll take you back, you know, a couple of decades ago,

you know, maybe 20 years or so, if a company had, you know, their website if their website was down,

probably nobody noticed.

Right? Like, not sure if anyone was looking at your website.

Application downtime wasn't a big deal because no 1 was really watching that. But today, it kind of sounds ridiculous. Right? If your website is down, it's a huge deal. If your application is down, it's a huge deal. And so there's this whole massive industry

that has emerged to help manage

application downtime. And there's entire teams, DevOps teams that manage application downtime.

If you look at the data industry in particular,

I think the same thing is happening in data. So if you think about it today or maybe a few years ago, you know, you could get away with data being wrong in some places,

you know, perhaps some issues in some area of areas of your data stack. Maybe some reports are wrong. Maybe some tables in your data warehouse are not quite are not quite

accurate. You know, maybe 1 or 2 sources have failed hello today, you can maybe get away with that. But today and, you know, in the next 5 to 10 years, I don't think that's gonna work anymore. I don't think that, you know, we're gonna be able to have these prolonged periods of data downtime.

And so I think in the same way that in the last couple of decades,

we've evolved as an industry to manage application downtime.

Now we need to evolve as a data industry to manage data downtime.

And that's really kind of how we think about data downtime as a concept and where the term comes from.

And another term that I've seen used in some of the blog posts and other material that you've been putting out is the idea of observability,

which, as you mentioned, is something that has come about from a lot of the work happening with DevOps and application reliability.

And I'm wondering

what that means in the context of data workflows and data applications and some of the types of information that you're monitoring to be able to gain observability

into what might be causing these downtime issues for data work for data applications and data analysis?

Right. You're caught on. So going back to the analogy

for data downtime and application downtime,

observability has emerged as sort of a fast growing area

to support DevOps teams to manage application downtime. And, basically, DevOps teams use observability to help manage their applications and gain visibility into the health of their applications,

basically optimize uptime. Right? In order to do that, there's a standard set of metrics that they use, you know, various number of metrics and methodologies,

just a few examples,

measuring performance, latency, CPU, etcetera. But there's a whole sort of set of kind of methodologies

around observability that are very, very well known in software and have really developed in the last couple of decades.

Now if you take that idea and you apply that to data observability, it actually makes a lot of sense. Right? If you look at the data stack today, it's a complex system where you basically have, you know, anywhere between 5 to 50 or 1500

different kind of steps or systems,

and it's really, really challenging to get visibility

into the health of your data in that system,

certainly in an automated way. And so we think that, you know, data observability is basically an organization's ability to really understand the health of data in your system, which, you you know, by understanding that, you can ultimately help reduce and eventually eliminate data downtime.

And so in order to create that data observability, you really need end to end data observability

all the way from when you first generate data, from when you ingest data throughout the transformations,

and all the way to actually analytics, where your machine learning

models, where your data scientists and end users are actually consuming your data. So in order to really ensure sort of the reliability and health of your data, you need to think about, you know, what what we call end to end data observability.

And that includes instrumentation,

monitoring of your data, analytics of that, of what you're of kind of the metrics, what you're looking at, remediation

workflows when there are issues,

all these different components which together

help create reliability of your data health.

And in terms of the contributing factors that lead to these issues of poor data quality and data downtime.

What are the ways that that manifests in the different stages of the overall data life cycle? And are there any stages in particular

that have an outsized impact on how bad of an effect that that has on the downstream analysis or the utility of the data?

Yes. Absolutely.

There's a bunch of different factors

that play into this, and, unfortunately, there's no silver bullet here. Right?

So some of the factors that that are contributing to to what we call data downtime,

the first is is actually

the idea that companies are increasingly using

very many sources of data

to feed their analyses. So it starts with various internal applications

that send their operational data to the data team for analysis,

continues with information from various SaaS platforms that, serve different parts of the business, like Salesforce or or Zendesk or others,

and then continues with information that companies pull from 3rd party data sources to feed their products and and their analyses.

The fact that there are so many data sources creates creates a lot of complexity because

every single 1 of these data sources

can change

in unexpected ways

and in ways that where the data team does not control and may not have visibility into it. So, for example, if 1 of the engineers building 1 of the applications that the company uses

makes a change to that application, maybe even like a small change, like adding a field to the schema,

the data team is impacted by that. And it's really, really hard to keep track of all the changes that are happening across the organization and outside of it and adapt the analysis accordingly. So that's definitely a contributing factor. And we hear about it all the time. An engineer made a change or a marketer made a change or sales made a change or a 3rd party made a change, and we didn't know about it. And so our pipelines broke. So that's definitely a big factor here.

Another factor that's increasingly an issue

is that the analyses are increasingly complex.

People are building pipelines that sometimes have dozens of steps from source to data product, right, to the dashboard or to the machine learning model. And every single 1 of those steps can materially impact the reliability of the data. So any change made to any of these steps can break the system in unexpected ways. And we see that happening all the time as well. 3rd factor

is actually the fact that data gen teams are growing in size.

So it's not uncommon anymore to see data teams that actually

can have more than a 100 people or even a 1000 people building the data stack, building pipelines, building analyses.

And at that point, it becomes impossible for any 1 person

to really fully understand the system

end to end and to really understand the implications

of a change

on all the downstream consumers.

And so all of these things together

can create a lot of issues with data, and that's why, essentially, observability is so important. Right? So the idea of of being able to monitor the system end to end, collect metrics that can help teams understand the health of the system, can help team understand the dependencies in the system, and then a set of tools that allow them to collaborate around it and understand how they're impacted by other teams and how they impact

their downstream

consumers.

For observability

and monitoring

of, quote, unquote, traditional software applications,

that's, for the most part, a fairly well understood problem, and there are a number of solutions that have been built out for that. And there are a number of ways that you can compose those different tool chains together.

I'm wondering what you have found to be some of the ways that monitoring and observability for data applications and data systems differs

from the more traditional software applications that we might be familiar with deploying and maintaining?

Yeah. So there's definitely many similarities, but a few key differences.

Starting out with the tooling and the mechanics, you know, we said this before, but, you know, I would say in this area, in the data space, we're probably about, you know, 10 years, 10 or 20 years or so behind in terms of software engineering. And just now teams are starting to embrace things like

automation and testing and monitoring and all these things that have been really kind of, you know, the bread and butter for DevOps.

1 of the key differences is actually the personas involved.

So when you think about data observability, the folks who are both involved and responsible and, you know, want to understand things like what needs to be monitored, you know, what data assets depend on others,

when there's a change somewhere in the system, what are the data assets that are impacted, and should I care about that change at all? Do I need to know about that? Those people are folks who, you know, can include a variety of roles

starting from engineers

who are building a data platform to data engineers who are responsible for the ETLs and pipelines.

It could be data scientists who are actually using data for research or for modeling.

It could be data analysts who are creating reports,

reporting on, you know, generating insights to drive the business, or it could even be business users or executives

in specific functions like marketing or sales or support were actually consuming the data, perhaps even on a daily basis to make these decisions.

So, really, anyone who creates or consumes data needs to be involved, which is obviously very different from kind of traditional software systems. You know, I think probably the second change is the outcome and impact of sort of data downtime versus application downtime. I remember I was speaking to, you know, an an ecommerce company,

the head of data at an ecommerce company. And, you know, he said he told me, you know, when the data on our website is wrong, it's actually as bad or even worse than when our website is down. You know, I think in today's world where really sort of data drives our applications,

When there is data downtime, it's incredibly costly for the organization, and it impacts, you know, a totally different set of people. And then I would say finally, you know, on on sort of the the measurements and the metrics specifically that you'd want to

monitor and analyze

for data observability.

We actually had to kind of define what those are. And so there's a sort of 5 pillars that we believe in, Freshness, distribution, volume, schema, and lineage,

which kind of together give you holistic view of of observability.

And those are obviously very different things in what you'd measure traditional software.

And that's also compounded by the fact that all of these systems are reliant on software that we write and deploy for being able to

manipulate these data flows and handle transformations and handle data loading. And so in addition to those 5 pillars that you mentioned for the aspects of the health and well-being of the data, we also need to correlate that with the observability metrics of the systems and platforms that we're using to process the data. And so I'm curious

what you have found to be some of the

particular challenges of that, or the existing observability systems are able to handle that correlation because of the fact that it's all a time domain problem.

It's a great question.

Honestly,

we found

that

correlating it is a huge challenge.

I'm not even sure it's been fully resolved in the application

observability space.

What we needed to do first is actually create observability in the data that flows through those systems.

So as you mentioned, at the end of the day,

data

systems are software systems,

and there's great tooling to track software and infrastructure.

The gap that we kind of see is the ability to track the data that flows through the system. Right? The idea of garbage in, garbage out. If you can't track the data as it go through the different

steps of a pipeline,

you might end up with data downtime and with broken data products.

And so

most of our efforts are around building

visibility into that part so that teams can actually correlate that

with application observability stuff, as you suggested.

And so our approach has been

very similar to what, you know, a lot of engineers

have used tools

like New Relic or Datadog.

So we built

the equivalent of that,

for data pipelines, right, the idea of

1 platform where you can aggregate

a lot of observability metrics about the data that's flowing through the stack, and then use that information

to monitor for issues,

to detect various health issues,

and then to troubleshoot and resolve them very quickly using that information.

What we also found

was that actually having

observability

can prevent a lot of the data issues in the first place. And an example of that is,

Bar mentioned, 1 of the observability

pillars that we strongly believe in is lineage, which is the idea of being able to

understand the dependencies between different data assets. So, like, starting from

maybe a a dashboard that an analyst has built on a BI tool

and going back into the tables that are queried by that dashboard

and then the tables that feed those tables upstream,

etcetera, etcetera, and all the way to the source. Now, what we found is that the fact that we add that visibility and allow teams to understand how different assets depend on each other,

actually allows them

to be responsible,

right? And make changes to their system

in a way that

doesn't break any downstream consumers,

and which allows them to move faster

and maintain a higher degree of reliability

in terms of their data products that they produce.

In terms of the

observability of systems for software applications, it sometimes requires

instrumenting

different areas of the code to be able to understand what's happening at the different points, figure out things like timing

or using things such as tracing where you have

a sample of the data as it propagates throughout the different stages of the system.

Is there a similar analogy

to data observability where you need to instrument the data applications

or insert some sort of tracing record that you can track as it propagates through the different layers of the data workflow?

Totally. So 1 of the

kind of main

ideas behind the Monte Carlo platform

is that that part of instrumentation,

right, and the idea of connecting

and integrating

with the stack. And quite like application reliability, the idea is connect to the existing stack. Don't ask people to migrate to another

solution. It needs to work with the stuff that people already use, that people already have in place. And by pulling information

and metrics from those systems,

we're able to create that observability.

Right?

So to your question about tracing and adding records,

and and at least in the data observability case, we're able to do a lot of the tracking without introducing any changes

to the data system.

And

the way we do that is we're able to automatically

track

the history of how data has been used. So, for example,

we can track historical records of, for example, SQL queries

on the data. And by examining those records, examining how the data was used, how it was written, we're able to create what is the equivalent of a trace in the system fully automatically without having to

modify the records.

And in terms of the platform that you've built for being able to collect and analyze and display this observability

information and the potential error points in the systems or raise alerts to the relevant teams.

How have you architected that overall system

solving this problem, sort of as Lior mentioned,

is with end to end observability.

So I think in the past, you know, the sort of common thinking around this was sort of garbage in, garbage out. Right? Which meant that whatever garbage use or whatever data you ingest, we need to really think about that quality because

that's kind of going to determine the quality of your data going forward. However, with the complexity of the tech stack today,

that is no longer sufficient.

You can ingest data that's perfectly clean, but then something might change along the way, which might cause your data to be corrupted, and you'll end up consuming

data that's actually bad quality. And so because of that, we actually think about the different places in which data downtime can occur. It can happen upon ingestion. It can happen in the data warehouse if you're updating a table.

It can happen if you're actually migrating

to a new data warehouse. It can happen any of your Looker or Tableau dashboards.

And because data downtime can happen anywhere across that stack, when we think about how to solve this problem, we need to think about this in an end to end way. It doesn't stop at sort of the the garbage in, garbage out.

That's first. Now second,

what kind of drives sort of the approach here is sort of the the 5 pillars that we talked about. So I'll talk a little bit about each of them. So

freshness,

what does freshness actually mean? Freshness helps us determine

how up to date your tables are. So if there's a table that you expect to see new data from, you know, on an hourly or daily basis, then suddenly data doesn't arrive, then there's potentially a freshness problem that you'll want to look into. The second pillar, which is distribution,

which, you know, is probably a collection of a number of different

aspects to look at the data itself.

1 of them is whether the data is within an accepted

range. So if you've typically seen a specific field to be within, you know, the numbers 5 to 50 and then suddenly it's at a1000,

you know, say it never made it to that number before, then that's probably kind of an an issue

with with that particular field that you want to be aware of. The 3rd pillar is volume,

the completeness of your data tables. If you expect a certain number of, you know, a number of rows or a file size and it's wildly different today. You wanna know about that? The 4th is schema. And so that can be, you know, any sort of changes that are being made to your data. If a column is removed or changed,

if a table is deprecated, etcetera.

And then finally, lineage really helps bring all of this together.

Lineage where you're basically mapping your upstream and downstream dependency. So for any given data assets, you can know here are all the different source and different transformations that are upstream that are feeding this particular data asset, and then here's everything downstream that's relying on this. Lineage actually helps us bring kind of the whole story of these 5 pillars together so we can actually paint a picture and understand, okay, there's potentially a freshness problem here upstream, which is kind of related or impacting this distribution problem in the table downstream,

which actually relates to this report in particular, which is being consumed very often. And so kind of thinking about this

these sort of 5 pillars together give us a good view of data observability

and then applying that end to end to your data stack is really how we think about sort of solving this problem of data reliability.

And as far as integrating into a customer's data stack, are there any particular systems that are easier to get information from or any that are particularly

difficult to be able to gain visibility into? I know things that, like, Kafka, for instance, have historically been somewhat difficult to be able to understand what's actually

sitting resident in a given topic, for instance.

Yeah. Totally. There's definitely places where it's easier or harder.

I'd say

at a high level,

the more

cloud based and modern the system is, the more likely it is to be observable

using

APIs that we all like to use and and using kind of modern

models of permissions and of logging and and metrics.

The more you go into the legacy area and kind of on prem technologies

and older versions of warehousing or BI tools or pipelines,

the harder it becomes. Right? Like, these systems may not have an API that you can really use, or it might be very limited.

Those systems tend to be less integrated with source control and kind of, like, more modern

development life cycle

tools.

And they also tend to differ quite quite a bit across deployments. Right? So just to contrast,

if you're using,

technology like Snowflake or BigQuery,

everyone has the exact same

version of it, the exact same deployment. It's all very consistent.

Whereas if you look at older technologies

that the customer may have installed on their own and managed on their own, you might find a wide range of versions, a wide range of deployment models, And that's definitely more challenging in terms of connecting to that and pulling out all the information that's necessary in order to create end to end observability.

Having said that,

it is possible

to connect to most of those systems, and we've really spent a lot of our time

on making sure that we can very easily

connect to all these systems

and also collect all the metadata and the metrics that we need in order to create observability. And this is indeed an area where we invested a lot of resources and where we add a lot of value to Teams because doing that instrumentation

is definitely a challenge and and definitely requires a lot of engineering investments.

Today's episode of the data engineering podcast is sponsored by Datadog,

a SaaS based monitoring and analytics platform for cloud scale infrastructure,

applications,

logs, and more.

Datadog uses machine

based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering,

operations, and the rest of the company.

Go to data engineering podcast.com/datadog

today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt.

Another element of data systems

beyond just the functional aspects of being able to move information from point a to point b or query it and answer a given question are the

so called nonfunctional

aspects of it, such as data governance, compliance issues, security.

I'm wondering if there are

any aspects of the observability of the system or any of the information that you're collecting that can help feed into those goals as well.

Yeah. Absolutely.

So I would say you basically listed kind of the the top challenges in data today.

I think in all of these areas, data compliance, data security, data governance, data reliability, data quality,

these are burning issues. And they've become more and more sort of a burning issue, you know, in in most recent years for a number of reasons.

Maybe the primary 1 of is actually how we use data today, which is very different than how we use data in the past. So, you know, in the past, you know, maybe a decade ago or so, you know, people would use data

you know, maybe they'd look at their data, like, once a quarter when they're you know, when the finance team is sort of reporting on numbers or something like that. And so there's a very small number of people who's looking at a pretty small dataset and a handful

number of reports

that they're using to kind of, you know, look at the business and make decisions.

The world in which we live in today is incredibly different. So we have, you know, very many data sources, very many people looking at data, and they are looking at it all the time in real time. And more and more companies are moving to that model.

And because of that drastic change, we need to ensure sort of the highest

bar for that data being secure,

regulated, and governed. The other thing in which we were lacking is automation on all of these areas.

And so when you think about automating data reliability and data quality, there's certainly,

you know, aspects of security and compliance and governance, which are very relevant, which sort of get automated as well. So all of these topics are topics that, you know, we think a lot about and that are very, very near and dear to to our customers.

We see that every day.

And for somebody who is onboarding onto Monte Carlo and starting to integrate your system into their data platforms,

what is the overall workflow of being able to get up and running? And what are some of the concrete outcomes

of getting set up with Monte Carlo in terms of being able to gain better visibility

or set up alerting and pager rotations

for critical issues or some of the other aspects of going from

having little visibility into their systems or using some homegrown solution that they may have cobbled together from various other platforms to using Monte Carlo and getting a more holistic view of how data is propagating throughout their platform?

Yeah. In terms of implementing

Monte Carlo,

it's actually fairly easy, and we spent a lot of time

making it very, very seamless for our customers.

So, essentially, the way it works is

we would create

a read only

API keys or users

on the various parts of the data stack starting from from BI tools,

systems like Tableau or Looker,

and through a data warehouse, whether it's Redshift or Snowflake or BigQuery or or another,

and and all the way down to to data lakes and even streams.

Alright? So from the user's perspective,

they would create an API key for us, and that's pretty much it. At the moment, we have those API keys.

All the rest is pretty automated.

We automatically collect all the metadata, all the metrics,

all the logs that we need in order to create observability.

And so many of our customers can actually

get started in pretty much less than 20 minutes.

And

then we do the heavy lifting of instrumentation,

of benchmarking

the system, and understanding

what a normal baseline might look like,

and then exposing all of that

to the customer using

using our dashboards, using notifications.

And maybe Bar can tell us a little bit about how customers are experiencing it and how they are deriving value out of it.

There are

sort of 3 key areas.

The first is actually being able to

automatically

know about and detect data downtime.

So in the past, if you typically found out about data downtime from potentially

a consumer of the data downstream, like a data scientist or a data analyst or business user, or potentially you even heard about it from a customer, found a problem with your data and and your product. Right? That's something that happens as well.

The first thing is to actually know about data downtime and be the first to know about those issues. And so by having observability

at your fingertips, you're able to

understand and identify data downtime before it propagates

downstream and and impacts your business and your customers and and can be very costly.

That's the first item. The second is, actually, once there's an issue, being able to resolve that

in, you know, a matter of minutes instead of what can take easily weeks or months today. So, you know, we see many data teams that oftentimes

find out about a data issue, you know, a few weeks or a month after it happened,

only to then spend a few more weeks or a few more months trying to figure out what the root cause is. Now the problem is that the longer that the the problem lingers,

the more time it takes to actually resolve it and to spend time, you know, potentially backfilling the data, etcetera.

And so the second key thing here is being able to resolve in minutes.

And then the third is actually preventing these issues to begin with. So when you have this data at your fingertips,

and you actually have this visibility into the health of your of your data in an automated scalable way, you can actually prevent all of these to begin with. And that's 1 of the most interesting things that we've seen. And I think that if we, you know, continuing in this track, we should get to a place, you know, as an industry where we can actually trust our data. I know that sounds like a sort of a crazy concept today. People have sort of, you know, gotten used to the idea that the data is broken here and broken there and, you know, we're just gonna you know, such as life, bad data exists.

You know, I don't believe in that. I really think that we need to get to a place in in the world where we can trust our data, and we can use it to make decisions

in real time.

And I think data observability

is the key to getting us there.

In terms of any

either horror stories of situations that you find yourself in at previous organizations

where

data downtime

was became a problem or

homegrown solutions that you've seen built to address those types of problems or some of the most

interesting or innovative or unexpected ways that you've seen teams using Monte Carlo or their own systems for being able to

identify and address these observability and downtime issues?

Yeah. For sure. There are so many in the downtime disasters

all the way, you know, from, you know, teams that just, you know, wake up every morning and just get started on the daily data fire drill

or, you know, a CEO that would kinda go around the office and would sort of put, like, a sticky note on monitors and reports saying, like, this data is wrong,

you know, or a customer that, you know, left the vendor because because the data was wrong. Right? And so resulting in kind of material impact on that company

or a company that was at a compliance risk because the data that they were tracking,

financial data they were tracking was inaccurate.

So definitely, you know, if you've been in data, you for sure have a data disaster that you can that you can tell about or that you remember.

And, yeah, in our experience, you know, we've spoken with sort of hundreds of data teams about this problem, and we've actually kind of collected all the different ways in which teams deal with this and how they think about this problem and actually plotted what you would call sort of a data downtime maturity curve or journey. And we found that there are sort of 4 key stages that kind of typical

company goes through. The first stage is where, you know, you'd call it the reactive stage,

where really you don't quite have anything to ensure the quality and reliability of your data. And for that, you know, I think that's something that probably a lot of listeners can can resonate, but, you know, it's very hard to know what JADA can be trusted.

It's very hard to know what data is important and what's not, what's used and what's not.

And, really, you are kind of playing in this game of whack a mole, trying to figure out, you know, where the next data fire drill is gonna come from and how quickly can you resolve that. You know, this can happen even on a daily basis or weekly basis.

This oftentimes happen

at the same time when organizations try to become more data driven. So it's actually a good it's a good problem to have. It means that you are doing the right things with data,

but, you know, you still haven't gotten to the part of sort of figuring out data reliability yet. And that's when people start sort of moving into the proactive stage

where data teams actually start putting some probably manual

either testing or checks can be something, you know, pretty pretty basic, like looking at row counts, which sort of starts to give data teams some visibility

into,

you know, some symptoms of what happens when data goes wrong. Obviously, that kind of the challenge is there is that you will probably spend a lot of time in kind of postmortems

to figure out, you know, for every fire drill, what is kind of a particular

kind of test or monitor that I need to that I need to apply now. But you're starting to think about this in a proactive way is sort of the the second stage in an important shift in a company's life.

The 3rd stage is kind of the automated stage where companies actually start hacking their own solutions and that's where it gets really interesting people, you know, start building, you know, various solutions

for this and you know, there's there's definitely different approaches here but that certainly requires,

kind of, a dedicated,

you know, engineering

work stream or, kind of, a decision to invest in this. And then sort of the last stage, which is scalable,

where really, you know, you sort of have the the Netflix

and the Airbnbs of the world, which have dedicated specific

resources to figuring out a very scalable way to provide data quality and reliability oftentimes across their data stack. So just as an example, Netflix has something called RAD, which is this, sort of, their anomaly detection,

which uses RPCA

to detect anomalies.

There's actually there's an interesting

blog about this that they released, and that's probably an example of an organization that's very mature in how they think about

meeting their observability

needs. I think, overall, we're seeing more and more data teams, you know, definitely realizing that they need to start thinking about this and also looking for more automatic ways so that, you know, their teams

can actually be relieved of these things and work on more, you know, projects that are, you know, revenue generating or more core to their own business.

Another interesting thing about this particular point in time is that it seems that this is the year that everybody has decided to build a data quality or data reliability company where I can think of at least 3 or 4 different businesses that have launched in the past few months.

I'm curious what it is about this particular

point in time in the industry or

the economy or the maturity of organizations

or the maturity of the platforms that we have available to us that has led to that particular outgrowth and this particular focus in that being a business need that needs to be solved, that that there's so much opportunity for that many companies?

Yeah. It's definitely the year of of data observability. I'd agree with you on that. And I think it's probably a combination of a few different things that we talked about them. Right? 1 is kind of the drastic change that we're seeing the last few years in regards to how we work with data, moving from a small number of people who work with a small subset of data

to many people in the organization using data all the time. And actually, I think COVID 19 has accelerated this to to a certain degree.

We see companies rely on data way more than past.

Actually, interestingly, DoorDash recently posted a great article on their data platform.

And we're seeing, you know, more and more teams in this day and age

thinking about what a great data platform can look like. The second I would say is that, sort of, you know, as you mentioned, like, the data stack itself and consumers have evolved. We used to work, you know, a few decades ago, just doesn't cut it anymore. You know, if people would tell me a few years ago, yeah. You know, I have, like, 3 or 4 sets of eyes on every single report before it leaves, you know, leaves our team. We always have someone who's, like, looking at the numbers and checking them every day.

You know, that just doesn't cut it anymore today. So I think in in some ways, like, technology needs to evolve as well. And then I'd say sort of the third reason that I see is is just the data industry has progressed at a breathtaking

rate the last few years. And that you can see in large acquisitions

like Looker and Tableau, which we've seen. And then, you know, Snowflake's IPO is an example,

you know, just a few weeks ago. You know, we're obviously seeing a lot of exciting

changes and exciting times in the data industry at large.

And in your own experience

of building the Monte Carlo business and the technology platform to facilitate

visibility into customers'

data workflows and the issues that crop up there? What are some of the most interesting or unexpected or challenging lessons that you've learned in that process?

You know, I'd say probably 1 of the most interesting is, you know, we're creating a category. Right? The data reliability, data observability category. And, you you know, we're solving a very big problem that has a broad set of capabilities.

But in the end, you really need to listen to, you know, sort of to your customers and to what works in order to to figure that out. Few specific things that we found out, 1 is that, actually, language matters.

And so how you call a problem or how you think about a problem

really sort of makes a difference when in speaking to people. And so we've heard that kind of data downtime did observability, something that people sort of intuitively understand and get it. There weren't that many words or sort of language to describe this. And so we actually had to sort of go through, you know, kind of the process of thinking through what is the language that describes this problem? How do we think about this? How do our customers think about that? And then bridge that with kind of terms like data downtime. I would say the second is that,

you know, thinking about

data reliability, it's a very, very broad broad problem. But if you actually break it down into specific pain points that people really resonate with, it's actually kind of the first step to tackling this problem. So specific problems like, you know, getting a call from your CEO

at, you know, 7 AM on Monday morning because the report looks wrong. That's something very, very specific that people understand

and experience if you're in data. And so, actually, by starting to sort of lay out what are the specific pain points, we can start tackling this problem

and think about what a solution to it can be. And the third, I would say, is that, you know, we've been fortunate to work with some great customers

and to really think about kind of this category and this platform

alongside with them. They are probably

the most important part in this journey. And so, you know, when thinking about this problem, we work with customers really from very early on. The goal was to get something in their hands and then iterate with them in solving this problem. And I think these sort of some of the sort of key things that have been most interesting and also most fun to work on as part of the the Monte Carlo journey.

Do you have any of your own insights on that, Lior?

Sure. Yeah. I think the thing that that struck me the most was

data is always very specific to every company,

and every

data incident or data issue

is very unique. Right? Like, each 1 is is like a snowflake,

and everyone has their own problems.

What I was really amazed by is is that there's actually

a lot of common threads between companies, and we keep seeing

the same things

break over and over again.

And so by really focusing on those

particular problems that keep happening, and you can, you can kind of tease out what they are just from the observability pillars that we mentioned earlier,

you can actually go and solve

a pretty broad range of data incidents, right? Automatically detect them and help teams resolve them quickly

and and help prevent them in the first place. And to me, that was surprising because when we first started,

you know, we were wondering, is it actually possible

to solve this problem across

different companies, different teams, different type of datasets?

And to me, it was great to find out that the answer is a resounding yes.

So, yeah, that's my lesson from 2 years of working on this problem.

For people who are running their own data platforms

and who have their own workflows that they're trying to manage and try and gain visibility into,

Are there any cases where Monte Carlo might be the wrong choice or there might be some alternative approach that would work better for them?

Yeah. So I would say definitely

1 kind of alternative

is, you know, actually to not use your data at all or to not rely on data. I would say that that's definitely what what approach that you could take. The other alternative that that you could take is to actually try and build in house. We see data teams

who, you know, decide to kind of focus on this as part of their platform and, you know, put work into building this. There's definitely kind of a cost enough question of kind of building something like this from scratch. It definitely requires kind of a unique set of capabilities

ranging from engineers, data engineers, data scientists, and more. And so, you know, I think oftentimes for some companies, it can be challenging to justify building something that's sort of complicated and nuanced like this from the ground up. But certainly, if that's something that, you know, companies have appetite for, that's an option as well.

And as you continue to build out the platform

and work with customers,

are there any

initial assumptions that you had going into this or

any

personal biases as to what you thought the proper solution was going to be that have had to be updated or that have been challenged in the process of bringing on more different workloads or different problem domains or industries?

Yeah. I would say, you know, our customers challenge our assumptions

every day.

And so if there's something that we learned is that

we, you know, can think together about what we think is a potential solution, but then, you know, build it as fast as humanly possible, put it in the hands of a customer and see how they actually interact with. So there's also a big difference between what a customer says they want and what they actually do.

So in a sense, I feel like, you know, a company sort of focusing on data observability like us, it's a lot of actually, you know, running a series of experiments with early customers to understand,

you know, to validate and invalidate all of the assumptions that we had. 1 of them is sort of what kind of, you know, Lior spoke about before, which is, you know, what are the different aspects of this problem and how do we attack that? So, you know, in a way, if you think about sort of each of the the data observability

pillars, we've had a certain set of assumptions about each of them, which data observability

pillar matters more and why, which, you know, how to think about the solution for each. And so I would say, literally, our our customers challenge us on those and surprise us on those every single day.

1 example that kind of springs to mind, which kinda surprised me, was we always came in thinking that people want to know when things break,

and that was true.

But what also surprised us is that people really like to know when things got resolved.

So apparently, in a lot of companies,

you kind of hear about things breaking.

You know, some alerts might go off or someone might

might announce it quite widely, but nobody ever tells you when things are fixed and can be used again.

And so that was kind of an interesting

learning from working with actual customers is when when we're able to automatically detect that there's an issue and then automatically notify the stakeholders

when it's resolved.

That's actually a very powerful thing that people really

appreciate and and get a lot of value out of.

And along those lines, for software applications, there's often the concept of a status page where you can go to if you start to experience problems with something like GitHub or Slack to see, is this down? You know, what is it that's causing this problem?

Do you have anything analogous like that for the data domain where you provide a status page of this is causing a problem, this is some of the downstream impacts, this is what our expected, you know, time to resolution might be, or this was when the resolution happened and have something like the concept of an uptime graph?

Yeah. A 100%. And if you actually think about what are the different things that go into

preventing and mitigating data downtime,

There's probably kind of a few key metrics, you know, that that we think about. 1 of them is the number of incidents. Right? So how many incidents do you have? What's their severity? What's their impact?

That's 1 important 1. The second is time to detection.

Right? So how quickly have you actually found out about an issue? How quickly have you identified it? How quickly was the person responsible for it notified? You know, there are some people or some organizations where in order to find the person responsible,

it can take, you know,

15 or more pings back and forth on Slack to figure out who the right person is and who should be the person on the hook for this data incident.

So, you know, that's kind of the the second big metric, time to detection.

Then the third is, time to resolution.

So, you know, from the minute that you have a data downtime issue until it's resolved,

how long does that take? Is it a number of months, weeks, minutes? And obviously, you know, both reducing data detection, time detection, and time to resolution,

as well as reducing the number of incidents overall

is what should help you, you know, increase your data uptime.

I also, you know, start seeing companies who are actually measuring their data downtime or or uptime. And, you know, I think that 1 day, we're gonna have, like, 5 nines, basically,

of data uptime, and that's something that, you know, any data team will need to monitor and to think about and to optimize for.

That reminds me of the old joke of you can have as many nines of uptime as you want, but I get to choose where the decimal place goes.

Exactly.

So yeah. Absolutely.

And so as you continue to build out the platform and plan for future features or future

use cases, what are some of the things that you're most excited by or things that you're keeping an eye out as far as overall trends in the industry or new capabilities that you want

to add? Yeah. So I think it's definitely an exciting time in the market, and there's a number of exciting solutions that we're building out. You know, 1 of them would sort of talk just a little bit about the things that we talked about today include kind of key assets, which is, you know, sort of a new way to think about what kind of data matters and what data outdated or no longer relevant. We're gonna be publishing a blog post about that pretty soon. But, you know, I think kind of taking a step back, our mission is really to give data teams unwavering confidence to trust their data so that they're not paged by their CEO in the middle of the night or, you know, kind of forced to reckon with some broken report during a company all hands meeting.

And that means that, you know, data pipelines are trustworthy and reliable.

Data analysts can use their dashboards and know kind of which you know,

what data they can trust and where it's coming from and who else is using it. That also means that data governance is automated and easy and painless.

And, ultimately, that companies can make better decisions with data. I really think this is the most important

kind of trend in the data industry.

You know, if you think about it, you know, we have more and more people making decisions based on data, but also starting to use machine learning models to make decisions for them on or on behalf of them with data. And so making sure that we have reliability and trust in data is key to really enabling

almost all of the other trends in the industry that we'd like to see going forward.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, starting with you again, Barr, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

A 100%. I would vote for data observability.

Check out some of our recent blogs on this. But, you know, I think data engineering is an area that's moving fast and and needs to sort of catch up for software engineering. And I think observability and reliability is 1 of the sort of key areas to do that. And,

you know, more broadly, kind of figuring out how to automate data engineering workflows, I think in the last year or so, it's really started to bear fruit. And for us, that means, you know, automatically sort of detecting and preventing data issues. I'm optimistic about the future of of this space.

Seen a lot of different

interesting areas with a lot of potential.

Excited about the future.

And, Lior, how about yourself? I think 1 of the biggest challenges

in data is

is actually collaboration.

And this is something that's been worked on a lot in the software engineering world.

And there's a lot of tooling around that, and the industry is getting increasingly mature with that. But with data,

a lot of these problems remain unsolved. Right? Like, to deliver really great data products, you need to have data engineers, you have to have analysts, you have to have

data scientists,

and all of them have to work

in concert together.

And, you know, how do you allow all these different groups and all these different individuals

to work together at scale

to deliver,

great products

that are both useful for the business

and also

reliable and trusted. Right? And and data observability is obviously

a big part of

that. Well, thank you both again for taking the time today to join me and discuss the work that you've been doing with Monte Carlo and helping to promote the concept of observability for data workflows. It's definitely a very

problematic area and 1 that's good to get some strong focus on that from yourself and others in the industry. So thank you again for all of your time and effort on that, and I hope you enjoy the rest of your day. Thanks, Tobias. It was fun.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcastdot

com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email host atdataengineeringpodcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links