Digging Into Data Reliability Engineering

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Igor Grasnov, cofounder and CTO of Bigeye, about the ideas and practices of data reliability engineering and how to integrate them into your systems. So, Igor, can you start by introducing yourself? Sure. Thanks for having me on, Tobias.

I'm

Igor. I am the cofounder and CTO of Bigeye. We are building a data observability

platform. And what we're gonna be talking about today, data reliability engineering, is really where we want to go with the product and

enable customers to

be able to have a solid data reliability engineering platform

regardless of the size of their data team. For folks who haven't heard you on the episode that you did previously about the work that you're doing at Bigeye. Can you remind us about how you got involved in the area of data management?

Definitely.

So I

describe myself as a software engineer that has just been doing data things my whole career.

I started off writing MapReduce jobs

back when Hadoop was just getting started.

And from there, I got into data warehousing, built out warehouse infrastructure,

particularly,

using Vertica,

built in house ETL tooling, did data modeling, was an early adopter of Looker

back in the day.

And in

late 2014,

I joined Uber as 1 of the first data engineers in the company. Uber at the time was starting to scale

out their analytics platform. They were going from 1 Postgres replica

supporting

the whole company

to a proper data stack.

So did sort of the same things

except for a 10 x to scale and 10 x to pace.

Set up their vertical clusters,

built internal ETL tooling, made sure that all the data was flowing, a visualization,

did star schema data modeling.

And

there, we realized that

in order to scale a data platform to a company of that size

and that amount of disparate data,

we needed to build a lot of internal tooling to manage that data.

And so Uber, as many other

large tech companies tend to do,

had an internal team of engineers building out internal tooling

around their data infrastructure.

So these are things like data catalog,

data quality tools, lineage tracing,

communicating data outages.

And when my cofounder, Kyle, and I

started talking to people in the industry,

we realized that everyone is facing the same problems everywhere else as well. And the lack of tooling available

in the market made it really hard for data teams to spin up and scale quickly because they would have to hire people in order to build all this stuff internally.

And so that was really the impetus for starting big eye is to provide the sort of tooling that we wish we had in 2014 and 2015

to data teams that are just getting started today. So focusing on the term data reliability engineering,

and you mentioned also that the platform that you're building at Bigeye is around adding observability to the data stack. And I'm wondering if you can just

give your definition of what is encompassed by that phrase of data reliability engineering

and maybe some of the ways that it bears relation to the observability

aspects of the sort of data platform and data pipelines?

I see data reliability engineering

as a natural extension

to the data team in the same way that SRE

was a natural extension to software engineering teams. Data reliability engineering

means treating data quality

like it is an engineering problem.

It's applying

practices and using tools

in order to ensure that the data stays fit for use across every single application in the business

without losing

the velocity

of your data team in your data environment.

If you look at

SRE,

it came out of a need to

scale how software was built.

Before, when you had bare metal servers, it would take a long time to build an application,

deploy it,

go to your server closet,

reconnect the network cables,

and then that's how things worked. And so things moved slower.

With the advent of AWS,

all of a sudden, any developer across the globe could push a button, get a database, get a web server, have some APIs.

And now

this proliferation of applications

led

to the next problem of how do you know that these applications are actually running correctly.

If you look at the data landscape,

this is where SRE came into play.

SRE had the goal of

creating the processes and tooling necessary to know that your applications are

running correctly, are actually available,

and working as intended.

If you look at the data landscape,

today, data infrastructure

is very easy. It's swipe a credit card,

get Snowflake. There's your data warehouse. Swipe a credit card, get Fivetran. There's your ETL tool. And pushing more and more data into a central data warehouse is easier than ever. This now takes teams

days instead of months as it used to do. Because it's so much easier now to

collect all this data and start using it, you have

this large responsibility

on the shoulders of fewer and fewer data engineers who are managing larger number of data pipelines.

And in order to know that the data is good and reliable,

you need similar sorts of practices and tools that we have in SRE, but applied to the data

landscape. And this is where

data reliability engineering comes in.

Now you asked a great question, which is how does data observability

map into this concept of data reliability engineering?

And

in my mind,

data observability

is a necessary prerequisite

for a great data reliability engineering platform.

You

can't start ensuring the quality of your data and know that it's reliable and be able to detect that and communicate that without

first monitoring what your data looks like.

And data observability encompasses all of these concepts around monitoring, alerting, setting up the instrumentation,

making sure that

at any point in time,

you know what the state of your data is

and what it looks like.

And once you have that observability in place, then you can start building the other pieces of the data reliability

engineering platform, such as

defining SLAs around your deliverables,

such as your datasets and your dashboards,

running better incident management processes,

having runbooks for

resolving issues that have occurred in the past where you can take very concrete steps in order to fix them. And,

obviously, the holy grail of any SRE or DRE

movement is automating all of this and then saying, as soon as an issue arises,

can we

have the system push the right buttons in order to fix the problem without having any human intervention?

And so data observability is a necessary piece, but it's not the whole piece of the puzzle. And there's a lot more to data reliability engineering,

but you can't even get to those more advanced topics without first knowing what does your data look like. So you mentioned how observability is of the base need for being able to build out a data reliability

capacity within an organization or a data team.

And I'm wondering if you can talk through some of the other aspects of data reliability engineering, both the technical and social and organizational aspects.

So let's break this down into those 2 different pieces. The technical

aspects, which, Adonkius, would be the tooling and how do you even

do data reliability engineering.

And then the social ones, which are typically

how do you structure a data team or really a whole business and an organization

in order to

leverage

data and

participate in best practices around data reliability.

From a

tooling perspective,

the first thing that is necessary, as we mentioned earlier, is data observability, understanding what does your data look like.

Historically,

data observability

has

been built in house. You've had

teams

run SQL queries, collect some values, and push them into something like Datadog,

and then have Datadog alert them when something's going wrong.

Obviously, with tools such as Bigeye today, this is now available out of the box, and teams don't have to go on this on their own.

Usually, once you know

the state of your data, you want to then surface this information

to the rest of the business. And this is where data catalogs come into play. If you look at things like Alation and Stemma,

Calibra,

data catalogs are a great place for the whole business to see what is the state of their data, but also a great place to start enforcing SLAs around your data products.

If you think about every table, every dashboard, every report, ML model as a data product,

then a data catalog becomes

a clear place for the whole business

to understand what is the state, what is available

for each 1 of those products.

Finally, the last tool that would be necessary is really around runbooks and incident management.

Sadly, there isn't much available today,

but it is an interesting area

to explore.

If you look at

PagerDuty

and Opsgenie and the like from the SRE world,

they're geared around individual incidents that get recorded historically.

And you can comment on them and write notes in the attached runbooks that it will tell you, this is how this problem was resolved.

Sadly, today in the data lit world,

this turns into a guessing game, and you go on Slack, and you find out the last person who touched this dataset and ask them, well, how did you solve this problem last time? Have you ever seen this? And that's just not a great way to scale out incident management. And so

the last missing piece for data reliability engineering,

in my mind is having a

holistic way of expressing what issues have happened in the past and how did they get resolved, and then, obviously, potentially automating the resolution of some.

From an organizational

perspective,

the tooling for data reliability engineering is actually the easy part.

And the hard part for most businesses is the actual organizational

aspect or the organizational systems that need to be in place for data reliability engineering to even exist.

I've talked before about the 2 types of data users, the data producers and the data consumers. Producers are your data and ML engineers who are building datasets and data products, and then your consumers are usually the analyst in the business and the data scientists who are doing something with those data products.

Now

data producers

need a good way of understanding who their consumers are, and that is typically the hardest part about data reliability engineering inside of a business. If you have a large organization

that has thousands of people,

you could have different consumers of varying degrees of technical knowledge and expertise

all using the same datasets that you're producing.

And if the producers don't know who these people are and how to communicate the to them in a way that they will understand,

then it's really hard

to have this data reliability

because

the consumers will not be able to understand

when something goes wrong. Or if they do understand that something is wrong, they might not understand why or how they should be working around it.

The second hard part about the organizational assist necessary is that

the problem of empowering the users

to act and be notified about reliability

issues.

Today, we

especially in a remote working environment, we have this problem of email and Slack overload

where I don't think I've cleared my Slack

notifications in the last 3 months. There's always something on red. There's always something going on.

Having the tools in place to tell you about data issues is a necessary prerequisite

to data reliability.

But

if these notifications aren't going to the right people,

you just end up with alert blindness and alert fatigue

where

the people getting notified about the issues don't actually

know how to deal with them

and may just end up ignoring them.

So these alerts need to be relevant. So the data team needs to be able to set up groups of people that care about particular data products

and notify own them and only them about any issues with that. And those people should then be empowered to actually

go and do something about this issue. If you get a notification that's not relevant to you, that you can't act on,

that's

no better and probably worse than just not getting a notification at all. You mentioned that there are some parallels between data reliability engineering and site reliability engineering from the sort of web application and services

ecosystem.

I'm wondering what are some of the ways that the concepts map between the 2 sort of domains and some of the aspects of data reliability engineering that are unique to the problems that exist in the sort of data management and data analytics ecosystem.

A lot of the core principles

of site reliability engineering or SRE

are

still applicable

in data reliability engineering.

Data observability that we covered a little bit earlier

is

is very similar to

system observability

that in SRE land is provided by

Datadog,

New Relic, AppDynamics,

and the like.

There's also the notion of having runbooks and having incidents. When an application goes down, you have an issue.

And that issue then needs to be resolved and needs to be documented.

Data reliability engineering has the same concept.

Finally, 1 of the largest

ideas in SRE are is the notion of SLA or service level agreements,

where each service has a

allowable amount of downtime

and a way to measure that downtime and when it is actually available or not.

Data products should behave a lot like applications, should have their own SLAs that will

describe when that data product is available.

And if it's not, how long has it not been available for? And then you can start summarizing that by quarter, by year to see

how often are things going down, how often can I trust my data to be up to date and reliable?

Now talking about some of the differences,

applications

all

have the same sorts of metrics that you are monitoring about them.

You have your classic

latencies

and QPS,

how many requests per second. It's serving

memory consumption,

CPU utilization.

These encompass the vast majority

of metrics that people care about from an application perspective.

And just applying those 4,

you can cover

90 to 95% of applications and know the general state of what is going on with them. Error rates would actually be probably the 5th 1 there.

The hard part about

data is that data is so disparate. There are so many different

things

represented by each dataset

that you can't have the same metrics representing every single data product. If you have a

dataset that you expect to update daily, you might want to measure

how often is it updating. You might wanna measure how many records is it loading.

But then once you start drilling into it,

you have different fields and different columns that are available to you. And those all have their own unique properties that you might want to measure that are different across each dataset in each column.

For example, if you have

a identifier

for a user in your system, You might expect that identifier to be unique.

But that measure doesn't actually apply across the board

because different columns have different data in them. You might have a column storing a ZIP code, and you might wanna check that this is actually a valid ZIP code or a valid email address.

So

the hard part about doing data reliability engineering is that there are so many things that could go wrong about the data, and every incident is probably going to be very unique

as opposed to

SRE where

the finding out about incidents

is pretty easy, and the hard part is actually figuring out how to debug it.

Now I'm not saying that debugging data problems is any easier. We've all gone through our own pain of that 1 problem that you just can't quite figure out before.

But detecting issues and knowing exactly

where they're going wrong is much harder on the data side than it is in SRE.

To your point of the sort of service level agreements that exist in SRE and some of the ways that it maps into data reliability engineering, there are other concepts too that are present in terms of the service level of objectives and service level indicators, and I'm wondering how those

map to some of the ways that we can think about the reliability of our data platforms and our datasets that we're working with.

The other 2 concepts that you just mentioned make up that trifecta

of

SRE

building blocks. You have SLIs, which are the actual

measurements and the numbers that you're working off of to track your performance,

service level indicators.

You have service level indicators or SLIs that are the actual numbers that you're tracking to measure the performance of your application.

You have then SLOs

or service level objectives,

which are the agreements that the application owner is trying to meet

based on those SLIs that they are collecting.

And then finally, you have SLAs or service level agreements, which is the contract between the application owner

and the user of the application.

Now I talked about SLAs in data reliability engineering

because

data itself is very bimodal where, as I mentioned before, you have the producers and the consumers.

And, fundamentally,

every dataset has an owner, and there are then users of that dataset that care about the state of it. So SLAs are probably 1 of the most important concepts

to take into data reliability engineering.

We break down the other concepts.

SLIs or those numbers that we're measuring

are

really the output of data observability.

We are trying to measure the state

of the data

using

some metrics and some measurements

that you can then

create

SLOs around,

which are,

to put it bluntly, the threshold

beyond which something is incorrect.

Let's take an example here.

If I have a dataset that

has user information

and I have a column for

email addresses,

I might expect that some users don't have an email address. Maybe users can sign up with a phone number instead.

But, usually,

there's around

5 to 6% of users that sign up for the phone number. Everyone else signs up with an email address.

This means that

the email address column would be null around 5 to 6% of the time on average.

Now that 5 to 6%

of nulls

is the SLI. It's the indicator. It's the number that we are measuring

the performance of this column on.

Now

on top of that, we need to set an objective.

What is the SLO for

the that percent of nulls?

Now

you might have a business expectation,

which is we would expect users no more than 10% of users to set up a phone number.

You might have an objective

to say, we want to cancel phone number sign ups, and we'll only make them available to a limited subset of users. So we want to drive that down. So let's have only 2%

of

nulls in this column.

Or this might just be a we want to maintain status quo. It should be always be around 5 to 6 percent because that's normal for us.

Now

you would then take

those

2 concepts, that threshold of, let's keep it around normal, so 5 to 6%,

and that SLI, that constant measurement of your nulls,

and say, whenever

the percent of nulls in my email column go over 5%,

then that is a violation

of this SLA because we promise you, the user of our data, that

this data would remain pretty consistent over time.

And so if our

email sign up system breaks and they're all of a sudden only phone subscriptions and a 100% nulls in your email column,

then

you are violating your SLA, and your customers need to know about that. But the only way that you would know that is by measuring

that percent of nulls and setting those thresholds and those bounds

for

what is an acceptable range for that SLA.

Another

concept that has been going around in the data ecosystem recently is the idea of DataOps and some of the ways that that does and does not map to the concept of DevOps from the application landscape.

And I'm wondering

what your thoughts are on some of the differences and distinctions between

DRE versus DataOps and whether DRE is just

a sort of rebranding

of those same concepts for whatever the purpose of that might be. I don't think that a DRE is a rebranding of DataOps.

I think that

DRE is

a subset of what DataOps

is meant to provide.

If

if you think about what the output of DRE is, it's the knowledge that your data is reliable and is fit for use.

DataOps is a much broader

notion

of how do you manage your datasets

holistically?

How do people know what's available? How do you manage access controls

to your datasets and data products?

What are the different teams that require access to it? In what formats do they require access?

How do

you manage the whole ecosystem?

That is all part of data operations.

Data reliability engineering

gives you the necessary

inputs to have a great data operations

organization

because you can then say

the following datasets

haven't been healthy

for

months, and we should probably deprecate them and people should stop using them. That's the goal of DataOps to make that decision. But data GRE gives you the inputs that you need in order to even make that decision in the first place.

So

if you think about it as

really layers in a

pyramid,

you have the foundational

data observability layer, which is you need to know what's happening in your data ecosystem at all times.

Then building on top of that, you have data reliability engineering

where you say, we can take these inputs and actually

create the processes and the workflows necessary in order to know whether our data is reliable and fit for use.

And then you build on top of DRE,

your data operations layer,

and say, now that we know what the state of our data is, how do we make these decisions about what data should be used, where it's available, and who's using it for what? In terms of the actual practice of data reliability engineering,

who is typically responsible for actually

building out the practices, identifying

the SLOs and SLAs,

and actually implementing the

technologies and techniques and processes that are necessary to be able to actually make this a part of the organizational fabric? This really depends on the size of the organization that we're talking about.

So

I've worked in companies that are large. I've worked in companies that are

small. And here at Bigeye, we also

have a wide range of customers

ranging from

a 1 man data engineering

shop to

Instacart that has a

12 person organization that is a central core data engineering

team.

And at smaller companies,

we usually see DRE being distributed

to the teams that are actually using the data.

If you have a

small

central data engineering team, maybe it's 1 or 2 people that are managing the warehouse and the infrastructure and the ETL tooling,

they typically don't have the time to actually

do DRE all on their own, but they will usually provide the tools to the rest of the business in order to enable them

to set up their own SLAs

and help

the whole business define

their

quality

and GRE practices.

At a larger company,

you get into a world where there's a large central

core data engineering team

that

are able to

say, we will

provide

DRE tooling to the whole business, but we will also take it upon ourselves

to set the SLAs

for all of the core datasets and all of the core measurements that we care about. So they will measure

the basic concepts of

freshness and row counts and distribution of your numerics,

and they will track that across all of the fundamental datasets for the business.

Now there could be some auxiliary datasets that are used by

a separate team, like the growth team, the marketing team. They might have their own datasets that don't fall into the domain of the central data engineering team. And those are that would then be the responsibility of that individual team.

But this core central data engineering team would be responsible for DRV across all the main fundamental datasets.

Digging more into some of the technical aspects of how to actually implement the measurement and alerting and correction

of

these different measurements that we're doing to ensure that the data that is under our control and under our responsibility

is accurate and, you know, has sufficient quality based on the commitments that we've made to the downstream consumers. What are some of the

tools that are necessary,

you know, in terms of broad categories or even specific instances

to be able to build out these practices?

And what are some of the areas of innovation that teams are focusing on to be able to support those overall reliability practices?

We can break down the problem of DRE into

a couple of core

concepts.

Obviously, we've talked quite a bit about data observability, just knowing what is actually going on in your system as being the first step.

Obviously, there's tools like Bigeye for this.

Historically, a lot of teams have been fairly clever around custom built solutions for this, running SQL queries, emitting the results back into a table,

into

tools such as Datadog.

People get very creative with these sorts of tools.

Once you know

what your data looks like with the data observability,

the next step is then

setting the expectations

and communicating any

issues that arise.

Setting the expectations has typically been the

hardest

part for

monitoring and for DRE.

Because setting the SLOs

for

your

datasets

involves a deep understanding of what does the business expect out of this. Data producers typically don't have that depth of knowledge. They can tell you the basics around, we expect this dataset to come in daily, hourly.

And this dataset

usually has somewhere around 10, 000 records a day.

Past that, they don't usually understand what's inside the dataset and how it's being used. And so it's hard to set

these expectations

around what indicates an issue and what doesn't.

The largest area of innovation here is the ability for

either

the business to come in

and say,

this is my expectation of this data. I expect this column to have a specific format, to be in a specific range, to have us only certain values in it.

Or

the second approach, which is much more scalable, is automating that,

the setting of those expectations.

If your data looks a certain way historically, it probably is going to look the say should look the same way going forward,

especially if you're basing your machine learning models on it that expect that's

more or less consistency across their features over time.

Big eye is actually

doing a lot of work around understanding how

are we able to set these expectations and these thresholds

automatically for users based on their historical

performance

of the data and what it looks like so that users don't get overwhelmed thinking about what their expectations of the data is.

Finally, there's the communication piece going back to the users and telling them that something is actually going wrong.

Most tools provide a way for you to push notifications.

Email has been the classic.

Slack, obviously.

Some

teams use PagerDuty or Opsgenie in order to actually

wake somebody up if a pipeline is doing something wrong.

I think the next step in the evolution of this notification is going to be something similar to a notification hub.

If you look at something like

Jira, where

every issue is a task in Jira, and it's assigned to you. And you can log in to Jira and see everything that's assigned based on priority order and what actually needs to happen.

Notifying users about issues in their data

is moving in the same direction where there's often

just too many things going on at the same time in order to know what to respond to. And so having a central notification hub

for data issues

that

is

prioritized

and that has clear

actions and next steps assigned to it either through by attaching runbooks or by seeing what happened last time

and how did this issue get resolved previously and who did that.

That is going to be the next step in making GRE much more manageable

within a company. And I haven't seen that yet, but I definitely think that that's the right next step for teams building out their GRE practice.

Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data in motion.

That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data.

By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end to end metadata,

DataBAND lets you identify data quality issues and their root causes from a single dashboard.

With DataBAND, you'll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives.

Go to dataengineeringpodcast.com/databand

today to sign up for a free 30 day trial and to take control of your data quality.

As far as

organizations and teams that are looking to implement some of these DRE practices

and, know, maybe they already have some of the technical underpinnings. But what are some of the potential roadblocks that they might run into as they're

starting to plan out the

organizational aspects and working with their customers

to understand what should their SLAs be and just some of the complexities that arise as a team is starting to adopt some of these principles and practices?

I think complexities is the right word to use when talking about roadblocks to TRE.

Because

the biggest problem that companies face is that they underestimate

the complexity

of their data and their organization.

As I mentioned before,

applications are pretty straightforward. It's

easy to

measure the basics, and all applications have the same basics.

What

teams don't understand is that

every single piece of data is different and has its own complexity

and its own use and its own intricacies

that only a few people in an organization might know about. And not only that, if you look at the same dataset,

different teams would actually care about different aspects of that dataset.

You might have 1 team caring about how many users are subscribing to your service

week over week,

whereas another 1 cares about

how long do they take to convert. And that might be represented in the same dataset itself.

And so

not having that understanding of the complexity

usually leads

teams to

build something that's fairly

generic but naive

and is only measuring

the

basics

of

reliability

without actually going in-depth enough to be useful to the rest of the organization.

2nd,

if

we're talking about smaller companies,

they're usually much more concerned about

moving faster as an organization

and doing that by growing the number of datasets that they have available

and what kinds of models are they training off of these datasets and what business value are they providing.

This definitely drives the business forward.

But if the data itself isn't reliable,

then having more data in your warehouse isn't actually going to make the business any better. It's just really dumping more trash into a pile

that is already not very useful.

And

these teams are usually slower to adopt DRE practices because they think that they will slow

them down.

However, the important thing to understand is that

knowing the state of your data

and knowing that it's reliable

and actually usable

will move the business forward faster because there will be less time spent

by people

asking about the state of their data, whether it's healthy. People second guess themselves when building their reports and their

models,

And this will drive the business forward faster than just adding more and more datasets if you take that time to slow down and implement these processes and tools

to ensure the reliability

of all of your data.

As teams are

adopting some of these DRE practices,

what are some of the

sort of conceptual gaps that they might be running into if they're not already familiar with SRE principles or they don't have a dedicated SRE team in their organization

and just some of the sort of learning barriers and the

sort of, like, ramp up process of being able to

build out these capacities internally?

Interesting thing to understand about a DRE is that it almost becomes a responsibility

of anybody using the data.

SRE

has turned into its own

branch of engineering. You have engineers that are called site reliability engineers.

And they don't write software. They just operate and manage the tools and processes in order to ensure

the SRE best practices.

On the flip side, DRE is

everybody's problem. If you're using a dataset, if you're producing a dataset, you should care about its quality.

And so

I think

that

forcing a single person or a single group to ensure

data reliability across your whole organization

is a massive pitfall,

and it becomes a lot harder to actually implement a good DRE practice in your organization if you're trying to approach it that way. Think the second pitfall that people run into

is

not thinking about the consumer of the data.

At the end of the day,

measuring reliability

is

useful

only if you are then telling the people that are using the data about what the state of that data is.

If you're measuring it and then cleaning it up quietly

and then saying, oh, yes. There was that 4 hours of downtime, but we caught that and we documented it, and we know how to resolve it next time. If you're not telling the users of the data that there was that downtime, if they're not aware that the data was bad for that period of

time, then they might be making some business decisions that are incorrect now because they just didn't know about the state of the data. And so

the

important part about DRE is communication.

And it's not about ego, and it's not about

always having the answers and having

the cleanest dataset at all times.

It's just about being explicitly

transparent about what is the state of the world right now, what is usable and what's not, and making sure that the rest of the business knows that as well. Digging more into that communication aspect and being able to surface the information

to the end users of the data,

what are some of the sort of useful

protocols

that already exist? What are some of the areas of improvement that are necessary in the overall ecosystem of data tooling and particularly for,

you know, dashboarding tools or tools that are being used by the data consumers that we still need to develop to allow for

broadcasting this information to the people who are using it at the point that they're accessing the data? At the point that they're accessing the data is the most important part of this communication.

This is the biggest problem with sending emails and Slack messages.

They get ignored, and they're not

relevant at that point in time. I might see a Slack message 2 hours later that says,

hey. By the way, that user's cable was broken. I might say, great. Well, I already finished all my work on it, so now I have to go redo all my work again.

The improvements that I would want to see in this communication

space is

putting this information

where the end users are accessing the data itself,

whether that be a data catalog.

If somebody's browsing through the catalog trying to find what report they should use, In the catalog, you should be able to surface this information and say, you shouldn't use this report right now because there's an underlying table

that has quality issue. Or

even better, if you have

ad hoc querying, which most teams do, and you're starting to write your query, In an ideal scenario, what I would want is my IDE to just tell me that the query that I'm writing right now is not going to execute correctly because the table that I'm referencing

is having an issue right now. And so I think there's a lot of

interesting ways that we can surface reliability

information

at the point of consumption

that just haven't been

really invented yet. Data catalogs are getting there, and integrating

tools such as big eye with your data catalog is important in order to surface this information as close as possible to the user. But I think there's a lot of interesting innovation that can happen around pushing it even closer to the user, surfacing this in the query editor, in the dashboard somehow

to say there here's a big stop sign on this panel in this dashboard because the underlying data is bad. You shouldn't be even looking at this. Otherwise, you're gonna make some bad decisions.

That is really

the next step for communicating

to the end user. To that end, there's been a lot of work recently

in the overall space of lineage tracking and being able to automate the generation of lineage and understanding

at the various points that, you know, a pipeline stage failed, and that contributes to this downstream lineage and being able to propagate that information both

upstream and downstream as far as the overall lineage graph.

And I'm wondering if there are any recent developments that you've seen in that space that have been focusing on

being able to propagate that information to the data catalogs, to the BI dashboard, to the, you know, data warehouse consumers and the queries that are being executed there and some of the

additional work that's necessary. I know that Open Lineage is 1 of the efforts underway there, and I know that there are other

approaches that are being taken.

I see lineage as a

building block for all of this functionality

to exist.

But lineage alone doesn't solve this problem.

Lineage is necessary to understand what

does the whole data landscape look like, what datasets depend on which other ones, what data products depends

on which other datasets.

Once you have that lineage graph,

then it becomes easy to

say,

if dataset a fails, I know everything that's downstream of dataset a, and I can start notifying there.

The

problem with using lineage alone

is that

you're not gonna have your users staring at this lineage graph 247

to understand

what

is going wrong with their data and whether or not they can use a table. Users are just gonna go straight to the tool that they want to use, whether that's the catalog or the BI tool. And

using

lineage

in order to propagate

these data quality issues

and convey the information about the reliability of data products

is then the onus of the

downstream tool, whether that's let's talk about data catalogs and BI tools for a second.

If you have a data catalog that is already capturing your lineage,

that data catalog

can

use that lineage information in order to surface

reliability information

downstream

of wherever any issue has happened

on the catalog page itself.

If you look at a BI tool, the BI tool can be doing the same thing. If the BI tool already has the reliability information and the lineage information,

it can combine the 2 and say, I know that there is a problem on my users table, and I know that this dashboard uses the users

table, that means this dashboard has an issue.

Knowing the lineage graph alone and annotating the lineage graph alone isn't going to help you. This information needs to be combined and presented at

the final

usage point for the data consumer, whether that is a catalog, a BI tool, ML modeling platform, whatever it is.

And so in terms of your experience of building Bigeye and working with your customers

and just speaking with other practitioners in the space and from your own history as a data engineer,

what are some of the most interesting

or innovative or unexpected

approaches and solutions that you have seen to implementing

DRE

strategies and technologies?

Usually, a lot of the innovative solutions come

out of

necessity

and depend on the actual

ecosystem of the organization that's implementing them.

If I look back at my time at Uber,

a lot of the tooling that we built

took into account

how

our data team was structured, how the organization was using the data, what infrastructure

were we using.

We were big Presto users, and so partitions were a big thing. And so a lot of our notifications

and tooling was built around the notion of having daily partitions being updated.

I think a lot of the interesting innovation

that comes out of companies

is

interesting because it's so hyper specific and solves a very, very

niche

problem

because that would be the problem for that business specifically.

At Bigeye,

we are building a platform that enables

any user of at any size company

to

start

doing data observability

and building out their DRE practice within the organization.

But we're doing that by providing a platform that anybody can leverage regardless of what their infrastructure looks like and regardless of what their data looks like, and then start building on top of it

rather than

being extremely innovative for a very

hyper specific problem that somebody might be facing. In your own experience of working in the space and building a product that serves as a component in this overall ecosystem of data reliability engineering, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

The most interesting thing to me is that everybody

starts by

thinking about

how to

prevent bad data from getting into their systems.

Usually, the first question

that

we get is

how do we implement

circuit breakers? If you look at the Intuit

post, Intuit talks about circuit breakers where when data goes bad, they stop the whole pipeline. So no bad data propagates downstream.

Think the interesting thing about that is that nobody really thinks about the processes or the resolution

after the fact. They want to stop bad data from getting in,

but what if that issue wasn't actually critical?

What if the process

should just continue? That is just a little bit weird, and we can work around it, or we should just notify somebody. Maybe there's too many test events, and we don't expect that many test events, but

we ran a large load test on our system.

And we're already filtering out all the test events. Should we stop the pipeline?

Probably not. But we should let the users downstream know that this is actually happening. And so it's important to think about

the process

that will allow you to know that something's going wrong

and the steps that you can take towards resolution

rather

than simply saying,

no. Stop the world right now because this data is bad. And for teams that are considering implementing data reliability

as a technical and organizational strategy, what are some of the cases where that might be the wrong choice and they might be better suited just going with a sort of less sophisticated approach or just using the existing mitigation strategies that they have for observing and addressing data quality problems?

There is an argument

to be made that

data is just

vastly different from software,

And we can't take the same principles of SRE and DevOps, and we can't just translate them 1 to 1.

And I would agree with that,

but

I think it's important to

use the principles of SRE

as a blueprint.

They're a great foundation

for something that has worked

for decades now for software,

And we should, as data users and data people, should use those

existing well known, well thought out principles

as a foundation

for how to

have reliable

systems.

Now we don't always need the whole

shebang. You don't need a full on workflow

that is automating all the resolution

and performing actions in your system and going back to circuit breakers, doing circuit breakers everywhere.

But

taking the fundamental principles and applying them

as you see fit in your organization

lets you take those initial first steps.

Start monitoring

your data. Start building out the observability systems.

Create these contracts with your stakeholders.

Create these SLAs.

As you start building out

these best practices

and start implementing some of them, you might realize that there's a point in your organization where you've done enough, and you've solved the 99%

problem, and everything else is a 1 off use case that you can resolve 1 at ad hoc.

But

you need to start using these principles and implementing them

in order to have any sort of reliability

in your organization.

As you continue

to build out the Bigeye product and look to the near to medium term future of its capabilities and the ways that you interact with your customers and market it to newcomers, what are some of the things that you have planned, particularly as it pertains to the space of data reliability engineering?

I'm glad you asked. We're actively

scaling out our

platform. We actually just

raised

$45, 000, 000 in a series b,

and a lot of that is going to go towards building out more

functionality

around data reliability engineering within the platform.

We already have these basics of observability,

and we want to start creating the rest of that stack of tooling that's necessary

to have a great

DRE practice within an organization,

including

SLAs and incident managements and runbooks.

DRE

is also

about having all of these tools

work together,

and something else that we're looking into is broadening our integrations so

that DRE is more accessible to more organizations.

And even if you already have tools in place that solve parts of the DRE problem,

you can still use Bigeye

with all of your existing tooling, and it'll work

flawlessly.

Are there any other aspects of the data reliability

engineering

ecosystem

and practices and principles and some of the sort of technological and social requirements that go into it that we didn't discuss yet that you'd like to cover before we close out the show? I think the most important thing to

take away from this is that the tooling can only get you so far, and DRE really is about a mindset and about having the best practices

in your organization.

And starting early and being an advocate for

the reliability of your data and measuring your data quality and actually communicating it to the broader organization.

That can start at any point in a company's life cycle. And the sooner that companies get there and data teams start thinking about reliability,

the better off they will be in the long term. Well, for anybody who wants to follow along with you and get in touch and keep up to date with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think that

data infrastructure is extremely mature now. As I mentioned before,

between

Snowflake

and BigQuery and Redshift and on the

warehouse side and

5tran on the ETL side. It's very easy to get data infrastructure set up. And I think the biggest gap is really the whole data ops landscape.

And I'm not just talking about DRE, but I really am talking about that broader

landscape of how do you know what data you have, who's accessing it, when, for what.

All of this is difficult to measure today, and I think there's gonna be a lot of innovation in the tooling around DataOps.

And to piggyback on that, I think there are a lot of really interesting projects out there that

are sometimes solving multiple

different problems at the same time.

And the best way

to address a lot of these issues is actually with best of class tools that are really easy to set up and use that integrate well together

rather than having

1 giant

monolith that attempts to solve everything, but doesn't do any 1 thing exceptionally well.

Alright. Well, thank you again for taking the time today to join me and share the work that you're doing and sharing your perspective on the overall data reliability engineering space and some of the ways that that is emerging and growing. It's definitely very exciting to see some of these practices and principles start to spill over into the data ecosystem. So

definitely look forward to seeing more of that, and I appreciate your contributions to the space. And I hope you enjoy the rest of your day. Thanks for having me on, Tobias. It was a pleasure.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links