Exploring Incident Management Strategies For Data Teams

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others.

Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies.

Sign up for the SaaS today at data engineering podcast.com/accryl.

That's

acryl.

Your host is Tobias Macy. And today, I'm interviewing Francisco Alberini and Mei Tao about patterns and practices for incident management in data teams. So Francisco, can you start by introducing yourself? Hi. Thanks so much, Tobias, for having us here today. So I joined Monte Carlo last year. I was our first product hire here on the team. Prior to that, spent 5 and a half years at Segment where I started on the success side and, spent a couple years as a PM focused on protocols, which was the data quality management tool for for Segment. And, yeah, I'm just thrilled to be able to be here today to talk about the things we've learned from

our conversations over the last year with data engineers.

And, May, how about yourself? Hi. My name is May. I am the 2nd product hire at Monte Carlo. Joined the company in August last year. I actually have a statistics background. My first job in life was writing Python codes for the Berkeley stats department

for the research projects.

Deviated

a little bit in wanting to, you know, like, banking and investing before I found my way back to being a data platform manager at a logistic tech startup right before Monte Carlo. That's where I've experienced a whole series of paying off data engineering teams and was exploring potential solutions and discover Monte Carlo, and the rest is history.

Going back to you, Francisco, do you remember how you got involved in working in the area of data? Yeah. So a lot of it started when I joined Segment, actually.

1 of the first projects that I was, I guess, assigned as a success engineer was a huge migration across

multiple Redshift clusters for our customers. So we basically have the solution where we would stand up Redshift clusters for our customers and basically host the data on their behalf. And we did a huge migration multiple times for our customers across different Redshift clusters and then ultimately to self hosted version. So that was 1 of my first introductions to the deep world of data. A lot of, you know, dual screening, looking, you know, row by row

type of spot checking of data. And then as I spend more time with Segment, a lot more interaction with customers and how they use data, how they actually set up the data collection piece and making sure that that was of high quality,

and then, of course, down to the warehouse level. So, yeah, that was my introduction.

And, May, you mentioned a little bit about how you first got started working in the space of data. I'm curious what it is about this problem domain that keeps you interested.

So at my last company, I actually went through the process of, you know, having our platform as a batch process data stack into real time streaming

because we were providing logistics solutions to these big shippers,

that are really

dependent on our timely updates of the status of shipments.

Gone through the whole process of moving to real time streaming

and also went through a lot of migrations. And I realized my data engineering teams were actually spending a lot of time firefighting

and responding to the request from execs team, from the BI team, from the operations team. I wasn't really able to focus a lot of their productive time on actually building, you know, our real time streaming platform, building our price experimentation platform. Started looking to, you know, what are some of the solutions that we can actually deploy and found these, like, emerging trends of ETL tools,

reverse ETL tools, observability

solutions, and found it fascinating and eventually

jumped to, you know, the vendor side

and devoting my life to building some of these solutions. You mentioned that you've each had some experience working with customers and also, I'm sure, being on the operator side of these different types of data incidents. And I'm wondering if you can just start by describing

some of the ways that an incident in the context of a data system

might manifest and possibly some of the differences

to incident management that folks from the application development side might be familiar with?

Yeah. I will say the main difference is that in addition to, you know, the common

causes in the software engineering application engineering world where you have issues that emerge from

code changes as well as operational environment issues. In the data world, you have this additional source of quality problems coming from just data issues themselves.

So that creates more complexity and more trickiness around resolving some of these, like, data issues.

So to drill a little bit deeper in that, operational issues that we see in the data inflow,

a lot of it is, like, a pipeline failure, like a airflow DAT that was failing to run because of permission issues, or a dbt job failure, or any of these, perhaps, like a Google Cloud function pipeline that your team set up to run some of these pipelines.

Gaining hiccups in that resulting in downstream tables not populating. So that's on the operational side. It could also be code issues that occurs.

That,

has to do with the query code themselves. So maybe in your dbt model, you added a limit 500,

which resulted in number of rows downstream significantly reduced. The last area is the data issues, which are the trickiest because it could either be a daily quality problems that, you know, there's high percentage of no in your values, or it could be, you know, actual changes in your business environment.

So, like, a really common 1 that we see is schema changes. If you change the column type

from string to integer, that can break some code that generate, you know, the downstream tables. Also get

1 time events like perhaps your marketing team is running a marketing campaign,

which results in a lot of changes in traffic data and distribution of channels data that resulted a lot of detectors

firing off and distribution of your downstream table completely shift.

So these are some of the most common issues that we've seen that result in data downtime. Do you have anything to add, Francisco?

No. Yeah. I think that covers it. Just those 3 buckets we found in our conversations really account for the majority of what we call incidents or data incidents,

And we really built our platform around those 3 buckets in a way to understand, like, what kind of an incident this is and then how do I very quickly address it. But I think we'll get more into that as we go through. Yeah. And to the point of resolution

at a high level before we get into more of the details later, what are kind of the various steps involved or the stages of an incident and the different participants that are necessary to be able to bring an incident from I have a problem to I have resolved it? So I think that the first 1 is actually getting notified about the incident, and there are many ways to get notified about an incident.

Most common 1 that we've heard is you hear about it downstream consumer that says this number looks weird. Why? That's usually not the best way that you wanna get notified about an incident. So a lot of what we've been seeing in the market is just like this idea of automated anomaly detection, which takes the idea of getting notified something by an actual consumer and actually getting notified by a system that's looking for actual issues. So step 1 is get notified. And then I think from there, the next step that we like to really focus on is actually understanding the impact of an incident.

So is this really bad? Should I, you know, get out of bed right now to go fix this incident, or can I wait until tomorrow,

or does no 1 care? And then in that case, I can just kind of leave it alone and go about doing more impactful work. So I think understanding in terms of the stakeholders there is who are the downstream consumers of this data. Is there anyone consuming the data? And if so, then making a judgment call about the impact.

After the impact, understanding the root cause. So let's say you do say, yeah. This is actually a real issue. I'm gonna go notify the people that are gonna be impacted by this. I next wanna actually investigate the root cause. So that's kind of the 3rd phase. To do that, there's lots of things you can do. A lot of times what we hear from customers is they then just go to all the different systems where something could have broken. So they go airflow,

check the logs, They start querying the warehouse to understand the actual

more detailed version of what is actually happening in the data. Are there any type of correlations across other fields? Maybe it's a marketing campaign that ran, which caused this issue. Maybe it's something else. And then, of course, understanding upstream. Are there any issues? Like, was there a recent schema change that occurred? So we can look at something like lineage for that. After we've come to the point of understanding our root cause, we try and resolve it. We may need to rerun some tasks, backfill data, update code if we need to, whatever it may be, and then notify after that, downstream users that this issue has been resolved.

The last piece, which is often forgotten, is the idea of documenting

the postmortem of what actually happened so that the next time we can avoid it. Those are the general kind of high level phases that we see resolution or incident resolution going through. From a

sort of broad level and from your perspective as a vendor that is working in this space of being able to help teams identify when they have an issue,

what are you seeing as the general rate of adoption of some of these incident management practices among data teams as compared to

application teams, maybe even within the same organization?

It really varies, which I think is probably the the obvious answer. You know, I was giving it some thought before, and I think I could kind of summarize into, like, 3 real core types of teams that we're seeing in terms of how customers are adopting these different principles and and and kind of how much they're investing,

and I've come up with some fancy names for them. But the first 1 is the idea of, like, the goal. I don't know if you ever read Eliyahu's gold rats book. It's called The Goal, and it's all about operational efficiency. So it's, like, defining these really, really crisp processes so that you're continually improving the efficiency of the system overall. And we have a bunch of customers that fall into that that they're just so heavily invested stakeholders.

They have distributed teams focused on different parts of the stack. Stakeholders. They have distributed teams focused on different parts of the stack, thinking about how data mesh fits on top of that. 1 of our customers, Red Ventures, Brad, his name is Brandon Biedel, and he's kind of like the perfect example of this. We just released a blog post with him. But he's just done so much to define processes and

and get everyone aligned around what are the SLAs.

You know, who takes ownership of an incident? Who should be the responsible party? How do we make sure that that issue is addressed and that it's communicated to the organization

all of those questions

So that's the first 1 and the second 1 is what I call the the boss mode team, and these are usually smaller data teams and they're less focused on process because for the most part, any person on that team probably has the entire data pipeline in their head. And I've actually I was talking to a customer recently, and he I was telling him or asking him, like, how do you think about incident resolution?

And he basically said, like, listen. I've been here for 4 years. Like, I could probably tell you when I see an incident what line of code is, like, causing that. What is actually causing that particular incident from happening? So, like, for me, it's very easy to then go and resolve the issue. Like, I know exactly what to do, and not that this person was, you know, super full of themselves. Like, it was a the humble representation of reality and that, like, I don't need to spend a lot of time investing in process because I can fix things really quickly. What's more important is that I get notified of an issue very quickly. Like, that's what I want. I want to get notified as soon as something happens so that I can go fix it. And then the last 1 is more of, like, these teams that have a lot of breadth and not a ton of resourcing. So by breadth, I mean, like, in the hundreds of thousands of tables that they're trying to manage and think about how do we improve the quality of this system. So they could develop a lot of process, but the reality is that's gonna require a lot of resourcing. So they need to have a ton of team members, and they'd finalize process. So the kind of slight variation there is they're thinking more about how do we kind of reduce the complexity of what we've built here or what's been built over time. All these tables that no one's using, how do I remove them? So that kind of reduces the the scope of what is we're responsible for. How do I quantify the complexity of our pipelines today. So then I can present that to our leadership and say, how do we reduce this complexity? And we have a bunch of customers that are are thinking about it that way. Hopefully, that's a helpful representation of the kind of teams and how we're seeing some of the trends there. Definitely.

And

in terms of the

organizations

that do have more process in place where you don't necessarily have 1

or a few people on the team who know everything about everything in the data platform because of the sort of relatively small scope or the relatively long tenure of the people,

When you do have to start introducing more process and practices and formalities around incident management,

what are some of the

sort of organizational

and technical challenges that those teams have to deal with as they start to define and implement those practices?

Actually, 1 example that come to mind is how do you structure the team in a way that it's clear who's responsible for what so when incidents happen, there's clear ownership of who should go and fix it. So 1 of our customer that is a online marketplace company, they have this data mesh architecture,

and they structure their team around domains. So they have, a team that's responsible for the marketing data, a team responsible for finance data, and another team responsible for their customer service data. And they have a data engineer and a data analyst each responsible for for each area. So when there are data issues that happen, they know that if it's like a freshness of volume type of issue, it's the data engineer that's responsible for looking into it. If there isn't field distribution type of issues, then it's the data analyst's job to take ownership over that that issue. So that's, like, kind of how they structure their organizational

structure

to respond to increasing number of data issues that are happening and streamline their workflow.

Are you looking for a structured and battle tested approach for learning data engineering?

Would you like to know how you can build proper data infrastructures that are built to last?

Would you like to have a seasoned industry expert guide you and answer all of your questions? Join Pipeline Academy,

the world's first data engineering boot camp. Learn in small groups with like minded professionals for 9 weeks part time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer.

Plus, they have asked me anythings with world class guest speakers every week. The next cohort starts in April of 2022.

Visitdataengineeringpodcast.com/academy

today and apply now.

And then 1 of the perennial challenges

of any sort of alerting or incident identification

and management

is, 1, for the problem of alert fatigue and making sure you don't have too many signals that drown out or too much information that drowns out the signal. And then the other question is just what are the actual signals that you need to be worried about, or what are the

sources of information that are going to carry the most weight for you to be able to identify that I actually have a problem here versus just here's some additional information and context that maybe is useful when you're digging into the problem, but isn't going to tell you whether or not you have an actual issue. And to some of the ways that data teams need to be thinking about

at what points of their system they need to introduce these monitors and the

methods that they use for being able to

send and notify on those pieces of information?

I kind of think of it as the leading signals and the lagging signals. You really don't want to have any lagging signals, which are usually the downstream data consumers, you know, seeing a broken report and come to you and be like, can you fix it? Or the data products that you have are experiencing

data quality issues and resurfacing back to the data engineering teams.

We're also already lagging signals that if your data team's already spending, you know, 20, 30% of their time, similar to my old data team, trying to firefight these data issues

instead of being able to spend the time on revenue generating features, like

improving

your experimentation

platform and improving your streaming platform.

And those are the probably already laid signs of you should investing into, you know, building some monitoring system and getting real time alerts. And some of the earlier signs could be your

datasets, size of the number of tables you have already growing over a certain size, more than 50 tables already,

significant size for you to warrant some kind of monitoring on top. Or your complexity of data stack, the number of data sources that you have, the variety of data sources you have, how often your data sources are changing,

or the number of data pipelines that you have set up, feeding your tables.

And then the last 1 is somewhat more obvious of the size of your data engineering team. Initially, when you probably have, like, 1 or 2 people, you can coordinate among yourself and have clear ownership of who's monitoring what. When you grow to a little bit bigger than that, if you have, like, 5 people in your data engineering team, that's usually a sign of teams growing large enough that more automation and system in place to ensure data quality.

And on the other side, the consumer side, if the number

of your data consumers,

which can include anyone

from BI team to your, you know, financial analyst, marketing analyst, data exec team. It's growing over time. That's also a sign when you should start building the initial building blocks for this monitoring system.

In terms of the actual

alerting aspect of it, it's definitely very sort of subjective and based on the context

and capabilities of the team

and can be nuanced depending on those leading and lagging indicators and sort of what are the highest

value signals that you're capturing.

And I'm curious what you see as some of the useful practices for being able to

introduce

alerting to a data team or an organization that has some level of data capacity and

being able to

build feedback loops to understand which alerts are useful, how to tune those alerts, understanding

when you actually just need to eliminate a source of information altogether so that you don't

overwhelm the people who are responsible for being able to actually

find and fix these high impact issues?

Yeah. That's a great question. Alert fatigue is actually 1 of the biggest problem that we've seen with data engineering teams. They're really overwhelmed with the large number of issues that are emerging. And oftentimes, they're like, what do I do with, you know, this huge number of alerts? So a few things that could help. 1 is bringing these issues to where users already are. So that can mean, you know, setting up Slack channels. So you get it within your workspace that you're actively

participating in. And then 2 is some kind of categorization

for those data incidents, whether it's, like, you know, incidents that are all related to a certain sets of table

or instances that are of a certain nature. They're all, you know, pipeline delays or they're all, you know, field distribution issues. Categorizing them together so you have some sense of high level priority to know when you join to this list of issues, they're all percent type and can,

have similar root causes.

You can also send

the similar category of incidents to the same channel to help you prioritize and organize the workflow for these incidents.

The third would be assigning some kind of priority to help you

tackle certain issues first and then being able to, you know, sometimes ignore some of the other issues.

So 1 thing that we've seen a lot of our customers do is

marking the key tables that they have. If this is, like, a very high stake

dataset, we, tag that as a key table. And whenever there's issue that surface with this table, we know that it takes priority over the others.

And for some of these, like,

higher priority

incidents, you can also set reminders

on these incidents. So if I miss it 1 time, I get another reminder

until, you know, I I fixed it and update a status so it doesn't show up anymore.

And for some of the lower priority incidents that are less time sensitive or lower stake,

you can set up a notification

in more of a digest format. So if there's a lot of schema changes that happen with my tables,

and I don't necessarily care about every 1 of them, I just want to be aware of it at some point, then I can set up maybe, like, a daily or weekly digest. So I review it just once for this

same type of issues at once, and you don't have to invest too much time into it. Yeah. So those are some of the best practices that we've seen our customers and a lot of the data engineering teams apply,

helping improve the ability to prioritize workflow and reduce alert alert fatigue.

Yeah. I would add, I think, that this idea of getting the right message toward the right incident to the right person in the right place is, like,

1 of the things that our customers invest a lot of time in. It's a real kind of force multiplier, if you will. If you don't get that right, then

the whole system effectively becomes useless. And I think as a kind of just like a rule of thumb, what we've seen work well is, like, if a customer is getting or an individual user is getting around 5 incidents a day, that's manageable.

But once you get into 15, 20, 30 incidents in a day, it becomes effectively not manageable. And in the words of 1 of our success teams, like, we would basically annihilate the customer and being able to actually

address the issues. So we are really careful about that. We spend a lot of time thinking, how do we make sure our customers are being able to manage the workload of incidents that are being generated, that they are useful,

that they are actionable. We're not inundating them with things that they can't handle.

Some rules of thumb there, the the 5 incidences that is a really good 1 that we found to work really well. Beyond just being able to say, this is a problem. This needs to be addressed. There's also the question of routing those notifications

to the appropriate user at the appropriate time. So if it's an issue that is high impact, but, you know, that the batch job isn't gonna run for another 24 hours, you don't necessarily need to page somebody at 3 in the morning. Or, you know,

if the incident happens and, you know, it's something deep in the code of the data ingest pipeline, you don't necessarily wanna send that notification to

a, you know, BI engineer who's just writing the dashboard. So I'm curious how you've seen teams approach that question of alert routing, you know, paging and notification, and just how you see them handling that kind of on call rotation and alert delivery question.

Yeah. I can touch on that 1. I'm curious what you think, Mae, too. Like, the the way that we structured is we try and give customers as many levers and kind of slices on the incident feed as possible so that they can align with their internal organizational structure. So it's like, if they do have an on call rotation

to create a Slack channel where that on call rotation person is responsible for the incidents that are flowing through that particular channel, And then we can do a lot of sub filtering on top of that. So I think May mentioned the idea of filtering on the a group of tables. So these, you know, 150 tables, super high impact. I want anything that happens here to go to this particular Slack channel, and that's gonna be higher in the priority list for that on call person because they know, okay. Yeah. Those are the important tables. We even have 1 customer that they have, like, some financial data that's really important and needs to be perfect. And I think they have a Slack channel dedicated that everyone knows. If there is an incident in this channel, like, everyone stop what you're doing and go very quickly address that. So just that idea of filtering, making sure that you have instructions in in kind of within Slack or whatever, you know, kind of collaboration tool you use to be advised of those issues happening very quickly, and you know how to prioritize them. And then I think there's another 1 that we mentioned too, this idea of importance score. That's something that we can pass through. So we say this is a key asset. This table is super important based on a bunch of different factors we look at. We're helping customers look at that as a filtering mechanism for what they should be addressing or not. But, yeah, I think those are some of the big ways that we've seen people figure out how to get messages to the right place,

and the right team. And then a lot of times individuals themselves are setting up notifications for areas that they're responsible for. So when you get this emailed to me, or I actually just wanna get a digest of what's happening. You know, at the end of the day, tell me the the 10 things that happened in these tables. I don't need to hear about them in real time. Yeah. Another interesting 1 I've seen is 1 of our customer, PagerDuty,

they have this team

responsible for their data ops called the DataDuty

for their

high priority reports, like a few, you know, executive report, financial reports. They actually set up PagerDuty

themselves as the alerting mechanism,

and they apply this escalation policy that goes from, you know, data engineer to data engineer manager and all the way up to the executives when there's an issue with those high level, high priority

reports. And to make sure that for these, like, most important asset, they get alerts and reminders on it to the most important people until it's addressed.

And on that point of using some of these,

what folks would view as DevOps tools that are primarily

oriented towards

application development and application operations,

what have you seen as the

applicability

of systems such as PagerDuty or things like Sentry

or, you know, maybe

more kind of generic application oriented

monitoring systems such as Prometheus and Grafana

to the domain of data and being able to feed those systems and integrate them with a Monte Carlo or some of the other more data observability oriented tool chains to be able to have this holistic view of both the sort of performance and capacity of the underlying infrastructure

layered in with the data domain and the kind of information modeling aspects of the problem space?

We have plenty of customers that actually use all the integrations you just mentioned. We also integrate with Opsgenie. I have seen some customers push data into Grafana. So the way we see it is, like, we wanna be agnostic to the actual tool that the teams use to manage the incident feed to be able to address and make sure that we're integrating into whatever system they already currently use. We're huge proponents of people using what they already use to kind of get in front of the team and make sure that they're properly triaging and assigning issues out to everyone on the team. The additional piece that might be differentiated is the way we look at it is how do we actually help customers resolve the issue? And I think that's where we're making some investments to to help people say, like, hey. What is the single pane of glass that I can actually see all the different components that could be causing this issue? And making it easier for people instead of having to run around to all the different component systems to figure out, is it Airflow? Is it something in Snowflake? Is it something upstream in the raw data? Kind of where is that issue happening? Instead of having to run around, we'd say, here's a single pane of glass. Here are all the clues that we've been able to ingest to help you figure out where do I actually know where the root cause is happening. So that's the 1 place that some of the pager duties of the world file incredible

tools. It's harder to to bring in that context. We're really excited about those types of integrations, and we have tons of customers using them. To that point of being able

to have that holistic view of I have an issue to now I'm able to identify the source of the issue, figure out the steps to resolution, and then

indicate to the end user, you know, this is resolved. Curious

what you see as a

common

process and approach that an individual or a team might take to be able to bring that incident from, this is a new incident, we need to address it, through to resolution and some of the kind of debugging tasks and the pieces of information that they need and the context that's required and some of the ways that they might approach that as a

new and developing data team who doesn't necessarily have

and all of the integrations in place, and then how that might differ with a team that has developed that muscle. They're more practiced at being able to handle incidents, and maybe they're already using the data observability platform or a data catalog to view that sort of lineage and impact graph?

The data engineering teams that are a little bit more early on without the stack in place to, you know, have a observability platform, have the lineage tools in place, it's a little bit scrappier. Oftentimes, they instead of, you know, having a tool that gives them a 1 pane of glass view, they will go into the different operational system. So for example, if there's a delay in updates for 1 of their table, they will have to go directly into their Snowflake to see, you know, when it was last updated and who updated it, and who are the users that are querying that table. If we need to investigate further, they will also look what query is, you know, populating this table, and what's the upstream table to this table. And they go 1 label upstream to see if there's any freshness

issue into that.

Then they will go into the external data source to see even further upstream if it's a API change resulted in, you know, the data being ingested, changing

volume, or changing update frequency.

But for the teams that have more of a solution stack, they will be able to have 1, you know, pane of glass view. So for both investigating

the root cause of an incident as well as, you know, assessing

whether they need to investigate the root cause of incident or they can ignore. It's easier for them to have an initial glance of, you know, how many tables are impacted in this incident and, you know, how many queries are run-in

these tables and what are the downstream BI reports that are associated with these tables and whether those reports actually have any engagement, have any adoptions or views

with them that weren't, you know, a investigation process. And then when they go into it, they're able to leverage, for example, a lineage tool that

they can use to see what tables and reports are downstream as well as which tables are upstream. So they can quickly identify

what could be a potential

source of issue depending on the type of incidents that they're faced with. For example, if there's a high percentage of no rate for 1 of their data field, oftentimes, they will go into their data warehouse to pull those roles that have the no values and then see if these roles are attributed to a certain channel or if there's any clue that provide a reason for what could have resulted in these values.

So those are the more common things that we see that teams have applied when, you know, going through the process of investigating an issue.

So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data.

For every table in select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries.

Best of all, it's simple to set up and easy for both engineering and operations teams to use.

With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.

Try it out for free and double the length of your free trial at dataengineeringpodcast.com/selectstar.

You'll also get a swag package when you continue on a paid plan.

As far as the

sort of more

formalized

incident management processes

in the operation space, I've usually seen it where there's 1 person who is put in charge of actually resolving the incident and driving the remediation. There's another person who's there as the observer and the communicator,

somebody who's responsible for taking notes so that all of that information can then be

compiled into a postmortem

after the fact to be able to use as a feedback mechanism to

fix and preemptively resolve these

problems to keep them from happening in the future.

And I'm curious how that plays out in this context of data teams, maybe if there are

particular roles in the group that generally fall into 1 of those responsibilities during the incident and just some of the

communication aspects around data incidents and how they might differ from

a operational context where you're not necessarily trying to fix a site being down and hemorrhaging 1, 000, 000 of dollars in the event of a large ecommerce site, but you are trying to prevent a potentially very impactful and problematic

issue with your data that can lead to these larger incidents,

but, you know, maybe not have the exact same kind of timeliness component to it? What we've seen a lot of times in history does repeat itself, unfortunately,

in terms of when it comes to these data incidents. And I think to your point thereof, some of the incidents might have the same kind of urgency or potential impact. We still want to invest, and I think we're seeing customers want to, like, understand what that issue is. If it keeps coming up, even if it's just a nagging issue, teams wanna be aware of those patterns and think, okay. How do we now, like, address that issue and stop that issue from happening in the future even if it's small and easy to fix. There is a big piece of documentation

there. So I would say not even the roles, but more about the, like just kind of the mechanism that we have as a team to document when an issue occurs.

I find this I mean, on the product side, as a product manager, 1 thing that we love to do is a change log. So we wanna know what has changed. And as a PM, if every week, I have to figure out, okay, how am I gonna capture the change log this week? Should I use a Google Doc? Should I go use some new tools? Should I try and put it into, I don't know, a Jira ticket? I don't know.

That kind of friction means that I probably won't do it. What we're trying to do is, like, how do we reduce the friction and make it so that everyone knows, like, after an incident happens, this is the way you do a postmortem, and it's these 3 components,

super clear. There's no question. I think removing that friction is actually the most impactful thing that we can do to help teams get in the habit of doing these postmortems. Because the reality is, like, the postmortem is probably the least fun part of the actual resolution process. Like, it's exciting. I'm resolving the issue. I fixed it. Great. Hurrah. And now it's like, I don't wanna spend time writing this down. So what we're thinking is, like, how do we remove that friction? It's what I would say is the most important piece.

And then

in terms of

the additional components that are useful beyond just saying, okay. I have some signal that lets me know I have a problem.

What are some of the additional

capabilities

or tools or automation or

systems augmentations that they'll build in to be able to decrease the time to resolution

so that they can find and fix problems more rapidly

and be more proactive about identifying potential failures?

1 thing that I've seen a lot of data teams either are working on building or are planning to invest time into is building this lineage tool. You can construct it with, you know, a DBT tool or leverage other vendors, but also some teams are investing their internal data engineering resources to build that tool. And specifically, if you can invest to build a lineage tool that give you visibility into field level dependencies. That's even more powerful.

Common use case we've seen from customers are they want to deprecate a column, want to know what are the downstream dependency,

what are the reports that will break based on that change,

as well as if there's a a specific issue with

a field in a report,

how do we track down, you know, which field upstream is the source of that anomaly. Using a tool like field level lineage is something that we'll be able to power those use cases.

Shameless plug, we actually build this tool at Monte Carlo, And a couple of my engineering colleague and I recently published a article about how you can build a field image tool on InfoIQ.

Happy to share the link if folks are interested in. Another practical thing that I've seen teams do, especially if they're writing tests in a SQL based environment. So let's say I wanna write a a SQL command that checks for a particular field always being not null. The thing that I've seen people do is to write the test, but also to include within the test all of the other columns within that table. So when the test runs, basically, yes, it's checking for the nullness of that particular column. But in addition, we're pulling in the context of, like, those null values, what other field values exist within that particular set of records. And we've seen people do that with our SQL rules tool. I think that could be done with some, like, a great expectations environment, even DBT tests. So I think that idea of, like, when you're writing a test, how do you think about the context that will be helpful when you now need to resolve that issue that's been caught by the test? Seem that to be really effective as a kind of another simple hack, not very complicated, but, simple.

Once you have resolved an incident, 1 of the things that you mentioned as some of the post resolution steps is running backfills,

and

I know that that can be

challenging if you haven't already planned that in when you're designing the initial pipeline. And so I'm wondering if you can talk to some of the

considerations and strategies

that are necessary

as you're designing and building out these pipelines in the first place to be able to

properly handle backfill operations for the case where

you do have a problem, you do need to regenerate the data, and you don't want to introduce new problems as a result of trying to fix the old problem?

Yeah. This is a tricky 1. Yeah. I think as a PM, I probably am not the most qualified, maybe may have some better ideas here. But 1 thing that I've seen I mean, I came from a world where from Segment, which was all about streaming and the idea of, like, backfilling data was nearly impossible. It was kind of like a 1 way stream of events. And once those events were pushed into a destination, it was very difficult to undo any of that. But I do think from, like, at least from what I've heard, is this idea of, like, understanding to build the pipelines with some kind of, like, temp tables in between so that you can do some rollbacks. Like, there's a lot of things that I think we've heard customers doing. I will be the first to say I'm not the expert in that. I don't know, Mae, if you have additional thoughts.

Not super familiar with, like, best practice around this. What what here is yeah. Toby, if you have any ideas around that. So, I mean, definitely,

1 of the things

related to backfilling is that it's usually related to working with batch datasets and

dealing with blocks of time. And so

in my own work, I've got 1 pipeline, for example, that the way that it's written,

it runs on a daily basis,

but the code itself

is

kind of naive to the timeliness of the data. It will just run a SQL query, dump out whatever the contents of the database are at that point in time.

And so if you try to run a backfill to say, oh, I missed 5 days' worth of data. I need to fill in those 5 days.

It's not actually going to fill in that missing 5 days' worth of data. It's just gonna generate whatever new data it happens to have available. And

so as you're designing the job, you need to be thinking about that

sort of time window aspect of it. So when you're writing the queries, you want to make sure that you are passing in specific date ranges so that you're only selecting and working with the data that is pertinent to that specific window of time.

And that gives you also the flexibility of being able to adapt that same task for

different granularities

where you can say, okay. I care about being able to

populate this downstream system with information on a daily basis, this other system, or for this other task, I actually care more about the weekly or the monthly aggregates.

And then

because you already have information about those time windows

when you need to do a backfill, you can just

generate the appropriate time windows for those windows that you missed, and you don't have to be concerned about saying, okay. Well, actually, I'm pulling in the most recent data, not the older data. And so just as you're designing the pipelines upfront, understanding that backfilling is something that will have to happen no matter how careful you are. And so being very deliberate about having time and

windowing be a first class consideration in all of the jobs that you're building.

Yeah. No. I've seen also the idea of of adding pending columns with the time stamp at the time which the data arrived, and that's another thing that I've seen work well. So it's like, to your point about time, understanding which different time stamps you have for when the event was received, potentially, when was it pushed into the actual warehouse, when it was potentially

removed or ETLed or ELT ed. Those are, also considerations I've seen work well with my lived experience there. Yeah. And in the event that you're working with, for instance, a set of tables that don't have any conception of time, there's no date time column in there for you to be able to segment across, another option is being able to use,

some extra maybe key value system to be able to say, okay. In this batch, I processed these

IDs, you know, this range of IDs from this database table.

And so then in your job definition, you can say, okay.

When I need to backfill, I'm going to query this key value store to see what are the IDs that I've already processed

and then query against, you know, what are the gaps to say, okay. I only wanna process the IDs that I haven't already worked on. And so that way, you're making sure you're not duplicating effort.

Another question that's interesting to dig into in this space of incident management, incident resolution,

and kind of the organizational

technical practices around it is

in your experience of working with your customers and the various teams that they're involved with, what are some of the sort of gaps in

experience or knowledge or understanding

that you've encountered

that have been

useful

to address as you're helping them adopt these practices of incident management, data observability,

and just kind of the raising the general awareness of how to approach it and the relative importance of having that capability in the team.

The 1 thing that comes to mind there is just, like, even the concept of SLAs,

SLOs, and SLIs,

which I think in the software engineering world, it's fairly standard in terms of probably have a good understanding of that measurement model. And I've had many conversations with folks in the data world that have never heard of that before. And, you know, obviously, there's no fault of their own. I think data engineers that we speak to a lot of times are focused on the the hair on fire problem of, like, oh, I have, you know, spend most of my day trying to respond to consumers or data consumers saying something is wrong. I don't have time to go

research best practices in the DevOps world. So I think just that concept of, like, defining what an SLI is and then being able to have some kind of objectives that you've set as a team and then an agreement, which is kind of the holy grail, I think, that's an exciting area. Another 1 is severity levels. That's, again, a similarly

standard practice in the DevOps SRE world. This is a SEV 1, SEV 0, SEV 2.

Similarly, not a common practice among data teams, at least in many of the ones that we've spoken to. Yes. There are many that do think about the world that way, but, I would say the majority of folks don't yet think about it that way. So I think those are just really helpful things that we can borrow from the DevOps world and keep in mind as we build out, systems for data, which I will also notice is significantly more complicated. I think we've touched on that multiple times here, but there's a lot additional complexity to building an SRE or DevOps system for data. But I guess that's why we're here. It's exciting.

In your work with these teams, what are some of the most interesting or innovative or unexpected ways that you've seen them approach this overall problem of incident response

and media and mitigation?

I think that's 1 example that comes to mind. We have 1 customer that they became really obsessed with this idea of wanting to classify

every incident that Monte Carlo generated,

and we have a mechanism that you can classify things as investigating

or resolved or not an actual issue or even false positive where there's actual issue with the underlying

detector logic itself. Those are really helpful signals for us. But what this customer was, they were like, I want us to basically classify every single incident that comes through. He actually went and built a Slack bot that would remind people and kind of, like, chase people down if an incident wasn't classified. So

I

I think people are just getting inundated with, like, you didn't classify the incident. It's like, okay. Yeah. Okay. I'll I'll get to it. That 1 was pretty interesting and innovative.

And then there's another 1 that 1 customer they've been talking to, they happen to have a lot of sprawl, so, like, in the hundreds of thousands of tables.

And they kind of I alluded to this at the very beginning, but this idea of how do we think about, like, simplifying this? How do we reduce the complexity of this? And this this is, like, a next level type of analysis of the problem.

This particular customer was, like, basically

using our lineage nodes to kind of, like, compute a complexity score for all the different schemas that existed within their warehouse and then being able to then use that as a prioritization model. So say, like, hey. This is super complicated.

What are, like, the ways that I could potentially reduce the complexity by removing tables that aren't being used or that have been around for

eons and then no 1 knows what they do? But anytime that this person would try and go and say, look, I wanna go remove this table because I don't think anyone's using it. They were like, if I do this though, I'm innovative approach to thinking, like, you could even think about the process

or I could think about reducing the complexity so that

when I do build a process, it's actually gonna work. I thought that that was a really interesting perspective and really, like, pushing the bounds of what's possible. Like, there's not easy ways to, you know, build a graph database on top of lineage today, especially when you have hundreds of thousands of tables

and 1 that could even be comprehensible. Like, as a human, obviously, I can't store a 150, 000 things in my head and look at a graph and understand it. Maybe some folks can't. I certainly can't. But, anyways, that's a I think another example of a interesting thing that we've seen customers do. Yeah. Another 1 that I've seen is a data engineer building a visual pipeline of their entire data flow.

This task is upstream of this other task. And if it's running, you know, an hour late, then it has downstream dependency on all these other tasks downstream of that. Completely visualize that, and being able to, you know, pinpoint to 1 and see the impact of that 1 text on all the other pipeline that have dependency on it is a super cool internal tool that I've seen. Another 1 I've seen is data engineering teams being creative with the way that they tag assets. So

1 company actually

parsed their own DBT model and use the parsed artifacts

to determine whether this table is important or not and then assign

those key asset tags to those tables

and use that to route the instant alert to other teams.

So it's a hacky way to reduce

noise, the amount of incidents that they're receiving, and help surface some of the most important issues for their teams to focus on.

And in your own experience of working with these teams

and helping to

understand the overall process of incident management, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

I think the interesting thing that I've seen in this space is that, collectively,

all the teams are exploring the best practices going forward since it is a new

paradigm that we are building.

Every team is eager to learn the best practices that other teams are exploring and try to, you know, pick up things that they could apply from other companies.

We oftentimes hear customers asking,

oh, how are your other customers using Fielding Edge? How are your other customers

using the

routing, the alerts to, you know, make sure that the right team gets the right incidents.

So I think it's, like, the collaborative

effort and eagerness to, you know, learn from across team that's interesting in this space. I was gonna say, I think 1 of the more interesting lessons is just the importance of notification fatigue. I know we touched on that for quite a bit earlier in the podcast. But the idea of, like, not sending too many messages, I wouldn't have thought that would be just so important. And, like, there is this very clear threshold after which the notifications

lose

all value.

So I think that was 1 of the most unexpected things, I think, in in the work that we've done with our customers and just coming to that realization of how important and critical this is.

And for data teams who are just starting down this path of starting

to work with incident management or if they have reached a certain level of sophistication,

what do you see as the

missing pieces of

being able to help them

identify

and resolve problems

as fast as possible and

some of the either

principles and

simpler problem to adapt and solve.

A simpler problem to adapt and solve?

Yeah. I think

the incident catching process, identification,

and notification process

has pretty much been built out. Different vendors, different tools in the space.

But what's really still missing, just given the complexity of, you know, the issues that could happen, is the root cause analysis piece.

Because it could be, you know, all kinds of variety of, you know, external systems that data engineering teams deploy. It could also be, you know, all kinds of variety of type of issues like a code change or, you know, a data pipeline failure or, like, a humongous number of types of data sources,

format change, you know, type of data change, values change that can result in these incidents.

So there hasn't been really any automated solution to help teams resolve or identify

clues that can help them resolve these issues fast.

So I think that's the main missing piece in the stack today.

I'd add the piece of understanding

deeply with, like, the downstream consumers. And I think given that data is, like, proliferating across organizations,

everyone's talking about how, you know, all these folks within the organization are actually using data to do their job. The question of, like, okay. The fact that this particular table is broken and not being able to deeply understand, like, exactly who in the organization is being impacted by that, that I think continues to be a big gap for us. And we spent a lot of time we actually have quite a few tools to start understanding, okay, which

downstream BI dashboards? Who was looking at those dashboards?

But that extends further. It's like, which data scientist potentially ran a model

on that particular data? And, like, did that model have an output that was then used by the customer success team to, you know, address a particular customer or sales folk to actually go and reach out to a prospect because this model had this output based on the bad data. So it's like, what is the ripple effect of data being proliferated and these decisions being made across the organization.

So just mapping that out, I think, is a huge opportunity. Like, what actually happened? Who was impacted by this? It reminds me 1 of our customers said, like, the fact that they don't have that means that as a data engineering team, anytime something happens, they basically have to notify half the company that something broke.

Whether or not any of those people were impacted by it, they're still saying, like, hey. Just so you know, this thing happened. We're able to address it, but, you know, there's this potential ramification of the data having being broken for this amount of time. And, like, that's just an incredibly painful thing to do as a team. So, like, imagine having to just constantly tell their org,

hey. We this thing messed up. Like, oh, yep. This thing messed up again. You're just kind of, like, deroding the trust that you built with the organization as a data team. So I think, yeah, we really want to be able to help customers

narrow down who was impacted by this and only

tell those people

that there's this potential thing that happened and, like, this is what you should do as a result. There's a huge value there, I think, that we can create and I think is a gap today.

Are there any other aspects

to the overall space of incident management

and how it manifests

for data teams and in data platforms that we didn't discuss yet that you'd like to cover before we close out the show? There's 1 that we hear a lot, and it's this idea of going back to the reducing the complexity,

just being able to quantify and map data that is not being used. So, like, which tables has no 1 looked at? You know, that simple question is actually it might be easy to figure out, but we find that it's not something that data engineers are necessarily doing on their own. So I think that continues to be an area of, like, how can we reduce the the entropy that is occurring within our data systems

versus continuing to allow kind of anyone to create any table anywhere in this kind of common

data warehouse that everyone can access. What we're seeing is, like, the need to create more structure and control so that the proliferation of schemas that, you know, every single person that joins creates a new schema under their name. They might be having tables in there that are populating

core downstream processes even though it's in their own username spaced schema.

That person then leaves the company, and then what do you say? Can I just delete this schema or no? Is it potentially gonna cause a catastrophic

failure down the pipeline?

So I think just that idea of defining really clear practices of, like, how do we,

as data teams, ensure that people are kind of stewarding data in the right way is also a really exciting area for all of us. And I think from what I'm talking to customers, it's like feels pretty early, but also really exciting. That would be it for me. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question for each of you, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

The ability to kind of query

and

quantify

the complexity of a lineage graph

and slice and dice what is actually happening within lineage across the end to end pipeline from, you know, raw data ingest all the way through, if you have a data lake, all the way through to the warehouse, all the way down to the actual BI layer. And just being able to understand

all of the different relationships across each of the nodes within each of those subsystems and how they all work together. That's an area that that we're making good progress towards, but it continues to be a gap. There's tons of opportunity

within that. And once that's mapped, there's, like, endless possibility in my mind. Yeah. As I mentioned, I think the incident management, most of the, you know, anomaly detection, you know, like the notifying teams and some of the workflow kick offing

dealing with the incidents has already been built out. And the piece that is really missing is the root cause analysis piece. And talked about it already, same response.

Well, thank you both very much for taking the time today to join me and share your experiences

of working with teams to help them

adopt and implement incident management practices and helping them understand how to

manage better data uptime. So I appreciate all of the time and effort you've put into that, and I hope you enjoy the rest of your day. Thanks so much, Tobias. This was a really, really fun conversation. Yeah. I'm looking forward to continue to learn from you. Yeah. Thanks for the insights on, you know, backfilling data as well.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links