Summary
The accuracy and availability of data has become critically important to the day-to-day operation of businesses. Similar to the practice of site reliability engineering as a means of ensuring consistent uptime of web services, there has been a new trend of building data reliability engineering practices in companies that rely heavily on their data. In this episode Egor Gryaznov explains how this practice manifests from a technical and organizational perspective and how you can start adopting it in your own teams.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
- Your host is Tobias Macey and today I’m interviewing Egor Gryaznov, co-founder and CTO of Bigeye, about the ideas and practices of data reliability engineering and how to integrate it into your systems
Interview
- Introduction
- How did you get involved in the area of data management?
- What does the term "Data Reliability Engineering" mean?
- What is encompassed under the umbrella of Data Reliability Engineering?
- How does it compare to the concepts from site reliability engineering?
- Is DRE just a repackaged version of DataOps?
- Why is Data Reliability Engineering particularly important now?
- Who is responsible for the practice of DRE in an organization?
- What are some areas of innovation that teams are focusing on to support a DRE practice?
- What are the tools that teams are using to improve the reliability of their data operations?
- What are the organizational systems that need to be in place to support a DRE practice?
- What are some potential roadblocks that teams might have to address when planning and implementing a DRE strategy?
- What are the most interesting, innovative, or unexpected approaches/solutions to DRE that you have seen?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Reliability Engineering?
- Is Data Reliability Engineering ever the wrong choice?
- What do you have planned for the future of Bigeye, especially in terms of Data Reliability Engineering?
Contact Info
- Find us at bigeye.com or reach out to us at hello@bigeye.com
- You can find Egor on LinkedIn or email him at egor@bigeye.com
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Igor Grasnov, cofounder and CTO of Bigeye, about the ideas and practices of data reliability engineering and how to integrate them into your systems. So, Igor, can you start by introducing yourself? Sure. Thanks for having me on, Tobias.
[00:02:11] Unknown:
I'm Igor. I am the cofounder and CTO of Bigeye. We are building a data observability platform. And what we're gonna be talking about today, data reliability engineering, is really where we want to go with the product and enable customers to be able to have a solid data reliability engineering platform
[00:02:32] Unknown:
regardless of the size of their data team. For folks who haven't heard you on the episode that you did previously about the work that you're doing at Bigeye. Can you remind us about how you got involved in the area of data management?
[00:02:43] Unknown:
Definitely. So I describe myself as a software engineer that has just been doing data things my whole career. I started off writing MapReduce jobs back when Hadoop was just getting started. And from there, I got into data warehousing, built out warehouse infrastructure, particularly, using Vertica, built in house ETL tooling, did data modeling, was an early adopter of Looker back in the day. And in late 2014, I joined Uber as 1 of the first data engineers in the company. Uber at the time was starting to scale out their analytics platform. They were going from 1 Postgres replica supporting the whole company to a proper data stack.
So did sort of the same things except for a 10 x to scale and 10 x to pace. Set up their vertical clusters, built internal ETL tooling, made sure that all the data was flowing, a visualization, did star schema data modeling. And there, we realized that in order to scale a data platform to a company of that size and that amount of disparate data, we needed to build a lot of internal tooling to manage that data. And so Uber, as many other large tech companies tend to do, had an internal team of engineers building out internal tooling around their data infrastructure.
So these are things like data catalog, data quality tools, lineage tracing, communicating data outages. And when my cofounder, Kyle, and I started talking to people in the industry, we realized that everyone is facing the same problems everywhere else as well. And the lack of tooling available in the market made it really hard for data teams to spin up and scale quickly because they would have to hire people in order to build all this stuff internally. And so that was really the impetus for starting big eye is to provide the sort of tooling that we wish we had in 2014 and 2015 to data teams that are just getting started today. So focusing on the term data reliability engineering,
[00:04:58] Unknown:
and you mentioned also that the platform that you're building at Bigeye is around adding observability to the data stack. And I'm wondering if you can just give your definition of what is encompassed by that phrase of data reliability engineering and maybe some of the ways that it bears relation to the observability aspects of the sort of data platform and data pipelines?
[00:05:21] Unknown:
I see data reliability engineering as a natural extension to the data team in the same way that SRE was a natural extension to software engineering teams. Data reliability engineering means treating data quality like it is an engineering problem. It's applying practices and using tools in order to ensure that the data stays fit for use across every single application in the business without losing the velocity of your data team in your data environment. If you look at SRE, it came out of a need to scale how software was built. Before, when you had bare metal servers, it would take a long time to build an application, deploy it, go to your server closet, reconnect the network cables, and then that's how things worked. And so things moved slower.
With the advent of AWS, all of a sudden, any developer across the globe could push a button, get a database, get a web server, have some APIs. And now this proliferation of applications led to the next problem of how do you know that these applications are actually running correctly. If you look at the data landscape, this is where SRE came into play. SRE had the goal of creating the processes and tooling necessary to know that your applications are running correctly, are actually available, and working as intended. If you look at the data landscape, today, data infrastructure is very easy. It's swipe a credit card, get Snowflake. There's your data warehouse. Swipe a credit card, get Fivetran. There's your ETL tool. And pushing more and more data into a central data warehouse is easier than ever. This now takes teams days instead of months as it used to do. Because it's so much easier now to collect all this data and start using it, you have this large responsibility on the shoulders of fewer and fewer data engineers who are managing larger number of data pipelines.
And in order to know that the data is good and reliable, you need similar sorts of practices and tools that we have in SRE, but applied to the data landscape. And this is where data reliability engineering comes in. Now you asked a great question, which is how does data observability map into this concept of data reliability engineering? And in my mind, data observability is a necessary prerequisite for a great data reliability engineering platform. You can't start ensuring the quality of your data and know that it's reliable and be able to detect that and communicate that without first monitoring what your data looks like.
And data observability encompasses all of these concepts around monitoring, alerting, setting up the instrumentation, making sure that at any point in time, you know what the state of your data is and what it looks like. And once you have that observability in place, then you can start building the other pieces of the data reliability engineering platform, such as defining SLAs around your deliverables, such as your datasets and your dashboards, running better incident management processes, having runbooks for resolving issues that have occurred in the past where you can take very concrete steps in order to fix them. And, obviously, the holy grail of any SRE or DRE movement is automating all of this and then saying, as soon as an issue arises, can we have the system push the right buttons in order to fix the problem without having any human intervention?
And so data observability is a necessary piece, but it's not the whole piece of the puzzle. And there's a lot more to data reliability engineering, but you can't even get to those more advanced topics without first knowing what does your data look like. So you mentioned how observability is of the base need for being able to build out a data reliability
[00:09:51] Unknown:
capacity within an organization or a data team. And I'm wondering if you can talk through some of the other aspects of data reliability engineering, both the technical and social and organizational aspects.
[00:10:05] Unknown:
So let's break this down into those 2 different pieces. The technical aspects, which, Adonkius, would be the tooling and how do you even do data reliability engineering. And then the social ones, which are typically how do you structure a data team or really a whole business and an organization in order to leverage data and participate in best practices around data reliability. From a tooling perspective, the first thing that is necessary, as we mentioned earlier, is data observability, understanding what does your data look like. Historically, data observability has been built in house. You've had teams run SQL queries, collect some values, and push them into something like Datadog, and then have Datadog alert them when something's going wrong.
Obviously, with tools such as Bigeye today, this is now available out of the box, and teams don't have to go on this on their own. Usually, once you know the state of your data, you want to then surface this information to the rest of the business. And this is where data catalogs come into play. If you look at things like Alation and Stemma, Calibra, data catalogs are a great place for the whole business to see what is the state of their data, but also a great place to start enforcing SLAs around your data products. If you think about every table, every dashboard, every report, ML model as a data product, then a data catalog becomes a clear place for the whole business to understand what is the state, what is available for each 1 of those products.
Finally, the last tool that would be necessary is really around runbooks and incident management. Sadly, there isn't much available today, but it is an interesting area to explore. If you look at PagerDuty and Opsgenie and the like from the SRE world, they're geared around individual incidents that get recorded historically. And you can comment on them and write notes in the attached runbooks that it will tell you, this is how this problem was resolved. Sadly, today in the data lit world, this turns into a guessing game, and you go on Slack, and you find out the last person who touched this dataset and ask them, well, how did you solve this problem last time? Have you ever seen this? And that's just not a great way to scale out incident management. And so the last missing piece for data reliability engineering, in my mind is having a holistic way of expressing what issues have happened in the past and how did they get resolved, and then, obviously, potentially automating the resolution of some.
From an organizational perspective, the tooling for data reliability engineering is actually the easy part. And the hard part for most businesses is the actual organizational aspect or the organizational systems that need to be in place for data reliability engineering to even exist. I've talked before about the 2 types of data users, the data producers and the data consumers. Producers are your data and ML engineers who are building datasets and data products, and then your consumers are usually the analyst in the business and the data scientists who are doing something with those data products. Now data producers need a good way of understanding who their consumers are, and that is typically the hardest part about data reliability engineering inside of a business. If you have a large organization that has thousands of people, you could have different consumers of varying degrees of technical knowledge and expertise all using the same datasets that you're producing.
And if the producers don't know who these people are and how to communicate the to them in a way that they will understand, then it's really hard to have this data reliability because the consumers will not be able to understand when something goes wrong. Or if they do understand that something is wrong, they might not understand why or how they should be working around it. The second hard part about the organizational assist necessary is that the problem of empowering the users to act and be notified about reliability issues.
Today, we especially in a remote working environment, we have this problem of email and Slack overload where I don't think I've cleared my Slack notifications in the last 3 months. There's always something on red. There's always something going on. Having the tools in place to tell you about data issues is a necessary prerequisite to data reliability. But if these notifications aren't going to the right people, you just end up with alert blindness and alert fatigue where the people getting notified about the issues don't actually know how to deal with them and may just end up ignoring them.
So these alerts need to be relevant. So the data team needs to be able to set up groups of people that care about particular data products and notify own them and only them about any issues with that. And those people should then be empowered to actually go and do something about this issue. If you get a notification that's not relevant to you, that you can't act on, that's
[00:15:53] Unknown:
no better and probably worse than just not getting a notification at all. You mentioned that there are some parallels between data reliability engineering and site reliability engineering from the sort of web application and services ecosystem. I'm wondering what are some of the ways that the concepts map between the 2 sort of domains and some of the aspects of data reliability engineering that are unique to the problems that exist in the sort of data management and data analytics ecosystem.
[00:16:23] Unknown:
A lot of the core principles of site reliability engineering or SRE are still applicable in data reliability engineering. Data observability that we covered a little bit earlier is is very similar to system observability that in SRE land is provided by Datadog, New Relic, AppDynamics, and the like. There's also the notion of having runbooks and having incidents. When an application goes down, you have an issue. And that issue then needs to be resolved and needs to be documented. Data reliability engineering has the same concept. Finally, 1 of the largest ideas in SRE are is the notion of SLA or service level agreements, where each service has a allowable amount of downtime and a way to measure that downtime and when it is actually available or not.
Data products should behave a lot like applications, should have their own SLAs that will describe when that data product is available. And if it's not, how long has it not been available for? And then you can start summarizing that by quarter, by year to see how often are things going down, how often can I trust my data to be up to date and reliable? Now talking about some of the differences, applications all have the same sorts of metrics that you are monitoring about them. You have your classic latencies and QPS, how many requests per second. It's serving memory consumption, CPU utilization.
These encompass the vast majority of metrics that people care about from an application perspective. And just applying those 4, you can cover 90 to 95% of applications and know the general state of what is going on with them. Error rates would actually be probably the 5th 1 there. The hard part about data is that data is so disparate. There are so many different things represented by each dataset that you can't have the same metrics representing every single data product. If you have a dataset that you expect to update daily, you might want to measure how often is it updating. You might wanna measure how many records is it loading.
But then once you start drilling into it, you have different fields and different columns that are available to you. And those all have their own unique properties that you might want to measure that are different across each dataset in each column. For example, if you have a identifier for a user in your system, You might expect that identifier to be unique. But that measure doesn't actually apply across the board because different columns have different data in them. You might have a column storing a ZIP code, and you might wanna check that this is actually a valid ZIP code or a valid email address. So the hard part about doing data reliability engineering is that there are so many things that could go wrong about the data, and every incident is probably going to be very unique as opposed to SRE where the finding out about incidents is pretty easy, and the hard part is actually figuring out how to debug it.
Now I'm not saying that debugging data problems is any easier. We've all gone through our own pain of that 1 problem that you just can't quite figure out before. But detecting issues and knowing exactly where they're going wrong is much harder on the data side than it is in SRE.
[00:20:16] Unknown:
To your point of the sort of service level agreements that exist in SRE and some of the ways that it maps into data reliability engineering, there are other concepts too that are present in terms of the service level of objectives and service level indicators, and I'm wondering how those map to some of the ways that we can think about the reliability of our data platforms and our datasets that we're working with.
[00:20:40] Unknown:
The other 2 concepts that you just mentioned make up that trifecta of SRE building blocks. You have SLIs, which are the actual measurements and the numbers that you're working off of to track your performance, service level indicators. You have service level indicators or SLIs that are the actual numbers that you're tracking to measure the performance of your application. You have then SLOs or service level objectives, which are the agreements that the application owner is trying to meet based on those SLIs that they are collecting. And then finally, you have SLAs or service level agreements, which is the contract between the application owner and the user of the application.
Now I talked about SLAs in data reliability engineering because data itself is very bimodal where, as I mentioned before, you have the producers and the consumers. And, fundamentally, every dataset has an owner, and there are then users of that dataset that care about the state of it. So SLAs are probably 1 of the most important concepts to take into data reliability engineering. We break down the other concepts. SLIs or those numbers that we're measuring are really the output of data observability. We are trying to measure the state of the data using some metrics and some measurements that you can then create SLOs around, which are, to put it bluntly, the threshold beyond which something is incorrect.
Let's take an example here. If I have a dataset that has user information and I have a column for email addresses, I might expect that some users don't have an email address. Maybe users can sign up with a phone number instead. But, usually, there's around 5 to 6% of users that sign up for the phone number. Everyone else signs up with an email address. This means that the email address column would be null around 5 to 6% of the time on average. Now that 5 to 6% of nulls is the SLI. It's the indicator. It's the number that we are measuring the performance of this column on.
Now on top of that, we need to set an objective. What is the SLO for the that percent of nulls? Now you might have a business expectation, which is we would expect users no more than 10% of users to set up a phone number. You might have an objective to say, we want to cancel phone number sign ups, and we'll only make them available to a limited subset of users. So we want to drive that down. So let's have only 2% of nulls in this column. Or this might just be a we want to maintain status quo. It should be always be around 5 to 6 percent because that's normal for us. Now you would then take those 2 concepts, that threshold of, let's keep it around normal, so 5 to 6%, and that SLI, that constant measurement of your nulls, and say, whenever the percent of nulls in my email column go over 5%, then that is a violation of this SLA because we promise you, the user of our data, that this data would remain pretty consistent over time.
And so if our email sign up system breaks and they're all of a sudden only phone subscriptions and a 100% nulls in your email column, then you are violating your SLA, and your customers need to know about that. But the only way that you would know that is by measuring that percent of nulls and setting those thresholds and those bounds for what is an acceptable range for that SLA.
[00:24:42] Unknown:
Another concept that has been going around in the data ecosystem recently is the idea of DataOps and some of the ways that that does and does not map to the concept of DevOps from the application landscape. And I'm wondering what your thoughts are on some of the differences and distinctions between DRE versus DataOps and whether DRE is just a sort of rebranding of those same concepts for whatever the purpose of that might be. I don't think that a DRE is a rebranding of DataOps.
[00:25:17] Unknown:
I think that DRE is a subset of what DataOps is meant to provide. If if you think about what the output of DRE is, it's the knowledge that your data is reliable and is fit for use. DataOps is a much broader notion of how do you manage your datasets holistically? How do people know what's available? How do you manage access controls to your datasets and data products? What are the different teams that require access to it? In what formats do they require access? How do you manage the whole ecosystem? That is all part of data operations. Data reliability engineering gives you the necessary inputs to have a great data operations organization because you can then say the following datasets haven't been healthy for months, and we should probably deprecate them and people should stop using them. That's the goal of DataOps to make that decision. But data GRE gives you the inputs that you need in order to even make that decision in the first place.
So if you think about it as really layers in a pyramid, you have the foundational data observability layer, which is you need to know what's happening in your data ecosystem at all times. Then building on top of that, you have data reliability engineering where you say, we can take these inputs and actually create the processes and the workflows necessary in order to know whether our data is reliable and fit for use. And then you build on top of DRE, your data operations layer, and say, now that we know what the state of our data is, how do we make these decisions about what data should be used, where it's available, and who's using it for what? In terms of the actual practice of data reliability engineering,
[00:27:22] Unknown:
who is typically responsible for actually building out the practices, identifying the SLOs and SLAs, and actually implementing the technologies and techniques and processes that are necessary to be able to actually make this a part of the organizational fabric? This really depends on the size of the organization that we're talking about.
[00:27:44] Unknown:
So I've worked in companies that are large. I've worked in companies that are small. And here at Bigeye, we also have a wide range of customers ranging from a 1 man data engineering shop to Instacart that has a 12 person organization that is a central core data engineering team. And at smaller companies, we usually see DRE being distributed to the teams that are actually using the data. If you have a small central data engineering team, maybe it's 1 or 2 people that are managing the warehouse and the infrastructure and the ETL tooling, they typically don't have the time to actually do DRE all on their own, but they will usually provide the tools to the rest of the business in order to enable them to set up their own SLAs and help the whole business define their quality and GRE practices.
At a larger company, you get into a world where there's a large central core data engineering team that are able to say, we will provide DRE tooling to the whole business, but we will also take it upon ourselves to set the SLAs for all of the core datasets and all of the core measurements that we care about. So they will measure the basic concepts of freshness and row counts and distribution of your numerics, and they will track that across all of the fundamental datasets for the business. Now there could be some auxiliary datasets that are used by a separate team, like the growth team, the marketing team. They might have their own datasets that don't fall into the domain of the central data engineering team. And those are that would then be the responsibility of that individual team.
But this core central data engineering team would be responsible for DRV across all the main fundamental datasets.
[00:29:55] Unknown:
Digging more into some of the technical aspects of how to actually implement the measurement and alerting and correction of these different measurements that we're doing to ensure that the data that is under our control and under our responsibility is accurate and, you know, has sufficient quality based on the commitments that we've made to the downstream consumers. What are some of the tools that are necessary, you know, in terms of broad categories or even specific instances to be able to build out these practices? And what are some of the areas of innovation that teams are focusing on to be able to support those overall reliability practices?
[00:30:39] Unknown:
We can break down the problem of DRE into a couple of core concepts. Obviously, we've talked quite a bit about data observability, just knowing what is actually going on in your system as being the first step. Obviously, there's tools like Bigeye for this. Historically, a lot of teams have been fairly clever around custom built solutions for this, running SQL queries, emitting the results back into a table, into tools such as Datadog. People get very creative with these sorts of tools. Once you know what your data looks like with the data observability, the next step is then setting the expectations and communicating any issues that arise.
Setting the expectations has typically been the hardest part for monitoring and for DRE. Because setting the SLOs for your datasets involves a deep understanding of what does the business expect out of this. Data producers typically don't have that depth of knowledge. They can tell you the basics around, we expect this dataset to come in daily, hourly. And this dataset usually has somewhere around 10, 000 records a day. Past that, they don't usually understand what's inside the dataset and how it's being used. And so it's hard to set these expectations around what indicates an issue and what doesn't.
The largest area of innovation here is the ability for either the business to come in and say, this is my expectation of this data. I expect this column to have a specific format, to be in a specific range, to have us only certain values in it. Or the second approach, which is much more scalable, is automating that, the setting of those expectations. If your data looks a certain way historically, it probably is going to look the say should look the same way going forward, especially if you're basing your machine learning models on it that expect that's more or less consistency across their features over time.
Big eye is actually doing a lot of work around understanding how are we able to set these expectations and these thresholds automatically for users based on their historical performance of the data and what it looks like so that users don't get overwhelmed thinking about what their expectations of the data is. Finally, there's the communication piece going back to the users and telling them that something is actually going wrong. Most tools provide a way for you to push notifications. Email has been the classic. Slack, obviously.
Some teams use PagerDuty or Opsgenie in order to actually wake somebody up if a pipeline is doing something wrong. I think the next step in the evolution of this notification is going to be something similar to a notification hub. If you look at something like Jira, where every issue is a task in Jira, and it's assigned to you. And you can log in to Jira and see everything that's assigned based on priority order and what actually needs to happen. Notifying users about issues in their data is moving in the same direction where there's often just too many things going on at the same time in order to know what to respond to. And so having a central notification hub for data issues that is prioritized and that has clear actions and next steps assigned to it either through by attaching runbooks or by seeing what happened last time and how did this issue get resolved previously and who did that.
That is going to be the next step in making GRE much more manageable within a company. And I haven't seen that yet, but I definitely think that that's the right next step for teams building out their GRE practice.
[00:34:58] Unknown:
Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data in motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end to end metadata, DataBAND lets you identify data quality issues and their root causes from a single dashboard. With DataBAND, you'll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand today to sign up for a free 30 day trial and to take control of your data quality.
As far as organizations and teams that are looking to implement some of these DRE practices and, know, maybe they already have some of the technical underpinnings. But what are some of the potential roadblocks that they might run into as they're starting to plan out the organizational aspects and working with their customers to understand what should their SLAs be and just some of the complexities that arise as a team is starting to adopt some of these principles and practices?
[00:36:11] Unknown:
I think complexities is the right word to use when talking about roadblocks to TRE. Because the biggest problem that companies face is that they underestimate the complexity of their data and their organization. As I mentioned before, applications are pretty straightforward. It's easy to measure the basics, and all applications have the same basics. What teams don't understand is that every single piece of data is different and has its own complexity and its own use and its own intricacies that only a few people in an organization might know about. And not only that, if you look at the same dataset, different teams would actually care about different aspects of that dataset.
You might have 1 team caring about how many users are subscribing to your service week over week, whereas another 1 cares about how long do they take to convert. And that might be represented in the same dataset itself. And so not having that understanding of the complexity usually leads teams to build something that's fairly generic but naive and is only measuring the basics of reliability without actually going in-depth enough to be useful to the rest of the organization. 2nd, if we're talking about smaller companies, they're usually much more concerned about moving faster as an organization and doing that by growing the number of datasets that they have available and what kinds of models are they training off of these datasets and what business value are they providing.
This definitely drives the business forward. But if the data itself isn't reliable, then having more data in your warehouse isn't actually going to make the business any better. It's just really dumping more trash into a pile that is already not very useful. And these teams are usually slower to adopt DRE practices because they think that they will slow them down. However, the important thing to understand is that knowing the state of your data and knowing that it's reliable and actually usable will move the business forward faster because there will be less time spent by people asking about the state of their data, whether it's healthy. People second guess themselves when building their reports and their models, And this will drive the business forward faster than just adding more and more datasets if you take that time to slow down and implement these processes and tools to ensure the reliability of all of your data.
[00:39:06] Unknown:
As teams are adopting some of these DRE practices, what are some of the sort of conceptual gaps that they might be running into if they're not already familiar with SRE principles or they don't have a dedicated SRE team in their organization and just some of the sort of learning barriers and the sort of, like, ramp up process of being able to build out these capacities internally?
[00:39:32] Unknown:
Interesting thing to understand about a DRE is that it almost becomes a responsibility of anybody using the data. SRE has turned into its own branch of engineering. You have engineers that are called site reliability engineers. And they don't write software. They just operate and manage the tools and processes in order to ensure the SRE best practices. On the flip side, DRE is everybody's problem. If you're using a dataset, if you're producing a dataset, you should care about its quality. And so I think that forcing a single person or a single group to ensure data reliability across your whole organization is a massive pitfall, and it becomes a lot harder to actually implement a good DRE practice in your organization if you're trying to approach it that way. Think the second pitfall that people run into is not thinking about the consumer of the data.
At the end of the day, measuring reliability is useful only if you are then telling the people that are using the data about what the state of that data is. If you're measuring it and then cleaning it up quietly and then saying, oh, yes. There was that 4 hours of downtime, but we caught that and we documented it, and we know how to resolve it next time. If you're not telling the users of the data that there was that downtime, if they're not aware that the data was bad for that period of time, then they might be making some business decisions that are incorrect now because they just didn't know about the state of the data. And so the important part about DRE is communication.
And it's not about ego, and it's not about always having the answers and having the cleanest dataset at all times. It's just about being explicitly transparent about what is the state of the world right now, what is usable and what's not, and making sure that the rest of the business knows that as well. Digging more into that communication aspect and being able to surface the information
[00:41:46] Unknown:
to the end users of the data, what are some of the sort of useful protocols that already exist? What are some of the areas of improvement that are necessary in the overall ecosystem of data tooling and particularly for, you know, dashboarding tools or tools that are being used by the data consumers that we still need to develop to allow for broadcasting this information to the people who are using it at the point that they're accessing the data? At the point that they're accessing the data is the most important part of this communication.
[00:42:20] Unknown:
This is the biggest problem with sending emails and Slack messages. They get ignored, and they're not relevant at that point in time. I might see a Slack message 2 hours later that says, hey. By the way, that user's cable was broken. I might say, great. Well, I already finished all my work on it, so now I have to go redo all my work again. The improvements that I would want to see in this communication space is putting this information where the end users are accessing the data itself, whether that be a data catalog. If somebody's browsing through the catalog trying to find what report they should use, In the catalog, you should be able to surface this information and say, you shouldn't use this report right now because there's an underlying table that has quality issue. Or even better, if you have ad hoc querying, which most teams do, and you're starting to write your query, In an ideal scenario, what I would want is my IDE to just tell me that the query that I'm writing right now is not going to execute correctly because the table that I'm referencing is having an issue right now. And so I think there's a lot of interesting ways that we can surface reliability information at the point of consumption that just haven't been really invented yet. Data catalogs are getting there, and integrating tools such as big eye with your data catalog is important in order to surface this information as close as possible to the user. But I think there's a lot of interesting innovation that can happen around pushing it even closer to the user, surfacing this in the query editor, in the dashboard somehow to say there here's a big stop sign on this panel in this dashboard because the underlying data is bad. You shouldn't be even looking at this. Otherwise, you're gonna make some bad decisions.
That is really the next step for communicating
[00:44:21] Unknown:
to the end user. To that end, there's been a lot of work recently in the overall space of lineage tracking and being able to automate the generation of lineage and understanding at the various points that, you know, a pipeline stage failed, and that contributes to this downstream lineage and being able to propagate that information both upstream and downstream as far as the overall lineage graph. And I'm wondering if there are any recent developments that you've seen in that space that have been focusing on being able to propagate that information to the data catalogs, to the BI dashboard, to the, you know, data warehouse consumers and the queries that are being executed there and some of the additional work that's necessary. I know that Open Lineage is 1 of the efforts underway there, and I know that there are other approaches that are being taken.
[00:45:12] Unknown:
I see lineage as a building block for all of this functionality to exist. But lineage alone doesn't solve this problem. Lineage is necessary to understand what does the whole data landscape look like, what datasets depend on which other ones, what data products depends on which other datasets. Once you have that lineage graph, then it becomes easy to say, if dataset a fails, I know everything that's downstream of dataset a, and I can start notifying there. The problem with using lineage alone is that you're not gonna have your users staring at this lineage graph 247 to understand what is going wrong with their data and whether or not they can use a table. Users are just gonna go straight to the tool that they want to use, whether that's the catalog or the BI tool. And using lineage in order to propagate these data quality issues and convey the information about the reliability of data products is then the onus of the downstream tool, whether that's let's talk about data catalogs and BI tools for a second.
If you have a data catalog that is already capturing your lineage, that data catalog can use that lineage information in order to surface reliability information downstream of wherever any issue has happened on the catalog page itself. If you look at a BI tool, the BI tool can be doing the same thing. If the BI tool already has the reliability information and the lineage information, it can combine the 2 and say, I know that there is a problem on my users table, and I know that this dashboard uses the users table, that means this dashboard has an issue.
Knowing the lineage graph alone and annotating the lineage graph alone isn't going to help you. This information needs to be combined and presented at the final usage point for the data consumer, whether that is a catalog, a BI tool, ML modeling platform, whatever it is.
[00:47:26] Unknown:
And so in terms of your experience of building Bigeye and working with your customers and just speaking with other practitioners in the space and from your own history as a data engineer, what are some of the most interesting or innovative or unexpected approaches and solutions that you have seen to implementing DRE strategies and technologies?
[00:47:49] Unknown:
Usually, a lot of the innovative solutions come out of necessity and depend on the actual ecosystem of the organization that's implementing them. If I look back at my time at Uber, a lot of the tooling that we built took into account how our data team was structured, how the organization was using the data, what infrastructure were we using. We were big Presto users, and so partitions were a big thing. And so a lot of our notifications and tooling was built around the notion of having daily partitions being updated. I think a lot of the interesting innovation that comes out of companies is interesting because it's so hyper specific and solves a very, very niche problem because that would be the problem for that business specifically.
At Bigeye, we are building a platform that enables any user of at any size company to start doing data observability and building out their DRE practice within the organization. But we're doing that by providing a platform that anybody can leverage regardless of what their infrastructure looks like and regardless of what their data looks like, and then start building on top of it rather than being extremely innovative for a very
[00:49:14] Unknown:
hyper specific problem that somebody might be facing. In your own experience of working in the space and building a product that serves as a component in this overall ecosystem of data reliability engineering, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:49:31] Unknown:
The most interesting thing to me is that everybody starts by thinking about how to prevent bad data from getting into their systems. Usually, the first question that we get is how do we implement circuit breakers? If you look at the Intuit post, Intuit talks about circuit breakers where when data goes bad, they stop the whole pipeline. So no bad data propagates downstream. Think the interesting thing about that is that nobody really thinks about the processes or the resolution after the fact. They want to stop bad data from getting in, but what if that issue wasn't actually critical?
What if the process should just continue? That is just a little bit weird, and we can work around it, or we should just notify somebody. Maybe there's too many test events, and we don't expect that many test events, but we ran a large load test on our system. And we're already filtering out all the test events. Should we stop the pipeline? Probably not. But we should let the users downstream know that this is actually happening. And so it's important to think about the process that will allow you to know that something's going wrong and the steps that you can take towards resolution rather than simply saying, no. Stop the world right now because this data is bad. And for teams that are considering implementing data reliability
[00:51:02] Unknown:
as a technical and organizational strategy, what are some of the cases where that might be the wrong choice and they might be better suited just going with a sort of less sophisticated approach or just using the existing mitigation strategies that they have for observing and addressing data quality problems?
[00:51:19] Unknown:
There is an argument to be made that data is just vastly different from software, And we can't take the same principles of SRE and DevOps, and we can't just translate them 1 to 1. And I would agree with that, but I think it's important to use the principles of SRE as a blueprint. They're a great foundation for something that has worked for decades now for software, And we should, as data users and data people, should use those existing well known, well thought out principles as a foundation for how to have reliable systems.
Now we don't always need the whole shebang. You don't need a full on workflow that is automating all the resolution and performing actions in your system and going back to circuit breakers, doing circuit breakers everywhere. But taking the fundamental principles and applying them as you see fit in your organization lets you take those initial first steps. Start monitoring your data. Start building out the observability systems. Create these contracts with your stakeholders. Create these SLAs. As you start building out these best practices and start implementing some of them, you might realize that there's a point in your organization where you've done enough, and you've solved the 99% problem, and everything else is a 1 off use case that you can resolve 1 at ad hoc.
But you need to start using these principles and implementing them in order to have any sort of reliability in your organization.
[00:53:11] Unknown:
As you continue to build out the Bigeye product and look to the near to medium term future of its capabilities and the ways that you interact with your customers and market it to newcomers, what are some of the things that you have planned, particularly as it pertains to the space of data reliability engineering?
[00:53:29] Unknown:
I'm glad you asked. We're actively scaling out our platform. We actually just raised $45, 000, 000 in a series b, and a lot of that is going to go towards building out more functionality around data reliability engineering within the platform. We already have these basics of observability, and we want to start creating the rest of that stack of tooling that's necessary to have a great DRE practice within an organization, including SLAs and incident managements and runbooks. DRE is also about having all of these tools work together, and something else that we're looking into is broadening our integrations so that DRE is more accessible to more organizations.
And even if you already have tools in place that solve parts of the DRE problem, you can still use Bigeye with all of your existing tooling, and it'll work flawlessly.
[00:54:34] Unknown:
Are there any other aspects of the data reliability engineering ecosystem and practices and principles and some of the sort of technological and social requirements that go into it that we didn't discuss yet that you'd like to cover before we close out the show? I think the most important thing to
[00:54:51] Unknown:
take away from this is that the tooling can only get you so far, and DRE really is about a mindset and about having the best practices in your organization. And starting early and being an advocate for the reliability of your data and measuring your data quality and actually communicating it to the broader organization. That can start at any point in a company's life cycle. And the sooner that companies get there and data teams start thinking about reliability,
[00:55:19] Unknown:
the better off they will be in the long term. Well, for anybody who wants to follow along with you and get in touch and keep up to date with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think that
[00:55:38] Unknown:
data infrastructure is extremely mature now. As I mentioned before, between Snowflake and BigQuery and Redshift and on the warehouse side and 5tran on the ETL side. It's very easy to get data infrastructure set up. And I think the biggest gap is really the whole data ops landscape. And I'm not just talking about DRE, but I really am talking about that broader landscape of how do you know what data you have, who's accessing it, when, for what. All of this is difficult to measure today, and I think there's gonna be a lot of innovation in the tooling around DataOps. And to piggyback on that, I think there are a lot of really interesting projects out there that are sometimes solving multiple different problems at the same time.
And the best way to address a lot of these issues is actually with best of class tools that are really easy to set up and use that integrate well together rather than having 1 giant monolith that attempts to solve everything, but doesn't do any 1 thing exceptionally well.
[00:56:49] Unknown:
Alright. Well, thank you again for taking the time today to join me and share the work that you're doing and sharing your perspective on the overall data reliability engineering space and some of the ways that that is emerging and growing. It's definitely very exciting to see some of these practices and principles start to spill over into the data ecosystem. So definitely look forward to seeing more of that, and I appreciate your contributions to the space. And I hope you enjoy the rest of your day. Thanks for having me on, Tobias. It was a pleasure. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Welcome
Interview with Igor Grasnov: Data Reliability Engineering
Defining Data Reliability Engineering
Technical and Organizational Aspects of Data Reliability
Parallels and Differences Between SRE and DRE
SLIs, SLOs, and SLAs in Data Reliability
DataOps vs. Data Reliability Engineering
Implementing DRE Practices
Tools and Innovations in DRE
Roadblocks and Complexities in DRE Adoption
Conceptual Gaps and Learning Barriers
Lineage Tracking and Data Reliability
Innovative Approaches to DRE
Lessons Learned in Building Bigeye
When DRE Might Not Be the Right Choice
Future Plans for Bigeye
Final Thoughts on DRE
Closing Remarks and Contact Information