Summary
One of the oldest aphorisms about data is "garbage in, garbage out", which is why the current boom in data quality solutions is no surprise. With the growth in projects, platforms, and services that aim to help you establish and maintain control of the health and reliability of your data pipelines it can be overwhelming to stay up to date with how they all compare. In this episode Egor Gryaznov, CTO of Bigeye, joins the show to explore the landscape of data quality companies, the general strategies that they are using, and what problems they solve. He also shares how his own product is designed and the challenges that are involved in building a system to help data engineers manage the complexity of a data platform. If you are wondering how to get better control of your own pipelines and the traps to avoid then this episode is definitely worth a listen.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
- Your host is Tobias Macey and today I’m interviewing Egor Gryaznov about the state of the industry for data quality management and what he is building at Bigeye.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by sharing your views on what attributes you consider when defining data quality?
- You use the term "data semantics" – can you elaborate on what that means?
- What are the driving factors that contribute to the presence or lack of data quality in an organization or data platform?
- Why do you think now is the right time to focus on data quality as an industry?
- What are you building at Bigeye and how did it get started?
- How does Bigeye help teams understand and manage their data quality?
- What is the difference between existing data quality approaches and data observability?
- What do you see as the tradeoffs for the approach that you are taking at Bigeye?
- What are the most common data quality issues that you’ve seen and what are some more interesting ones that you wouldn’t expect?
- Where do you see Bigeye fitting into the data management landscape? What are alternatives to Bigeye?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Bigeye being used?
- What are some of the most interesting homegrown approaches that you have seen?
- What have you found to be the most interesting, unexpected, or challenging lessons that you have learned while building the Bigeye platform and business?
- What are the biggest trends you’re following in data quality management?
- When is Bigeye the wrong choice?
- What do you see in store for the future of Bigeye?
Contact Info
- You can email Egor about anything data
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Bigeye
- Uber
- A/B Testing
- Hadoop
- MapReduce
- Apache Impala
- One King’s Lane
- Vertica
- Mode
- Tableau
- Jupyter Notebooks
- Redshift
- Snowflake
- PyTorch
- Tensorflow
- DataOps
- DevOps
- Data Catalog
- DBT
- SRE Handbook
- Article About How Uber Applied SRE Principles to Data
- SLA == Service Level Agreement
- SLO == Service Level Objective
- Dagster
- Delta Lake
- Great Expectations
- Amundsen
- Alation
- Collibra
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and deidentification features eliminate the need for time consuming manual processes, and their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.
Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. That's imuta. Your host is Tobias Macy. And today, I'm interviewing Igor Grzislav about the state of the industry for data quality management and what he is building at Bigeye. So, Igor, can you start by introducing yourself?
[00:01:56] Unknown:
Sure. Thanks for having me, Tobias. It's a big honor to be here. This is my favorite show to listen to during my morning runs. So my name is Igor, and I am the cofounder and CTO of Bigeye. We help data engineering and analytics teams monitor the quality of their data at scale. Before Bigeye, I spent 4 and a half years at Uber working on pretty much all things data from data warehousing infrastructure to building an in house AB testing and analytics product and really everything in between.
[00:02:27] Unknown:
And do you remember how you got involved in data management?
[00:02:30] Unknown:
Yeah. So the first time I've started working in data was back in 2012 when Hadoop was just getting started. And most of my job consisted of writing map reduce jobs and making that work and just having that process be less painful. I remember when Impala was released, and that was a huge deal because the first major sequel on Hadoop project, and it would make working with Hadoop a lot easier. In terms of data warehousing, I actually got into that at a company called 1 Kings Lane, where I worked on setting up their data warehouse as well as all of the ETL tooling involved. That's where I really learned about the wide range of use cases that data users have because there are so many different cases from marketing to web analytics to product analytics that all had to be answered and resolved using data in the warehouse.
And how difficult it really is to build a platform and a system that is generic enough to address all of those. Taking all of those learnings, I joined Uber in back in 2014 as 1 of the first dedicated data engineers. And my team was tasked with making the data warehouse work. Uber at that time was migrating from a Postgres replica to Vertica as their primary data warehouse, and really maintaining and building all of the tooling around that to help just to help scale with the company's growth. While at Uber, I got to explore really the whole data landscape, not just from the eyes of somebody building pipelines and setting up infrastructure, But also someone building products on top of the data. Sometimes the data that I didn't control, such as with the experimentation platform and the analytics tool there, where I built an analytics tool. But the data that we were using to show the metrics were often produced by the internal mobile teams or other teams within the company. And I didn't necessarily know where it was coming from or how it was being generated.
[00:04:33] Unknown:
In terms of the overall attributes of data quality, can you start by giving your views on what are the attributes that you consider when defining what data quality means, and some of the impacts that either high or low data quality can have on downstream uses of that information?
[00:04:51] Unknown:
Data quality, at the end of the day, is being able to vouch for the correctness of your data. And more so, it's being able to use the data in a meaningful way for the business. It's how fit is the data to be used. So as somebody who owns a data product, I want to make sure that my users trust the data that I am showing them in order to make the decisions that they need to make. This applies to really both sides of the aisle. So if you look at the data landscape, you have, on 1 hand, your data producers, who are your data engineers or other engineers within the company. And they want to provide data for analysts and data scientists to work with, and they want those analysts and data scientists to trust their processes.
On the other side, you have data consumers who are these analysts, data scientists who want to build products, build reports, build dashboards, machine learning models that the business can use. And they want the business users to trust the results and trust that they are making the best possible decision with that data. In terms of attributes of measuring this, you have your common ones such as latency and schema. And over the years, you've had a lot of companies and surveys try to define some number of measures of data quality. But at the end of the day, I feel like if you don't have trust in the data from the users of it, then that data is not high quality, and it won't actually be used by the business.
And a large part of what users are looking for is understanding the semantics of the data.
[00:06:50] Unknown:
And in terms of the semantics, I'm wondering if you can discuss a bit more about what that means and some of the elements of the ways that the data is being used that translate into those semantics and how that might differ based on the industry or the specifics of the organization or the goals of how that data is going to be used?
[00:07:14] Unknown:
So data types, by default are very generic, and they don't always convey the meaning of the data itself. So when you look at a scheme of a table in in a database, for example, you can see that it's either a string or it's a number. But you don't really understand how that will be used or what that number or what that string even means. The semantics is about imbuing the data with the information about how it will be used and what it actually means to the user. A couple of a good examples of this are strings in a database. The string type is generic. It'll accept any number of characters, And I'm sure everyone has had cases when some mysterious characters or mysterious strings appeared in their dataset before.
But without understanding how you're going to use that column later, you won't really know what should be in there and whether or not that field is of high quality. For example, if something should always be a stock ticker or an email or something should always be an internal identifier, That is the semantic of the column, but the actual representation of it can't really convey that. Same actually goes for numeric columns where if you have a summary table and you have a column that represents account, it's an integer column in the database. But you know that the column can't be negative because it's a count of something. You can never have a negative count of something.
And this is actually an interesting story where there are some COVID data that we were looking at and testing our product on, and the summary table had negative counts. Now we found out that these were corrections in the data, but because we identified that the semantic was that this should be a count of something, we could quickly understand that by having a negative value in that field, something looks wrong about it, and we should investigate further.
[00:09:18] Unknown:
In terms of the quality of the data and the effectiveness of the semantics, what are some of the driving factors that contribute to the presence or the lack of quality in an organization or a data platform, whether that's technological or structural or based on the maturity of the overall capabilities of that organization for being able to work with data?
[00:09:43] Unknown:
I think there are a lot of factors here. And I feel that a lot of the factors aren't necessarily technical in nature, And they are more organizational and more about the mindset on how you approach the data and how you approach data quality. We recently revisited the can book about what does quality mean in software draw the corollaries to the data world. So, draw the corollaries to the data world. So they talk a lot about embracing risk and knowing that things will go wrong and knowing that something could fail, and then acknowledging that and planning for those inevitable failures.
And in the data world, sometimes you just set up a pipeline and you say, great. The data's flowing. Everything's gonna be great. And that lack of understanding that something could go wrong in the future could lead to problems that you didn't expect. There's also a whole notion of being able to measure the quality. In software and SRE, this is pretty straightforward. If a server is up, then it's up. If application is responding, if the latency is below a certain value, then that's good. In data, it's a little bit trickier. And you need to be able to define what you want to measure, how you want to measure that, and then constantly measure and monitor it for any changes.
And having that ability to define what you expect from your data and monitor it and then expose that information in a meaningful way that somebody can take action off of it is really what's going to drive the adoption of data quality within your organization. And we honestly think that the biggest barrier that most teams face to getting started and getting data quality off the ground is not being able to measure the quality in the first place.
[00:12:00] Unknown:
The overall, if you wanna call it, revolution or evolution of the use of data in larger quantities and for an increasing number of purposes has been going on for the better part of the last 1 or 2 decades. But I've noticed that this year in particular, there has been an upswing in the number of tools and products and companies that are focused on data quality as a primary concern of an overall data product or a data platform and something that is being viewed as a critical component now where it may not have been before. So I'm wondering what you see as being some of the drivers in the industry or in the availability of technologies that this is the particular time where this focus on data quality is coming to the fore.
[00:12:50] Unknown:
I think that the revolution in data over the last decade or so, as you mentioned, has been this huge driving factor for new technologies in the infrastructure and end user landscape. So before when you had data, you had a bunch of flat files and you maybe loaded those into a database, and you could run queries on them. But this isn't very friendly and not very accessible to the users. If you fast forward to the last 7 years, what you see is an easy way to set up the actual infrastructure. It's easy to get a data warehouse going whether it's Redshift or Snowflake or BigQuery. You put in a credit card, you get a database. And it's fairly straightforward now to get data into it with tools like Fivetran and Segment and others that are pushing the data into 1 place.
Now that the data is in 1 place and easily accessible, it seems like tools moved directly to the use of data. So you had within the last 5 or 6 years, an explosion of BI tools and machine learning tools that you can run directly on your warehouse. You have Mode and Tableau and Jupyter Notebooks for actual exploration and data visualization and building reports and some useful getting insights out of the data. And on the machine learning side, you have tools like PyTorch and TensorFlow, where it's pretty easy to stand up a machine learning platform on top of the data that is now in your easy to set up warehouse, Redshift, Snowflake, whatever that is.
The part that is missing in that, we can call it hierarchy, is the middle layer, which is what really has been called the data ops landscape recently. Partially as a mirror to the DevOps landscape that evolved about 15, 20 years ago. So once you have easy data access and tools that can provide that access and expose it to more and more users. The difficult part is managing the understanding and the expectations of that data. It's hard when somebody looks at a dashboard or looks at a report and says, something here looks wrong because I have a gut feel for with a business, but I don't really understand how the data is stored, where it's coming from. So I don't have a good understanding of why this dash board looks wrong. And today, there's no easy way to debug this other than pushing this down to the data engineer, the analyst, and saying, go investigate this. And now these issues take hours of digging through SQL and pipelines to figure out what's actually going on.
This DevOps tool chain is meant to enable data users to have an easier time to work with their data. Whether it's understanding where it's coming from, where it's located, how to use it, what to expect of it. And so that's why you're seeing a lot of these companies now pop up that are focused around data quality. Data cataloging is another huge space in the data ops landscape today. And that's why this area is growing so quickly.
[00:16:23] Unknown:
Yeah. I think that your point about the availability of the tools is definitely 1 of the big ones where in the past 10 or 15 years, the organizations that were working with data at any sort of scale were on the forefront and on the leading edge where they had the staff on hand to be able to handle all the complexities of the infrastructure and the tooling around it. They had the sophistication for being able to build these products, but there wasn't as much of a widespread education of how to use the data. And so it wasn't being as widely accessed within an organization and instead was relegated to the specialists who were dealing with the data for their particular purposes. But now that we have all of these self-service platforms and data has become more ubiquitous and has come to be a distinguishing factor in the overall success of an organization.
There are more users within the business who need to be able to access the data, and so it brings in these data education requirements, and it brings in these requirements of how to convey trust to the user in understanding how and why their data is accurate or not accurate and what that means for their own use cases for that information.
[00:17:44] Unknown:
Yeah. Definitely. And I think it's exciting to see businesses become more data driven as much of a buzzword as that is. I think using data to make business decisions is the right way to go. But you can't make good decisions unless you understand the data that is going into those decisions.
[00:18:06] Unknown:
And so as you mentioned, there are a number of businesses that have been starting up to try and provide access to information that elevates the trustworthiness of this data and brings data quality into the core workflow of producing these data assets. And 1 of those companies is what you're building at Big Eyes. I'm wondering if you can give a bit of an overview about your approach to this problem and how the business got started.
[00:18:35] Unknown:
Big Eye is building an automated data monitoring platform that alerts you when your data changes and helps provide a clearer picture of what's going on in your data landscape in order to help debug any issues quickly and understand when there are problems. So the story behind Bigeye is my co founder, Kyle, and I met at Uber working on the experimentation analytics tool that I mentioned earlier. We ran into a lot of problems that you would normally see in a full stack data team. Where Kyle was the data scientist who wanted the results and wanted to build some dashboards. And I was the data engineer who was putting together the pipelines and getting the machinery working.
And when something would go wrong in Kyle's dashboard, the product manager would go to Kyle and say, something's wrong with the dashboard. This metric shouldn't have moved. Kyle then says, well, my statistics are alright. Let me go to Igor and ask him what's actually happening in the data. And now you have this multi day round trip of what's going on, why is the dashboard look the way that it does. That inspired us to build a lot of tooling to monitor what is going on in the pipeline, quickly expose that, quickly show which metrics are moving, and then try to figure out why before somebody comes and asks us about that.
Kyle actually went off and became the product manager for an internal team that worked on the sorts of tools that you would consider DataOps today. At Uber, there was a data quality system called Trust. There was a data catalog called Databook. There are some blog posts about that. And those tools were an extension of what we felt working together on the experimentation team applied to the rest of the company. We want to make sure that users within the company understand what is the quality of their data. They can go to the catalog and understand where that data is coming from. What does it look like? How do other people use it?
And that would unlock velocity within the data teams because people could very quickly find out about the data that they're using rather than relying on tribal knowledge, word-of-mouth, or just manually looking at a table in the data that's located there. Learning from that experience, we decided to tackle that problem that we faced ourselves, but we wanted to do so in a more generic and scalable manner. What we learned was that users want an easy way to set this up and an easy way to monitor this. And they don't want to go through the repeated headache of doing this multiple times for the same tables. And so we tried to take those learnings and apply them to the product that we're building now at Bigeye.
[00:22:02] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water Flask. There are a number of other ways that folks are addressing data quality problems where they might be using the data warehouse as the focal point for understanding if content being landed in the warehouse meets a certain set of criteria, or they might be embedding it very early in the pipeline where it is living with the data integration tool chain for being able to handle schema compliance for inbound events or data as it's being extracted from sources, or it might be in the business intelligence layer where you have things like Looker with their data modeling tools there for ensuring that the information is conformant to the model that's being fed into a dashboard.
I'm wondering if you can just give your overall assessment of the landscape of data quality tools and the approaches being taken and what you see as the trade offs for how you're managing it at Bigeye?
[00:24:02] Unknown:
There are a couple of approaches that we see commonly in the industry today. So if you look back, data quality tools historically focused on more data cleansing and data cleanup than monitoring of the data. Or you would get data that comes in, and you know that a field is supposed to be not null, and there are 20 null fields. Let's drop them all before we insert them. And this works, but then you don't know what you're dropping. You don't know what you're actually missing. So what happened after that is data quality checks were pushed down to the data processing layer and this ingestion layer where you could define them in tools like DBT. They have a testing framework, which is great, but this still requires you to write manual checks in SQL, and you have to write it for every single pipeline. And as a data engineer, I know him personally Sometimes too lazy to write a check, and I say I'll come back to it. And then I never come back to it. And then it becomes stale or unmonitored at all.
The more common approach today seems to be rule based, where you have a dataset and you can define the rules that you expect to apply to that data. And check those rules for consistency and make sure that your data is correct. At Bigeye, what we're doing is taking that to the next step and being inspired by a lot of work in the APM world such as Datadog and New Relic, where they collect metrics about applications and about servers and can monitor those metrics for the health of said application. Bigeye collects metrics about your data and then lets you define which metrics you want to monitor to define the quality of your data. For example, if something is should never be null, then you can collect a how often is this field null metric. And if that's ever greater than 0, then send me an alert.
The trade offs here really boil down to do you want continuous monitoring or do you want to check it at ingest? Having continuous monitoring means that there's slightly increased load on the warehouse, and there's this extra step involved where the data is already loaded, and now you're checking the data after the fact. That being said, this takes the load off of the ingest layer and is more in line with what ELT proposes, which is take the data, load it somewhere, and then transform it, and then do all your operations on top of it. In this world, Toro is already living on the place that you're loading the data. So consistently monitoring that load makes much more sense than monitoring it before you ingest it.
[00:27:23] Unknown:
As far as the data quality issues that you've seen, I'm wondering what you have found to be some of the common causes of it or common ways that it manifests and some elements of the pipeline as far as gathering the information or structuring the information that engineers should be keeping an eye out for in order to prevent downstream problems?
[00:27:50] Unknown:
The most common quality issues are data latency or freshness, row counts, so a table doesn't load. It is empty or has less records than you expect. In the off case, it has more records because you joined to a table that has duplicates. And no and empty field checks or completeness. These are common checks that many systems enable out of the box. They make sense for baseline coverage of your data. So you know that at least some data is there on time and it has more than 0 records. The more interesting quality issues that we've seen were all semantic based issues. Going back to that notion of semantics, you expect a field to contain only phone numbers.
And all of a sudden, you see ZIP codes in your phone number field because the UI that you're entering the data through changed the field order. And now your data load is loading in a different order. And these are more deceptive issues that are harder to find with these common checks. And you really have to understand what does the data mean and how is it going to be used. Another more interesting issue that we've seen is a column that it was supposed to represent dollar amounts, except for it was always being rounded to the nearest dollar, which obviously threw off reporting and made for very suspicious numbers.
You can't really catch these sorts of things in the ETL because this is really where the bug goes happens. If you have a bug in your TL, you're not gonna catch it until after the data is being loaded. But monitoring for this after the fact is really important. When it comes to building more resilient pipelines, so things like this don't happen, It's important to understand how is this data going to be used, what is it going to be used for. For straightforward pass through or easy transformations, it's easy to encode that in the pipeline. But for anything that requires some sort of business logic or business knowledge, it's good to clarify that upfront and then encode that into your pipeline. If you see a phone number field that isn't exactly 10 characters long or doesn't have a plus 1 in front of it, then maybe you should throw some sort of warning in the logs and monitor that to make sure that you're processing the data that you expect and you're outputting the data that your user would expect.
[00:30:38] Unknown:
And so in terms of the responsibility and priority of integrating data quality into the overall life cycle of data, I'm wondering what you have seen as being the breakdown of who owns that aspect and what the drivers are for ensuring that it is included in the definition of done for any aspect of building data pipelines or producing analyses or data products?
[00:31:05] Unknown:
At the end of the day, the person who cares about the quality of the data is going to be the person using the data. Most of the time, the responsibility of defining and managing data quality falls on the data consumer. Because they are the ones who will be building the dashboard, and they are the ones who will be the first line of defense when an executive comes and says this dashboard is wrong. What's going on with it? The problem here is that a lot of data consumers do not have the right sort of access and tooling that would allow them to push down this data quality knowledge to a layer of the producers or to the data itself.
And if you go back to the data producer, the data engineer building the pipelines, their incentives are typically to get the pipeline done and move on to the next thing. Move on to the next pipeline and come back to it if somebody says that it's broken or and come back to it if they need to change the logic of the pipeline. It's important to have a middle ground where both sides can communicate about their expectations of the data and set this contract. If you look to the SRE handbook, they have a notion of SLAs and SLOs that talk about setting a contract with your application.
The application is expected to perform in a certain way. It's important to get both the data consumer and the data producer in the same room and get them on the same page about what is the contract of this data, what should this data look like, How will it be used? And how do we define when this dataset is of high quality? The follow-up to that then becomes, how do you encode these data quality rules? And this should also be in a centralized system that is accessible by both the producer. So the data producers can know is the data that I'm producing of high quality, and by the data consumer, so they can see is the data that I am accessing of high quality. And if not, what is going on and is someone responding to this.
So really the burden should fall on both sides, but in a way that is centralized where everybody can get on the same page very quickly and be able to react to things as they come up rather than playing a game of telephone where the data consumer has to go to the data producer, the data producer has to go back to the consumer back and forth until they agree whether or not something is going on and what is hap
[00:33:59] Unknown:
Another aspect to the overall question of data quality and the health of pipelines and the visibility, as to the trustworthiness of the data is the support that's built into the tooling that's being used at each of the different stages of the life cycle. And there has been an increase in the focus of that as a primary concern and a design element for the systems that are being used to collect and process information, notably things like great expectations as far as a code first approach or with DAXTER being able to embed expectations into the different sections of compute to make sure that you can see what the assumptions are around data and whether or not they're being violated. And then also with things like Delta Lake with the similar approach of expectations being built into the table definitions.
I'm wondering what you have seen as far as some of the most noteworthy ways of surfacing the concern of data quality within these different tools and within the user experience of data engineers and data analysts? And what are some of the cases where the tooling, in your view, needs to be improved or where these design considerations need to be built into the tools and, you know, either refactored or revisited with the existing infrastructure and the overall approach of the industry at large?
[00:35:31] Unknown:
The tooling breaks down into 2 camps. You have the tooling that allows you to set these definitions and expectations in the pipeline and in the code. So you mentioned great expectations. I would put DBT in a similar place where the expectation lives with the pipeline. The difficult part about having expectations live in code is that changing them becomes hard, and it also becomes tedious to get everyone on the same page, because often the consumers of the data, your analysts, and your scientists aren't going to go dig through a bunch of pipelines to figure out what your expectations are.
The second set of tooling defines the quality rules outside of the pipelines and on the data itself. If you look at the most primitive example of this, it's the data catalog where you have descriptions of the columns and of the tables that you are using, and they're stored in a central location that anybody can go and read and some people will update. This is useful to a human, but it's not useful to the actual tooling used by data engineers because they would still need to translate what's there into their pipelines if they want resilient and reliable pipelines. It's important for tools that define quality checks in a central place to be able to expose that information both in a human accessible and human readable way, but also in a way that the code can reference the same checks and the same logic and allow engineers to implement that logic in their tools and assert that their pipelines are producing the data that the users expect.
Bigeye does this by providing a UI for centralizing the definitions and then exposing APIs that you can pull those definitions into your processing layer and run the same assertions on the data you're producing that are already defined in that central location.
[00:37:50] Unknown:
As you have been building out the Bigeye product and onboarding customers and working with them to understand the requirements. I'm wondering what you have seen as some of the most interesting or innovative or unexpected ways that they're using the Bigeye platform.
[00:38:05] Unknown:
When we set out to build Bigeye, we expected teams to make an individual decision on how they would want to resolve a data quality issue. Because you can't tell whether data quality issues because the business has changed or because the data is actually wrong. What we've seen is that a lot of the times there is an alert in bigeye, there is actually a data quality issue and the team thinks that this is an unexpected data problem and they want to resolve it at the data layer. 1 of the most interesting ways that we've seen Bigeye used in this case is triggering automation in their tools to fix or roll back the data that is bad.
An example of this is you have an alert in Bigeye that says this column has a non 0 number of nulls. It triggers an alert and that alert hits a webhook in the infrastructure layer, which triggers an ETL that takes all of the rows that are null and moves them into a side table as a quarantine table. So now the investigation that happens after the fact can be done on that specific quarantine table because you know all of the bad rows are in a single place. So it's very interesting to see data quality monitoring not just being used as an alerting mechanism, but also as a hook back into the infrastructure layer and into the ETL orchestration layer and actually performing automated actions off of the alerts that are coming from the data quality system.
[00:39:53] Unknown:
And a corollary to the number of businesses that are addressing data quality in a product oriented fashion, 1 of the reasons that that is a viable option is because there are so many people who are identifying this as a need and who are building their own homegrown approaches to addressing data quality and trying to identify issues with it and solve for it. So I'm wondering what you have seen as some of the most interesting ways of building homegrown platforms for being able to identify this within an organization's engineering group or even things that you yourself have done before building Bigeye?
[00:40:33] Unknown:
The most common homegrown solution that we've seen, and honestly, that I have built multiple times at this point, is taking a SQL query, sticking it out on a chron schedule, running it, and then outputting the results on some dashboard. You walk in in the morning. You look at the dashboard. Does the graph look okay? This is obviously extremely manual. You're copy pasting SQL queries around. You're running a cron on some random box in e c 2 completely unscalable. We've seen interesting variations of that where teams have taken that SQL query and then output all of the data to Datadog. So then it lives in the same place as their infrastructure metrics, and they can monitor data quality and infrastructure in the same place. The most extreme example of this was a data scientist that we've met that ran all of his checks and then put all of those into 1 giant data quality table, which he then pivoted into multiple dashboards and then presented those as effectively data products on the quality of his data.
The most extreme example of a homegrown solution that is nontechnical that we've seen is literally putting a single person on call for data quality, and their whole job is to monitor dashboards and reports. And if they see something, investigate it and fix the data quality problem. This obviously doesn't scale, but it is a very interesting homegrown approach to data quality.
[00:42:15] Unknown:
The if you see something, say something of data engineering.
[00:42:18] Unknown:
Exactly.
[00:42:20] Unknown:
And so in terms of the overall data management and data quality
[00:42:35] Unknown:
discovery have been top of mind for quite a while. If you look at the need of most data teams, it's to access the data and build something useful with it quickly. And the only way that you can build something with data is if you understand what exists there. From that perspective, data catalogs and data discovery have been a huge trend in the space recently. If you look at the open sourcing of Amundsen from Lyft as well as other enterprise data catalogs such as Alation and Calibra. They've all been coming more into focus recently as teams have to deal with larger and larger amounts of data.
The other trend that is most interesting to me is the migration from ETL to ELT, as well as the migration from warehouses to lakes and back to warehouses again. It's very interesting to see the pendulum swing in the data space as teams go to a more centralized data model and then realize that that's a lot of overhead. I just want to manage my own team's data, and they break out into data marts and data warehouses again. It feels like history repeats itself every decade. And it's interesting to see that cycle repeat now.
[00:44:05] Unknown:
For folks who are trying to gain control of their data quality in their own data pipelines, and they're considering using Big Eye, what are the cases where it's the wrong choice and they would be better suited with a homegrown solution or some other off the shelf product?
[00:44:21] Unknown:
Bigeye operates by querying data out of a database, whether that be a warehouse or presto on top of s 3. If your data is not in a queryable format, if it's very nested JSON that's completely unstructured and you're doing completely offline processing in Jupyter Notebooks, then Bigeye might not be the right tool for you. And you might want to look into other ways to monitor that data. That being said, if you have unstructured data that you're trying to use for something, typically, structuring it is the first step. And if you're structuring it, you're putting it in a queryable format, which is typically SQL, in which case, bigeye would work for you once you get to that step.
[00:45:14] Unknown:
And as you continue to iterate on the product and continue to explore different ways of identifying data quality issues or patterns in how these problems surface and ways that data is being used. What do you have planned for the future of the Bigeye product and the business for it?
[00:45:33] Unknown:
A lot of the focus has been on introducing more automation into Big Eye. Whether that's on setup time for the metrics or it's on setting thresholds automatically rather than manually. And we want to improve the intelligence of the product. Because we are capturing these metrics, we can do a lot of interesting things about predicting how the metric will behave, as well as what are other parts of your system that you might want to apply the same metrics to. So taking all of this metadata that we capture about your database, as well as the metadata from the metrics that you're already collecting, and allowing you to quickly set up new checks and new tests across your whole data warehouse in as little as an hour is really exciting to us.
[00:46:30] Unknown:
For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:46:45] Unknown:
The 2 biggest gaps that I see in data management today is the ability to encode domain expertise and information into the tooling, and tying tools together easier. To that first point, if you look at if you look at things like Salesforce or Carta for cap tables, there's a lot of domain expertise that's actually baked into the tools because it's so required. Data tools today are very unopinionated because a lot of patterns haven't really been set in stone yet by the community. But there's still a lot of tribal knowledge, and there's a lot of common patterns that you can see across teams that can make tools more opinionated in order to make them more useful for everybody.
And to that point, the same tools are frequently used together, and there's still no good way of tying tools together. It seems like every data engineering team has to rebuild the same things from scratch and sort of reinvent the wheel. So I'm excited to see a lot more integrations in the data space between tools that are commonly used. So that when I'm starting a new project, I don't have to read all the API docs and and build all the piping manually.
[00:48:13] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you've been doing with Bigeye and your experiences working with the data quality market. It's definitely a very interesting and important area of focus. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It was great to be on the show. Have a great day
[00:48:37] Unknown:
yourself.
[00:48:39] Unknown:
Listening. Thank you for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Igor Grzislav and Bigeye
Defining Data Quality
Factors Affecting Data Quality
Industry Trends in Data Quality
Overview of Bigeye's Approach
Common Data Quality Issues
Tooling and Data Quality
Customer Use Cases and Homegrown Solutions
Future of Bigeye and Data Management Trends
Biggest Gaps in Data Management Tooling