Summary
Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
- Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy about strategies for proactive data quality management and his work at Datafold to help provide tools for implementing them
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what you are building at Datafold and the story behind it?
- What are the biggest factors that you see contributing to data quality issues?
- How are teams identifying and addressing those failures?
- How does the data platform architecture impact the potential for introducing quality problems?
- What are some of the potential risks or consequences of introducing errors in data processing?
- How can organizations shift to being proactive in their data quality management?
- How much of a role does tooling play in addressing the introduction and remediation of data quality problems?
- Can you describe how Datafold is designed and architected to allow for proactive management of data quality?
- What are some of the original goals and assumptions about how to empower teams to improve data quality that have been challenged or changed as you have worked through building Datafold?
- What is the workflow for an individual or team who is using Datafold as part of their data pipeline and platform development?
- What are the organizational patterns that you have found to be most conducive to proactive data quality management?
- Who is responsible for identifying and addressing quality issues?
- What are the most interesting, innovative, or unexpected ways that you have seen Datafold used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold?
- When is Datafold the wrong choice?
- What do you have planned for the future of Datafold?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Datafold
- Autodesk
- Airflow
- Spark
- Looker
- Amundsen
- dbt
- Dagster
- Change Data Capture
- Delta Lake
- Trino
- Presto
- Parquet
- Data Quality Meetup
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Special Guest: Gleb Mezhanskiy.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and de identification features eliminate the need for time consuming manual processes, and their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.
Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. That's immu t a. RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse, enabling identity stitching and advanced use cases like lead scoring and in app personalization. Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder. Your host is Tobias Massey. And today, I'm interviewing Gleb Myshansky about strategies for proactive data quality management and his work at DataFold to help provide tools for implementing them. So, Gleb, can you start by introducing yourself? Thanks, Tobias, for having me. My name is Gleb. I'm currently CEO and cofounder of DataFold. We're building data observability platform
[00:02:30] Unknown:
that helps data teams build data products faster and with higher confidence. Before building DataFold, I was a data practitioner and was doing data engineering, data science, analytics. So a lot of what we're building is informed by my personal experiences and pain points. And do you remember how you first got involved in the area of data management? Yeah. That was back in 2014. I joined Autodesk's newly formed consumer group that had a portfolio of over 20 b to c creativity tools. As a 1 month data platform, I was then tasked to centralize all analytics around this portfolio of 20 apps.
And the great part about that was that, you know, as a 1 man data platform, I got to choose all the tools I wanted and put together in my stack. But it's also worth mentioning that although it was not so long ago, in 2014, we lived in a pretty much completely different world, data tools wise. So airflow wasn't released yet. I think Looker and Snowflake just raised their series b. Spark was bleeding edge, just the first release. And so tooling wise, and I think approaches that we used back then were quite different from what is mainstream today.
[00:03:45] Unknown:
Yeah. It's definitely always crazy to look back at some of the timelines because the overall data tooling space has been moving so fast that if you look at what's out there today, it's just impossible to even remember what came out when and how long ago it was available for. Because as you said, you know, 2014, that's only 7 years ago, but it's a complete lifetime and an entire paradigm shift away from where we are right now with the overall data landscape. Yeah. Absolutely. And a huge shift in the problem space as well. So I think what's top of mind for data teams today is is a very different set of problems than what we were facing back then. Yeah. I think at that point, it was still just a matter of, I need to get this data from here to over there, and I need to make sure that it doesn't error out halfway through. And now we're sort of moving up the pyramid of, you know, the hierarchy of needs to, you know, data observability is actually 1 of the concerns for data teams now that wasn't even on the table 7 years ago. Exactly.
And so in terms of what you're building at DataFold, can you give a bit of a background and overview about what it is that you're creating and some of the story behind what motivated you to launch this company? Yeah. Absolutely. I think to tell the story of DataFold, I should also
[00:04:55] Unknown:
give a little bit of background on my path in data engineering. So after starting, you know, building data platform at Autodesk, I then moved to Lyft where at the time when I joined, we had a 15 person data team that over the course of next 3 years grew to over 300 person org. And so with almost, you know, 20 x expansion of the team, exponential growth of the business, and, of course, the data volume and complexity, that all created a tremendous pressure infrastructure and tooling. And so I initially started as a data analyst building data products such as BI reports and forecasting machine learning models. And I very quickly realized that the available tooling was really not suited to tackle the problems that were rapidly emerging due to the growth of the team and and the data. So I switched my focus to from building data products to building tools that enabled data developers, data scientists to build those products because the complexity of the data, the reliability of the data, and the speed of developments that we had in data teamwork were quickly becoming bottlenecks for the business growth.
And so 1 of the key, I guess, pivotal moments for me to start focusing on tooling were when as a data engineer being on call, so basically responsible for taking care of all incidents, I had to ship a very small incremental change to 1 of the core jobs that were building analytical datasets. And I made just a very tiny change about 4 lines of SQL. And I did some testing. I got a code review from my teammates, and I shipped it. I merged it, rebuild the entire DAG. And next day, we discovered that there was a huge data incident going on. So, basically, all analytics was stopped because it was apparent that a huge portion of the data was missing.
And what was crazy is that it took us about 6 hours to realize that that data incident was related to the change that I made the previous night. And even for me, the person who made the change, it wasn't at all apparent that the data incident was related to it. Right? And the most scary part is that I followed the process that existed and I use the tools that existed, but even still, I was able to make such a mistake that led to a really bad outcome for for the business. So it took us the full next day to clean it up and to relaunch all the processes and to get all the data pipelines back on track. And so the realization that 1 person, you know, making a small change can bring down, you know, entirely the platform at a large company with huge business impact was 1 of the pivotal moments for me to start focusing on building tools first internally at Lyft and then eventually starting DataFold to help solve these problems for for everyone.
So back at Lyft, just to to give you kind of a sense of what we were building, we built a framework on top of Airflow that enhanced the developer experience and helped build more testable pipelines. We also built a real time anomaly detection based on Apache Flink. We also built an early version of data catalog that was predecessor to Amundsen, which is now open source. And so all of these project really impacted how the entire data work was building data products. And then the realization was each that's something that Lyft needs with its scale. Probably the rest of the data community that's likely suffers from the same issues, but 1 have resources to build so many different tools in house who also need it. So that's kind of the my personal experience that led to creation of DataFold.
And I guess the macro reason or the bigger why of why I decided to start a company building data observability tool and data quality tooling is that obviously data is leading the world. And I think we're just at the beginning of seeing how data products disrupts industries, all industries in the world. And we kind of started talking about that at the beginning of how the data environment is different right now from, let's say, 7 or 10 years ago. And so over the last 5 to 7 years, we really solved a lot of fundamental problems of how we store data, how we collect it. We now have really fast, almost limitless, scalable databases. Right? We have great BI tools, visualization capabilities. We have a Malinfra.
But the problem that emerged over the last few years is now that the companies have limitless capacity to accumulate and produce data, how do we deal with this complexity? How do we tackle the problems of data quality when we're dealing with, you know, tens of thousands of tables and millions of columns at in average size company. And so I think that by solving those problems for the people, for data teams who are using and developing data day to day, we can then make a really huge impact on the world in general. So that's the bigger why behind Dataflow.
[00:10:25] Unknown:
Yeah. It's definitely a huge problem that a lot of teams are dealing with is because there's this explosion in tooling, and it's moving so fast. It's it's hard to keep up with that. And so you're just trying to build systems and keep them moving and deal with all the different data sources. And now that data integration is a lot easier with tools like Fivetran or the Stitch ecosystem where anybody can say, oh, I'm going to connect this data source into my data warehouse that, you know, now data teams are just being completely swamped. And so even just keeping track of what data exists has become an entire tooling problem and, you know, entire companies are being launched just on that 1 problem. So the fact that managing the quality of any 1 of those data sources can have, you know, such an outsized impact, as you mentioned, with just changing 4 lines of SQL, destroying the entire productivity of the company for a day is, you know, definitely something that a huge financial burden, particularly for companies that aren't set up to handle it. And so in terms of those data quality problems, I'm curious what you see as being the biggest factors that will actually contribute to incidents of, you know, qual you know, quality problems or pipeline failures or, you know, some of these outsized impacts that can happen from a small change?
[00:11:37] Unknown:
Yeah. Absolutely. I think data quality is right now becoming as big of a problem space as software quality. So it's enormous. And I don't think there is any single, you know, framework or tool or solution to really solving it for even a not very large company. And so I think with such big problems, it's always helpful to try to break them down into a few dimensions, then it becomes more manageable. And 1 way to look at the data quality problems is to look at what are the sources of those problems. So so 1 is obviously operational issues. Right? Say, our data producing jobs are delayed, infrastructure failures, there are errors.
There are there are certain queuing in a system, so data is not available. It's not computed. I think this problem space is more understood right now and probably easier to manage given the maturity of infrastructure. Another failure scenario is when the data that we rely on changes. So that the examples of that could be vendors that we use to ingest the data are not complying with expectations. Other teams are making changes to their data sources and causes impact. Or there's gonna also be change in the business. So the fundamental changes in the world that also get reflected in the data. And the 3rd big category of data quality problems arise from us, data engineers, data developers, making changes to our data products. So making changes to the code that processes data, be that SQL, Python, Scala, or other frameworks, changes to the business logic that exists in business intelligence tools. So right now, many of those tools also contain a lot of logic in terms of how data is computed and presented, changes to ML model definitions. So right now, data driven companies, so companies that are really relying on data for making decisions by humans and by machines, they typically deal with code bases that are used to process data, which are comparable size to their actual software products. So it's tremendous amount of complexity, probably tens of thousands of hundreds of thousands of lines of code. And it's also very rapidly evolving. Right? Because to be really data driven, we not only need to build out all the infrastructure and all the models and the star schema. We also have to rapidly iterate in it to keep up with the business demands and with the new challenges that the growth goals post.
And within that framework, right, operational issues, changes to the data, and changes to the data processing code, I think the latter area right now is probably the least studied and the least understood. And I think it's somewhat natural. Right? Because the first step when you are dealing with data quality issues is think, well, I wanna at least know whenever they happen. But I think to really tackle the problem, we'd have to pay closer attention to how do we work with the data, what is our development process, what is our change management process. And then for solving that and solidifying that, we can then achieve better
[00:14:50] Unknown:
data quality. To your point with the parallels about the software quality space, there are a lot of tools like linters and unit tests and, you know, static code analysis for both potential bugs and security implications. And because of the fact that data platforms are not a static system, there is no single point in time snapshot that is going to accurately represent the entire system as there can be with code, although with code, it gets complex as well. How are we able to map some of those same concepts from software quality management into the data space and deal with the sort of dynamism of real world data cleanliness issues and how they impact the actual systems that are dealing with processing them to create these preventative maintenance systems.
[00:15:37] Unknown:
So I think that 1 of the bigger trends that we see right now in the data world is the application of what are now considered standard software development practices into the data workflow. And some of the ways in which we can solidify our data development process are to, 1, bring version control. Right? So version control of everything starting with your ETL code that ingest data that processes data. Also, version control for things like even BI dashboards. Because if we think about this, in the current world where companies have entire meetings structured around dashboards to make decisions about investments or about killing or doubling down on a particular feature, the stakes of making the wrong decision based on data are really high. And so all data products, no matter whether the executive facing or you know, going into production, should have version control because that enables 1, very clear reproducibility of whatever is the state of this code. It also enables very clean and visible change management process because we can cleanly delineate between the previous version and the new version.
And it also allows for more seamless collaboration between the teams. Because right now, data products are not built by a single person or even single team. We have probably dozens of people collaborating on everyone, especially at larger companies. I think the second important aspect of development process that we're seeing coming from software engineering into data world is having a good visibility of changes. So whenever we are making change to, let's say, a pipeline that transforms data, we have to really understand what does this change entail both for us as others as a team and also for our downstream stakeholders and consumers. And so in software world, we are typically doing this through regression testing. Right? So we are running unit tests. We're running regression tests. We are, potentially, in the world of microservices, exposing a tiny bit of traffic to the new service and observing it and seeing what happens.
And so in the data space, we now also have similar frameworks, for example, assertions, such as validating that a given column on a data set is unique or not null. It is very helpful to validate business assumptions about the data can run both during the development process and in production. But I think 1 of the still missing aspects that we are trying to close in this development process and in particular in having the visibility into the changes is understanding, like, what is the full impact analysis of the change that I'm making? For ex and that can go into simple questions such as what is the number of rows that a particular dataset will produce? Will I have any drifts in the features?
Am I going to break any dashboards because I may have removed the column or I named the column? And so having this visibility is really paramount for a reliable change management process. And then the 3rd component that I think is also rapidly making its way into data world is continuous integration and continuous deployment. And so similarly how it helps software teams be more agile, make smaller incremental changes, and then ship them faster in a reliable way. In the data world, we see the renaissance of almost CI where we see data teams investing in automated testing procedures.
So for example, whenever someone checks in code that transforms the data or even controls the layout of the dashboard, there is an automatic process that runs tests, maybe builds a staging dataset, and then even maybe automatically merges this code and deploys it to ETL Orchestrator. So that really helps make sure that whatever is the change management process, it's not only available to people, but it's also automatically enforced. Right? And there's no change that bypasses certain testing, which is required.
[00:19:47] Unknown:
As far as the the tooling and the platforms and the sort of impact that it can have on effective data quality issues, what are some of the ways that they can contribute to the occurrence of data quality issues as far as the systems that you're building, the way that your data platform is architected, and some of the design considerations that teams should be thinking about as they're planning out their data platform or as they're starting to think about introducing new systems or new processes?
[00:20:16] Unknown:
So as a big believer in great workflows, I think that the best way tools can support reliable data and help data teams ensure high data quality is to really facilitate those strong workflows. And to give you an example, we talked about version control. We talked about testing and CI. So we see that certain tools that we now consider part of the modern data stack, for example, DBT for SQL transformations or tools like Dijkstra for general purpose data pipelines and tasks, they come with those features and frameworks already built in. So they already facilitate version control. They have built in testing frameworks that make it really easy for data developers to write tests and run them as part of the pipeline.
And documentation frameworks that help both keep documentation close to the code, which is always great, but also serve that documentation in a nice UI that can be consumed by, not necessarily the data developers, but data users. And very importantly, they have separate production and staging and development environments. That also is a very important concept for making sure that the change management process is reliable.
[00:21:32] Unknown:
As far as the potential consequences, we have addressed some of that where, you know, if you have a wrong column or the data is old, it can potentially lead to costly decisions that end up being based on incorrect assumptions around the data that's available. And so how can organizations start to shift to being more proactive in the data quality management and start to instill the understanding at the business level that it's worth the investment and the time and energy that it takes the engineering team to create these systems for proactive management and also how to instill the sort of level of care and diligence that's So I think, probably
[00:22:25] Unknown:
So I think probably the first step that's important in our organization is, 1, on you know, recognizing that there is a problem and getting a buy in to solve it. Fortunately, we still see that some teams, you know, live with data quality issues as a status quo. Right? And so we have to recognize that there's a problem, and we are able to improve it. I think the second important aspect is understanding what are the root causes of the issues. So probably trying to classify them and see what are the areas that are most risky and most impactful.
And, again, I'd like to emphasize the proactive data quality management through improving the developer process over more like post factum monitoring. Because appealing is the idea of kind of data monitoring post factum. So tell me when my data is wrong, kind of black box solutions is it's quite hard to rely on that solely to improve data quality. Because by the time that you identify that there is an issue in production, the damage is already done. Right? So the stakeholders probably already looked at the dashboards showing wrong information. Machine learning models already were ingested the wrong data and skewed their results.
Another problem is that it by the time the data is ready in production, it could be really hard to identify the root cause because with the multistage data pipelines, corrupted data propagates really fast and it becomes ubiquitous. And the other aspect, which is more organizational, is with the issues data quality issues that are already in production, to fix them, you have to fight the organizational momentum. Right? You have to advocate for people to stop whatever they're doing, go back and fix them, as opposed to work on the new things, which is always an uphill battle. That's why I strongly advocate for data teams and companies to really look into the preventative ways to address data quality because then all of those issues are taken care of. And so in terms of how to think about improving the process is, I think, an important aspect is to understand what are the current inefficiencies of the process. So is the bottleneck in the ability to ship, let's say, data. Right? The teams need better frameworks for shipping data products faster. So sometimes a team would need to, let's say, switch to a more agile framework like DBT, which comes with a lot of the data quality toolkit features already.
Right? But assuming that the basic infrastructure and tooling is already in place, I would start with planning out the change management process. So what are the steps that are required in order to make a change to a data product, be that, you know, a SQL job or a BI dashboard, and then introducing visibility tools. So how can we make sure that their tasks are executed, that we have full understanding of the changes that we're making, and then making sure that these processes enforced.
[00:25:34] Unknown:
As far as what you're building at DataFold, I'm wondering if you can talk through some of the design and features that you are building in and some of the architectural aspects of the system that allow it to be able to enable some of this proactive data quality management of finding and fixing, you know, data quality and data bugs before it actually goes out into a production context?
[00:25:58] Unknown:
Yeah. Absolutely. So we call Dataflow the data observability platform. And so by observability, we mean that we help data teams discover, understand their data, how it works, what the distribution of data, where it comes from, where it goes, and also verify and test it. And so while there are multiple features that I won't go into the detail right now, into all of them, the really key pieces of the platform that help enable reliable change management process are Datadiff and column level lineage engine. So Datadiff is a tool that analyzes changes in the data and provides a visual report across multiple dimensions and with various degrees of granularity.
So you can think of it as git diff for data or, you know, a Microsoft Word diff, but for for your datasets. So whenever you want to compare 2 datasets, it gives you a view into how they are different, both in terms of individual rows and also on a statistical level in terms of the distributions. And so how does Datadiv fit into those workflows that we discussed? So for 1, it helps you automate regression testing because you can compare the before and after state of your data product. For example, you can compare the production version of your dataset with the development version of the dataset built with a new code that you're about to merge. And so that helps you answer questions such as, what is going to happen to the data? Are there any unintended changes to, you know, number of rills, percentage of nulls? Are we going to cause feature drifts by changing distributions of particular dimensions?
Are we going to cause BI tools to fail because we renamed or misplaced the columns? So Datadhip helps answer those questions without writing any SQL or without doing any manual checks. And the way it fits into the workflow is essentially automating what most teams do right now but manually. So we spoke to some really senior data engineers at public companies to learn that sometimes they spend up to a week testing a single change to a really important SQL job if that job, for example, powers the financial reporting because the sakes of making a regression are super high. And the majority of the time in that week goes into writing arbitrary ad hoc SQL queries that are essentially comparing things and validating things to make sure that there are no regressions. So Datadip essentially takes out that manual part of work.
And then aside from testing the regressions between production and development, data diff can also be helpful in identifying drifts in the data in production because we can compare state of the data today versus yesterday or, let's say, after a job run versus before job run and identify any anomalies. So are there any unexpected consequences? So that's more of an autonomous anomaly detection piece. But back to the development workflow, like I said, the second component is column level lineage. So what is lineage? It's essentially an interactive map of the dependencies in your data ecosystem that essentially shows you for a given column where does the data go and where it comes from. So if we look at a particular dashboard, we can immediately answer a question.
So a given metric, what are the columns, how it's computed, what are the columns that are feeding the data into this metric? And we can see that, for example, a particular column is a combination of 2 upstream columns. It's, some operator or it's a case when statement. So we can trace those dependencies up and down. And while there are multiple uses for column level lineage, the 1 that's relevant for reliable change management process is doing the impact analysis. Right? So whenever we are changing, let's say, a SQL job and we have the data diff that shows us what is the impact on a particular table, The next thing we can do with column level lineage is understand what are the potential downstream consequences that we haven't accounted for of making a given change. For example, if we change the definition of a given metric, for example, conversion, with column level lineage, we can immediately identify all the downstream jobs, all the dashboards, all machine learning models that are using this metric.
So we can, 1, potentially do impact analysis there, or we can also proactively reach out to stakeholders, to owners of those data products and data users and tell them about the anticipated change. So together, these 2 tools facilitate the full understanding of the impact you're making when you're introducing changes to the data processing code. And through that, we can dramatically reduce the chance of making errors and also save a lot of time for data developers that otherwise would go into manual testing.
[00:31:02] Unknown:
We've all been asked to help with an ad hoc request for data by the sales and Write some Python scripts to automate it? But what about incremental sync? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to data engineering podcast.com/census today to get a free 14 day trial.
Another interesting element of the sort of data quality question is that with particularly organizations that have their own in house software teams, a lot of the data is going to be coming from operational database systems that are owned and managed by a team that is distinct from the data team and and that has their own priorities and their own release cadences and their own ideas about what database design should be and how to evolve it. And then there are also things like customer event tracking where you have a tracking pixel or, you know, set of JavaScript on a website that is going to have some event schema that's coming in. And so then you have to deal with pulling those events in and, you know, convert them into a database table and deal with downstream transformations there. And, you know, not even factoring in the 3rd party SaaS platform data that you need to pull in. You're just within the scope of data sources that are within the entire control of your organization, but not necessarily owned by the data team. How do you sort of popularize or build an organizational contract between the different stakeholders and data owners about how to manage change propagation through the different systems, you know, maybe starting in software systems or event tracking to, you know, how that impacts the business dashboard that your CEO is looking at tomorrow?
[00:33:01] Unknown:
Yeah. Absolutely. It's a huge problem, and it's typically a big pain point for every company that we spoke that is really data driven and building lots of data products. I think the first step is, again, to acknowledge and to say that change management process for data sources, be that events or operational data stores that are copied to your warehouse, should also be reliable and equip the teams that are owning those sources with full visibility into the impact of the changes that they are making. And then in the world of event tracking, we are seeing emergence of tools that are specifically focused on reliable definition and change management of those event schemas. So they are called instrumentation trackers or schema planners.
So the idea of those tools is that you have a central repository for defining events. So what is an event? What are the properties that are sent alongside the events? And then whenever engineers implement those events, there is an automatic validation against the spec to ensure that both during development and in production, whatever instrumentation generates, whatever data comes out of those sources as part of the tracking, it conforms to the original spec. And all the changes are also version controlled, and all the data developers who use those events, data consumers and engineers to instrument those events are all on the same page.
I think speaking about kind of intermodility of the tools and how we can piece of the events of the events because they are mostly seeing just the world up until those events are just in the warehouse. And this is where a tool like Data Vault can come in because we have the visibility all the way from the raw event sources to the ultimate data consumers. So by plugging these tools together, you can also ensure a reliable change management process for those resources. And as far as the operational stores that are oftentimes copied using change data capture into warehouses, This is a somewhat more complex problem because it's a fairly kind of low level infrastructure process to copy the data operational stores, and there is a big amount of variability in terms of how companies implement it. So some use vendors, some use open source CDC methods, some use batch copies.
And so whatever is that the team is using, I think the key part is to, again, make sure that before any change is made to the original source or to the source schema, there is an impact analysis performed that clearly shows what is gonna be the impact of the change. Because sometimes you can remove a column and no 1 cares. And sometimes you change a slight definition and this there's a huge data incident. So understanding the difference between these 2 scenarios is key. Again, I think column level in each is the fundamental instrument and source of information for that, but how exactly it plugs in to the change management process for operational data stores highly depends on how the company implements it. To that point too of column level lineage, a lot of systems will look at that from the data warehouse perspective.
[00:36:22] Unknown:
But it's definitely an interesting question to think about, how can we propagate some of that information and extend the visibility of these data tooling systems into the operational stores and the applications so that it becomes part of the application development life cycle to be able to view and analyze the downstream impacts and not just have that be a responsibility of the data engineers and data analysts?
[00:36:46] Unknown:
Absolutely. I think the cool thing is that with the emergence of ELT pattern, we shifted from doing a lot of in flight transformation software all data before it lands in the warehouse into the pattern of doing 1 to 1 copies of whatever is in your operational stores. So I think the prevalent pattern right now is to copy your entire schema from the transactional store such as Postgres or my MySQL into your warehouse as is. And so if you have if that is the case, then having lineage in your warehouse that shows you downstream usage of of those copies effectively, can be translated to the ultimate raw sources in your operational source, which makes the entire visibility pipeline much easier.
But if you have more complex scenarios, then, basically, there is also an option to extend your lineage graph to those sources, but that increases complexity massively.
[00:37:45] Unknown:
For organizations that aren't necessarily using a cloud data warehouse and are more in sort of the data lake paradigm where they have data in s 3 in parquet format, and they're dealing with partitioned datasets there. And, you know, they might be using Trino or Presto on top of it, or they're using Delta Lake or Hoody or, you know, the plethora of tools that are arising in in that space. What additional challenges or complexities does that pose to, you know, systems like what you're building with DataFold to be able to add the level of insight and introspection that's necessary that is, you know, relatively straightforward in a vertically integrated data warehouse stack, but, you know, but is not necessarily as sort of cohesive in these data lake environments?
[00:38:30] Unknown:
I think to answer it, it may be worthwhile to take a look at you know, under the hood of how column level lineage is constructed. So fundamentally, to have a reliable bottom up column level lineage map of your data ecosystem, we have to first obtain the code that's basically the DDL and DML code. So the code that defines the schema of your datasets and the recipes for how those datasets are created. And in the world SQL, that means SQL queries that are creating datasets or modifying them and SQL queries that are consuming datasets. And by then doing static analysis of that code, so decomposing that into AST representation and then piecing it back into the global graph dependencies, we can then understand how data is produced and how it's consumed no matter what happens in that SQL, no matter how complex your queries are and whether using correlated subqueries or case wide statements or renames.
A proper column lineage engine should piece it back together, which we do at DataFold. Now if you're using a data like approach and still relying on a SQL based engine such as Presto or Spark SQL or Hive, there fundamentally isn't more complexity than building a lineage graph for a, basically, self contained warehouse such as, you know, Redshift or BigQuery or Snowflake. It's just a matter of making sure that you collect those SQL logs. However, when it comes to other scenarios for how data is built, for example, using PySpark or ScalaSpark or, frameworks such as Apache Beam where the language of how data is transformed is not SQL, that massively increases the complexity because those languages have massively more powerful syntax than SQL. And so in these scenarios, we have to either connect to the underlying fundamental representations of jobs. So taking a look at how do those engines compile whatever is their domain specific language for defining those transformations into the primitive operations and then using that to augment the graph. But in any case, that probably increases the complexity for building lineage. But as long as we stay in the SQL world, piecing back the entire lineage graph is fairly straightforward.
[00:40:58] Unknown:
How much attention are you paying to efforts such as Open Lineage to try and create more of an open standard of how to think about and represent and integrate with these lineage graphs, particularly for non SQL systems that have their own sort of custom transformation logic? And how much potential, you know, positive impact do you see with more systems starting to adopt and flesh out that standard or anything else that might be arising in the space?
[00:41:28] Unknown:
Yeah. In general, I'm a strong believer in interoperability between data tools. And I think that's 1 of the core principles of a modern data stack that tools are increasingly specialized, but at the same time, more interoperable and more modular, which allows companies to piece together the stock with choosing the tool which is best in every particular vertical. And so I think the standards like Open Lineage are really important in defining how particular types of metadata are shared between the tools. And the way I think a tool like Data Vault can be integrated into a larger data ecosystem using Open Lineage is by providing the fundamental lineage information. So, basically, dependency graph that is then shared using the open lineage standard with other tools.
Right now, we already have integrations with data catalogs such as Amundsen and DataHub. So any 1 who is using them can ingest column level lineage information from DataFold using GraphQL API and then load them in the data catalog. I think with open lineage, that'll be even easier once it's adopted more widely in the ecosystem. Because once you have information that's standard, you can then reuse it across multiple tools. And like you said, you can also use this standard to piece together different sources for lineage. Right? So for example, you may use DataFold to obtain all the lineage information from your SQL warehouses, and you may then plug in the lineage graph from systems like Spark and Beam, again, using open lineage to construct the global graph of dependencies.
[00:43:11] Unknown:
Going back to the organizational aspects of data quality management, in your experience, who has typically been responsible for identifying and addressing data quality issues? And do you think that the current state of affairs is sufficient or beneficial, or do you think that there needs to be a shift in how data quality is sort of owned and operated at the organizational level?
[00:43:40] Unknown:
So I think, naturally, the responsibility for maintaining high data quality falls on the teams that own the data. And typically, that's analytics engineering or data engineering teams that have the largest surface area with the data products and therefore become responsible for the end to end data reliability. And then common for them to pass this responsibility to software engineering teams. So for example, the ultimate stakeholder or user of data such as, let's say, financial team or analytical team would expect the data engineering team to provide them with high quality data, and then data engineering team would build or collaborate with other teams that are responsible in the process of creating datasets to make sure that data is reliable across the entire pipeline.
I think what is currently missing is the clear contracts between the teams on who's responsible and and what are the ways that teams can collaborate to ensure data quality. Because, like you said, especially with the raw data sources such as operational data, which is owned typically by completely different teams and sometimes dozens of teams, if we're talking about a large company with a microservice architecture, the clear contracts about who is responsible for what and how the entire process for maintaining data quality for change management is conducted.
So I think 1 of the changes we'll see in the future also is the emergence of top level key results or KPIs that will be more organizational level that will measure the data reliability and data quality at an organizational scale. And then various teams that participate in creation of data products will be responsible for their parts, their contribution to that KPIs, and will be hold hold accountable in a more formal setting. Whereas right now, it's a more ad hoc process where teams are more reactive to certain quality issues, and there isn't a very clear understanding of how exactly to measure or to set those goals.
[00:45:48] Unknown:
In terms of the experience that you've had building Dataflow and working with your end users and talking to people in the industry, what are some of the initial ideas or assumptions that you had about how data quality is managed, the sources of issues, you know, the organizational aspects of it that you have had to, you know, reform and that have been challenged or changed as you worked through this overall problem space and built the tooling and technologies to help support teams who are trying to improve the visibility and quality of their data?
[00:46:24] Unknown:
Yeah. So I think 1 of the interesting realizations that we had after going to market with our solution was that, initially, we thought that you know, given our experience working at large companies and large data teams, my initial assumption was that what we're building, tooling for reliable change management and testing automation and observability, would be most useful and most sought after by really large companies with complex data ecosystems and large data teams. And what we realized is that while the impact overall is probably indeed larger at those companies of bad data quality, these are the issues that are sensed by increasingly more younger companies. So we've had customers as small as, you know, 1 person data team at post seed stage startups that already start to feel the data quality issues. So, overall, I think that the challenges of maintaining data reliability and quality have shifted from large companies into, you know, upstream earlier in the company life cycle. Cycle. That was 1 of the realizations.
So I think the second 1 was that even maybe 5 or 3 years ago, their data teams have or data engineers, individual contributors, used to have much more flexibility into choosing the tools. And, you know, even back in my days of doing beta engineering, there was a lot of freedom into, you know, go try this tool or that tool and kind of iterate fast on making making choices there. And there was a lot of adoption of data tools. Whereas, I think, these days, because companies become increasingly more protective of their data given the sensitivity and the complexity of their ecosystems, the decisions of what is the data stack and what is the approach and tooling for each step in the stack are increasingly more centralized and are made higher up in the organization.
So I think those are 2 primary takeaways that we had, you know, going with DataFold to the market.
[00:48:37] Unknown:
In terms of ways that you've seen DataFold deployed, what are some of the most interesting or unexpected or innovative ways that you've seen it used?
[00:48:46] Unknown:
Yeah. So initially, we built DataFold to automate data quality testing and increase data observability. But 1 of the popular use cases that we've seen for our tooling such as column level lineage and diff was to accelerate migrations toward more modern data stack or just in general across tools. For example, if you are migrating your ETL from, let's say, a legacy warehouse to a new warehouse, no matter what they are, 1 of the most time consuming parts of that process is to validate the before and after state of data because, ultimately, your stakeholders don't wanna deal with discrepancies. Right? They want to make sure that whatever they're seeing, which is served from your new warehouse or your new ETL framework, is the same data that they used to see from your legacy system. Or if they're not seeing the same data, they want you to be able to fully explain those discrepancies.
And when we were doing migration at Lyft, that was probably 80% of the time spent on overall migration effort. And so what was interesting was to see Datadiv being adopted for those use cases, basically, accelerating the migration through faster validation of the dataset transferred to new warehouses or to new ETL frameworks.
[00:50:11] Unknown:
And in your experience of building and growing the company, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:50:20] Unknown:
So, you know, being a data tool in 2021 with, like I said, increasing focus on data protection and security, we had to pay a lot of attention to making sure that our solution is secure. And for a very large number of customers, even larger than we expected, that meant being able to deploy our solution on premise, which for a younger company with, you know, fewer engineers, that brings lots of challenges. Right? Because we have to not not only maintain 1 SaaS solution that is scalable. We also have to be able to quickly deploy the entire distributed application into customer environments and do it securely, quickly, and also in a way which allows us to maintain it with minimal overhead as well. So that was, I think, 1 of the hardest technical problems that we had to tackle.
[00:51:16] Unknown:
In terms of people who are looking at DataFold and they're thinking about how they're going to manage their data quality and try to be more proactive instead of reactive. What are the cases where a DataFold is the wrong choice and they might be better served with other frameworks or in house tooling or just organizational patterns?
[00:51:35] Unknown:
Yeah. I think that DataFold is built with the modern data stack philosophy, and it's also optimized to integrate seamlessly with modern warehouses such as Freshships, BigQuery, Snowflake, and modern data lake systems like Presto and Spark. It is probably gonna be an uphill battle to use a system like DataFold with a more legacy data stack based on, let's say, older systems that are based on Hadoop, Hive, and even more kind of proprietary proprietary data frameworks. And if your organization is in the process of either establishing your dataset as stuck from scratch, so you're still in the process of setting up data warehouse and BI tools, more fundamental blocks of the stack, or you're in the process of migrating from legacy systems to the modern data stack, it's probably too early for you to adopt DataFold because in the hierarchy of needs, DataFold will not be able to solve your immediate challenges.
And I think the second group of use cases is for companies that are not necessarily data driven. So the importance that they give to analytics is not as high. Probably data folds also wanna be, you know, be able to bring lots of value because our value proposition is to help ensure data reliability and data quality. So that's not necessarily the topmost priority that we won't be able to naturally generate a lot of impact. And I think, finally, I think there has to be a mandate for change and improvement that exists in the organization.
So if there is a status quo, the data is broken, and and everyone is fine with living in this painful world of broken data, but without necessarily plans or KPIs or PRs to improve it. Again, probably solutions for data quality on tools like DataFold or others won't be able to help much. So it's very important to have right incentives and motivation within the organization to actually address those problems.
[00:53:42] Unknown:
And as you continue to build that data fold and work in the space of data quality management and try to stay up to date with all of the rapid shifts in the data ecosystem. What are some of the things that you have planned for the near to medium term?
[00:53:57] Unknown:
So for the near to medium term, we are going to focus on making data fold even more interoperable with other parts of modern data stack. So integrating with the popular BI tools and increasing the integrations with popular ATL frameworks such as Daxter and others, basically, to be able to provide a more holistic picture into data quality both as part of change management process and for sort of in production autonomous data monitoring. And if I were to zoom out and think about the fast forward future, more long term plans for Data Vault, what I would really want to happen is for us to be able to automate 80% of what current data prep or analytics engineering workflow is today.
Because if you look at it, most of it is not creative process. It's not writing code. It's actually dealing with simple but really painful questions of understanding your data, understanding the edge cases, understanding the data quality issues, or fixing data quality issues. It's reading the code to understand dependencies. And so through providing better observability, we can not only solve data quality, but we can also accelerate the entire workflow of building data products. And, ultimately, I think that we can go as far as not only helping teams to ensure the quality of their datasets, but even to create high quality datasets at the first place. Because as a data observability tool, we are uniquely positioned to collect and process very valuable metadata that basically gives us understanding of how data links, how it's produced, how it's consumed, what is the semantic meaning of every single data point, which puts us in a very strong position to build lots of useful tools to really accelerate the workflows.
[00:55:55] Unknown:
Are there any other aspects of the work that you're doing at DataFold or the overall space of data quality management and strategies for being proactive in preventing data quality issues that we didn't discuss yet that you'd like to cover before we close out the show? I'd like to say that,
[00:56:11] Unknown:
you know, as well versed in a space as we are, we realize that data quality is a very young topic and young space overall, both in terms of tools, but even in terms of understanding of what are the approaches and solutions to solving these problems. And so I think 1 of the key ways we, as data practitioners, can contribute to solving that and helping each other is through sharing the knowledge. And we at DataFold, and I personally have been hosting data quality meetup, which is a quarterly online gathering for data practitioners to discuss the best ways, tools, and solutions for data quality management.
And so we invite everyone to both contribute with lightning talks. So tell us about what are the ways in which you have tackled state quality problems in your organization, or what are the cool tools or frameworks that you've built or extended to help solve these problems, and also to just come and learn and dissipate the knowledge within your organization.
[00:57:19] Unknown:
And if you don't already have it, you would probably be interesting to add the data quality war stories where you have a sequence of lightning talks about all the things that went wrong and ways that you failed because it's always fun hearing about some of the non obvious ways that things can go wrong.
[00:57:34] Unknown:
Yes. Absolutely.
[00:57:35] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:57:49] Unknown:
Part of you wants to say you know, talk about more data quality tooling and testing. But this is, I think, less interesting because it's on our road map. We're gonna build this. It's gonna be great, and it's gonna be very helpful. So I don't think we're gonna build, but I think that's probably needs to be built. It doesn't make sense that building fundamental datasets like star schemas takes so much time and effort, basically, just to piece together raw data in to slightly more usable representations of business entities. I think that this process is ripe for more automation, which should come from really deep understanding of how the data works from maybe semantic or graph technologies that would help connect the, you know, dozens and hundreds of disparate data sources, events, OLTP sources, third party vendors into a more cohesive view of the data.
And we sort of scratch this area with customer data platforms, right, that kind of give you the unified view of customer. But the pitfall, I think, those tools fell into was focusing too much on marketing and using this data for marketing tool automation. Whereas I think that similar approaches to unifying the data views can be used across your entire data stack to build star schemas, to build machine learning feature sets, and ultimately to make building data products easier. So to whoever could make sense of my fairly high level desire or proposal, if you think that'd be exciting to build, reach out to me. I'd love to brainstorm discuss it. Yeah. It's definitely an interesting proposition
[00:59:32] Unknown:
and 1 that I can wholeheartedly agree with that there's a lot of time and effort that goes into data modeling that could potentially be automated, particularly with the progression that we've made with semantic graph technologies and being able to do entity extraction and entity resolution. So definitely interesting thing to think about. So, definitely, if anybody's working on that, reach out to me too. I'd love to talk about it. Awesome. So thank you again for taking the time today to join me and share the work that you've been doing at DataFold and your insights and experience on how to be more proactive about data quality management. It's definitely a very interesting and relevant and necessary space. So appreciate all of the time and effort you're putting into it, and I hope you enjoy the rest of your day. Thank you so much, Tobias, for inviting me to the show. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Gleb's Background and Journey in Data Engineering
Biggest Factors Contributing to Data Quality Issues
Applying Software Quality Practices to Data Quality
Tooling and Platforms Impact on Data Quality
Design and Features of DataFold
Managing Data Quality Across Different Data Sources
Organizational Responsibility for Data Quality
Lessons Learned from Building DataFold
Challenges in Building and Growing DataFold
Future Plans for DataFold and Data Quality Management
Closing Thoughts and Call to Action