Strategies For Proactive Data Quality Management

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bogged down by having to manually manage data access controls,

repeatedly move and copy data, and create audit reports to prove compliance?

How much time could you save if those tasks were automated across your cloud platforms?

Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and de identification

features eliminate the need for time consuming manual processes,

and their focus on data and compliance team collaboration

empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.

Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.

That's immu

t a.

RudderStack is the smart customer data pipeline.

Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse, enabling identity stitching and advanced use cases like lead scoring and in app personalization.

Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder.

Your host is Tobias Massey. And today, I'm interviewing Gleb Myshansky

about strategies for proactive data quality management and his work at DataFold to help provide tools for implementing them. So, Gleb, can you start by introducing yourself? Thanks, Tobias, for having me. My name is Gleb. I'm currently CEO and cofounder of DataFold. We're building data observability platform

that helps data teams

build data products faster and with higher confidence.

Before

building DataFold, I was a data practitioner

and was doing data engineering, data science,

analytics. So a lot of what we're building is informed by my personal experiences and pain points. And do you remember how you first got involved in the area of data management?

Yeah. That was back in 2014.

I joined Autodesk's newly formed

consumer group that had a portfolio of over 20 b to c creativity tools.

As a 1 month data platform, I was then tasked to centralize

all analytics around this portfolio

of 20 apps.

And the great part about that was

that, you know, as a 1 man data platform, I got to choose all the tools I wanted and put together in my stack.

But it's also worth mentioning that although it was not so long ago, in 2014, we lived in a pretty much completely different world, data tools wise. So airflow wasn't released yet. I think Looker and Snowflake just raised their series b.

Spark was bleeding edge, just the first release. And so tooling wise, and I think approaches that we used back then were quite different from what is mainstream

today.

Yeah. It's definitely always crazy to look back at some of the timelines because

the overall data tooling space has been moving so fast that if you look at what's out there today, it's just impossible to even remember what came out when and how long ago it was available for. Because

as you said, you know, 2014, that's only 7 years ago, but it's a complete lifetime and an entire paradigm shift away from where we are right now with the overall data landscape.

Yeah. Absolutely. And a huge shift in the problem space as well. So I think what's top of mind for data teams today is is a very different set of problems than what we were facing back then. Yeah. I think at that point, it was still just a matter of, I need to get this data from here to over there, and I need to make sure that it doesn't error out halfway through. And now we're sort of moving up the pyramid of, you know, the hierarchy of needs to, you know, data observability

is actually 1 of the concerns for data teams now that wasn't even on the table 7 years ago. Exactly.

And so in terms of what you're building at DataFold, can you give a bit of a background and overview about what it is that you're creating and some of the story behind what motivated you to launch this company? Yeah. Absolutely. I think to tell the story of DataFold, I should also

give a little bit of background on my path in data engineering.

So

after starting, you know, building data platform at Autodesk, I then moved to Lyft where

at the time when I joined, we had a 15 person data team

that over the course of next 3 years

grew to over 300 person org.

And so with almost, you know, 20

x expansion of the team,

exponential growth of the business, and, of course, the data volume and complexity,

that all created a tremendous pressure infrastructure and tooling.

And so

I initially started as a data analyst building data products such as BI reports and forecasting machine learning models.

And I very quickly realized that the available tooling was really not suited to tackle the problems that were rapidly emerging due to the growth of the team and and the data. So I

switched

my focus to from building data products to building tools that enabled

data

developers, data scientists to build

those products because

the complexity of the data, the reliability of the data, and the speed of developments that we had in data teamwork were quickly becoming bottlenecks for the business growth.

And so 1 of the key,

I guess, pivotal moments for me to start focusing on tooling were

when as a data engineer being on call, so basically

responsible for taking care of all incidents,

I had to ship a very small incremental change to 1 of the core

jobs that were building analytical datasets.

And I made just a very tiny change about 4 lines of SQL.

And I did some testing. I got a code review from my teammates, and I shipped it. I merged it,

rebuild the

entire DAG. And

next day,

we discovered that

there was a huge data incident going on. So, basically,

all analytics was stopped because it was apparent that a huge portion of the data was missing.

And what was crazy is that it took us about 6 hours to realize that

that data incident was related to the change that I made the previous night. And even for me, the person who made the change, it wasn't at all apparent that the data incident was related to it. Right?

And the most scary part is that I followed the process that existed and I use the tools that existed, but even still, I was able to make such a mistake that led to a really bad outcome for for the business. So it took us the full next day to clean it up and to relaunch all the processes and to get all the data pipelines back on track. And so

the realization that 1 person, you know, making a small change can bring down, you know, entirely the platform at a large company with huge business impact

was 1 of the pivotal moments for me to start focusing on building tools first internally at Lyft and then eventually starting DataFold to help solve these problems for for everyone.

So back at Lyft, just to to give you kind of a sense of what we were building, we built a framework on top of Airflow that enhanced the developer experience and helped build

more testable pipelines.

We also built a real time anomaly detection based on Apache Flink. We also built an early version of data catalog that was predecessor

to Amundsen, which is now open source. And so all of these project really impacted

how the entire data work was building data products. And then the realization was each that's something that Lyft needs with its scale. Probably the rest of the

data community that's

likely suffers from the same issues, but 1 have resources to build so many different tools in house who also need it. So that's kind of the my personal experience that led to creation of

DataFold.

And I guess the macro

reason or the bigger why of why

I decided to start a company building

data observability tool and data quality tooling

is that obviously data is leading the world.

And I think we're just at the beginning

of seeing how

data products disrupts industries, all industries in the world.

And we kind of started talking about that at the beginning of how the data environment is different right now from, let's say, 7 or 10 years ago. And so

over the last

5 to 7 years, we really solved a lot of fundamental problems of how we store data, how we collect it. We now have

really fast,

almost limitless,

scalable

databases. Right? We have great BI tools, visualization capabilities. We have a Malinfra.

But the problem that emerged over the last few years is now that the companies

have limitless capacity to accumulate and produce data, how do we deal with this complexity? How do we tackle

the

problems of data quality when we're dealing with, you know, tens of thousands of tables and millions of columns at in average size

company. And so I think that by solving those problems for

the

people, for data teams who are using and developing data day to day, we can then make a really huge impact on the world in general. So that's the bigger why behind Dataflow.

Yeah. It's definitely a huge problem that a lot of teams are dealing with

is because there's this explosion in tooling, and it's moving so fast. It's it's hard to keep up with that. And so you're just trying to build systems and keep them moving

and deal with all the different data sources. And now that data integration

is a lot easier with tools like Fivetran or the Stitch ecosystem

where anybody can say, oh, I'm going to connect this data source into my data warehouse that, you know, now data teams are just being completely swamped. And so even just keeping track of what data exists has become an entire tooling problem and, you know, entire companies are being launched just on that 1 problem. So the fact that managing the quality of any 1 of those data sources can have, you know, such an outsized impact, as you mentioned, with just changing 4 lines of SQL,

destroying the entire productivity of the company for a day is, you know, definitely something that a huge financial burden, particularly for companies that aren't set up to handle it. And so in terms of those data quality problems, I'm curious what you see as being the biggest factors that will actually contribute to incidents of, you know, qual you know, quality problems or pipeline failures or, you know, some of these outsized impacts that can happen from a small change?

Yeah. Absolutely. I think data quality

is right now becoming as big of a problem space as software quality.

So it's enormous.

And I don't think there is any single, you know, framework or tool or

solution to really solving it for even a not very large company.

And so I think with such big problems, it's always helpful to try to break them down into a few dimensions, then it becomes more manageable.

And 1 way to look at the data quality

problems is to look at what are the sources

of those problems.

So so 1 is obviously operational issues. Right? Say, our

data producing jobs are delayed, infrastructure

failures, there are errors.

There are there are certain queuing in a system, so data is not available. It's not computed. I think this problem space is more understood right now and probably

easier to manage given the maturity of infrastructure.

Another failure scenario is when the data that we rely on changes.

So that the examples of that could be

vendors that we use to ingest the data

are not complying with expectations.

Other teams are making changes to their data sources and causes impact.

Or

there's gonna also be change in the business. So the fundamental changes in the world that also get reflected in the data.

And the 3rd big category of data quality problems arise

from us, data engineers, data developers, making changes to our data products. So making changes to the code that processes data, be that SQL, Python, Scala,

or

other frameworks,

changes to the business logic that exists in business intelligence tools. So right now, many of those tools also contain a lot of logic in terms of how data is computed and presented,

changes to ML model definitions. So right now,

data driven companies, so companies that are really relying on data for making decisions by humans and by machines,

they typically deal with

code bases that are

used to process data, which are comparable size to their actual software products. So it's tremendous amount of complexity, probably

tens of thousands of hundreds of thousands of lines

of code. And it's also very rapidly evolving. Right? Because to be really data driven, we not only need to build out all the infrastructure and all the models and the star schema. We also have to rapidly iterate in it to keep up with the business demands and with the new challenges that

the growth goals post.

And within that framework, right, operational issues, changes to the data, and changes to the data processing code, I think the latter area right now is probably the least studied and the least understood.

And I think it's somewhat natural. Right? Because the first step when you are dealing with data quality issues is think, well, I wanna at least know whenever they happen.

But I think to really tackle

the problem, we'd have to pay closer attention to how do we work with the data, what is our development process, what is our change management process. And then for solving that and solidifying that, we can then achieve better

data quality. To your point with the parallels about the software quality

space, there are a lot of tools like linters and unit tests and, you know, static code analysis for both potential bugs and security implications.

And because of the fact that data platforms are not a static system, there is no single point in time snapshot that is going to accurately represent the entire system as there can be with code, although with code, it gets complex as well. How are we able to map some of those same concepts from software quality management into the data space and deal with the sort of dynamism of

real world data cleanliness issues and how they impact the actual

systems that are dealing with processing them

to create these preventative maintenance systems.

So I think that 1 of the bigger trends that we see right now in the data world is

the application

of what are now considered standard software development practices into the data workflow.

And some of the

ways in which we can solidify

our data development process are to,

1, bring version control. Right? So version control of everything starting with your ETL code that ingest data that processes data.

Also, version control for things like even BI dashboards.

Because if we think about this, in the current world where

companies have entire meetings structured around dashboards to make decisions about investments or about

killing or doubling down on a particular feature,

the stakes of making the wrong decision based on data are really high. And so all data products, no matter whether the executive facing or you know, going into production, should have version control because

that enables

1,

very clear

reproducibility

of

whatever is the state of this code. It also enables very

clean and

visible change management process because we can cleanly delineate between the previous version and the new version.

And it also allows for more seamless collaboration between the teams. Because right now, data products are not built by a single person or even single team. We have probably

dozens of people collaborating on everyone,

especially at larger companies.

I think the second

important aspect of

development process that we're seeing

coming from software engineering into data world

is having a good visibility of changes.

So whenever we are making change to,

let's say, a pipeline that transforms

data, we have to really understand what does this change entail both for us as others as a team and also for our downstream stakeholders and consumers.

And so in software world,

we are typically doing this through

regression testing. Right? So we are running unit tests. We're running regression tests. We are, potentially, in the world of microservices,

exposing

a tiny bit of traffic to the new service and observing it and seeing what happens.

And so in the data space, we now also have similar frameworks, for example, assertions,

such as validating

that a given column on a data set is unique or not null. It is very helpful to validate business assumptions about the data can run both

during the development process and in production.

But I think 1 of the

still missing aspects that we are trying to close in this development process and in particular in having the visibility into the changes

is understanding,

like, what is the full impact analysis of the change that I'm making?

For ex and that can go into simple questions such as what is the number of rows that a particular dataset will produce? Will I have any drifts in the features?

Am I going to break any dashboards because I may have removed the column or I named the column? And so having this visibility

is really paramount for a reliable change management process.

And then the 3rd component that I think is also rapidly making its way into data world is continuous integration and continuous deployment.

And so similarly how it helps software teams be more agile, make smaller incremental changes, and then ship them faster in a reliable

way. In the data world, we see the renaissance of almost

CI where we see data teams investing in automated testing procedures.

So for example, whenever someone checks in code

that transforms the data or even controls the layout of the dashboard,

there is an automatic process that runs tests, maybe builds a staging dataset, and then even maybe automatically merges this code and deploys it to

ETL Orchestrator.

So that really helps make sure that whatever is the change management process, it's not only available

to people, but it's also automatically enforced. Right? And there's no change that bypasses certain testing, which is required.

As far as the

the tooling and the platforms

and the sort of impact that it can have

on effective data quality issues, what are some of the ways that they can contribute to the occurrence of data quality issues as far as the systems that you're building, the way that your data platform is architected, and some of the design considerations that teams should be thinking about as they're planning out their data platform or as they're starting to think about introducing new systems or new processes?

So

as a big believer in great workflows,

I think that the best way tools can support

reliable

data and help data teams ensure high data quality

is to really facilitate

those

strong workflows. And to give you an example, we talked about version control. We talked about testing and CI.

So

we see that certain tools that we now consider part of the modern data stack, for example,

DBT for SQL transformations

or tools like Dijkstra for general purpose data pipelines and tasks,

they come with those

features and frameworks already built in. So they already facilitate version control.

They have built in testing frameworks that make it really easy for data developers to write tests and run them as part of the pipeline.

And

documentation frameworks that help both

keep documentation close to the code, which is always great, but also serve that documentation

in a nice UI that can be consumed by, not necessarily

the data developers, but data users.

And very importantly, they

have

separate production and staging and development environments. That also is a very important concept

for making sure that the change management process is reliable.

As far as the

potential consequences, we have addressed some of that where, you know, if you have a wrong column or the data is old, it can potentially lead to costly

decisions that end up being based on

incorrect assumptions around the data that's available.

And so

how can organizations start to shift to being more proactive in the data quality management

and start to instill the understanding

at the business level that it's worth the investment and the time and energy that it takes the engineering team to

create these systems for proactive management

and also how

to instill the

sort of level of care and

diligence that's So

I

think,

probably

So

I think probably the first step that's important in our organization is, 1,

on you know, recognizing that there is a problem and getting a buy in to solve it.

Fortunately, we still

see that some teams,

you know, live with data quality issues as a status quo.

Right? And so we have to recognize that there's a problem, and we are able to improve it.

I think the second important aspect is

understanding what are the root causes of the issues. So probably trying to classify them and see what are the areas that are most risky

and most impactful.

And,

again, I'd like to emphasize

the proactive

data quality management through improving the

developer process over more like post factum monitoring.

Because

appealing is the idea of kind of data monitoring

post factum. So tell me when my data is wrong, kind of black box

solutions is

it's

quite hard

to rely on that solely to improve data quality. Because by the time that you identify that there is an issue in

production, the damage is already done. Right? So the stakeholders probably already looked at the dashboards showing wrong information.

Machine learning models already were ingested

the wrong data and skewed their results.

Another problem is

that it by the time the data is ready in production, it could be really hard to identify the root cause because

with the multistage data pipelines,

corrupted data propagates really fast and it becomes ubiquitous.

And the other aspect, which is more organizational, is

with the issues data quality issues

that are already in production,

to fix them, you have to fight the organizational momentum.

Right? You have to advocate for people to stop whatever they're doing, go back and fix them, as opposed to work on the new things, which is always an uphill battle.

That's why I strongly advocate for data teams and companies to really look into the preventative

ways to address data quality because then all of those

issues are

taken care of. And so in terms of how to think about improving the process is,

I think, an important aspect is to understand what are the current inefficiencies of the process. So is the bottleneck in the ability to

ship, let's say, data. Right? The teams need better frameworks for shipping data products faster. So sometimes

a team would need to, let's say, switch to a more agile framework like DBT, which comes with a lot of the data quality

toolkit

features already.

Right? But assuming that the basic infrastructure and tooling is already in place, I would start with

planning out the change management process. So what are the steps that are required in order to make a change to a data product, be that, you know, a SQL job or a BI dashboard,

and then introducing

visibility

tools. So how can we make sure that their tasks are executed, that we have full understanding of the changes that we're making, and then making sure that these processes enforced.

As far as what you're building at DataFold, I'm wondering if you can talk through some of the

design

and features that you are building in and some of the architectural aspects of the system that allow it to be able to enable some of this proactive data quality management of finding and fixing,

you know, data quality and data bugs before it actually goes out into a production context?

Yeah. Absolutely. So we call Dataflow the data observability

platform.

And so by observability,

we mean that we help

data teams discover, understand their data, how it works, what the distribution of data, where it comes from, where it goes, and also verify and test it. And so while there are multiple features that I won't go into the detail

right now, into all of them,

the really key pieces of the platform that help enable reliable change management process

are Datadiff

and column level lineage engine.

So Datadiff

is a tool that

analyzes changes in the data

and provides a visual report across multiple dimensions

and with various degrees of granularity.

So you can think of it as git diff for data or, you know, a Microsoft Word diff, but for for your datasets.

So whenever you want to compare

2 datasets,

it gives you a view into how they are different, both in terms of individual rows and also on a statistical level in terms of the distributions.

And so how does Datadiv fit into those workflows that we discussed?

So for 1, it helps you automate regression testing because you can compare

the before and after state of your data product. For example, you can compare the production version of your dataset with the development version of the dataset built with a new code that you're about to merge.

And so that helps you answer questions such as,

what is going to happen to the data? Are there any unintended

changes to, you know, number of rills, percentage of nulls? Are we going to cause feature drifts

by changing distributions of particular dimensions?

Are we going to cause BI tools to fail because we renamed or misplaced the columns? So Datadhip helps answer those questions without writing any SQL or without doing any

manual checks.

And the way it fits into the workflow is essentially automating what

most

teams do right now but manually.

So we spoke to some really senior data engineers at public companies

to learn that sometimes they spend up to a week

testing a single change to a really important SQL job if that job, for example, powers the financial reporting because the sakes of making a regression are super high. And the majority of the time in that week goes into writing arbitrary ad hoc SQL queries that are essentially comparing things and validating

things to make sure that there are no regressions. So Datadip essentially takes out that manual part of work.

And then aside from

testing the regressions

between production and development,

data diff can also be helpful in identifying

drifts in the data in production because we can compare

state of the data today versus

yesterday or, let's say,

after a job run versus before job run and identify any anomalies. So are there any unexpected

consequences? So that's more of an autonomous anomaly detection piece.

But back to

the development workflow, like I said, the second component

is column level lineage. So what is lineage? It's essentially an interactive map of the dependencies

in your data

ecosystem

that essentially shows you for a given

column

where does the data go and where it comes from. So if we look at a particular dashboard, we can immediately answer a question.

So a given metric,

what are the columns,

how it's computed, what are the columns that are feeding the data into this metric? And we can see that, for example, a particular column is a combination of 2 upstream columns. It's, some operator or it's a case when statement.

So we can trace those dependencies up and down. And

while there are multiple uses for column level lineage,

the 1 that's relevant for reliable change management process

is doing the impact analysis. Right? So whenever we are changing,

let's say, a SQL job and we have the data diff that shows us what is the impact on a particular

table, The next thing we can do with column level lineage is understand what are the potential downstream consequences that we haven't accounted for of making a given change. For example, if we change the definition of a given metric, for example, conversion,

with column level lineage, we can immediately identify all the downstream jobs, all the dashboards, all machine learning models that are using this metric.

So we can, 1, potentially do impact analysis there,

or we can also

proactively reach out to stakeholders,

to owners of those data products and data users and tell them about the anticipated change.

So together,

these 2 tools facilitate the full understanding of the impact you're making when you're introducing changes to the

data processing code. And through that, we can dramatically

reduce the chance of making errors

and also save a lot of time for data developers that otherwise would go into manual testing.

We've all been asked to help with an ad hoc request for data by the sales and

Write some Python scripts to automate it? But what about incremental

sync? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more.

Go to data engineering podcast.com/census

today to get a free 14 day trial.

Another interesting element of the sort of data quality question is that

with particularly organizations that have their own in house software teams, a lot of the data is going to be coming from operational database systems that are owned and managed by a team that is distinct from the data team and and that has their own priorities and their own release cadences and their own ideas about what database design should be and how to evolve it. And

then there are also things like customer event tracking where you have a tracking pixel or, you know, set of JavaScript on a website that is going to have some event schema that's coming in. And so then you have to deal with pulling those events in and, you know, convert them into a database table and deal with downstream transformations there.

And, you know, not even factoring in the 3rd party SaaS platform data that you need to pull in. You're just within the scope of

data sources that are within the entire control of your organization, but not necessarily

owned by the data team. How do you sort of popularize

or

build an organizational

contract between the different stakeholders and data owners

about how

to manage

change propagation

through the different systems, you know, maybe starting in software systems or event tracking

to, you know, how that impacts the business dashboard that your CEO is looking at tomorrow?

Yeah. Absolutely. It's a huge problem, and it's typically a big pain point for every company that we spoke that is really data driven and building lots of data products. I think the first step is, again, to

acknowledge and to say that

change management process for data sources, be that events or operational data stores that are copied to your warehouse,

should also be reliable

and equip the teams that are owning those sources

with full visibility into the impact of the changes that they are making.

And then in the world of

event tracking,

we are seeing

emergence of tools that are specifically focused on

reliable

definition

and change management of those event schemas. So they are called instrumentation

trackers or

schema planners.

So the idea of those tools is that you have a central repository for defining events. So what is an event? What are the properties that are sent alongside

the events? And then whenever engineers implement those events,

there is an automatic validation against the spec to ensure that both during development and in production, whatever instrumentation

generates,

whatever data comes out of those sources

as part of the tracking, it conforms to the original spec. And all the changes are also version controlled, and

all the data developers

who

use those events, data

consumers and engineers to instrument those events are all on the same page.

I think speaking about kind of intermodility

of the tools and how we can piece of

the events of the events because they are mostly seeing just the world up until those events are just in the warehouse.

And this is where a tool like Data Vault can come in because we have the visibility

all the way from the raw event sources

to the ultimate data consumers. So by plugging these tools together, you can also ensure a reliable change management process for those

resources. And as far as the operational stores that are oftentimes

copied using change data capture into warehouses,

This is a somewhat more complex problem because it's a fairly

kind of low level infrastructure

process to copy the data

operational stores, and there is a big amount

of variability in terms of how companies implement it. So some use vendors, some use open source CDC methods, some use batch

copies.

And so whatever

is that the team is using, I think the key part is to, again, make sure that before any change is made to the original source or to the source schema,

there is an impact analysis

performed that clearly shows what is gonna be the impact of the change. Because sometimes you can remove a column and no 1 cares. And sometimes

you change a slight definition and this there's a huge data incident. So understanding the difference between these 2 scenarios is key. Again, I think column level in each is the fundamental instrument and source of information for that, but how exactly it plugs in to the change management process for operational data stores highly depends on how the company implements it. To that point too of column level lineage, a lot of systems will look at that from the data warehouse perspective.

But it's definitely an interesting question to think about, how can we

propagate some of that information and extend the visibility

of these data tooling systems

into the operational

stores and the applications so that it becomes part of the application development life cycle to be able to view and analyze the downstream impacts and not just have that be a responsibility of the data engineers and data analysts?

Absolutely.

I think the

cool thing is that with the emergence of ELT pattern,

we shifted from doing a lot of in flight transformation software all data before it lands in the warehouse into the pattern of doing 1 to 1 copies of whatever is in your operational stores. So I think the prevalent pattern right now is to copy

your entire

schema from the transactional store such as Postgres or my MySQL into your warehouse as is.

And so if you have if that is the case, then having lineage in your warehouse that shows you downstream usage of of those copies

effectively, can be translated to

the ultimate raw sources

in your operational source, which makes the entire visibility pipeline much easier.

But if you have more complex scenarios, then, basically, there is also

an option to extend your lineage graph to those sources, but that increases complexity massively.

For organizations

that aren't necessarily using a cloud data warehouse and are more in sort of the data lake paradigm where they have data in s 3 in parquet format, and they're dealing with partitioned datasets there.

And, you know, they might be using Trino or Presto on top of it, or they're using Delta Lake or Hoody or, you know, the plethora of tools that are arising in in that space.

What additional

challenges or complexities does that pose to, you know, systems like what you're building with DataFold to be able to

add the level of insight and introspection that's necessary

that is, you know, relatively straightforward in a

vertically integrated

data warehouse stack, but, you know, but is not necessarily

as sort of cohesive in these data lake environments?

I think to answer it, it may be worthwhile

to take a look at you know, under the hood of how column level lineage is constructed.

So fundamentally,

to have a reliable bottom up column level lineage

map of your data ecosystem,

we have to

first obtain the

code that's

basically the DDL and DML code. So the code that defines the schema of your datasets

and the recipes for how those datasets are created.

And in the world SQL, that means SQL queries that are creating datasets or modifying them and SQL queries that are consuming datasets. And

by then doing static analysis of that code,

so decomposing that into AST representation and then piecing it back into the global graph dependencies,

we can then understand how data is produced

and how it's consumed no matter what happens in that SQL, no matter how complex your queries are and whether using correlated subqueries or case wide statements or renames.

A proper column lineage engine

should piece it back together,

which we do at DataFold.

Now if you're using a data like approach

and still relying on a SQL based engine such as Presto or Spark SQL or Hive,

there fundamentally

isn't

more complexity

than building a lineage graph for a, basically, self contained warehouse such as, you know, Redshift or BigQuery or Snowflake.

It's just a matter of making sure that you collect those SQL logs.

However, when it comes

to other scenarios for how data is built, for example, using

PySpark or ScalaSpark

or,

frameworks such as Apache Beam where

the language of how data is transformed is not SQL,

that massively increases the complexity

because those languages have massively more powerful syntax than SQL. And so in these scenarios, we have to either connect to the

underlying

fundamental representations of jobs. So taking a look at how do those engines

compile whatever is their domain specific language for defining those transformations into the primitive operations and then using that to augment the graph. But in any case, that probably

increases the complexity

for building lineage. But as long as we stay in the SQL world,

piecing back the entire lineage graph is fairly straightforward.

How much attention are you paying to efforts such as Open Lineage to try and create more of an open standard of how to

think about and represent and integrate with these lineage graphs, particularly

for non SQL systems

that have their own sort of custom transformation logic?

And how much potential,

you know, positive impact do you see

with more systems starting to adopt and flesh out that standard or anything else that might be arising in the space?

Yeah. In general, I'm a strong believer in interoperability

between data tools. And I think that's 1 of the core

principles of a modern data stack that tools are increasingly specialized, but at the same time,

more interoperable and more modular, which allows

companies to piece together

the stock with choosing the tool which is best in every

particular vertical.

And so I think the standards like Open Lineage are really important in defining how

particular types of metadata

are shared between the tools.

And the way

I think a tool like Data Vault can be integrated into a larger data ecosystem using Open Lineage is by providing the fundamental

lineage information. So, basically, dependency

graph

that is then shared using the open lineage standard with other tools.

Right now, we already have integrations with data catalogs such as Amundsen and DataHub. So any 1 who is using them can ingest column level lineage information

from DataFold using GraphQL API and then load them in the data catalog. I think with open lineage, that'll be even easier once it's adopted more widely in the ecosystem. Because once you have information that's standard, you can then reuse it across multiple tools. And like you said,

you can also

use this standard

to

piece together

different sources for lineage. Right? So for example, you may use DataFold

to obtain

all the lineage information from your SQL warehouses,

and you may then plug in the lineage graph from systems like Spark and Beam, again, using open lineage to construct the global graph of dependencies.

Going back to the organizational

aspects of data quality management,

in your experience,

who has typically been responsible

for

identifying and addressing data quality issues? And do you think that the current state of affairs is

sufficient or beneficial, or do you think that there needs to be

a shift in how data quality

is

sort of owned and operated at the organizational level?

So I think, naturally, the responsibility for maintaining high data quality falls on the teams that own the data.

And typically, that's

analytics engineering or data engineering teams that have the largest surface area with the data products and therefore become responsible for the end to end

data reliability.

And then common for them to pass this responsibility

to software engineering teams. So for example, the

ultimate stakeholder or user of data such as, let's say, financial

team or analytical team would expect the data engineering team to provide them with high quality data, and then data engineering team would build

or collaborate with other teams that are responsible

in the process of creating datasets to make sure that

data is reliable across the entire pipeline.

I think what is currently

missing is

the clear contracts between the teams on who's responsible and and what are the ways that teams can collaborate to ensure data quality. Because, like you said,

especially with the raw data sources such as operational data, which is owned typically by completely different teams and sometimes dozens of teams, if we're talking about a large company with a microservice

architecture,

the clear contracts about who is responsible for what and how the entire process for maintaining data quality for change management

is

conducted.

So I think 1 of the changes we'll see in the future

also is the emergence of top level

key results or KPIs

that will be more organizational level

that will measure the data reliability and data quality at an organizational scale. And then various teams that participate in creation of data products will be responsible for

their parts, their contribution to that KPIs, and will be hold hold accountable in a more formal setting. Whereas right now, it's a more ad hoc process where teams are more reactive to certain quality issues,

and there isn't a very clear understanding of how exactly to measure or to set those goals.

In terms of

the

experience that you've had building Dataflow and working with your end users and talking to people in the industry,

what are some of the initial ideas or assumptions that you had about how data quality is managed, the sources of issues,

you know, the organizational

aspects of it that

you have had to,

you know, reform and that have been challenged or changed as you worked through this overall problem space and built the tooling and technologies to help support teams who are trying to improve the visibility

and quality of their data?

Yeah. So I think 1 of the interesting realizations

that we had

after

going to market with our solution was that, initially, we thought that you know, given our experience

working at large companies and large data teams,

my initial assumption was that

what we're building, tooling for reliable change management and testing automation and observability,

would be most useful and most sought after by really large companies with complex data ecosystems and large data teams.

And what we realized is that while

the impact overall

is probably indeed larger at those companies of bad data quality,

these are the issues that are sensed by increasingly more

younger companies. So

we've had customers as small as, you know, 1 person data team at post seed

stage startups that already start to feel the data quality

issues. So, overall, I think that the challenges of maintaining data reliability and quality have shifted from large companies

into, you know, upstream earlier in the company life cycle. Cycle. That was 1 of the realizations.

So

I think the second 1 was that even maybe 5 or 3 years ago,

their data teams have or data engineers,

individual contributors,

used to have much more

flexibility

into choosing the tools.

And,

you know, even back in my days of doing beta engineering, there was a lot of freedom into,

you know, go try this tool or that tool and kind of iterate fast on making making choices there. And there was a lot of adoption of data tools. Whereas, I think, these days, because companies become increasingly more protective of their data given the sensitivity and the complexity of their

ecosystems,

the decisions of what is the data stack and

what is the

approach and tooling for each step in the stack are increasingly more centralized and are made higher up in the organization.

So I think those are 2 primary takeaways that we had, you know, going with DataFold to the market.

In terms of ways that you've seen DataFold deployed, what are some of the most interesting or unexpected

or innovative ways that you've seen it used?

Yeah. So

initially, we built DataFold to

automate data quality testing and increase data observability.

But 1 of the popular use cases that we've seen

for our tooling such as column level lineage and diff was to accelerate

migrations

toward more modern data stack or just in general across tools.

For example,

if you are migrating your

ETL from,

let's say, a legacy warehouse to a new warehouse, no matter what they are,

1 of the most

time consuming

parts of that process is to validate

the before and after state of data because, ultimately,

your stakeholders don't wanna deal with

discrepancies. Right? They want to make sure that whatever they're seeing, which is served from your new warehouse

or your new ETL framework,

is the same data that they used to see from your legacy system. Or if they're not seeing the same data, they want you to

be able to fully explain those discrepancies.

And

when we were doing migration

at Lyft, that was probably 80% of the time spent on overall migration effort. And so what was interesting was to see Datadiv

being adopted for those use cases, basically, accelerating the migration through

faster validation of the dataset transferred to new warehouses or to new ETL frameworks.

And in your experience of building and growing the company, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

So, you know, being a

data tool in 2021

with, like I said, increasing focus on data protection and

security,

we had to

pay

a lot of attention to making sure that our solution is secure. And for

a very large number of customers, even larger than we expected, that meant being able to deploy

our solution on premise,

which for a younger company with, you know, fewer engineers, that

brings lots of challenges. Right? Because we have to not not only

maintain

1 SaaS

solution that is scalable. We also have to be able to quickly deploy the entire

distributed application into customer environments and do it securely,

quickly, and also in a way which allows us to maintain it

with minimal overhead as well. So that was, I think, 1 of the hardest technical problems that we had to tackle.

In terms of people who are looking at DataFold and they're thinking about how they're going to manage their data quality and try to be more proactive instead of reactive. What are the cases where a DataFold is the wrong choice and they might be better served with

other frameworks or in house tooling or just organizational patterns?

Yeah. I think that DataFold

is built with the modern data stack philosophy,

and it's also optimized to integrate seamlessly

with modern warehouses such as Freshships, BigQuery,

Snowflake,

and modern data lake systems like Presto and Spark.

It is probably gonna be an uphill battle to use a system like DataFold with a more legacy data stack based on, let's say, older systems that are based on Hadoop,

Hive, and even more kind of proprietary proprietary

data frameworks.

And

if your organization is in the process of

either establishing your dataset as stuck from scratch, so you're still in the process of setting up data warehouse and BI tools, more fundamental blocks of the stack, or you're in the process of migrating from legacy systems to the modern data stack, it's probably

too early for you to adopt DataFold because in the hierarchy of needs, DataFold will not be able to solve your immediate challenges.

And I think the second group of use cases is for companies that are not necessarily

data driven. So the importance that they give to analytics

is not as high. Probably data folds also wanna be, you know, be able to bring lots of value

because

our value proposition is to help ensure data reliability and data quality. So that's not necessarily the topmost priority that we won't be able

to naturally generate a lot of impact.

And I think, finally,

I think there has to be a mandate for

change and improvement

that exists in the organization.

So if there is a status quo, the data is broken, and and everyone is fine with living in this painful world of broken data, but without necessarily

plans or KPIs or PRs to improve it. Again, probably

solutions for data quality on tools like DataFold or others won't be able to help much. So it's very

important to

have right incentives and motivation within the organization to actually address those problems.

And as you continue to build that data fold and work in the space of data quality management

and try to stay up to date with all of the rapid shifts in the data ecosystem. What are some of the things that you have planned for the near to medium term?

So for the

near to medium term, we are going to focus

on making

data fold even more interoperable

with other parts of modern data stack.

So integrating with

the popular

BI tools and increasing the integrations with

popular ATL frameworks such as Daxter

and others,

basically, to be able to provide a more holistic picture into data quality both as part of change management process and for

sort of in production autonomous

data monitoring.

And

if I were to zoom out and think about

the fast forward future, more long term plans for Data Vault,

what I would really want to happen is for us to be able to automate

80%

of

what current data prep or analytics engineering workflow is today.

Because if you look at it,

most of it is not creative process. It's not writing code. It's actually dealing with

simple but really painful questions of understanding your data,

understanding the edge cases, understanding the data quality issues, or fixing data quality issues. It's

reading the code to understand dependencies.

And so through providing better observability,

we can not only solve data quality, but we can also accelerate the entire workflow of

building data products.

And, ultimately, I think that we can go as far as not only

helping teams to ensure the quality of their datasets, but even to create high quality datasets at the first place. Because

as a data observability tool, we are

uniquely positioned to collect and process

very valuable metadata

that basically gives us understanding of how data links, how it's produced, how it's consumed, what is the semantic meaning of

every single data point, which puts us in a very strong position to build lots of useful tools to really accelerate the workflows.

Are there any other aspects of the work that you're doing at DataFold or the overall space of data quality management

and strategies for being proactive in preventing data quality issues that we didn't discuss yet that you'd like to cover before we close out the show? I'd like to say that,

you know, as

well versed in a space as we are, we realize that data quality

is

a very

young topic and young space overall, both in terms of tools, but even in terms of understanding of what are the approaches

and

solutions to solving

these problems.

And so

I think 1 of the key

ways we, as data practitioners,

can contribute

to

solving that and helping each other is through sharing the knowledge. And

we at DataFold, and I personally have been hosting data quality meetup, which is a quarterly online gathering for data practitioners

to discuss

the

best ways, tools,

and solutions for data quality management.

And so we invite everyone to both contribute with lightning talks. So tell us about

what are the ways in which you have tackled state quality problems in your organization, or what are the cool tools or frameworks that you've built or extended

to help

solve these problems, and also to just come and learn and dissipate the knowledge within your organization.

And if you don't already have it, you would probably be interesting to add the data quality war stories where you have a sequence of lightning talks about all the things that went wrong and ways that you failed because it's always fun hearing about some of the non obvious ways that things can go wrong.

Yes. Absolutely.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Part of you wants to say you know, talk about more data quality tooling and testing.

But this is, I think, less interesting because it's on our road map. We're gonna build this. It's gonna be great, and it's gonna be very helpful.

So I don't think we're gonna build, but I think that's probably needs to be built.

It doesn't make sense that

building fundamental datasets like star schemas

takes so much time and effort, basically, just to piece together raw data in to slightly more usable

representations

of business entities.

I think that

this process is ripe for more automation,

which should come from really deep understanding of how the data works

from maybe

semantic or graph technologies that would

help connect the, you know, dozens and hundreds of disparate data sources, events, OLTP sources, third party vendors into a more cohesive view of the data.

And we sort of scratch this area with

customer data platforms, right, that kind of give you the unified view of customer.

But

the pitfall, I think, those tools

fell into was focusing too much on marketing and using this data for marketing tool automation.

Whereas I think that similar approaches to unifying the data views can be used across your entire data stack to build star schemas, to build

machine learning feature sets, and ultimately to

make building data products easier.

So

to whoever could make sense of my fairly high level

desire or proposal,

if you think that'd be exciting to build, reach out to me. I'd love to brainstorm discuss it. Yeah. It's definitely an interesting proposition

and 1 that I can wholeheartedly agree with that there's a lot of time and effort that goes into data modeling that could potentially be automated, particularly with the progression that we've made with semantic graph technologies and being able to do entity extraction and entity resolution. So definitely

interesting thing to think about. So,

definitely, if anybody's working on that, reach out to me too. I'd love to talk about it.

Awesome. So thank you again for taking the time today to join me and share the work that you've been doing at DataFold and your insights and experience on how to be more proactive about data quality management. It's definitely a very interesting and relevant and necessary space. So appreciate all of the time and effort you're putting into it, and I hope you enjoy the rest of your day. Thank you so much, Tobias, for inviting me to the show.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links