Keeping A Bigeye On The Data Quality Market

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bogged down by having to manually manage data access controls,

repeatedly move and copy data, and create audit reports to prove compliance?

How much time could you save if those tasks were automated across your cloud platforms?

Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and deidentification

features eliminate the need for time consuming manual processes,

and their focus on data and compliance team collaboration

empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.

Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.

That's

imuta. Your host is Tobias Macy. And today, I'm interviewing Igor Grzislav about the state of the industry for data quality management and what he is building at Bigeye. So, Igor, can you start by introducing yourself?

Sure. Thanks for having me, Tobias. It's a big honor to be here. This is my favorite show to listen to during my morning runs.

So my name is Igor, and I am the cofounder and CTO of Bigeye.

We help data engineering and analytics teams monitor the quality of their data at scale.

Before Bigeye, I spent 4 and a half years at Uber working on pretty much all things data from data warehousing infrastructure to building an in house AB testing and analytics product and really everything in between.

And do you remember how you got involved in data management?

Yeah. So the first time I've started working in data was back in 2012 when Hadoop was just getting started.

And most of my job consisted of writing map reduce jobs and making that work and

just having that process be less painful.

I remember when Impala was released, and that was a huge deal because the first major sequel on Hadoop project, and it would make working with Hadoop a lot easier.

In terms

of data warehousing, I actually got into that at a company called 1 Kings Lane,

where I worked on setting up their data warehouse as well as all of the ETL tooling involved.

That's where I really learned about the

wide range of use cases that data users have because there are so many different cases from marketing to web analytics

to product analytics that all had to be

answered and resolved using data in the warehouse.

And how difficult it really is to build a platform and a system that is generic enough to address all of those.

Taking all of those learnings, I joined Uber in back in 2014

as 1 of the first dedicated data engineers.

And my team was tasked with making the data warehouse work.

Uber at that time was migrating from a Postgres replica to Vertica as their primary data warehouse,

and really maintaining and building all of the tooling around that to help just to help scale with the company's growth.

While at Uber, I got to explore

really the whole data landscape, not just from the eyes of somebody building pipelines and setting up infrastructure,

But also someone building products on top

of the data. Sometimes the data that I didn't control,

such as with the experimentation platform

and the analytics tool there, where

I built an analytics tool. But the data that we were using to show the metrics were often produced by the internal mobile teams or other teams within the company. And I didn't necessarily know where it was coming from or how it was being generated.

In terms of the overall attributes of data quality, can you start by giving your views on what are the attributes that you consider when defining what data quality means, and some of the impacts that either high or low data quality can have on downstream uses of that information?

Data quality,

at the end of the day,

is being able to vouch for the correctness of your data.

And more so, it's

being

able to use the data in a meaningful way for the business. It's how fit is the data to be used.

So as

somebody

who owns a data product, I want to make sure that my users

trust the data that I am showing them in order to make the decisions that they need to make.

This

applies to really both sides of the aisle. So if you look at the data landscape, you have, on 1 hand, your data producers,

who are your data engineers or other engineers within the company.

And they want to provide data for

analysts and data scientists

to work with, and they want those analysts and data scientists to trust their processes.

On the other side, you have data consumers

who are these analysts, data scientists

who want to build products, build reports, build dashboards, machine learning

models

that the business can use.

And they want the business users

to trust the results and trust that they are making the best possible decision with that data.

In terms

of attributes of measuring this,

you have your common ones such as latency and schema.

And over the years, you've had a lot of

companies

and surveys

try to define

some number

of measures of data quality.

But

at the end of the day, I feel like

if

you don't have trust in the data from the users of it, then that data is not high quality,

and it won't actually be used by the business.

And a large part of what users are looking for is understanding

the semantics of the data.

And in terms of the semantics, I'm wondering if you can discuss a bit more about what that means and some of the

elements of the ways that the data is being used that translate into those semantics and how that might differ based on the industry or the specifics of the organization or the goals of how that data is going to be used?

So data types,

by default are very generic,

and they don't always convey the meaning of the data itself.

So when you look at a scheme of a table in in a database, for example,

you can see that it's either

a string or it's a number.

But you don't really understand

how that will be used or what that number or what that string even means.

The semantics is about imbuing

the data with the information about how it will be used and what it actually means to the user.

A couple of a good examples of this are

strings in a database.

The string type is generic. It'll accept any number of characters,

And I'm sure everyone has had cases when

some mysterious characters or mysterious strings appeared in their dataset before.

But

without understanding

how you're going to use that column later, you won't really know what should be in there and whether or not that field is of high quality.

For example, if something should always be a stock ticker or

an email or something should always be an internal identifier,

That is the semantic of the column,

but the actual representation of it can't really convey that.

Same actually goes for numeric columns

where if you have a summary table and you have a column that represents account,

it's an integer column in the database.

But you know that the column can't be negative because it's a count of something. You can never have a negative count of something.

And this is actually an interesting story where there are some COVID data that we were looking at and testing our product on, and the summary table had negative counts. Now we found out that these were corrections in the data,

but because we identified that the semantic was that this should be a count of something, we could quickly understand that by having a negative value in that field,

something looks wrong about it, and we should investigate further.

In terms of the

quality of the data and the effectiveness of the semantics,

what are some of the driving factors that contribute to the presence or the lack of

quality in an organization or a data platform, whether that's technological

or structural or based on the

maturity of the overall capabilities of that organization for being able to work with data?

I think there are a lot of factors here.

And I feel that a lot of

the factors aren't necessarily

technical in nature,

And they are more organizational

and more about the mindset

on how you approach

the data and how you approach data quality.

We recently

revisited

the can book about what does

quality mean in software draw

the

corollaries

to

the

data world. So, draw the corollaries to the data world.

So they talk a lot about embracing risk and

knowing that things will go wrong and knowing that something could fail, and then acknowledging that and planning

for those inevitable

failures.

And

in the data world,

sometimes

you just set up a pipeline and you say, great. The data's flowing. Everything's gonna be great.

And that lack of

understanding that something could go wrong in the future could lead to problems that you didn't expect.

There's also a whole notion of

being able to measure

the quality.

In

software and SRE, this is pretty straightforward.

If a server is up, then it's up. If application is responding, if the latency is below a certain value, then that's good.

In data, it's a little bit trickier.

And

you need to be able to define

what you want to measure, how you want to measure that, and then constantly measure and monitor it for any changes.

And

having that ability to

define what you expect from your data

and monitor it and then expose that information in a meaningful way that somebody can take action off of it

is really what's going to drive

the adoption of data quality within your organization.

And

we

honestly think that the biggest barrier that most teams face to getting started and getting data quality off the ground is not being able to measure

the quality in the first place.

The overall,

if you wanna call it, revolution or evolution of the use of data in larger quantities

and for an increasing number of purposes has been going on for the better part of the last 1 or 2 decades.

But I've noticed that this year in particular, there has been an upswing in the number of tools and products and companies that are focused on data quality

as a primary concern of an overall data product or a data platform and something that is

being viewed as a critical component now where it may not have been before. So I'm wondering what you see

as being some of the drivers

in the industry or in the availability of technologies that this is the particular time where this focus on data quality is coming to the fore.

I think that

the

revolution in data over the last

decade or so, as you mentioned,

has been this huge driving factor for new technologies

in the infrastructure

and end user landscape.

So

before when you had data, you had a bunch of flat files and you maybe loaded those into a database,

and you could run

queries on them. But this isn't very friendly and not very accessible to the users.

If you fast forward to the last

7 years,

what you see is

an easy way to set up the actual infrastructure.

It's easy to get a data warehouse going whether it's Redshift or Snowflake or BigQuery. You put in a credit card, you get a database.

And it's fairly straightforward now to get data into it with tools like Fivetran and Segment and others that are pushing

the data into 1 place.

Now that the data is in 1 place and easily accessible,

it seems like tools moved directly to

the use of data. So you had

within the last 5 or 6 years, an explosion

of BI tools and machine learning tools that you can run directly on your warehouse.

You have Mode and Tableau and Jupyter Notebooks for actual exploration

and data visualization and building

reports and some useful

getting insights out of the data. And on the machine learning side, you have tools like PyTorch and TensorFlow,

where it's pretty easy to stand up a machine learning platform on top of the data that is now in your easy to set up warehouse,

Redshift, Snowflake, whatever that is.

The

part that is missing in that,

we can call it hierarchy,

is the middle layer,

which is

what really has been

called the data ops landscape recently.

Partially as a mirror to the DevOps

landscape that evolved

about 15, 20 years ago.

So

once you have easy data access and tools that can provide that access and expose it to more and more users.

The difficult part is managing the

understanding

and the expectations

of that data.

It's hard when somebody looks at a dashboard or looks at a report and says,

something here looks wrong

because I have a gut feel for with a business,

but I don't really understand

how the data is stored, where it's coming from. So I don't have a good understanding of why this dash board looks wrong. And today, there's no easy way to debug this other than

pushing this down to the data engineer, the analyst, and saying, go investigate this. And now these issues take hours of digging through SQL and pipelines

to figure out what's actually going on.

This DevOps

tool chain is meant to enable data users to have an easier time to work with their data. Whether it's understanding where

it's coming from, where it's located,

how to use it, what to expect of it. And so that's why you're seeing a lot of these companies now pop up that

are focused around data quality. Data cataloging is another huge space

in the data ops landscape today.

And that's why

this area is growing so quickly.

Yeah. I think that your point about the availability of the tools is definitely 1 of the big ones where

in the past 10 or 15 years, the organizations

that were working with data at any sort of scale were on the forefront and on the leading edge

where they had the staff on hand to be able to handle all the complexities of the infrastructure and the tooling around it. They had the sophistication for being able to build these products, but there wasn't as much of a

widespread

education of how to use the data. And so it wasn't being as widely accessed within an organization and instead was relegated

to the specialists who were dealing with the data for their particular purposes. But now that we have all of these self-service platforms

and data has become more ubiquitous and has

come to be a

distinguishing factor in the overall success of an organization.

There are more users within the business who need to be able to access the data, and so it brings in these data education requirements, and it brings in these requirements

of how to convey trust to the user in

understanding

how and why their data is accurate or not accurate and what that means for their own use cases for that information.

Yeah. Definitely. And

I think it's exciting to see businesses

become more data driven as much of a buzzword as that is.

I think using data to make business decisions is the right way to go.

But you can't make good decisions unless you understand

the data that is going into those decisions.

And so as you mentioned, there are a number of businesses that have been starting up to try and

provide

access

to information that

elevates the trustworthiness of this data and brings data quality into the

core workflow of producing these data assets. And 1 of those companies is what you're building at Big Eyes. I'm wondering if you can give a bit of an overview about

your approach to this problem and how the business got started.

Big Eye is building an automated

data monitoring platform

that alerts you when your data changes and helps provide a clearer picture of what's going on

in your data landscape

in order to help debug

any issues quickly and understand

when there are problems.

So

the story behind Bigeye

is

my co founder, Kyle, and I met at Uber working on

the experimentation analytics tool that I mentioned earlier.

We ran into a lot of problems that you would normally see

in

a full stack data team. Where Kyle was the data scientist who wanted the results and wanted to build some dashboards.

And I was the data engineer who was putting together the pipelines and getting the machinery working.

And when something would go wrong in Kyle's dashboard,

the

product manager would go to Kyle and say,

something's wrong with the dashboard. This metric shouldn't have moved.

Kyle then says, well, my statistics are alright. Let me go to Igor and ask him what's actually happening in the data. And now you have this

multi day round trip of what's going on, why is the dashboard look the way that it does.

That inspired us to build a lot of tooling

to monitor what is going on in the pipeline, quickly expose that, quickly show which metrics are moving, and

then try to figure out why before somebody comes and asks us about that.

Kyle actually went off and became the product manager for an internal team

that worked on

the sorts of tools that you would consider

DataOps today.

At Uber, there was a data quality system

called Trust. There was a data catalog called Databook. There are some

blog posts about that.

And

those tools

were an extension

of what we felt working together on the experimentation

team applied to the rest of the company. We want to make sure that users within the company

understand

what is the quality of their data.

They can go to the catalog and understand where that data is coming from. What does it look like? How do other people

use it?

And that would unlock

velocity within the data teams because people could very quickly

find out about the data that they're using

rather

than relying on

tribal knowledge,

word-of-mouth, or just manually

looking at a table in the data that's located there. Learning from that experience,

we decided to tackle that problem

that we

faced ourselves,

but we wanted to do so in a more generic

and scalable manner.

What we learned

was that

users want an easy way to

set this up and an easy way to monitor this.

And they don't want

to go through the repeated headache of doing this multiple times for the same tables. And so we tried to take those learnings and apply them to the product that we're building now at Bigeye.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to data engineering podcast.com/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water Flask.

There are a number of other ways that folks are addressing data quality problems

where they might be using the data warehouse as the focal point for understanding

if content being landed in the warehouse meets a certain set of criteria,

or they might be embedding it very early in the pipeline where it is living with the

data integration

tool chain for being able to handle

schema compliance for inbound events or

data as it's being extracted from sources, or it might be in the

business intelligence layer where you have things like Looker with their

data modeling tools there for ensuring that

the information

is conformant to the model that's being fed into a dashboard.

I'm wondering if you can just

give your overall assessment

of the landscape of data quality tools and the approaches being taken and what you see as the trade offs for how you're managing it at Bigeye?

There are a couple of approaches that we

see

commonly

in

the industry today.

So if you look back, data quality tools

historically focused

on more data cleansing

and data cleanup

than monitoring of the data. Or you would get

data that comes in, and you know that a field is supposed to be not null, and there are 20 null fields. Let's drop them all before we insert them.

And this works, but then you don't know what you're dropping. You don't know what you're actually missing.

So

what happened

after that is data

quality checks were pushed down

to the data processing

layer and this ingestion layer

where you could define them in tools like DBT. They have a testing framework,

which is great,

but this still requires you to

write

manual checks

in SQL, and you have to write it for every single

pipeline. And

as a data engineer, I know him personally

Sometimes too lazy to write a check, and I say I'll come back to it. And then I never come back to it. And then it becomes stale or unmonitored

at all.

The more common approach today seems to be

rule based, where you have a dataset

and you can define the rules that you expect

to apply to that data.

And

check those rules for

consistency and make sure that your data is correct.

At Bigeye, what we're doing is taking that to the next step and being inspired by a lot of

work in the APM world such

as Datadog

and New Relic,

where they collect metrics about applications and about servers and can monitor those metrics for the health of said application.

Bigeye

collects metrics

about your data

and then lets you define which metrics you want to monitor

to define the quality of your data. For example, if something

is should never be null, then you can collect a

how often is this field null metric. And if that's ever greater than 0,

then send me an alert.

The

trade offs here

really boil down to

do you want continuous monitoring or do you want to check it at

ingest? Having

continuous monitoring

means that there's slightly increased load on the warehouse,

and there's this

extra step involved

where the data is already loaded,

and now you're checking the data after the fact. That being said, this

takes the load off of the

ingest layer and is more in line with what ELT

proposes, which is take the data, load it somewhere, and then transform it, and then do all your operations on top of it. In this world,

Toro

is already living on the place that you're loading the data. So consistently monitoring that load makes much more sense than monitoring it

before you ingest it.

As far as the

data quality issues that you've seen, I'm wondering what you have found to be some of the common causes of it or common ways that it manifests and some

elements

of the

pipeline as far as gathering the information

or structuring the information

that engineers should be keeping an eye out for in order to prevent downstream problems?

The most common quality

issues

are

data latency

or freshness,

row counts, so a table doesn't load. It is empty or has less records than you expect.

In the off case, it has more records because you joined to a table that has duplicates.

And

no and empty field checks or completeness.

These are common checks that many systems enable out of the box.

They make sense for baseline coverage of your data. So you know that at least some data is there

on time

and it has more than 0 records.

The

more interesting

quality issues that we've seen were all

semantic based issues. Going back to that notion of semantics,

you expect a field to contain

only phone numbers.

And

all of a sudden, you see ZIP codes in your phone number field because

the UI that you're entering

the data through

changed the field order. And now your data load is loading in a different order.

And these are

more deceptive

issues that are harder to find

with these common checks. And you really have to understand what does the data mean and how is it going to be used.

Another more interesting

issue that we've seen

is

a column that it was supposed to represent dollar amounts,

except for it was always being rounded to the nearest dollar,

which obviously threw off reporting

and made for very suspicious numbers.

You can't really catch these sorts of things

in the ETL

because this is really where the bug goes happens. If you have a bug in your TL, you're not gonna catch it until after the data is being loaded.

But

monitoring for this after the fact is really important.

When it comes to building more resilient pipelines, so things like this don't happen,

It's important to understand

how is this data going to be used, what is it going to be used for.

For straightforward

pass through or easy transformations,

it's easy to encode that in the pipeline. But for anything that requires some sort of business logic or business knowledge, it's good to clarify that upfront

and then encode that into your pipeline. If you see

a phone number field that isn't

exactly

10 characters long or doesn't have a plus 1 in front of it,

then

maybe you should throw some sort of warning in the logs and monitor that to make sure that you're processing the data that you expect

and you're outputting the data that your user would expect.

And so in terms of the responsibility

and priority of integrating data quality into the overall life cycle of data, I'm wondering what you have seen as being the breakdown of who owns that aspect and what the drivers are for ensuring that it is included in the definition of done for any aspect of

building data pipelines or producing analyses or data products?

At the end of the day, the person who cares about the quality of the data is going to be the person using the data.

Most of the time, the responsibility

of defining and managing data quality falls on the data consumer.

Because they are the ones who will be building the dashboard,

and they are the ones who will be the first line of defense

when an executive comes and says this dashboard is wrong. What's going on with it?

The problem here is that a lot of data consumers

do not have the right sort of

access and tooling

that would allow them to push down this data quality knowledge

to a layer of the producers or to the data itself.

And if you go back to the

data producer, the data engineer building the pipelines,

their incentives are typically to

get the pipeline done

and move on to the next thing. Move on to the next pipeline and come back to it if somebody

says that it's broken or and come back to it if they need to change the logic of the pipeline.

It's important to have

a middle

ground where both sides can communicate

about their expectations

of

the data

and

set this contract. If you look to the SRE handbook, they have a notion of SLAs and SLOs

that talk about setting a contract with your application.

The application is expected to perform in a certain way.

It's important to

get both the data consumer and the data producer in the same room and get them on the same page about what is the contract of this data,

what should this data look like, How will it be used? And how do we define when this dataset

is of high quality?

The follow-up to that then becomes,

how do you encode

these data quality rules?

And

this should also be in a centralized

system that is accessible by both the producer.

So the data producers can know

is the data that I'm producing of high quality,

and by the data consumer, so they can see is the data that I am accessing

of high quality. And if not, what is going on and is someone responding to this.

So

really the burden should fall

on both sides,

but in a way that is centralized where everybody can get on the same page

very quickly

and be able to react to things as they come up rather than playing a game of telephone where the data consumer has to go to the data producer, the data producer has to go back to the consumer

back and forth until

they agree whether or not something is going on and what is hap

Another aspect to the overall question of data quality and the health of pipelines

and the visibility,

as to the trustworthiness

of the data is the support that's built into the tooling that's being used at each of the different stages of the life cycle.

And there has been an increase in the focus of that as a primary concern and a design element for

the systems that are being used to collect and process information,

notably things like great expectations

as far as a code first approach or with DAXTER

being able to embed expectations

into the different

sections of compute to make sure that you can see

what the

assumptions are around data and whether or not they're being violated. And then also with things like Delta Lake with the similar

approach of expectations being built into the table definitions.

I'm wondering what you have seen

as far as some of the most noteworthy

ways

of surfacing the

concern of data quality within these different tools and within the user experience of data engineers and data analysts?

And what are some of the cases where

the tooling, in your view, needs to be improved or where these design considerations need to be built into the tools and, you know, either refactored or revisited with the existing

infrastructure

and the overall approach of the industry at large?

The tooling breaks down into 2 camps.

You have

the tooling that

allows you to set these

definitions and expectations

in the pipeline and in the code. So you mentioned great expectations.

I would put DBT in a similar place where the expectation

lives with the pipeline.

The difficult part about having

expectations live in code is that changing them becomes

hard,

and it also becomes

tedious to

get everyone on the same page, because often the

consumers of the data, your analysts, and your scientists aren't going to go dig through a bunch of pipelines to figure out what your expectations

are.

The second set of tooling

defines

the quality rules

outside of the pipelines and on the data itself.

If

you look at the most primitive example of this, it's the data catalog where you have descriptions of the columns and of the tables that you are using,

and they're stored in a central location

that

anybody can go and read and some people will update.

This is

useful

to a human, but it's not useful to the actual tooling

used by data engineers

because

they would still need to translate what's there into their pipelines if they want resilient and reliable pipelines.

It's important for tools that

define

quality checks in a central place

to be able to

expose that information both in a human accessible

and human readable way,

but also in a way that the code can reference the same checks and the same logic

and allow engineers

to implement that logic in their

tools

and assert that their pipelines are producing the data

that

the users expect.

Bigeye does this by providing a UI for centralizing the definitions

and then exposing APIs that you can pull those definitions into your processing layer

and

run the same assertions

on the data you're producing

that are already defined in that central location.

As you have been building out the Bigeye product and onboarding customers and working with them to understand the requirements.

I'm wondering what you have seen as some of the most interesting or innovative or unexpected ways that they're using the Bigeye platform.

When we set out to build Bigeye, we expected teams to make an individual decision

on how they would want to resolve a

data quality issue.

Because

you can't tell whether data quality issues because the business has changed

or because the data is actually

wrong.

What we've seen is that a lot of the times there is an alert in bigeye,

there is actually a data quality issue and the team thinks that this is an unexpected

data problem and they want to resolve it at the data layer.

1 of the most interesting ways

that we've seen Bigeye used in this case

is triggering automation in their tools

to fix

or roll back the data

that is bad.

An example of this is

you have an alert in Bigeye

that says this column

has

a non 0 number of nulls.

It triggers an alert and that alert

hits a webhook in the infrastructure layer,

which triggers an ETL that takes all of the rows that are null and moves them into a side table as a quarantine table.

So now the investigation

that happens after the fact can be done on that specific quarantine table because you know all of the bad rows are in a single place.

So it's very interesting to see

data quality monitoring

not just being used as an alerting mechanism,

but also as a hook back into the infrastructure layer

and into the ETL orchestration layer

and actually performing automated actions off of the alerts that are coming from the data quality system.

And a corollary to the number of businesses that are addressing data quality in a product oriented fashion,

1 of the reasons

that that is a viable option is because there are so many people who are identifying this as a need and who are building their own homegrown approaches

to addressing data quality and trying to identify issues with it and solve for it. So I'm wondering

what you have seen as some of the most interesting ways of

building homegrown platforms

for being able to identify this

within an organization's

engineering group or even things that you yourself have done before building Bigeye?

The most common homegrown solution

that we've seen, and honestly,

that I have built multiple times at this point, is

taking a SQL query,

sticking it out on a chron schedule,

running it, and then outputting the results on some dashboard.

You walk in in the morning. You look at the dashboard.

Does the graph look okay?

This is obviously extremely manual. You're copy pasting SQL queries around. You're running a cron on some random box in e c 2

completely

unscalable.

We've

seen

interesting

variations of that

where teams have taken that SQL query and then output all of the data to Datadog. So then it lives in the same place as their infrastructure metrics, and they can monitor data quality and infrastructure in the same place.

The most extreme example of this

was

a data scientist that we've met that ran all of his checks

and then put all of those into 1 giant data quality table, which he then pivoted into multiple dashboards

and then presented those as effectively data products on the quality of his data.

The most extreme

example of

a homegrown solution that is nontechnical that we've seen

is

literally putting a single person on call

for data quality,

and their whole job is to monitor dashboards and reports.

And if they see something, investigate it and fix the data quality problem.

This obviously doesn't scale, but it is a very interesting

homegrown approach to data quality.

The if you see something, say something of data engineering.

Exactly.

And so in terms of the overall data management and data quality

discovery

have been top of mind for quite a while.

If you look at the need of

most data teams, it's to access the data and build something useful with it quickly.

And the only way that you can build something with data is if you understand what exists there.

From that perspective,

data catalogs and data discovery

have been a huge trend in the space recently.

If you look at the open sourcing of Amundsen from Lyft

as well as

other enterprise data catalogs such as Alation and Calibra.

They've all been coming more into focus recently as teams have to deal with

larger and larger amounts

of data.

The other trend that is most interesting to me is

the

migration

from ETL to ELT,

as well as

the migration from warehouses

to lakes and back to warehouses again.

It's very interesting to see the pendulum swing

in the data space

as

teams

go to a more centralized data model and then realize that that's a lot of overhead. I just want to manage my own team's data, and they break out into data marts and data warehouses again.

It feels like history

repeats itself every

decade. And it's interesting to see that cycle

repeat

now.

For folks who are trying to gain control of their

data quality in their own data pipelines, and they're considering using Big Eye, what are the cases where it's the wrong choice and they would be better suited with a homegrown solution or some other off the shelf product?

Bigeye operates by querying data out of a database,

whether that be a warehouse

or

presto on top of s 3.

If your data is not in

a queryable format, if it's

very nested JSON that's completely unstructured and you're doing completely offline processing

in Jupyter Notebooks,

then Bigeye might not be the right tool for you. And

you might want to look into other ways to

monitor that data.

That being said, if you have unstructured data that you're trying to use for something,

typically, structuring it is the first step. And if you're structuring it, you're putting it in a queryable format, which is typically SQL, in which case, bigeye would work for you once you get to that step.

And as you continue to iterate on the product and continue to explore different ways of identifying data quality issues or patterns in how these problems surface

and ways that data is being used. What do you have planned for the future of the Bigeye product and the business for it?

A lot of the focus has been on introducing

more automation into Big Eye. Whether that's

on

setup time for

the metrics

or it's on

setting thresholds automatically rather than manually.

And we want to improve the intelligence

of the product.

Because we are capturing these

metrics, we can do a lot of interesting things about predicting

how the metric will behave,

as well as

what are other parts of your system that you might want to apply the same metrics to.

So taking all of this metadata that we capture

about your database,

as well as the metadata

from the metrics that you're already collecting,

and allowing you to quickly set up new checks and new tests across your whole data warehouse

in

as little as an hour

is really exciting to us.

For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

The 2 biggest gaps that I see

in data management today is

the ability

to

encode domain expertise and information into the tooling,

and tying tools together easier.

To that first point,

if you look at

if you look at things like Salesforce

or

Carta

for cap tables, there's a lot of domain expertise that's actually baked into the tools because it's so required.

Data tools today are very

unopinionated

because a lot of patterns haven't

really been set in stone yet by the community.

But there's still a lot of tribal knowledge, and there's a lot of common patterns that you can see across teams

that can make tools more opinionated in order to make them more useful for everybody.

And to that point,

the same tools are frequently used together,

and there's still no good way of tying tools together. It seems like every

data engineering team has to

rebuild the same things from scratch and sort of reinvent the wheel. So I'm excited to see a lot more integrations

in the data space

between tools that are commonly used.

So that

when I'm starting a new project, I don't have to

read all the API docs and and build all the piping manually.

Well, thank you very much for taking the time today to join me and share the work that you've been doing with Bigeye and your experiences working with the data quality market. It's definitely a very interesting and important area

of focus. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It was great to be on the show. Have a great day

yourself.

Listening. Thank

you for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links