Automated Data Quality Management Through Machine Learning With Anomalo

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm interviewing Elliot Schmuechler and Jeremy Stanley about Anomilo,

a data quality platform aiming to automate issue detection with 0 setup. So, Elliot, can you start by introducing yourself?

Hi, Tobias. Thank you so much for having us. I'm Elliot Schmucker, the CEO and cofounder of Anomelo.

And Jeremy, how about yourself? Yeah, Tobias. Excited to be here with you today. I'm Jeremy Stanley. I'm the CTO and cofounder of Anomalo.

And going back to you, Elliot, do you remember how you first got involved in the area of data?

Yeah, absolutely. I've been a growth leader for many years. So I led growth teams at places like LinkedIn, Wealthfront, and Instacart.

And as many of your listeners probably know,

consumer growth and consumer growth strategies are all

quantitative in nature.

And so from the earliest days of my career, I had to be great at getting to the data I needed and using all the tools

to make sure that we were growing.

And, Jeremy, what's your background in data?

I have been a data scientist,

using data to make decisions or using data to build and deploy

machine learning products.

And you can remember even, you know, 10, 15 years ago

coming up against, you know, huge collections of data in a data warehouse

with no context and no background and, you know, no transparency to the quality of that data and having to figure it out as I went. I've always been

a very deep user of data and focused on making sure we can get the most out. And so that brings us to what we're talking about today and the work that you're doing at Anomalous. I'm wondering if you can just give a bit of an overview about what it is that you're building there and some of the story behind how it came to be and why you decided that this is the problem that you wanted to spend your time and energy on.

Yeah. Absolutely, Tobias. So Anomalosa is a data quality platform,

and our goal is to empower

data scientists,

data analysts, data engineers

to very quickly detect and resolve

issues with data quality

other things that we're doing is we're doing a lot of work in the industry, and we're doing a lot of work in the industry.

And so, we're companies

where we were constantly trying to use data to make decisions and build products and improve our products and grow faster.

And 1 of the biggest issues we've encountered

is,

you know, the more data you collect and the more you try to use it, the more data quality issues

You're fine. The more you encounter situations where data is missing

or inconsistent between periods because some definition

of a particular metric or a particular field in the data has changed

or is otherwise corrupted or stale.

And so we had a number of these situations in our career, and that was the inspiration

for building something

that will actually find these issues for us

without us having to do the work to find them, and without us being surprised

when our models break,

or our dashboards are wrong, or we simply can't get to the right data when we need

to. In terms of data quality, there are a number of different ways of thinking about that and a number of different axes and elements of quality. And

at the highest level,

the objective is to establish and quantify and maintain trust in the data that you're working with.

I'm wondering if you can speak to the types of

guarantees and promises

that data teams are able to make and maintain and

validate when they're using the Anomilo platform.

Yeah, absolutely. So I'll go through a few of them and Jeremy chime in as well. Number 1, is your data fresh?

Right? So we've seen a bunch of situations

in our career where datasets fail to update

or data fails to arrive for a given day. And that's not always something that's obvious

to the folks using that data.

So first and foremost, if you're monitoring your data through Anomalow, we will tell you if

the new data was expected to arrive at this time and did not. And so you can be confident

that you're going to know that your data is fresh.

Number 2,

is your data complete,

right? Very often,

you have issues in various parts of the modern data stack

that cause you to miss certain data. Maybe you have a bug introduced

into how you ingest the data, how you collect the events, or maybe you had a breakdown in a part of your pipeline

that didn't include all the data,

Or maybe you're getting feeds from 3rd parties,

and today

that 3rd party feed is missing some things that it usually has.

And so we will actually

look for evidence

that your data is incomplete

or that something that was in the data historically is now missing.

You know, to give you an example to bias, back when Jeremy and I worked at Instacart,

together

we actually had an issue of this kind where because of an engineering bug,

we stopped collecting,

you know, behavior events on our Android app. If you looked at our volume of events or you looked at our overall metrics,

you know, it wasn't obvious that anything had moved. But if you look deeper in the data, you could see that any events from the Android platform that were there before were now gone.

And so 1 of the promises

that we make when we monitor data with an envelope

is that we will detect these situations where your data is missing.

That's foundational.

On top of that,

we also help you understand whether metrics

that you've defined on this data set

have unusual movements, right? That tends to be pretty important. You're going to use metrics to monitor your business or make decisions.

And so we help you understand where those metrics move in unusual ways.

And then on top of all that,

if you want to establish some very fixed parameters for your data, you know, we don't encourage this. We don't encourage establishing rules for your data because we detect so many things automatically

through enamel of so many of the most common issues. But if you want to establish some rules, for example, because you want a 100% certainty

that a particular thing is true about your data, maybe that's important for compliance purposes,

then we let you do that as well. So

fundamentally,

you know, once you have your data monitor or anomaly, you can be sure that it's fresh and you can be sure that it's complete and nothing is missing there.

And then

further, you can make sure that the metrics powered by that data

are reasonable

and that whatever deterministic rules you want to establish for that data are also met.

1 of the things that's always interesting about data quality platforms is kind of where they focus and what the kind of integration points are into the platform. And 1 of the things that you mentioned of

the situation where your Android application was no longer sending events and being able to identify that. But at that point, it's, you know, kind of too late to resolve the immediate issue until you have gone through another engineering,

you know, fix, build, deploy cycle.

And so, you know, there's always the question of,

at what point do you try to detect these quality issues to be able to remediate them as fast as possible? And I'm curious if you can talk to

the

focus that you've settled on with Anamalo and maybe some of the ways that people

can and should think about

augmenting what Anamalo is doing with other platforms or systems to be able to get a more holistic view of

the end to end quality of data where you might be able to identify earlier before you get into production with that Android app that these events are no longer being generated and propagated?

Yeah, absolutely, Tobias. So what we've chosen to do is focus on data that's in your data warehouse,

right? We see, a lot of the times, the folks that are setting up the modern data stacks, becoming data driven organizations,

are centralizing all their data in a cloud data warehouse, you know, a warehouse like Snowflake,

that's becoming very popular.

And so everything from your raw data, you know, the raw events ingested

from your apps,

the feeds from the 3rd parties to your sort of refined

and more usable data ends up in the data warehouse.

So

as a first step, we chose to focus on connecting to your data warehouse and letting you monitor anything

inside your data warehouse with ease,

right? But you're absolutely right.

1 of the things that

I've always championed in the various roles that I've had is

data

and data collection is as important as the correctness of the code,

right, underlying your products.

So if you're building

an Android app, for example, of course, you're going to have unit tests and reviews and all these kinds of things to make sure your app is actually functioning correctly.

But actually you should have unit tests for data too,

right? Am I omitting events,

right? When I view this page and I'm supposed to admit an event on a page root, does that actually happen?

It's been amazing to me how often those are missed,

you know, in modern development because data is kind of sometimes seen as an afterthought, as not as essential

to the functioning of the product. But in modern organizations, it is. Because you're going to try to make decisions based on that data. You're going to try to prioritize future features based on that data. That data may underlie

results of experiments

that you're running. So I think it's very important, and hopefully your audience is treating it as such that, you know, data and data collection is, you know, a production level, a code level issue,

right, for these teams where it needs to be tested and verified.

Tobias, 1 thing I would add is I like to use the metaphor of a factory.

And, you know, if you're gonna test

the quality of a good, you know, manufactured at a factory, the most critical test you can do is at the end of the line. Right, when the product's completely assembled and it should be entirely functional, and you can say, you know, does it meet the specifications you would want?

We chose to focus on the, you know, data warehouse because that's that final complex data product that's being consumed by the analysts, the data scientists, you know, people in product.

And there's so much complexity upstream of that. And, yes, you can and you should

test components of that complexity where you can. Right? Some won't be possible to test. You'll have external things happening, you know, with data vendors or SaaS platforms or other tools. You'll have people making manual decisions, you know, that can't easily be tested.

But then, also, all of the code that your organization is changing and shipping, all of the transformations that are happening in the data warehouse, they can all interact in very complex ways.

Right? The the individual data bits that are collected at the top of the funnel end up going through incredibly complex transformations. And so how those interactions

can take place introduce a lot of complexity and a lot of risk of data quality issues. And the only way to really effectively catch those

is to test at that final, you know, point of consumption.

And, in addition, that's where you also have the most context. You can look at the other data around the issue

to understand, you know, what might have happened and, you know, what's being affected.

Going back to the question of

guarantees and being able to validate

the promises about the data that you're making, what are some of the promises that can't be made unequivocally based solely on the perspective from the data warehouse or from the sort of statistical

anomaly detection that you're doing at Anomolo based on the information that you do have and where you do need to go to

some external system or some additional manual validation or other sort of process control methods for being able to make and maintain those promises?

Yeah. So you can think about the anomalous system as having this, you know, machine learning driven, and we can talk more about what's behind that,

but fully automated approach to identifying

the most common,

you know, issues that are happening in your data and the most common

degradation of your data.

It's doing that by sampling data from your data warehouse and, you know, learning the distribution and patterns in that data and identifying if there are meaningful you know, statistically meaningful changes that happen.

So there are some limitations there. It can identify that a column that was previously 10% null, you know, and has been, you know, typically 10% null, maybe with some seasonal fluctuations, if that suddenly becomes 20% null, that's statistically significant, and you would want to receive that notification.

Typically, 10% null, and you had a single additional extra unexpected null value. You know, certainly, an algorithm could detect that. But if it was, you know, so finely tuned to be able to detect that, it would detect so many other things you wouldn't care about that you would drown in noise.

So, typically, when we think about our fully automated approaches, they're looking for statistically significant changes,

or they're looking for changes where you went from a situation where you had, for example, no null values in a column. That would be meaningful.

And so the way I like to think about it is it can protect you from the most common types of cases, but there's no free launch. It's not gonna be able to find every meaningful change. And so that's where, to Elliot's point earlier, it is, key metric side, when you say, I actually care a great deal

On the key metrics side, when you say, I actually care a great deal about this specific

statistic, this specific metric, you're making a really

pointed value judgment

about

what columns you care about and what rows you care about. Right? The specific rows and the specific columns that go into that metric matter a lot to you, and so you wanna pay very close attention to it. And so it's gonna be more sensitive to changes in that very specific part of the table.

And, similarly, if you define a validation rule, if you say, you know, this column should simply never be null, then that's the strongest way of asserting something. Right? It's using your own independent

expectation and understanding of the data to make that assertion, which is different from looking for a regression and a, you know, a change happening in the data. Yeah. I'll add just a couple of things to bias to this.

You know, 1 truth, systems like ours and statistical and machine learning based system is is we're detecting changes in your data.

Right? So we've encountered a few situations where the data has been wrong from day 1,

right? We're not going to spot that.

We might

recognize when that wrongness has changed or when it got corrected

through our models, but we're not going to spot data that has never been correct,

right, because we just don't know.

And so often

we can judge statistical changes in the data, but we can't tell you what the data correctly reflects, you know, the real world,

right, outside.

Doctor. So that speaks to the

kind of question of the utility of automated versus manually defined or programmatic tests of data.

And I'm wondering if you can speak to some of the ways that you and your customers think about the application of Anomolo versus using tools such as Great Expectations

and kind of the overlap between the 2 of when to use which ones?

Our perspective, Tobias, is that you need both,

right? You need both automated tests on your data, statistical tests, like the ones at the core of our product, and more manually defined,

you know, rules and focus areas to sort of

refine

that monitoring

and fine tune that monitoring.

The issue is

it's actually impossible to be on either ends of that spectrum. We need to combine both, and I'll tell you why.

If, for example, all you're going to do is the manual monitoring, like you can do very effectively with great expectations and other tools, it's just a

for And not just write them,

they have to maintain

those rules over time.

As your company launches new products and new geographies and new platforms, right? Those rules have to change.

The other issue that we've seen, and we've seen this firsthand

when we were at places like Instacart where we had a rules based system

to monitor data quality,

rules can only protect you against things that you can anticipate.

You're not going to write a rule for something you can't imagine happening

down the line.

But guess what? Things you can't imagine happening happen all the time

in data.

And so you relying 100% on a rules based system means you're going to miss these unknown unknowns,

right, or unanticipated

issues in your data.

And because of the level of work you have to do to write all those rules,

you can't possibly cover your entire

data warehouse, your entire collection of datasets

with this manual approach.

So the way we work with our customers

is we say, Look,

the automated approach,

the core of an envelope should be your foundation.

It should be something that you can easily apply to every table in your data warehouse

without any work, right? And now you've guaranteed yourself sort of a base level of quality for every table.

But for tables that are critical to you, that have your most important data,

please go ahead

and use Anamolo's

other tools,

you know, the more manual tools of defining rules and metrics to kind of fine tune

our monitoring.

So now what we've done

is we've given customers,

you know, a base level coverage

across their entire data warehouse,

right, with protection against unanticipated

issues and unknown unknowns

in their data. And we've allowed them as a result to to focus their manual efforts on just the most critical areas

where they make sure to get it perfectly right. And so that's a powerful combination.

You know, neither approach in isolation,

we think solves

the data quality problem

for modern teens. But if you combine the 2 approaches,

and if you combine them in 1 place like an amyloid does, where they can feed off each other and help each other,

then I think you have something pretty powerful.

And so can you give a bit of an overview about the

technical implementation and design of the Anamalo platform and some of the ways that you have approached

the automated

discovery

and alerting on some of these data quality problems that people experience as their data does

change and evolve in their data warehouses and their kind of data platforms?

There's a few different components.

The first 1 is just making sure the data is fresh

and that the volume of records in a given table are what you would expect.

And so we have a platform that is

pulling,

you know, the data in the data warehouse

to constantly

monitor, you know, have new records arrived.

And then once the records have arrived, is the expected row count what you would expect?

And this is useful both because you can define SLAs for when you want your data to arrive and be alerted if you're missing the last 15 minutes of data for a time period or you have a significant drop in row counts. You know, there's value in that in and of itself, but it's also great because

that is a system that gates running all of the other checks that Anomalow runs. So we don't go in and do a whole bunch of statistics and machine learning

and execution of these rules on data that's actually just incomplete.

It's 1 great way to avoid false positives.

So that's the first part of the system.

From there, we go in and we sample

records from the table.

And this is in order to run our machine learning algorithms. And so, you know, the way to think about this is imagine you have a table,

and each day,

you're sampling, say, 10, 000 rows randomly from that table.

What Anomalo is doing is looking for drift

in those samples of data.

And a simple way to imagine how you might do that is suppose you had a random sample of data from today and a random sample of data from yesterday.

If you could build a machine learning model

that could predict on which day

each record came from,

then something about the data is different. Right? Whatever it was that the machine learning algorithm was able to use to make that prediction accurately,

you know, is indicating what has changed from 1 day to the next.

Now the obvious answer would be, well, the date

is, you know, yesterday on the 1 sample and today on the other. And so that's a very trivial feature that an algorithm could use to distinguish the 2 days of data.

And so what Anomolo does is it learns what all of the features are, and it identifies the ones that are always predictive of time

and removes them. And it dampens the columns that are very chaotic. For example, if you're doing marketing and you have a marketing dataset, you might have new campaigns

constantly starting

in that dataset. And so the campaign ideas or the campaign metadata

is constantly changing. It's very chaotic.

It's not a data quality issue. That's an operational

change. And so we have a layer on top of this algorithm that's looking for, you know, those repeated chaos, you know, operations and is dampening them. So you're left with just the unexpected

changes happening in the data.

We then use explainability algorithms like Shapley to, you know, interpret

what data in particular, what rows, what columns, what values in particular

are causing, you know, this change, this drift in the data. And it's really powerful because we can use that to summarize for the end user. You know, here are the collection of columns

that are affected.

You know, this is the nature of the change. Here are visualizations that helps to explain it. And we can go even further than that. We can actually identify, well, what characterizes the rows that are changed?

You know, is it Android only? Is it this geography or this partner or this provider only? So we provide what we call a root cause analysis

into understanding what has actually changed in the data.

So that's the basis of the machine learning. And, you know, this can detect, you know, drift in columns, changes in distribution,

even changes in the relationships

between columns.

And then we have special versions of this that are looking for things like duplicate data, looking for increases in null values or 0 values

to highlight those in particular since they're often very concerning to users.

After that, we have the key metrics and the validation rules. These would be set up by the user

and all executed, you know, in parallel against the the warehouse, pulling samples of data, you know, time series out, building those time series models, and producing visualizations.

I think 1 thing I wanna generally highlight is we spend a lot of time and energy producing visualizations. We probably have a 100 different types

in the product.

And we do that because explaining these issues is at least as important

as finding them,

making sure that the user can quickly understand, is this something I expected?

You know, is this something that's concerning? You know, where did it happen? Who do I need to talk to? All of that needs to happen through a visualization and an analysis of the data. And so we do a lot of that in a very automated way as well. As far as the machine learning elements of the system,

because of the fact that you're trying to detect anomalies in the system, it makes sense to take a statistical approach. But at the same time,

1 of the challenges of machine learning is that it can be unpredictable. And so I'm curious what your

process has been for being able to validate the logic and the statistical models of those machine learning approaches to ensure that you are able to

effectively identify and alert on the problematic elements of the data that's under test.

It's actually a really fun part of what we do, you know, fun for the data scientist machine learning, you know, background for me. Maybe not fun for everybody, but we have a Chaos library

for data. And so, you you know, this is about 30 different classes of issues that we can introduce

into datasets.

And we actually do this in the data warehouse. We'll create temporary tables

that introduce these issues and mask the underlying data.

And the types of chaos that we can introduce, you know, ranges from simple things like, you know, null value increases

or dropping rows

to things that are more nuanced, like I'm gonna shuffle the values in a column

or I'm going to swap values in a certain way, you know, duplicate data,

change the distribution in some small way.

And so we have a whole library of these chaos operations. And what we do is we have a bunch of benchmark datasets.

Many of them are public datasets. We've also had customers contribute

data to our benchmark datasets.

And we run a huge barrage of chaos operations

at all of these datasets, you know, introducing

1 specific issue for some fraction

of the data, you know, oftentimes a very small fraction of the data. And then we run our machine learning algorithm to ask the question, well, can it recover that issue? Does it identify it, and does it correctly characterize it?

And so, you know, we can look at ultimately, the algorithms behind the scenes produce an anomaly score. You know, how severe is this issue and identify where the issue is happening. And so we can look at that as, essentially, a prediction for how unexpected

a given day's worth of data is and use that to compare, you know, specificity and sensitivity and performance of the algorithm

against this chaos benchmark

set of data.

With the model selection itself, I'm wondering

how you have approached the

architectural elements of the models

to be able to make them scalable and maintainable.

And you mentioned explainability as well. And so I'm curious how you have

worked through that design and selection process to ensure that it is as powerful and useful as possible while still being

maintainable over the long run and being able to maybe deal with cold start where

the client has a, you know, small sampling of data, and so you need to be able to start making inferences without having a huge corpora?

Great question. So there's a few things there. You know, 1

is, you know, we are using

gradient boosting decision treats

when we build these models. We're not using deep learning models.

And, you know, that's actually pretty important because we're working with predominantly structured data. We do take JSON,

and we automatically structure

that JSON data, and it can also be fed into these algorithms.

But the nice thing about the gradient boosting decision trees is they have a relatively,

you know, bounded

runtime, and so they're not gonna take a tremendous amount of time. They can also work on relatively small amounts of data.

And, you know, we have a process for training them such that they don't overfit

even on relatively small datasets.

1 of the key components is, you know, what features are we generating? And and this is a fascinating part of building a product like Anomilo.

We have to deploy Anomilo oftentimes in VPC,

you know, in environments we may not even be able to log in to and see. You know, if it's a health care customer

that's very sensitive to the to the privacy and HIPAA compliance,

We may not even be able to log in and see. And so this is an algorithm that's running and deployed

in lots of different environments and has to encounter arbitrary data. And so the

know, most important thing we do is the feature generation process.

And, you know, given a structured set of data, how are features automatically generated?

And so, you know, we have a whole library of tools

for automatically generating features that are relevant

for detecting these types of anomalies, you know, given the, you know, the chaos testing strategy that I described.

And, originally,

we started down a greedy path of let's keep adding,

you know, new features and complexity in the model

such that we got, you know, ever higher ROC

scores

on detecting chaos.

And at a certain point, we realized that we'd created a monster

that, you know, we were now able to detect things. But even with all the advanced visualizations, natural language explanations that we have, in the end, a user would look at it and go, well, I just don't know

what to do with that explanation.

You know, you have found something that is maybe statistically significant but not really interpretable to me.

And so we've had to pare back

the types of features that we include and ensure that any feature will always have a meaningful explanation

to the end user, even at the expense of losing some of, you know, the performance

of the actual algorithm itself.

The last thing you mentioned is, what about the cold start problem? And so another thing that we built on top of this, it's an online learning

process. Right? Every

day, the algorithm's actually rebuilt.

And so we retrain the algorithm to relearn the new patterns. We store a tremendous amount of metadata coming out

of that run, and we build up time series of these model runs.

And for the cold start problem, we actually set thresholds that start out at extremely high value,

and they gradually decay down

to what we learn to be an accurate threshold for the algorithm given what we're actually observing in the data. And so that way we can begin and, you know, have this run on day 1. And it won't send any notifications. It's going to be kinda cautious,

and it's going to begin to learn and observe the pattern of changes occurring in the data over time.

And then as it becomes confident over the first few weeks, it'll be able to identify really serious changes.

And then eventually, you know, all unexpected changes, it will be able to identify accurately.

In terms of that time series aspect of it, and you alluded to this a little bit earlier, there's the question of seasonality

in the patterns of the data that you're working with, where that might be seasonality over the period of a day or weeks or years. And I'm curious

how you approach

being able to identify those patterns and understand them appropriately. And then also the question of windowing of over what period of time do I care about these anomalous events? Where do I only care about if an anomaly is happening within these hour boundaries? Do I care about days, weeks? Like, at what point does the window become

either too large or too small?

Good question. So in terms of seasonality,

the underlying machine learning algorithm itself,

the way we've constructed how it works with the data, it's able to account for short term seasonality

in the gradient boosting decision tree algorithm. And so it's able to look for day of week seasonality or hour of day seasonality, those kinds of, you know, very, very short term patterns.

We then control for additional seasonality, even things like day of month, day of year, you know, holiday effects, you know, changes in trend,

with all of the time series algorithms that we run on top of the metadata coming out of the machine learning algorithm. So seasonality was 1 of the most important things to get right, and it actually required a couple of these different approaches in conjunction to do well. So that's the answer to the seasonality.

Mostly, we operate daily. That's where we see customers finding the most value for this kind of fully automated system.

And there's a couple of reasons for that.

1 is

you've gotta have a human in the loop in a lot of these evaluations. Someone needs to look at the visualizations and make the assessment. Is this something I expected? Is it unexpected? Is it, you know, significant? Who do I bring in?

And so, ultimately, you want the system to operate at the same decision scale, right, as the humans. And, typically, daily

is the right cadence in most cases.

You can certainly identify hourly changes on a daily

basis. But if you want to run this every hour,

it's possible, but it does introduce more risk of false positives. Now you have, you know, 24 opportunities

each day to potentially get an alert. And so we recommend doing that only in kind of extreme situations when you really care very deeply about something, and you're going to be responding in real time. Right? That's the key component that you're gonna have someone making a decision or acting in that kind of real time moment. You certainly can run it on slower time scales,

and what you're looking for typically is

slow drift in your data. Right? A leak

of some sort. And so instead of a sudden sharp change,

maybe you've had something that has slowly drifted or deteriorated.

And so what looks very slow on a weekly basis might look very sharp.

And so you can expand that time scale if you want to find those things. Oftentimes, we'll support customers doing that with metrics because you wanna be a little bit more focused on exactly

the kind of drift you wanna care about versus looking for arbitrary drift in your business.

And another interesting element of being able to manage these

anomaly detection scenarios is the question of being able

to use something like Anomolo in the CI and CD workflows of managing

the changes in the data processing and data generation utilities that you're running and some of the ways that you can be proactive about identifying, okay, this change in my transformation logic is actually causing this problem in data before I actually push that to run against your production environments.

Absolutely, Tobias. It's actually 1 of the most common requests we get

from teams once they've deployed Anomelo is, How can I make sure my staging table that I'm about to promote into production products

that

you're

using

for

deeply into your CICD pipelines

or other data pipelines?

And we've designed the product in a way where you can kind of combine

our UI and our API

in good ways. So for example,

you know, you can open up your RUI to anyone who cares about data, anyone in your organization

that's going to designate tables that we should monitor or set up rules and define metrics.

But then through the API,

you can do things like cloning,

you know, the types of rules and metrics that are pointed to a production table, and just applying them

to your staging table, to the table in progress,

right, that you're building.

And so you're actually able to very easily answer questions like, If I promote this dataset

to its final destination, what alerts will fire?

Right? And then you can make a decision whether that's okay

or whether you should pause that process,

right, and try to figure it out before

that data goes live.

So that's a great way we see customers using the API to accomplish what you're describing.

We also have some, you know, advanced validation rules that allow you to diff

tables, and that diff can be done in a deterministic way. Right? I'm gonna define the primary keys, and I want to identify,

you know, are there any row differences, duplicate rows, you know, any changes between my prod and staging version on a row level

or down to individual values

that are different.

And then we'll apply our statistical analysis to help summarize where

those changes are happening. And so I think sometimes we get overconfident

writing

tests for the transformation without realizing, well, what are the all of the implications

in the data? And so doing a diff can help you check for the actual implications and changes in the data. And we're going to be releasing a machine learning technique

to automatically diff tables even when you don't have a primary key constraint, something that can work across databases, work across different samples of data that can be pretty useful in this context as well.

StreamSets DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud Amp

up

Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming.

Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure.

Finally, 1 single pane of glass for operating and monitoring all of your data pipelines,

the full transparency and control you desire for your data operations.

Get started building pipelines in minutes for free at data engineering podcast.com/streamsets.

The first 10 listeners that subscribe to StreamSets professional tier will receive 2 months free after their 1st month.

Another interesting element of this product, and you started to dig into this a little bit with the question of explainability,

is the communications

aspect of being able to say, okay. I've detected this statistical anomaly, but now I need to be able

to describe the problem to somebody who can do something about it. And so I'm wondering if you can talk to some of the ways that you have worked through

the types of information that you need to be able to convey and the types of questions actually

mean at the broader picture? Where is this data being used? You know, what does it actually mean at the broader picture? Where is this data being used? You know, what is the appropriate

resolution for this? Do I need to go back to the source system and reexport and rebuild all these tables? Do I need to contact my dev team because they've, you know, messed up the schema of this event structure and, you know, kind of managing the collaboration and organizational

aspects of data quality?

Yeah. So in terms of the information that we provide, at the very top level, we're gonna be summarizing, you know, the issue in natural language and, you know, saying what is the specific statistic, what is the specific count,

right, of issues that have occurred. That's kind of the most simplistic

summary, if you will.

We then always visualize

that and then put that in historical

context. So, you know, it's always important to be able to understand, you know, what's been the history of changes in this statistic or, you know, in this value, in this distribution,

and how is this different or unique from that? And also put it in the context of confidence intervals

produced by models. Right? How unexpected is this change, really? And so we have visualizations

that help to explain that.

All of that is just to give you a sense of, you know, severity

and historical context for the issue.

We then dig deeper into, well, where in the data did it occur. And we have a pretty, you know, interesting approach to that, this root cause analysis where we will take a sample of bad data

and a sample of good data and do statistical analysis on those 2 samples

to be able to identify

for every segment where a segment is, you know, some you can imagine some Where SQL filter on the table. For every segment, you know, how indexed is that to the bad data versus the good data?

And, you know, as we correlate and cluster and rank those,

it will give someone a very clear road map into exactly where in the data

is the issue occurring. And that's often, you know, super

important to be able to understand,

you know, what part of the business,

what process upstream,

you know, what team might be responsible or engaged in in having caused this issue or needing to root cause and diagnose it. We send all of this information as an alert into Slack.

And so, you know, the communication Slack or Microsoft Teams, that's typically the point of consumption.

And our best practice with customers is for them to set up multiple different

team channels

and have each team channel subscribe

to, you know, a subset of tables in their data warehouse that that team really cares about. And so they'll get all of the, you know, text descriptions and the key visualization,

you know, who created, who last edited the check, those kinds of context,

pipe directly into their team channel. They can click in to learn more and view the history of the data and and, you know, this kind of root cause analysis of the check and then have a conversation about what the root cause is and what steps they should take next. Yeah. And, Tobias, we found

this sort of routing of the alerts

and the visualizing and explaining of what's going on in the root causing to actually be

as important as the detection

of the issue. I mean, I'm sure you and your audience have experienced other alerting systems and very often you just get alerts, which

just build up because it takes so long to investigate each 1

and to dig into what's happening. And so we didn't want that to be an outcome of using an AMOLO. We wanted you to be able to evaluate an alert from an AMOLO in 10 seconds,

right? And kind of flag it as, you know, This is an issue, I need to resolve it, or I need to send this to my teammate, or,

This is okay. I'm going to ignore this for now. And so we focus quite a bit on making sure the right people

see the alert,

making sure we give you the best summary at the moment of consuming the alert. And then if you want to investigate that, we give you that sort of statistical root cause that we're able to compute so you can figure out, you know, the next steps to take

in investigating.

We'll even generate our SQL query for you. This is 1 of the fun features

where we'll generate a SQL query for you that you can paste into your favorite SQL client, connect it to your data warehouse, and get the bad data out,

right? So you can just consume it yourself

as you investigate.

And so a lot of our customers really appreciate that we're saving them a lot of time, not just in detecting issues, but also in doing the investigation. Joshua:

And another element of the kind of communications

piece of this is the question of alerting. And 1 of the problems

that any vendor who's working on generating alerts runs into is the problem of alert fatigue and being able to actually send information that is useful and actionable and isn't just going to be ultimately ignored by somebody.

And so I'm curious how you approach that problem and some of the ways that you have built in feedback systems for your users to be able to say, I don't actually care about this kind of alert, so don't even bother sending it anymore, or I do care about this alert, but not for this piece of data, and just some of the

kind of nuance that goes into that process?

Yeah. Great question, Tobias. And this is an area where we focused quite a bit, and we've approached it in multiple levels

to make sure that we're not

creating alert fatigue.

1 of the most important things is just routing the alerts to the right people,

right? It's hard to get alert fatigue over something that you really care about,

because you're going to keep

wanting to make sure that it's okay. And so

some of the routing functionality that we've built into Anomalove, being able to route

particular tables directly to their teams, or even particular

checks

that were running directly

to a particular team that may only care about that specific metric or that specific rule and not the entire table.

You know, that's been a pretty big part of it.

2 is we actually have an intelligent

layer built into our systems

that when it encounters a duplicate alert,

right, when it sent an alert on this issue before,

and now it's about to send the same alert again,

right? You may have seen and I'm sure your audience has seen a lot of alerting systems. Once a condition happens,

they just keep sending you that alert again and again and again and they build up.

We actually have a little bit of intelligence where we try to figure out If we sent you that alert before and we're about to send it to you again, should we do it? Right? And we base that on kind of, you know, have you taken action

to resolve this issue? Or How fast do you take actions to resolve issues?

And so we'll actually decrease the cadence

of certain alerts

automatically

if they continue to come up.

And then 3rd, and this is an area where of course we're investing quite a bit, is the feedback

level. So today we take in some, you know, implicit feedback of whether you corrected the issue or not.

But we're also focused on our roadmap on getting explicit feedback,

and on having flows that allow folks to say, you know, I don't care about this, or This is too sensitive.

I only want issues of greater magnitude,

you know, for this particular element to generate alerts.

So that's a big part of it too. And 1 of the things that our customers

really love about Anomolo is how easy it is to create

these checks and to edit and to configure them and how flexible the system is. And so it's, you know, just a couple of mouse clicks away

on any alert to go in and change a WHERE SQL filter

to pull out data that the alert shouldn't apply to

or to change the confidence interval to make it less likely to alert or to reduce the severity

of the alert so that it isn't as important

or likely to alert you. You know, that's just a few. There's probably a 100 different types of configuration changes that you can make

that will reduce the likelihood of being alerted.

And, you know, making that really easy for users to do is another big part of this. Given the fact that you

are running

and you're getting these feedback mechanisms for being able to understand, okay, this error is actually interesting and useful. This one's not.

How are you

working in those larger feedback cycles into your product to be able to say, okay,

these data issues aren't really all that interesting in the majority case, so we're not going to turn it on by default. Or,

you know, this is a type class of error in data that we didn't anticipate that's being reported by users, or this is the type of check that's being created.

And so now we're going to work on being able to actually validate that from a statistical perspective and just managing that product cycle.

You already summarized it very well. You know, when we first started, we began with just the unsupervised machine learning detecting drift. And so there wasn't a lot of additional user feedback. And it was through discussions with users that we realized, you know, what else did they care about? We created the system that allowed them to come in and, you know, create,

these metrics, create the validation rules. And we've just learned a tremendous amount from our design partners early in our venture now through all of our customers that are using the product.

We'll observe, you know, 1 customer setting up a lot of checks in a manual process and recognize that that's entirely something we can automate.

And so, you know, we've gone from, in the beginning, having just 1 fully automated check

to now we have 8, you know, fully automated checks, each 1 often finding multiple different types of issues.

And that's been following, you know, what are the kind of most obvious, consistent things that our customers want to have entirely

automated. And I think that's a special part of Anomalow, is that we have

the, you know, engineering skill and the machine learning skill to be able to automate some of these complex processes. It's not always easy to do, especially when these are running,

you know, on arbitrary sets of data and arbitrary environments. How do you come up with something that's gonna be robust enough that it will work for customers? And we've been able to tackle that again and again.

In your work of

designing and building the Anamolo system and working with your design partners and your early customers, what are some of the most interesting or innovative or unexpected ways that you've seen the product applied?

Yeah. I'll start with a couple. So

1 was

we really began with a focus on data quality

and, you know, identifying what I would call an anomaly. Right? Hence, the name.

An anomaly being, you know, a sudden sharp change

in the data, something structural happening in the data generation process.

But we realized and our customers realized that we have a system that's, you know, running

these tests on the data every day, and it could be used for other purposes. And so an interesting example came to us from 1 of our early customers where, you know, they wanted to find IP addresses

that were suddenly spiking and hitting their website.

And another customer wanted to find email addresses

that were sending large batches of suspicious email.

And, really, what these were were identifying outliers.

And, you know, we thought about creating a general process for identifying outliers,

but It's very difficult to do, right, what makes an outlier special, that every dataset has outliers.

And so working with these customers, we realized if we gave them some guardrails where they could express,

you know, what they care about

as an outlier, it could actually be really valuable. And so

the structure of that is you can go and identify an entity. And so an entity could be an IP address or an entity could be, you know, an account sending email, and you define a statistic for that entity.

And we then automatically identify

if there's ever suspicious behavior that suddenly significantly changes the distribution

of that statistic, and you suddenly have, you know, an IP address that's far more of your traffic than you would have expected given the past history or seasonality

or suddenly a new account sending more suspicious email than you would expect. And we then explain that for them. And so that's become, you know, a whole unexpected

use case for our product.

Another good 1 I would throw out is actually closer to the machine learning use case. We end up working sometimes with machine learning teams

who collect, you know, tremendous number of features and are publishing these features, and they want to detect drift in them. And it turns out that the same approaches that we're using to look for data quality issues

are also really good to apply to unexpected feature drift.

Oftentimes, that is a result of some upstream data quality issue. Right? You suddenly have nulls in a feature that you didn't expect and now your machine learning model goes off the rails.

The same algorithms we're using to look for that in the structured data upstream can also be applied to the feature store insofar as you're replicating those features to your data warehouse. You're taking the logs of the features used in production and sending those into your data warehouse. And so we've had a number of customers also set up that kind of feature monitoring.

In your work of building the product and the business, what are some of the most interesting or unexpected or challenging lessons that you've each learned in the process?

I think, you know, 1 of the most important things, and maybe this 1 is obvious, is you have to listen to your customers

and meet them where they are,

right? As Jeremy mentioned, we started perhaps naively with this idea that we're going to have this great amazing machine learning model that's going to solve everything,

is magically going to find all the issues

and rank them by severity and importance. And

it just wasn't enough. It wasn't something

that out of the gate our customers could trust because machine learning models are, you know, sometimes difficult to understand, sometimes black boxes where you don't know what's going to come out of it. And so we've had to build a much, much larger product

than we expected

to really help our customers, you know, in addition

to obtaining trust in their data, obtain trust in an envelope,

and be able to direct based on their knowledge of what was important and fine tune

what we can do. And as Jeremy mentioned, it's no longer 1 machine learning model, now there's many

covering different elements

and having different focus areas

that we know are important. So, you know, having a great idea but also

finding where your customers are and meeting them there

was a pretty important process to get to where we are. I think in my experience identifying,

you know, who the customer even would be

and what would characterize the customers, you know, initially, we weren't sure, you know, what industries would make the most sense for Anomolo.

And in practice, that actually hadn't mattered much. You know, pretty much any company that has a data warehouse,

you know, is gonna have data quality issues if they're using the data, and the data is going to be important enough that they ought to pay attention to them. And so we've ended up working with a huge range of customers from financial services and payments to health care,

identity management,

publishing,

e commerce, marketplaces.

Just about, you know, any company, you name it, today, there's an appetite to be data driven, and they're collecting data, and they need these kinds of data quality, you know, monitoring solutions.

I think the next key realization, well, how mature should they be?

And, you know, we did spend time working early on with, you know, companies that had selected the data warehouse but actually weren't using it yet. And that was 1 of the key things that we had to identify and understand is they really need to have invested in, you know, building processes on top of the data. It needs to really matter to them. And as soon as that has happened, they start to experience these data quality issues, and it makes more sense for a system like Anomlo to come into play. For people who are interested in being able

to manage the quality of their data and understand when they do have these issues and

these anomalous events in their records, what are some of the cases where Anomilo is the wrong choice and maybe they're better suited building out their own internal systems or processes

or falling back on rules based systems?

1 example

is we have talked to companies doing,

you know, pharma drug development,

And, you know, they will collect

10, 000 observations

that are very unique

for, you know, 15 patients

in 1 large study.

And there could be very important and significant data quality issues there,

but it's very difficult to use a system like Anomalow to find them because,

really, the only way to find them is to have, you know, some scientist

or,

you know, product or engineer

state, you know, this is what I expect of the data. And it should conform with that because it's too small a sample size. It's not being updated regularly. It's this kind of, you know, maybe very valuable single trove of data that is then static.

And so that's not a great use case for Anomilo. We tend to want to work with data

that is, in some sense, live and being, you know, somewhat continuously updated. It can be even just weekly, but it should be, you know, continuing to arrive as a part of operating the business.

Yeah. There are companies out there where their core dataset is human entered data, right, where someone entered entries into a CRM system manually by typing them in,

right? And that's also typically not a dataset we're going to be great at because,

you know, it's not a great match

for kind of the statistical learning

that we do.

So,

we would think of Anamilo as something

that's great for anyone that has

essentially

code collected data,

right, that's being collected continuously,

because that's a great fit for the type of dataset where we can find lots of meaningful issues.

Yeah. Or data from, you know, partners or external providers that's arriving to them. So someone else is code collecting it and then passing it to you. That's right. It's almost even better because,

you know, then you have much less control, and that can be swapped out from underneath you in lots of different ways. Yeah. Hard to fix an issue in a third party's data pipeline.

So you better be able to detect it on your side.

And as you continue to build and evolve the Anamolo product and business, what are some of the things you have planned for the near to medium term, or any projects that you're particularly excited to dig into?

Yeah, absolutely, Tobias. I think I look at our product as sort of enabling

sort of 3 activities

in dealing with data quality.

Number 1 is detection,

right? Can we find the issues? If we can't find the issues, then of course there's nothing you can do

about them. Number 2 is root causing of those issues. And number 3 is resolving

those issues. And so 1 of the big focus areas for us this year is

moving down that stack. You know, we believe we're great at detection.

We

are amazing at root causing, of course, could always get better and improve, and there are other situations that we're working on to provide, you know, a deeper root cause

for some of the issues.

And then we're just starting to think about

how do we help you orchestrate the resolution

of this issue? How do we help you

document that,

so that you can resolve the issue. And also if someone comes back later on and tries to understand what happened to the data on this day, there's some record

of the issue occurring and being resolved.

Are there any other aspects of the Anamolo product or the overall space of data quality and its detection that we didn't discuss yet that you'd like to cover before we close out the show?

I think we've covered quite a bit, Tobias.

Thank you for the detailed question. I would agree. I think we've gone over a lot, and

we definitely are excited to continue

to evolve and iterate on what we've created, and it's gonna be making it more accurate, you know, more insightful,

finding those features that customers love and and making them easier and broadcasting them across the customer base. So a lot of exciting innovation to come this year. Alright. Well, for anybody who wants to follow along with you and keep up to date with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspectives on what you see

It's been a gap for a long time, and it's now being filled with data management today. I'll tell you something, Tobias, that I'm pretty excited about.

It's been a gap for a long time

and it's now being filled, which is

kind of the metric store, the metrics layer

that's being added to the stack that a couple of companies are doing.

I battled for many years the issue of

metrics being defined differently

by different people

and never lining up when I'm trying to replicate a particular metric,

you know, trying to go to the tribal knowledge

of the folks around me to figure out how that metric is actually defined.

And so, 1 thing that I'm pretty excited about, a gap that is being filled by some folks, is this idea of having a metrics layer

that has the official definitions

of all your metrics, and an easy way to replicate them,

and slice and dice them in any scenario that you might need. There's so much happening today. I think I'm almost as excited to see it just mature

and find out what actually sticks and, you know, what becomes really widespread

in both the data management ecosystem and in the machine learning tooling

ecosystem as well. There's a tremendous amount of experimentation,

a lot of exciting ideas.

I think, ultimately,

you know, we wanna see more companies

be data driven more efficiently and be able to have faster cycles

and being able to effectively make decisions and ship products that have some meaningful impact on their business

using data.

And so I think there's so many things happening right now that it can be almost confusing

to people trying to start a new data stack and decide what they should choose and what they shouldn't. And so I'm almost looking for some clarity and consistency

around that and then a new wave of innovation on top of that foundational layer of what has actually worked from this recent revolution.

Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at Anamalo. It's definitely a very interesting

problem domain. It's great to see the different people who are taking different stabs at it, and I'm definitely interested in the machine learning and statistical approach that you're exploring. So I appreciate all the time and energy that you're putting into that, and I hope you each enjoy the rest of your day. Yeah. Thank you, Tobias. This is great. Thanks for having us, Tobias.

Listening. Don't forget to check out our other show, podcast dot init at pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe

dotcom with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links