Summary
Data quality control is a requirement for being able to trust the various reports and machine learning models that are relying on the information that you curate. Rules based systems are useful for validating known requirements, but with the scale and complexity of data in modern organizations it is impractical, and often impossible, to manually create rules for all potential errors. The team at Anomalo are building a machine learning powered platform for identifying and alerting on anomalous and invalid changes in your data so that you aren’t flying blind. In this episode founders Elliot Shmukler and Jeremy Stanley explain how they have architected the system to work with your data warehouse and let you know about the critical issues hiding in your data without overwhelming you with alerts.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
- Your host is Tobias Macey and today I’m interviewing Elliot Shmukler and Jeremy Stanley about Anomalo, a data quality platform aiming to automate issue detection with zero setup
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Anomalo is and the story behind it?
- Managing data quality is ostensibly about building trust in your data. What are the promises that data teams are able to make about the information in their control when they are using Anomalo?
- What are some of the claims that cannot be made unequivocally when relying on data quality monitoring systems?
- types of data quality issues identified
- utility of automated vs programmatic tests
- Can you describe how the Anomalo system is designed and implemented?
- How have the design and goals of the platform changed or evolved since you started working on it?
- What is your approach for validating changes to the business logic in your platform given the unpredictable nature of the system under test?
- model training/customization process
- statistical model
- seasonality/windowing
- CI/CD
- With any monitoring system the most challenging thing to do is avoid generating alerts that aren’t actionable or helpful. What is your strategy for helping your customers avoid alert fatigue?
- What are the most interesting, innovative, or unexpected ways that you have seen Anomalo used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomalo?
- When is Anomalo the wrong choice?
- What do you have planned for the future of Anomalo?
Contact Info
- Elliot
- Jeremy
- @jeremystan on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Elliot Schmuechler and Jeremy Stanley about Anomilo, a data quality platform aiming to automate issue detection with 0 setup. So, Elliot, can you start by introducing yourself?
[00:02:07] Unknown:
Hi, Tobias. Thank you so much for having us. I'm Elliot Schmucker, the CEO and cofounder of Anomelo.
[00:02:13] Unknown:
And Jeremy, how about yourself? Yeah, Tobias. Excited to be here with you today. I'm Jeremy Stanley. I'm the CTO and cofounder of Anomalo.
[00:02:20] Unknown:
And going back to you, Elliot, do you remember how you first got involved in the area of data?
[00:02:24] Unknown:
Yeah, absolutely. I've been a growth leader for many years. So I led growth teams at places like LinkedIn, Wealthfront, and Instacart. And as many of your listeners probably know, consumer growth and consumer growth strategies are all quantitative in nature. And so from the earliest days of my career, I had to be great at getting to the data I needed and using all the tools to make sure that we were growing.
[00:02:51] Unknown:
And, Jeremy, what's your background in data?
[00:02:53] Unknown:
I have been a data scientist, using data to make decisions or using data to build and deploy machine learning products. And you can remember even, you know, 10, 15 years ago coming up against, you know, huge collections of data in a data warehouse with no context and no background and, you know, no transparency to the quality of that data and having to figure it out as I went. I've always been
[00:03:18] Unknown:
a very deep user of data and focused on making sure we can get the most out. And so that brings us to what we're talking about today and the work that you're doing at Anomalous. I'm wondering if you can just give a bit of an overview about what it is that you're building there and some of the story behind how it came to be and why you decided that this is the problem that you wanted to spend your time and energy on.
[00:03:39] Unknown:
Yeah. Absolutely, Tobias. So Anomalosa is a data quality platform, and our goal is to empower data scientists, data analysts, data engineers to very quickly detect and resolve issues with data quality other things that we're doing is we're doing a lot of work in the industry, and we're doing a lot of work in the industry. And so, we're companies where we were constantly trying to use data to make decisions and build products and improve our products and grow faster. And 1 of the biggest issues we've encountered is, you know, the more data you collect and the more you try to use it, the more data quality issues You're fine. The more you encounter situations where data is missing or inconsistent between periods because some definition of a particular metric or a particular field in the data has changed or is otherwise corrupted or stale.
And so we had a number of these situations in our career, and that was the inspiration for building something that will actually find these issues for us without us having to do the work to find them, and without us being surprised when our models break, or our dashboards are wrong, or we simply can't get to the right data when we need
[00:04:57] Unknown:
to. In terms of data quality, there are a number of different ways of thinking about that and a number of different axes and elements of quality. And at the highest level, the objective is to establish and quantify and maintain trust in the data that you're working with. I'm wondering if you can speak to the types of guarantees and promises that data teams are able to make and maintain and validate when they're using the Anomilo platform.
[00:05:28] Unknown:
Yeah, absolutely. So I'll go through a few of them and Jeremy chime in as well. Number 1, is your data fresh? Right? So we've seen a bunch of situations in our career where datasets fail to update or data fails to arrive for a given day. And that's not always something that's obvious to the folks using that data. So first and foremost, if you're monitoring your data through Anomalow, we will tell you if the new data was expected to arrive at this time and did not. And so you can be confident that you're going to know that your data is fresh. Number 2, is your data complete, right? Very often, you have issues in various parts of the modern data stack that cause you to miss certain data. Maybe you have a bug introduced into how you ingest the data, how you collect the events, or maybe you had a breakdown in a part of your pipeline that didn't include all the data, Or maybe you're getting feeds from 3rd parties, and today that 3rd party feed is missing some things that it usually has.
And so we will actually look for evidence that your data is incomplete or that something that was in the data historically is now missing. You know, to give you an example to bias, back when Jeremy and I worked at Instacart, together we actually had an issue of this kind where because of an engineering bug, we stopped collecting, you know, behavior events on our Android app. If you looked at our volume of events or you looked at our overall metrics, you know, it wasn't obvious that anything had moved. But if you look deeper in the data, you could see that any events from the Android platform that were there before were now gone. And so 1 of the promises that we make when we monitor data with an envelope is that we will detect these situations where your data is missing.
That's foundational. On top of that, we also help you understand whether metrics that you've defined on this data set have unusual movements, right? That tends to be pretty important. You're going to use metrics to monitor your business or make decisions. And so we help you understand where those metrics move in unusual ways. And then on top of all that, if you want to establish some very fixed parameters for your data, you know, we don't encourage this. We don't encourage establishing rules for your data because we detect so many things automatically through enamel of so many of the most common issues. But if you want to establish some rules, for example, because you want a 100% certainty that a particular thing is true about your data, maybe that's important for compliance purposes, then we let you do that as well. So fundamentally, you know, once you have your data monitor or anomaly, you can be sure that it's fresh and you can be sure that it's complete and nothing is missing there.
And then further, you can make sure that the metrics powered by that data are reasonable and that whatever deterministic rules you want to establish for that data are also met.
[00:08:32] Unknown:
1 of the things that's always interesting about data quality platforms is kind of where they focus and what the kind of integration points are into the platform. And 1 of the things that you mentioned of the situation where your Android application was no longer sending events and being able to identify that. But at that point, it's, you know, kind of too late to resolve the immediate issue until you have gone through another engineering, you know, fix, build, deploy cycle. And so, you know, there's always the question of, at what point do you try to detect these quality issues to be able to remediate them as fast as possible? And I'm curious if you can talk to the focus that you've settled on with Anamalo and maybe some of the ways that people can and should think about augmenting what Anamalo is doing with other platforms or systems to be able to get a more holistic view of the end to end quality of data where you might be able to identify earlier before you get into production with that Android app that these events are no longer being generated and propagated?
[00:09:40] Unknown:
Yeah, absolutely, Tobias. So what we've chosen to do is focus on data that's in your data warehouse, right? We see, a lot of the times, the folks that are setting up the modern data stacks, becoming data driven organizations, are centralizing all their data in a cloud data warehouse, you know, a warehouse like Snowflake, that's becoming very popular. And so everything from your raw data, you know, the raw events ingested from your apps, the feeds from the 3rd parties to your sort of refined and more usable data ends up in the data warehouse. So as a first step, we chose to focus on connecting to your data warehouse and letting you monitor anything inside your data warehouse with ease, right? But you're absolutely right.
1 of the things that I've always championed in the various roles that I've had is data and data collection is as important as the correctness of the code, right, underlying your products. So if you're building an Android app, for example, of course, you're going to have unit tests and reviews and all these kinds of things to make sure your app is actually functioning correctly. But actually you should have unit tests for data too, right? Am I omitting events, right? When I view this page and I'm supposed to admit an event on a page root, does that actually happen? It's been amazing to me how often those are missed, you know, in modern development because data is kind of sometimes seen as an afterthought, as not as essential to the functioning of the product. But in modern organizations, it is. Because you're going to try to make decisions based on that data. You're going to try to prioritize future features based on that data. That data may underlie results of experiments that you're running. So I think it's very important, and hopefully your audience is treating it as such that, you know, data and data collection is, you know, a production level, a code level issue, right, for these teams where it needs to be tested and verified.
[00:11:42] Unknown:
Tobias, 1 thing I would add is I like to use the metaphor of a factory. And, you know, if you're gonna test the quality of a good, you know, manufactured at a factory, the most critical test you can do is at the end of the line. Right, when the product's completely assembled and it should be entirely functional, and you can say, you know, does it meet the specifications you would want? We chose to focus on the, you know, data warehouse because that's that final complex data product that's being consumed by the analysts, the data scientists, you know, people in product. And there's so much complexity upstream of that. And, yes, you can and you should test components of that complexity where you can. Right? Some won't be possible to test. You'll have external things happening, you know, with data vendors or SaaS platforms or other tools. You'll have people making manual decisions, you know, that can't easily be tested.
But then, also, all of the code that your organization is changing and shipping, all of the transformations that are happening in the data warehouse, they can all interact in very complex ways. Right? The the individual data bits that are collected at the top of the funnel end up going through incredibly complex transformations. And so how those interactions can take place introduce a lot of complexity and a lot of risk of data quality issues. And the only way to really effectively catch those is to test at that final, you know, point of consumption. And, in addition, that's where you also have the most context. You can look at the other data around the issue to understand, you know, what might have happened and, you know, what's being affected.
[00:13:16] Unknown:
Going back to the question of guarantees and being able to validate the promises about the data that you're making, what are some of the promises that can't be made unequivocally based solely on the perspective from the data warehouse or from the sort of statistical anomaly detection that you're doing at Anomolo based on the information that you do have and where you do need to go to some external system or some additional manual validation or other sort of process control methods for being able to make and maintain those promises?
[00:13:50] Unknown:
Yeah. So you can think about the anomalous system as having this, you know, machine learning driven, and we can talk more about what's behind that, but fully automated approach to identifying the most common, you know, issues that are happening in your data and the most common degradation of your data. It's doing that by sampling data from your data warehouse and, you know, learning the distribution and patterns in that data and identifying if there are meaningful you know, statistically meaningful changes that happen. So there are some limitations there. It can identify that a column that was previously 10% null, you know, and has been, you know, typically 10% null, maybe with some seasonal fluctuations, if that suddenly becomes 20% null, that's statistically significant, and you would want to receive that notification.
Typically, 10% null, and you had a single additional extra unexpected null value. You know, certainly, an algorithm could detect that. But if it was, you know, so finely tuned to be able to detect that, it would detect so many other things you wouldn't care about that you would drown in noise. So, typically, when we think about our fully automated approaches, they're looking for statistically significant changes, or they're looking for changes where you went from a situation where you had, for example, no null values in a column. That would be meaningful. And so the way I like to think about it is it can protect you from the most common types of cases, but there's no free launch. It's not gonna be able to find every meaningful change. And so that's where, to Elliot's point earlier, it is, key metric side, when you say, I actually care a great deal On the key metrics side, when you say, I actually care a great deal about this specific statistic, this specific metric, you're making a really pointed value judgment about what columns you care about and what rows you care about. Right? The specific rows and the specific columns that go into that metric matter a lot to you, and so you wanna pay very close attention to it. And so it's gonna be more sensitive to changes in that very specific part of the table.
And, similarly, if you define a validation rule, if you say, you know, this column should simply never be null, then that's the strongest way of asserting something. Right? It's using your own independent expectation and understanding of the data to make that assertion, which is different from looking for a regression and a, you know, a change happening in the data. Yeah. I'll add just a couple of things to bias to this.
[00:16:12] Unknown:
You know, 1 truth, systems like ours and statistical and machine learning based system is is we're detecting changes in your data. Right? So we've encountered a few situations where the data has been wrong from day 1, right? We're not going to spot that. We might recognize when that wrongness has changed or when it got corrected through our models, but we're not going to spot data that has never been correct, right, because we just don't know. And so often we can judge statistical changes in the data, but we can't tell you what the data correctly reflects, you know, the real world, right, outside.
[00:16:51] Unknown:
Doctor. So that speaks to the kind of question of the utility of automated versus manually defined or programmatic tests of data. And I'm wondering if you can speak to some of the ways that you and your customers think about the application of Anomolo versus using tools such as Great Expectations and kind of the overlap between the 2 of when to use which ones?
[00:17:14] Unknown:
Our perspective, Tobias, is that you need both, right? You need both automated tests on your data, statistical tests, like the ones at the core of our product, and more manually defined, you know, rules and focus areas to sort of refine that monitoring and fine tune that monitoring. The issue is it's actually impossible to be on either ends of that spectrum. We need to combine both, and I'll tell you why. If, for example, all you're going to do is the manual monitoring, like you can do very effectively with great expectations and other tools, it's just a for And not just write them, they have to maintain those rules over time.
As your company launches new products and new geographies and new platforms, right? Those rules have to change. The other issue that we've seen, and we've seen this firsthand when we were at places like Instacart where we had a rules based system to monitor data quality, rules can only protect you against things that you can anticipate. You're not going to write a rule for something you can't imagine happening down the line. But guess what? Things you can't imagine happening happen all the time in data. And so you relying 100% on a rules based system means you're going to miss these unknown unknowns, right, or unanticipated issues in your data.
And because of the level of work you have to do to write all those rules, you can't possibly cover your entire data warehouse, your entire collection of datasets with this manual approach. So the way we work with our customers is we say, Look, the automated approach, the core of an envelope should be your foundation. It should be something that you can easily apply to every table in your data warehouse without any work, right? And now you've guaranteed yourself sort of a base level of quality for every table. But for tables that are critical to you, that have your most important data, please go ahead and use Anamolo's other tools, you know, the more manual tools of defining rules and metrics to kind of fine tune our monitoring.
So now what we've done is we've given customers, you know, a base level coverage across their entire data warehouse, right, with protection against unanticipated issues and unknown unknowns in their data. And we've allowed them as a result to to focus their manual efforts on just the most critical areas where they make sure to get it perfectly right. And so that's a powerful combination. You know, neither approach in isolation, we think solves the data quality problem for modern teens. But if you combine the 2 approaches, and if you combine them in 1 place like an amyloid does, where they can feed off each other and help each other, then I think you have something pretty powerful.
[00:20:14] Unknown:
And so can you give a bit of an overview about the technical implementation and design of the Anamalo platform and some of the ways that you have approached the automated discovery and alerting on some of these data quality problems that people experience as their data does change and evolve in their data warehouses and their kind of data platforms?
[00:20:38] Unknown:
There's a few different components. The first 1 is just making sure the data is fresh and that the volume of records in a given table are what you would expect. And so we have a platform that is pulling, you know, the data in the data warehouse to constantly monitor, you know, have new records arrived. And then once the records have arrived, is the expected row count what you would expect? And this is useful both because you can define SLAs for when you want your data to arrive and be alerted if you're missing the last 15 minutes of data for a time period or you have a significant drop in row counts. You know, there's value in that in and of itself, but it's also great because that is a system that gates running all of the other checks that Anomalow runs. So we don't go in and do a whole bunch of statistics and machine learning and execution of these rules on data that's actually just incomplete.
It's 1 great way to avoid false positives. So that's the first part of the system. From there, we go in and we sample records from the table. And this is in order to run our machine learning algorithms. And so, you know, the way to think about this is imagine you have a table, and each day, you're sampling, say, 10, 000 rows randomly from that table. What Anomalo is doing is looking for drift in those samples of data. And a simple way to imagine how you might do that is suppose you had a random sample of data from today and a random sample of data from yesterday. If you could build a machine learning model that could predict on which day each record came from, then something about the data is different. Right? Whatever it was that the machine learning algorithm was able to use to make that prediction accurately, you know, is indicating what has changed from 1 day to the next.
Now the obvious answer would be, well, the date is, you know, yesterday on the 1 sample and today on the other. And so that's a very trivial feature that an algorithm could use to distinguish the 2 days of data. And so what Anomolo does is it learns what all of the features are, and it identifies the ones that are always predictive of time and removes them. And it dampens the columns that are very chaotic. For example, if you're doing marketing and you have a marketing dataset, you might have new campaigns constantly starting in that dataset. And so the campaign ideas or the campaign metadata is constantly changing. It's very chaotic.
It's not a data quality issue. That's an operational change. And so we have a layer on top of this algorithm that's looking for, you know, those repeated chaos, you know, operations and is dampening them. So you're left with just the unexpected changes happening in the data. We then use explainability algorithms like Shapley to, you know, interpret what data in particular, what rows, what columns, what values in particular are causing, you know, this change, this drift in the data. And it's really powerful because we can use that to summarize for the end user. You know, here are the collection of columns that are affected.
You know, this is the nature of the change. Here are visualizations that helps to explain it. And we can go even further than that. We can actually identify, well, what characterizes the rows that are changed? You know, is it Android only? Is it this geography or this partner or this provider only? So we provide what we call a root cause analysis into understanding what has actually changed in the data. So that's the basis of the machine learning. And, you know, this can detect, you know, drift in columns, changes in distribution, even changes in the relationships between columns.
And then we have special versions of this that are looking for things like duplicate data, looking for increases in null values or 0 values to highlight those in particular since they're often very concerning to users. After that, we have the key metrics and the validation rules. These would be set up by the user and all executed, you know, in parallel against the the warehouse, pulling samples of data, you know, time series out, building those time series models, and producing visualizations. I think 1 thing I wanna generally highlight is we spend a lot of time and energy producing visualizations. We probably have a 100 different types in the product.
And we do that because explaining these issues is at least as important as finding them, making sure that the user can quickly understand, is this something I expected? You know, is this something that's concerning? You know, where did it happen? Who do I need to talk to? All of that needs to happen through a visualization and an analysis of the data. And so we do a lot of that in a very automated way as well. As far as the machine learning elements of the system,
[00:25:25] Unknown:
because of the fact that you're trying to detect anomalies in the system, it makes sense to take a statistical approach. But at the same time, 1 of the challenges of machine learning is that it can be unpredictable. And so I'm curious what your process has been for being able to validate the logic and the statistical models of those machine learning approaches to ensure that you are able to effectively identify and alert on the problematic elements of the data that's under test.
[00:25:52] Unknown:
It's actually a really fun part of what we do, you know, fun for the data scientist machine learning, you know, background for me. Maybe not fun for everybody, but we have a Chaos library for data. And so, you you know, this is about 30 different classes of issues that we can introduce into datasets. And we actually do this in the data warehouse. We'll create temporary tables that introduce these issues and mask the underlying data. And the types of chaos that we can introduce, you know, ranges from simple things like, you know, null value increases or dropping rows to things that are more nuanced, like I'm gonna shuffle the values in a column or I'm going to swap values in a certain way, you know, duplicate data, change the distribution in some small way.
And so we have a whole library of these chaos operations. And what we do is we have a bunch of benchmark datasets. Many of them are public datasets. We've also had customers contribute data to our benchmark datasets. And we run a huge barrage of chaos operations at all of these datasets, you know, introducing 1 specific issue for some fraction of the data, you know, oftentimes a very small fraction of the data. And then we run our machine learning algorithm to ask the question, well, can it recover that issue? Does it identify it, and does it correctly characterize it? And so, you know, we can look at ultimately, the algorithms behind the scenes produce an anomaly score. You know, how severe is this issue and identify where the issue is happening. And so we can look at that as, essentially, a prediction for how unexpected a given day's worth of data is and use that to compare, you know, specificity and sensitivity and performance of the algorithm against this chaos benchmark set of data.
[00:27:39] Unknown:
With the model selection itself, I'm wondering how you have approached the architectural elements of the models to be able to make them scalable and maintainable. And you mentioned explainability as well. And so I'm curious how you have worked through that design and selection process to ensure that it is as powerful and useful as possible while still being maintainable over the long run and being able to maybe deal with cold start where the client has a, you know, small sampling of data, and so you need to be able to start making inferences without having a huge corpora?
[00:28:18] Unknown:
Great question. So there's a few things there. You know, 1 is, you know, we are using gradient boosting decision treats when we build these models. We're not using deep learning models. And, you know, that's actually pretty important because we're working with predominantly structured data. We do take JSON, and we automatically structure that JSON data, and it can also be fed into these algorithms. But the nice thing about the gradient boosting decision trees is they have a relatively, you know, bounded runtime, and so they're not gonna take a tremendous amount of time. They can also work on relatively small amounts of data.
And, you know, we have a process for training them such that they don't overfit even on relatively small datasets. 1 of the key components is, you know, what features are we generating? And and this is a fascinating part of building a product like Anomilo. We have to deploy Anomilo oftentimes in VPC, you know, in environments we may not even be able to log in to and see. You know, if it's a health care customer that's very sensitive to the to the privacy and HIPAA compliance, We may not even be able to log in and see. And so this is an algorithm that's running and deployed in lots of different environments and has to encounter arbitrary data. And so the know, most important thing we do is the feature generation process.
And, you know, given a structured set of data, how are features automatically generated? And so, you know, we have a whole library of tools for automatically generating features that are relevant for detecting these types of anomalies, you know, given the, you know, the chaos testing strategy that I described. And, originally, we started down a greedy path of let's keep adding, you know, new features and complexity in the model such that we got, you know, ever higher ROC scores on detecting chaos. And at a certain point, we realized that we'd created a monster that, you know, we were now able to detect things. But even with all the advanced visualizations, natural language explanations that we have, in the end, a user would look at it and go, well, I just don't know what to do with that explanation.
You know, you have found something that is maybe statistically significant but not really interpretable to me. And so we've had to pare back the types of features that we include and ensure that any feature will always have a meaningful explanation to the end user, even at the expense of losing some of, you know, the performance of the actual algorithm itself. The last thing you mentioned is, what about the cold start problem? And so another thing that we built on top of this, it's an online learning process. Right? Every day, the algorithm's actually rebuilt.
And so we retrain the algorithm to relearn the new patterns. We store a tremendous amount of metadata coming out of that run, and we build up time series of these model runs. And for the cold start problem, we actually set thresholds that start out at extremely high value, and they gradually decay down to what we learn to be an accurate threshold for the algorithm given what we're actually observing in the data. And so that way we can begin and, you know, have this run on day 1. And it won't send any notifications. It's going to be kinda cautious, and it's going to begin to learn and observe the pattern of changes occurring in the data over time.
And then as it becomes confident over the first few weeks, it'll be able to identify really serious changes. And then eventually, you know, all unexpected changes, it will be able to identify accurately.
[00:31:49] Unknown:
In terms of that time series aspect of it, and you alluded to this a little bit earlier, there's the question of seasonality in the patterns of the data that you're working with, where that might be seasonality over the period of a day or weeks or years. And I'm curious how you approach being able to identify those patterns and understand them appropriately. And then also the question of windowing of over what period of time do I care about these anomalous events? Where do I only care about if an anomaly is happening within these hour boundaries? Do I care about days, weeks? Like, at what point does the window become either too large or too small?
[00:32:26] Unknown:
Good question. So in terms of seasonality, the underlying machine learning algorithm itself, the way we've constructed how it works with the data, it's able to account for short term seasonality in the gradient boosting decision tree algorithm. And so it's able to look for day of week seasonality or hour of day seasonality, those kinds of, you know, very, very short term patterns. We then control for additional seasonality, even things like day of month, day of year, you know, holiday effects, you know, changes in trend, with all of the time series algorithms that we run on top of the metadata coming out of the machine learning algorithm. So seasonality was 1 of the most important things to get right, and it actually required a couple of these different approaches in conjunction to do well. So that's the answer to the seasonality.
Mostly, we operate daily. That's where we see customers finding the most value for this kind of fully automated system. And there's a couple of reasons for that. 1 is you've gotta have a human in the loop in a lot of these evaluations. Someone needs to look at the visualizations and make the assessment. Is this something I expected? Is it unexpected? Is it, you know, significant? Who do I bring in? And so, ultimately, you want the system to operate at the same decision scale, right, as the humans. And, typically, daily is the right cadence in most cases.
You can certainly identify hourly changes on a daily basis. But if you want to run this every hour, it's possible, but it does introduce more risk of false positives. Now you have, you know, 24 opportunities each day to potentially get an alert. And so we recommend doing that only in kind of extreme situations when you really care very deeply about something, and you're going to be responding in real time. Right? That's the key component that you're gonna have someone making a decision or acting in that kind of real time moment. You certainly can run it on slower time scales, and what you're looking for typically is slow drift in your data. Right? A leak of some sort. And so instead of a sudden sharp change, maybe you've had something that has slowly drifted or deteriorated.
And so what looks very slow on a weekly basis might look very sharp. And so you can expand that time scale if you want to find those things. Oftentimes, we'll support customers doing that with metrics because you wanna be a little bit more focused on exactly the kind of drift you wanna care about versus looking for arbitrary drift in your business.
[00:34:49] Unknown:
And another interesting element of being able to manage these anomaly detection scenarios is the question of being able to use something like Anomolo in the CI and CD workflows of managing the changes in the data processing and data generation utilities that you're running and some of the ways that you can be proactive about identifying, okay, this change in my transformation logic is actually causing this problem in data before I actually push that to run against your production environments.
[00:35:23] Unknown:
Absolutely, Tobias. It's actually 1 of the most common requests we get from teams once they've deployed Anomelo is, How can I make sure my staging table that I'm about to promote into production products that you're using for deeply into your CICD pipelines or other data pipelines? And we've designed the product in a way where you can kind of combine our UI and our API in good ways. So for example, you know, you can open up your RUI to anyone who cares about data, anyone in your organization that's going to designate tables that we should monitor or set up rules and define metrics. But then through the API, you can do things like cloning, you know, the types of rules and metrics that are pointed to a production table, and just applying them to your staging table, to the table in progress, right, that you're building.
And so you're actually able to very easily answer questions like, If I promote this dataset to its final destination, what alerts will fire? Right? And then you can make a decision whether that's okay or whether you should pause that process, right, and try to figure it out before that data goes live. So that's a great way we see customers using the API to accomplish what you're describing.
[00:36:45] Unknown:
We also have some, you know, advanced validation rules that allow you to diff tables, and that diff can be done in a deterministic way. Right? I'm gonna define the primary keys, and I want to identify, you know, are there any row differences, duplicate rows, you know, any changes between my prod and staging version on a row level or down to individual values that are different. And then we'll apply our statistical analysis to help summarize where those changes are happening. And so I think sometimes we get overconfident writing tests for the transformation without realizing, well, what are the all of the implications in the data? And so doing a diff can help you check for the actual implications and changes in the data. And we're going to be releasing a machine learning technique to automatically diff tables even when you don't have a primary key constraint, something that can work across databases, work across different samples of data that can be pretty useful in this context as well.
[00:37:46] Unknown:
StreamSets DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud Amp up Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, 1 single pane of glass for operating and monitoring all of your data pipelines, the full transparency and control you desire for your data operations.
Get started building pipelines in minutes for free at data engineering podcast.com/streamsets. The first 10 listeners that subscribe to StreamSets professional tier will receive 2 months free after their 1st month. Another interesting element of this product, and you started to dig into this a little bit with the question of explainability, is the communications aspect of being able to say, okay. I've detected this statistical anomaly, but now I need to be able to describe the problem to somebody who can do something about it. And so I'm wondering if you can talk to some of the ways that you have worked through the types of information that you need to be able to convey and the types of questions actually mean at the broader picture? Where is this data being used? You know, what does it actually mean at the broader picture? Where is this data being used? You know, what is the appropriate resolution for this? Do I need to go back to the source system and reexport and rebuild all these tables? Do I need to contact my dev team because they've, you know, messed up the schema of this event structure and, you know, kind of managing the collaboration and organizational aspects of data quality?
[00:39:43] Unknown:
Yeah. So in terms of the information that we provide, at the very top level, we're gonna be summarizing, you know, the issue in natural language and, you know, saying what is the specific statistic, what is the specific count, right, of issues that have occurred. That's kind of the most simplistic summary, if you will. We then always visualize that and then put that in historical context. So, you know, it's always important to be able to understand, you know, what's been the history of changes in this statistic or, you know, in this value, in this distribution, and how is this different or unique from that? And also put it in the context of confidence intervals produced by models. Right? How unexpected is this change, really? And so we have visualizations that help to explain that.
All of that is just to give you a sense of, you know, severity and historical context for the issue. We then dig deeper into, well, where in the data did it occur. And we have a pretty, you know, interesting approach to that, this root cause analysis where we will take a sample of bad data and a sample of good data and do statistical analysis on those 2 samples to be able to identify for every segment where a segment is, you know, some you can imagine some Where SQL filter on the table. For every segment, you know, how indexed is that to the bad data versus the good data? And, you know, as we correlate and cluster and rank those, it will give someone a very clear road map into exactly where in the data is the issue occurring. And that's often, you know, super important to be able to understand, you know, what part of the business, what process upstream, you know, what team might be responsible or engaged in in having caused this issue or needing to root cause and diagnose it. We send all of this information as an alert into Slack.
And so, you know, the communication Slack or Microsoft Teams, that's typically the point of consumption. And our best practice with customers is for them to set up multiple different team channels and have each team channel subscribe to, you know, a subset of tables in their data warehouse that that team really cares about. And so they'll get all of the, you know, text descriptions and the key visualization, you know, who created, who last edited the check, those kinds of context, pipe directly into their team channel. They can click in to learn more and view the history of the data and and, you know, this kind of root cause analysis of the check and then have a conversation about what the root cause is and what steps they should take next. Yeah. And, Tobias, we found
[00:42:17] Unknown:
this sort of routing of the alerts and the visualizing and explaining of what's going on in the root causing to actually be as important as the detection of the issue. I mean, I'm sure you and your audience have experienced other alerting systems and very often you just get alerts, which just build up because it takes so long to investigate each 1 and to dig into what's happening. And so we didn't want that to be an outcome of using an AMOLO. We wanted you to be able to evaluate an alert from an AMOLO in 10 seconds, right? And kind of flag it as, you know, This is an issue, I need to resolve it, or I need to send this to my teammate, or, This is okay. I'm going to ignore this for now. And so we focus quite a bit on making sure the right people see the alert, making sure we give you the best summary at the moment of consuming the alert. And then if you want to investigate that, we give you that sort of statistical root cause that we're able to compute so you can figure out, you know, the next steps to take in investigating.
We'll even generate our SQL query for you. This is 1 of the fun features where we'll generate a SQL query for you that you can paste into your favorite SQL client, connect it to your data warehouse, and get the bad data out, right? So you can just consume it yourself as you investigate. And so a lot of our customers really appreciate that we're saving them a lot of time, not just in detecting issues, but also in doing the investigation. Joshua:
[00:43:43] Unknown:
And another element of the kind of communications piece of this is the question of alerting. And 1 of the problems that any vendor who's working on generating alerts runs into is the problem of alert fatigue and being able to actually send information that is useful and actionable and isn't just going to be ultimately ignored by somebody. And so I'm curious how you approach that problem and some of the ways that you have built in feedback systems for your users to be able to say, I don't actually care about this kind of alert, so don't even bother sending it anymore, or I do care about this alert, but not for this piece of data, and just some of the kind of nuance that goes into that process?
[00:44:25] Unknown:
Yeah. Great question, Tobias. And this is an area where we focused quite a bit, and we've approached it in multiple levels to make sure that we're not creating alert fatigue. 1 of the most important things is just routing the alerts to the right people, right? It's hard to get alert fatigue over something that you really care about, because you're going to keep wanting to make sure that it's okay. And so some of the routing functionality that we've built into Anomalove, being able to route particular tables directly to their teams, or even particular checks that were running directly to a particular team that may only care about that specific metric or that specific rule and not the entire table.
You know, that's been a pretty big part of it. 2 is we actually have an intelligent layer built into our systems that when it encounters a duplicate alert, right, when it sent an alert on this issue before, and now it's about to send the same alert again, right? You may have seen and I'm sure your audience has seen a lot of alerting systems. Once a condition happens, they just keep sending you that alert again and again and again and they build up. We actually have a little bit of intelligence where we try to figure out If we sent you that alert before and we're about to send it to you again, should we do it? Right? And we base that on kind of, you know, have you taken action to resolve this issue? Or How fast do you take actions to resolve issues?
And so we'll actually decrease the cadence of certain alerts automatically if they continue to come up. And then 3rd, and this is an area where of course we're investing quite a bit, is the feedback level. So today we take in some, you know, implicit feedback of whether you corrected the issue or not. But we're also focused on our roadmap on getting explicit feedback, and on having flows that allow folks to say, you know, I don't care about this, or This is too sensitive. I only want issues of greater magnitude, you know, for this particular element to generate alerts.
[00:46:32] Unknown:
So that's a big part of it too. And 1 of the things that our customers really love about Anomolo is how easy it is to create these checks and to edit and to configure them and how flexible the system is. And so it's, you know, just a couple of mouse clicks away on any alert to go in and change a WHERE SQL filter to pull out data that the alert shouldn't apply to or to change the confidence interval to make it less likely to alert or to reduce the severity of the alert so that it isn't as important or likely to alert you. You know, that's just a few. There's probably a 100 different types of configuration changes that you can make that will reduce the likelihood of being alerted.
And, you know, making that really easy for users to do is another big part of this. Given the fact that you
[00:47:22] Unknown:
are running and you're getting these feedback mechanisms for being able to understand, okay, this error is actually interesting and useful. This one's not. How are you working in those larger feedback cycles into your product to be able to say, okay, these data issues aren't really all that interesting in the majority case, so we're not going to turn it on by default. Or, you know, this is a type class of error in data that we didn't anticipate that's being reported by users, or this is the type of check that's being created. And so now we're going to work on being able to actually validate that from a statistical perspective and just managing that product cycle.
[00:48:06] Unknown:
You already summarized it very well. You know, when we first started, we began with just the unsupervised machine learning detecting drift. And so there wasn't a lot of additional user feedback. And it was through discussions with users that we realized, you know, what else did they care about? We created the system that allowed them to come in and, you know, create, these metrics, create the validation rules. And we've just learned a tremendous amount from our design partners early in our venture now through all of our customers that are using the product. We'll observe, you know, 1 customer setting up a lot of checks in a manual process and recognize that that's entirely something we can automate.
And so, you know, we've gone from, in the beginning, having just 1 fully automated check to now we have 8, you know, fully automated checks, each 1 often finding multiple different types of issues. And that's been following, you know, what are the kind of most obvious, consistent things that our customers want to have entirely automated. And I think that's a special part of Anomalow, is that we have the, you know, engineering skill and the machine learning skill to be able to automate some of these complex processes. It's not always easy to do, especially when these are running, you know, on arbitrary sets of data and arbitrary environments. How do you come up with something that's gonna be robust enough that it will work for customers? And we've been able to tackle that again and again.
[00:49:26] Unknown:
In your work of designing and building the Anamolo system and working with your design partners and your early customers, what are some of the most interesting or innovative or unexpected ways that you've seen the product applied?
[00:49:38] Unknown:
Yeah. I'll start with a couple. So 1 was we really began with a focus on data quality and, you know, identifying what I would call an anomaly. Right? Hence, the name. An anomaly being, you know, a sudden sharp change in the data, something structural happening in the data generation process. But we realized and our customers realized that we have a system that's, you know, running these tests on the data every day, and it could be used for other purposes. And so an interesting example came to us from 1 of our early customers where, you know, they wanted to find IP addresses that were suddenly spiking and hitting their website.
And another customer wanted to find email addresses that were sending large batches of suspicious email. And, really, what these were were identifying outliers. And, you know, we thought about creating a general process for identifying outliers, but It's very difficult to do, right, what makes an outlier special, that every dataset has outliers. And so working with these customers, we realized if we gave them some guardrails where they could express, you know, what they care about as an outlier, it could actually be really valuable. And so the structure of that is you can go and identify an entity. And so an entity could be an IP address or an entity could be, you know, an account sending email, and you define a statistic for that entity.
And we then automatically identify if there's ever suspicious behavior that suddenly significantly changes the distribution of that statistic, and you suddenly have, you know, an IP address that's far more of your traffic than you would have expected given the past history or seasonality or suddenly a new account sending more suspicious email than you would expect. And we then explain that for them. And so that's become, you know, a whole unexpected use case for our product. Another good 1 I would throw out is actually closer to the machine learning use case. We end up working sometimes with machine learning teams who collect, you know, tremendous number of features and are publishing these features, and they want to detect drift in them. And it turns out that the same approaches that we're using to look for data quality issues are also really good to apply to unexpected feature drift.
Oftentimes, that is a result of some upstream data quality issue. Right? You suddenly have nulls in a feature that you didn't expect and now your machine learning model goes off the rails. The same algorithms we're using to look for that in the structured data upstream can also be applied to the feature store insofar as you're replicating those features to your data warehouse. You're taking the logs of the features used in production and sending those into your data warehouse. And so we've had a number of customers also set up that kind of feature monitoring.
[00:52:11] Unknown:
In your work of building the product and the business, what are some of the most interesting or unexpected or challenging lessons that you've each learned in the process?
[00:52:20] Unknown:
I think, you know, 1 of the most important things, and maybe this 1 is obvious, is you have to listen to your customers and meet them where they are, right? As Jeremy mentioned, we started perhaps naively with this idea that we're going to have this great amazing machine learning model that's going to solve everything, is magically going to find all the issues and rank them by severity and importance. And it just wasn't enough. It wasn't something that out of the gate our customers could trust because machine learning models are, you know, sometimes difficult to understand, sometimes black boxes where you don't know what's going to come out of it. And so we've had to build a much, much larger product than we expected to really help our customers, you know, in addition to obtaining trust in their data, obtain trust in an envelope, and be able to direct based on their knowledge of what was important and fine tune what we can do. And as Jeremy mentioned, it's no longer 1 machine learning model, now there's many covering different elements and having different focus areas that we know are important. So, you know, having a great idea but also finding where your customers are and meeting them there was a pretty important process to get to where we are. I think in my experience identifying,
[00:53:32] Unknown:
you know, who the customer even would be and what would characterize the customers, you know, initially, we weren't sure, you know, what industries would make the most sense for Anomolo. And in practice, that actually hadn't mattered much. You know, pretty much any company that has a data warehouse, you know, is gonna have data quality issues if they're using the data, and the data is going to be important enough that they ought to pay attention to them. And so we've ended up working with a huge range of customers from financial services and payments to health care, identity management, publishing, e commerce, marketplaces.
Just about, you know, any company, you name it, today, there's an appetite to be data driven, and they're collecting data, and they need these kinds of data quality, you know, monitoring solutions. I think the next key realization, well, how mature should they be? And, you know, we did spend time working early on with, you know, companies that had selected the data warehouse but actually weren't using it yet. And that was 1 of the key things that we had to identify and understand is they really need to have invested in, you know, building processes on top of the data. It needs to really matter to them. And as soon as that has happened, they start to experience these data quality issues, and it makes more sense for a system like Anomlo to come into play. For people who are interested in being able
[00:54:49] Unknown:
to manage the quality of their data and understand when they do have these issues and these anomalous events in their records, what are some of the cases where Anomilo is the wrong choice and maybe they're better suited building out their own internal systems or processes or falling back on rules based systems?
[00:55:08] Unknown:
1 example is we have talked to companies doing, you know, pharma drug development, And, you know, they will collect 10, 000 observations that are very unique for, you know, 15 patients in 1 large study. And there could be very important and significant data quality issues there, but it's very difficult to use a system like Anomalow to find them because, really, the only way to find them is to have, you know, some scientist or, you know, product or engineer state, you know, this is what I expect of the data. And it should conform with that because it's too small a sample size. It's not being updated regularly. It's this kind of, you know, maybe very valuable single trove of data that is then static. And so that's not a great use case for Anomilo. We tend to want to work with data that is, in some sense, live and being, you know, somewhat continuously updated. It can be even just weekly, but it should be, you know, continuing to arrive as a part of operating the business.
[00:56:12] Unknown:
Yeah. There are companies out there where their core dataset is human entered data, right, where someone entered entries into a CRM system manually by typing them in, right? And that's also typically not a dataset we're going to be great at because, you know, it's not a great match for kind of the statistical learning that we do. So, we would think of Anamilo as something that's great for anyone that has essentially code collected data, right, that's being collected continuously, because that's a great fit for the type of dataset where we can find lots of meaningful issues.
[00:56:50] Unknown:
Yeah. Or data from, you know, partners or external providers that's arriving to them. So someone else is code collecting it and then passing it to you. That's right. It's almost even better because, you know, then you have much less control, and that can be swapped out from underneath you in lots of different ways. Yeah. Hard to fix an issue in a third party's data pipeline.
[00:57:09] Unknown:
So you better be able to detect it on your side.
[00:57:13] Unknown:
And as you continue to build and evolve the Anamolo product and business, what are some of the things you have planned for the near to medium term, or any projects that you're particularly excited to dig into?
[00:57:23] Unknown:
Yeah, absolutely, Tobias. I think I look at our product as sort of enabling sort of 3 activities in dealing with data quality. Number 1 is detection, right? Can we find the issues? If we can't find the issues, then of course there's nothing you can do about them. Number 2 is root causing of those issues. And number 3 is resolving those issues. And so 1 of the big focus areas for us this year is moving down that stack. You know, we believe we're great at detection. We are amazing at root causing, of course, could always get better and improve, and there are other situations that we're working on to provide, you know, a deeper root cause for some of the issues.
And then we're just starting to think about how do we help you orchestrate the resolution of this issue? How do we help you document that, so that you can resolve the issue. And also if someone comes back later on and tries to understand what happened to the data on this day, there's some record of the issue occurring and being resolved.
[00:58:33] Unknown:
Are there any other aspects of the Anamolo product or the overall space of data quality and its detection that we didn't discuss yet that you'd like to cover before we close out the show?
[00:58:44] Unknown:
I think we've covered quite a bit, Tobias.
[00:58:47] Unknown:
Thank you for the detailed question. I would agree. I think we've gone over a lot, and we definitely are excited to continue to evolve and iterate on what we've created, and it's gonna be making it more accurate, you know, more insightful,
[00:59:00] Unknown:
finding those features that customers love and and making them easier and broadcasting them across the customer base. So a lot of exciting innovation to come this year. Alright. Well, for anybody who wants to follow along with you and keep up to date with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspectives on what you see It's been a gap for a long time, and it's now being filled with data management today. I'll tell you something, Tobias, that I'm pretty excited about.
[00:59:29] Unknown:
It's been a gap for a long time and it's now being filled, which is kind of the metric store, the metrics layer that's being added to the stack that a couple of companies are doing. I battled for many years the issue of metrics being defined differently by different people and never lining up when I'm trying to replicate a particular metric, you know, trying to go to the tribal knowledge of the folks around me to figure out how that metric is actually defined. And so, 1 thing that I'm pretty excited about, a gap that is being filled by some folks, is this idea of having a metrics layer that has the official definitions of all your metrics, and an easy way to replicate them,
[01:00:14] Unknown:
and slice and dice them in any scenario that you might need. There's so much happening today. I think I'm almost as excited to see it just mature and find out what actually sticks and, you know, what becomes really widespread in both the data management ecosystem and in the machine learning tooling ecosystem as well. There's a tremendous amount of experimentation, a lot of exciting ideas. I think, ultimately, you know, we wanna see more companies be data driven more efficiently and be able to have faster cycles and being able to effectively make decisions and ship products that have some meaningful impact on their business using data.
And so I think there's so many things happening right now that it can be almost confusing to people trying to start a new data stack and decide what they should choose and what they shouldn't. And so I'm almost looking for some clarity and consistency around that and then a new wave of innovation on top of that foundational layer of what has actually worked from this recent revolution.
[01:01:13] Unknown:
Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at Anamalo. It's definitely a very interesting problem domain. It's great to see the different people who are taking different stabs at it, and I'm definitely interested in the machine learning and statistical approach that you're exploring. So I appreciate all the time and energy that you're putting into that, and I hope you each enjoy the rest of your day. Yeah. Thank you, Tobias. This is great. Thanks for having us, Tobias. Listening. Don't forget to check out our other show, podcast dot init at pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe dotcom with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction to Anomalo with Elliot and Jeremy
The Importance of Data Quality
Guarantees and Promises of Data Quality
Focus and Integration Points of Anomalo
Technical Implementation and Design
Anomalo in CI/CD Workflows
Communication and Alerting
Customer Use Cases and Feedback
When Anomalo is Not the Right Choice
Future Plans for Anomalo
Exciting Trends in Data Management