Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy, and today I'm interviewing Andrew Maguire about his work on the AnomStack project and how you can use it to run your own anomaly detection for your metrics. So, Andrew, can you start by introducing yourself?

Yeah. Hi. Thanks thanks for having me on. I'm a big big fan of the show.

So I am the analytics and mld.netdata.

Andnetdata is an open source observability company,

primarily with you know, focus on metrics and some logs type

functionality as well. And, that's that's kind of the the root source of where I've been coming from. You know, anomaly detection is a big part of, the NetData project itself. And so kind of as part of that, the Numstack is a sort of side project I've been working on lately.

And do you remember how you first got started working in data?

Yeah.

Well, in

in 1 of the first, you know, first ways I actually gotten introduced to data was in insurance. Funny enough, an insurance is, like, 1 of the oldest kind of, big data

industries around. Funny enough. So that was in insurance.

I I kinda first fell in love with data. And then in terms of, like, the data management stuff, it's been typical kind

of journey as, you know, going from companies where you're, you know, the only data guy or the 1st data guy, and then you have to obviously, you know, stand up the data infrastructure

to get the data you need, to do the data science, to do the ML. And so that's pretty much where,

a lot of my data management exposure is coming from kinda necessity as you're as you're trying to build stuff and use the data. You know?

And now bringing us to the AnomStack project, you said that some of its origin comes from the work that you're doing at NetData. But I'm wondering if you can just give an overview about what it is that you've built, some of the story behind how it came to be, and why you decided that you wanted to make it as accessible and approachable as possible.

Yeah. So, probably, primarily, it's because I've I've had to build build versions of this in every job I've been in for the last 10 years. And it's always been kind of custom every time and a little bit, you know, not very very customer specific to whatever infrastructure or data stack you're using.

But nowadays, you know, there's a lot of open source projects and tools that we can build on. And I just felt like the time is right now to actually, you know, save myself from building it the next time for the next 5 years. I should just build a project that I can kind of open source and see if I can get some contributions around.

And so the idea there is this is focusing on, you

know, a smaller smaller teams, smaller data,

operations.

Give, you know, give them a simple way to just bring their metrics and get really, you know, decent,

anomaly detection out of the box, basically.

And in terms of

the term metrics,

given your background in NetData,

that makes me think about metrics from an operations and infrastructure standpoint about what is the CPU load, how you know, what is the available memory.

But the term metrics in the data ecosystem has also become overloaded with this idea of the semantic layer and business metrics. And what does it mean for somebody to be

on stack and the ways that it can be applied. Yeah. So, actually, metric trees is another thing I've I've seen recently. There's a lot of talk around metric trees and, you know, building these relationships onto metrics. The the main goal is simplicity. And so there is there is lots of different metric concepts in the observability space,

but we haven't we're not using that here necessarily. So the the definition of a metric basically is

a row a row on on a row in your data frame or a row in your database in the metrics table where it's literally just a metric name, time stamp, and value, and that's it. So that so that's kind of the idea there is this makes it really easy for users. That's all they a user that's all a user has to produce is these 3 these 3, fields. You know? And so we're not going too too fancy in terms of, like, complex metrics definition because that just adds kind of a little bit more of a of a a ramp for people to actually, you know, to use the system. So there's there's pros and cons to each, of course, like in in observability. And, you know, you have all these concepts and tools like Prometheus and, you know, the different types of metrics and, you know, how you work dimensions in and stuff like that. But for our case, for an unstacked idea is just keep it as simple as possible basically to begin with. And that also

makes it very flexible because if you don't necessarily

have a

constrained definition of what that metric can be and what it's supposed to mean, then that means that everybody can map it to whatever semantic attribute they want it to in order to determine what are the anomalies and how does that impact what it ever whatever it is that I'm trying to measure. Yeah. And this is kind of actually something that I I have got on the road map for the project is to to extend a little bit so that when you're defining the metric, you also

define some metadata.

Obviously, the first thing being, like, a metric description, say. And because and the idea there is actually if we could do that, even if you had a useful description,

that would help a lot more within, like, the the saliency of the of the nominees because an anomaly is an anomaly, but whether it's something you care about or not is a is a different question. And so if we can get some of this metadata, like, maybe things like priority p 1, p 2, or, you know, whatever different tags

you want. You can obviously then do different kind of route. You can you can route alerts differently. But actually, like, longer term, I'm thinking that there could be something where this could be something that,

large language models could obviously use as well. So if they had this kind of rich metadata that they could make sense of, that could also be useful in terms of, you know, you might say, oh, you know, how's my

what's my nominees in the sales today? And, like, the fact that you have all this stuff in the descriptions

would make that a lot easier. So previously with the semantic all the semantic stuff was good, but there's a lot of overhead to maintain it. You have to, you know, agree on your structure upfront and and implement it. Whereas if we if we just allow some kind of free text y, you know, more higher level stuff, there's definitely roles where I think language models could help, make sense of it as well in terms of, you know, sorting through the the metrics. Yeah. And giving you some human level understanding about, is this something that you actually care about? Yeah. That's always the problem because,

oftentimes, you systems like this, you end up metrics just thousands of metrics and the idea is, you know, we want metrics to just be like cattle. You don't have to think about them. They're not special. You know, just produce your metrics, metrics, metrics. And then that's great because then you have all these metrics, but then the problem can be how do you make sense of it when you maybe have, you know, 100 alerts a day and maybe 50 of those alerts are on metrics that you don't they're, you know, they're nice to know, but they're not that important.

And so it's things like that where if you could have each of these alerts be like a little a little insight snippet, you could actually maybe have a language model made sense of it. Or ultimately longer term, if you had a sort of a feedback loop on top of the system like an AnomStack where you give thumbs up, thumbs down to sort of try and start measuring saliency of like, okay, what do people care about more than average? That then kinda could become a whole different layer on top of it, but it's it's that's an open problem. I don't think anyone's really solved that yet, to be honest. Absolutely. And also even if some single metric is anomalous,

it maybe doesn't matter unless it's correlated with another anomaly in a different metric, and it's that

conjunction of anomalies,

across different metric series or maybe even across different service boundaries that will let you know, oh, hey. There's actually something really wrong here. You need to do something about it. Yeah. Yeah. And that's that's like another that's something that I've seen some some,

people do really well. So Anadot is another tool I've used in the past with for anomaly detection. And they do a really good job of this where they stack all the alerts together into sort of so so each each alert becomes like a a stack of alerts, and then you can kind of quickly, really quickly see based on the map of, like, okay, what's making up this this batch of alerts, basically?

That's something that I I would be like to add in the future as well actually could be really interesting.

And you mentioned that 1 of your objectives of building this project and releasing it as open source is so that you don't have to build it again in whatever future role you have. I'm wondering if you can just give an overview about what are the core objectives that you have and what are the things that you would like to see come out of this project and some of the, direction that you'd like to see it taken in. Yeah. So the main objective is just,

have a nice easy open source, solution for people to get good anomaly detection on typically business metrics is is what I have in my head here.

And, you know, lower lower overhead and then so if you're like someone that's kind of

you know, not you don't necessarily have to be an infrastructure engineer. You're just technical enough to maybe you you bring your own SQL to define the metrics or you can define custom Python functions to define the metrics as well. But the idea is, like, you could be a business analyst who could actually just bring your metrics and and then actually stand this up yourself,

in you know, it's just a Docker container. So there's that's the main idea is, like, keep it as as easy as possible for, like, smaller teams who either can't afford

bigger expensive SaaS solutions

or, you know, they don't necessarily have

the time or or expertise to, like, build their own custom solution. That's just use a tool like this and get decent enough anomaly detection on all your metrics out of the box. That's the main aim.

And for people who are interested in being able to get these alerts and understand, okay. I've got lots of metrics. I don't wanna have to care about them and keep a close eye on them. I just want something to let me know when there are things going

wrong. What are some of the other tools or products that they might be evaluating when they con when they come across a Numstack, and what are the aspects of a Numstack that might sway them in its favor?

Yeah. So there there's there's lots of there's kind of,

there's a couple of different solutions here, a couple of different approaches. There's like vendors who I've actually used in the past. Anadot is probably the biggest and the oldest player here. Like, they're they have really good anomaly detection across all types of metrics,

and lots of other stuff. It was a few years ago when I used them, and they've they've done a lot since as well. And so these are, like, you know, services that you pay for in an enterprise setting that they're are a bit expensive and there's a bit of configuration involved. Once they're up and running, they're they're they're good.

And then there's also lots of, like, newer,

SaaS type startups in in the kind of modern data stack space and era that we're in. So,

Chaosgenius is another 1 there that's actually I've been looking at recently. That's pretty good and pretty cool. But the the there's also then the other approach here. A lot of these a lot of the data warehouses now are starting to build some of these ML features into their, you know, into their stack themselves. So like Snowflake,

BigQuery, they all actually now typically have their own anomaly detection,

functions and and ML, like, you know, ML functions that you can train train models and save models just within your SQL. So that's another option as well. If you're if you're using a platform like this, you can always, of course, try and a little bit easier now to try and roll your own,

because it's it it you can do a lot of it now in SQL itself. And then there's sort of, like, these are vendors. Like, Metaplane is actually 1 of use as well. Metaplane's pretty pretty cool. It's a little bit more focused on the data,

data engineering, data data upside of of the metrics. But you can you can kind of tweak some of these things to also cover business metrics as well.

And

digging more into that concept of the business metrics

and being able to generate alerts and detect when there are anomalies,

I guess that's another vague term that might be worth digging further into is that idea of anomalies and what makes something actually anomalous. Is it just because it is 2 standard deviations away from the mean? Is it because there's something some specific rule that you have that this value can never exceed this threshold? I'm wondering what are some of the specific types of anomalies that you're looking to address and alert on and some of the ways that people need to be thinking about how to understand

when something is actually anomalous versus just a little bit weird. Yeah. Yeah. That's like that's a good point. And this is kind of I'm I'm a little bit obsessed with anomaly detection, to be honest, because it's 1 of those areas of, machine learning and data science that still has there's a lot of kinda art and science involved in it. So, like, there's a lot of subjective,

subjective,

like, decisions as to, like, well, does this look anonymous to you? It does to me and, you know, that this deal, that sort of it's not as easy as, like, you know, just doing something like regression or classification

where, you know, you have a simple a simple metric like accuracy. In anomaly detection, you don't have any metrics like this, that you can use as a source of truth, so it's it's a little bit subjective.

And

so that's 1 of the reasons why we are we use good defaults, basically. So we're using PyOD, which is an open source project around anomaly detection.

And, basically, we have defaults there to use, like, as flexible a model by default as possible.

So it's using, you know, best practice standard,

sensible things around feature preprocessing.

And then it it's using, like, a PCA based anomaly detection model, which is which is more flexible and it'll cover more types of anomalies as opposed to say if if you just have single spikes, you know, okay. They're the obvious ones that people always think of. But sometimes it's basically like instead of a single spike, is it a strange little squiggle that's changed recently? Or is it a is it an increase in trend and different a different wider kinda cast and net as wide as possible?

And so that's why we're using, PyOD with, like, a general, you know, flexible model underneath. But there's also then, of course, like, if you're a user, you can you can define your own preprocessing function or you can you can define your own model as well. So you can, if you wanted to, you can extend it to be like maybe maybe you know, for instance, if it's say, well, okay. This metric here is daily sales, and you actually know that there's a big impact on whether it's the whether it's the weekend or whether it's the, you know, the weekday, say, or even time of day. So you could actually build your own your own preprocessing function to say, okay. I wanna, like when it's the weekend, I want it to be, you know, is weekend equals equals 1. And then when it's during the week, is weekends equals 0. And you can then pass that through to the model to use that as a as a feature. So it can it it can get quite sort of,

it can depend a lot

on exactly how you want to do it. But the idea here with with the knomstack approach is to, like, use as general and sensible a default as possible

that, you know, will cover all metrics reasonably well. And then if you want to kind of go more complex, you can. But, yeah, it it can get quite sort of subjective and and complicated in terms of, you know, what what is an

anomaly

or

not.

Data projects are notoriously

complex

With multiple stakeholders to manage across varying backgrounds and tool chains, even simple reports can become unwieldy to maintain.

Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data.

I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects.

Find simplicity in your most complex projects with Miro.

Your first 3 Miro boards are free when you sign up today at dataengineeringpodcast.com/

Miro.

That's 3 free boards at dataengineeringpodcast.com/mir0.

Digging now into the idea of metrics definition and identifying what are the metrics that you should care about, what are the metrics that are useful to be alerted on, What are some of

the ways that data teams or operations teams should be approaching that question and thinking about how do I

decide what are the metrics that are actually going to matter? What are the ones that will give me a useful signal of something needs to be addressed and it's going to have some sort of business impact versus just, hey. It might be neat to know about this thing.

Yeah. Yeah. It's typically,

it it you know, what metrics are you reporting to your senior management, basically? Start with them.

So there's typically, like, there's a there's there's business metrics, you know, where you start with Dan's typically, they're obviously, like, headline business metrics like users,

payments,

sign ins, you know, depending on your business. They're they're usually pretty obvious domain domain bright stars.

And then there's also, like, technical,

technical metrics as well. So we use sometimes a lot of technical metrics for, like, under underneath the stuff for the like, the health of the app itself and and things like that.

But, generally,

it it it should be obvious. And if it's not obvious, then it's probably

a a question for, okay, well, maybe this isn't

a metric I should I should use. Kind of the way I think about things, though, is that it's everything you know, metrics are increasing every like, everything has become a time series as we have more and more data and, you know, metrics are becoming just more and more commonplace. So it's okay to have lots and lots of metrics. It's just that you you wanna have, like, a, you know, priority 1 level of metrics, priority 2 level of metrics. So so you can kind of embrace the embrace the messiness of like, okay. We've got loads of metrics across all these other types of business, you know, objectives, secondary objectives. We'll put them in a different book, then you put your main kind of executive level metrics.

And and they they obviously then would, you know, they get a special a special route when, you know, when they go off versus when all the other metrics go off. Because then it's like, okay. Well, if if a p 1 metric,

alerts, I, you know, I want that to go straight into the Slack or I want that to email me straight away. But then I also want, like, all the other metrics that are, like, lower, you know, lower priority or lower interest. Maybe every now and then, I wanna just open up that inbox and and browse through those kinda, you know, read the newspaper as such to see.

And that's very useful as well then because the you know, if you have a good anomaly detection system, it almost becomes like a BI tool in that sense as well and that it's it's actually uncovering insights. And you can quickly, extend more just about the UI UX. Like, can you quickly scan,

50 alerts and see, oh, there's 1 thing there that, you know, it it's actually might be interesting.

That's that's, like, that's, like, that's gold dust if you can get that in terms of an insight. Because otherwise, you would have had to preconfigure a dashboard, and maybe it's in it's in some dashboard in the 2nd tab and the 3rd

down the 2nd quarter of the page. And, you know, you have to get so lucky that your eyeball happens to land on that chart. That's just it doesn't it's not really a scalable approach to to analytics, especially in this day and age when there's just so much more data. So and that's the other flip side of it as well. It's like

it's more about sort of how you make you know, how you route the the insights that you get from these these tools.

And before we dig too much further into the

implementation of a nom stack,

another thing that I noticed as I was reviewing the project is that you put in a lot of effort to make it as easy to get up and running and get started with and evaluate as possible with

including out of the box pipelines for Dagster, having a GitHub Codespaces

available.

I forget what the other options were,

but it it was just very much a I really want you to use this thing, and I'm wondering

what was the impetus for putting in all of that effort,

and what are some of the ways that that focus of making it easy to adopt, making it easy to test out, influenced the overall design of the project and the ways that you were thinking about how to architect it so that it was easy to adopt and implement?

Yeah. So main kind of consideration there is to try and keep it, like, as easy as possible in terms of, like, it's not overengineered at all. Basically, under the hood, when you look into it, everything is like a Pandas data frame that's moving around. So I want I kinda wanted it basically build it for a version of myself maybe 10 years ago who was, like, instead of back then, I had to, like, stand up my own airflow

VM and, you know, come up with, like, all the data engineering part of it. If I can actually just, you know, Docker Compose up and then just focus on the SQL and the metrics, then I'd be really happy. And that's kind of what the aim is here is that you can you can easily run it through through Docker or even, you know, serverless. Daxter Cloud is really cool as well the way they have an integration on GitHub, and it it'll just automatically deploy to Daxter Cloud. So you don't even have any sort of operations.

Then you can just focus on a PR to add new metrics or or as your as your metrics evolve, it's all kind of GitOps type approach. And the idea was they're, like ideally, I'd love to it's still quite early on in the project, so, I've only been working on it kind of a a month or 2. And the plan is kind of have users that actually use it, but it could also then become contributors as well. You know? And so lower the barrier barrier to contribution as much as possible as well. So that's why we're kinda all the concepts are are very straightforward

and and very simple.

And that's the idea, like, is to actually have users that can use it. And now, so, like, if they wanna make an improvement,

for sure, like, yeah, get involved, make it pure. It'd be great. You know? So that's the idea is to actually have users and and contributors.

In terms of the implementation

and as you were

defining the scope of the project and thinking through, okay, I want to have this open source anomaly detection stack so that I don't have to rebuild it over and over again.

What are the core capabilities

and constraints that you were focused on that informed the final implementation

of what you have built so far?

Yeah. So I I actually, originally started with,

anomaly detection provider in Airflow. So we we use Airflow, and I I

built

a anomaly detection airflow provider package that's also in the in the Airflow registry with with the Astronomer folks. And that that works. So if you're using airflow,

that's that's 1 approach.

But I was thinking as I was doing it, I was kinda thinking, well, it's this kind of depends on airflow, and it's a bit silly for people to have to then stand up airflow to do anomaly detection. So I wanted something that's more stand alone. And so I also was aware, like, at the time that all you know, a lot of these data orchestration tools, there's so many options, and they're also you know, they're they're all great now.

So,

the approach there was actually okay. I wanna have a a flexible enough general simple orchestration

tool And then also use, you know, PyOD does all the ML stuff. So it's basically putting all the ingredients together into this little app approach that's kind of fairly easy to stand up, fairly easy to reason about.

And,

that's the that's the that's the main aim is to actually have as little moving parts as possible and, just

get what we need for for decent enough, you know, and on detection layer, it's into your inbox. That's the the the the north star.

And now as far as the actual implementation, the architecture,

wondering if you can describe how you implemented a non stack and some of the ways that you

optimized for these particular design constraints that you mentioned.

Yeah. So I I had a look at a few different orchestration platforms, basically. And it it was a good excuse that I'd been aware of Dagster, but I hadn't really used it that much. I'd mostly been used to airflow and, you know, other things like, serverless options in GCP and AWS.

And so I had to look at Dykster, and actually, Dykster seemed almost perfect because the, well, the Dykster have an approach called software defined assets. That's, like, really interesting approach that they have. But actually, a step underneath that is basically just jobs.

And a job

is is is the core kind of building block here. So when a user defines their their metrics, a metric batch, basically, it an AMstack will just trigger 6 jobs or 4 main jobs. Like, there's a job to ingest, a job to train, a job to score, and a job to alert. And so,

the main kind of concept here is you bring your configuration

and then the tool itself will will do the orchestration

and then also use, you know, PyOD for the ML stuff as well. So, like, it's kind of it's more mainly like putting putting together these recipes of of different ingredients that are already out there in the ecosystem.

And that's that's kind of what the the culmination is.

From the time that you first started building this project to where you are now, I'm wondering what are some of the ways that the overall goals and implementation

have evolved and maybe some of the dead ends that you explored and ultimately discarded?

Yeah. I actually,

1 of the dead ends that I almost talked was like, it was kinda jokey, but,

I have been we've implemented

a LLM alert job itself. So instead of the PyODML

models for the anomaly detection, we actually have a an an LLM alert job that you can enable, which basically just sends the data to to, GPT

and asks it, does it look anomalous? And I was kinda more so curious because it was a good,

it's a good example of, like, where the limits are in terms of language models because I wanted to see, like, how actually useful can it can it be, you know, in getting sense back from the language model. And time series is time series data is is still a bit sort of at the edge of what LLMs are really able to do really well.

So it was kind of fun playing around with that. There's a lot of iterations

of doing as min like, I start with as minimal approach as possible, send the data to the LLM, see what it gets back, and it was kind of it it wasn't even kind of understanding the time series. Like, it couldn't even get the order of the data itself.

And so there's been a few iterations of that, like, playing around with prompt engineering to actually give it all the hints it needs to to to do it. And it's it's actually it's kinda surprisingly working,

but it it works technically, but, like, when you and it works and it makes sense. But when you look at it then and and take a higher picture as a human, it's actually not that useful at all because the anomalies that the LLM comes up with are technically, often, they are anomalies, but they're not anomalies that you would care as a human if you're eyeballing the data. And so it's it's tricky. That was, like, it was fun to do all that, and I kinda mainly did that just as a as a sort of a a joke almost. But, that was a that was something that I think is is is it's kinda interesting to see, but it's,

being definitely on. I've just turned them all off today, by default. It's an optional kinda job that you can turn on. It's, yeah, it's a little bit of a dead end. I don't think it's as useful as,

you know, you might think it is. For people who are interested in testing it out, getting it deployed,

as we already discussed, there is a very easy on ramp. But for

people who want to then go from, okay, I've tested it out. It seems interesting. Now I wanna run it in production. What does that journey look like, and what are some of the considerations

and potential sharp edges that people need to be thinking about as they go from proof of concept to this is business critical now? So there's there's a couple of ways to use it. It's you can the the repository itself is is a GitHub template, so you can kind of actually you can clone you know, you can obviously, of course, clone the repository, but you can use the GitHub template to make a copy of it. And then once you have that GitHub template repository, you can then use that for your for your metrics and deploy it however you want through, Dijkstra Cloud or just using your own kind of CICD and and Docker Compose.

So some of the sharp edges are probably like this. There's still it's still very immature project. It's still very, very young. I just finished, like, the first set of proper tests today. So,

there's always like this is the this is the this is something that comes with these open source projects as well is that it's, especially when they're young like this, you know, that that definitely is I wouldn't it's a pinch of salt in terms of, like, it's better you're you're better off, like, dog fooding it dog food it gently on, you know, stuff that's not production. And then once you're once you're comfortable with that, then you go from there. So, like, that's what I'm gonna do at the moment. I'm kinda dogfooding it as we go. And,

so there's a little bit like it's still there's a small little bit of infrastructure in terms of, okay,

how are you gonna run these Docker containers,

how are you gonna monitor them, you know, how are you gonna have availability, things like that. These are typical enough kinda considerations,

with tools like this. So there's still there there is still a couple of kind of it's not completely hands free. It's not it's not completely painless.

Not yet, but the aim is to be as painless as as possible, basically. And so there's there's definitely some typical kind of,

sharp edges there in terms of, like, we don't have necessarily a standard a standard deployment or standard installation.

Yet we've given as many as possible. So you can use Docker or you can use a local Python environment yourself, or you can then use the serverless options as well. And so we're kind of waiting to see which which which approaches people are most comfortable with as

well. Data lakes are notoriously

complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture

with first class support for Apache Iceberg, Delta Lake, and Hoody,

so you always maintain ownership of your data.

Want to see Starburst in action? Go to data engineering podcast.com/starburst

and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

There are multiple different flavors of open source projects where sometimes people just want to produce something out in the open, but they don't really care about getting contributions. There's the corporate open source where we're gonna release this because it furthers our business. And if you happen to get used out of it, that's great. And then there are the open source projects that are intended to be maintained and grown by community. And I'm wondering

what your thoughts are on how you're approaching this,

this particular project. Are you looking for contributions?

Are you just looking for feedback? I'm wondering what types of engagement and community you're looking to build around it in ways that folks can contribute and help you out with this. Yeah. Yeah. No. I'm I'm always looking for contributions. I would love some contributions.

Kinda I don't necessarily have, like, a software engineering background myself. So that's always been, sort of a a fear I've had around the the imposter syndrome and stuff like that. So I would love if somebody came with a contribution that completely

showed me, oh, you know,

your tests are all wrong or you could do something better or or, like, here's more here's better abstractions we can use. There's definitely, like, room for improvement across the board. And so, I would yeah. I'd love contributions. And that's been the aim of, like, keeping it as simple as possible where, you know, everything is basically all the main concepts are you have, like, a metric batch, which is just a definition of of, of your of your metrics. And then you have jobs, which are, like, you know, ingest train score alert. And then it within under the hood, when you're looking at the code, really, it relies heavily on on pandas data frames.

Every job basically, you know, produces a Pandas data frame or it takes in a Pandas data frame and produces a Pandas data frame. So it's quite easy to reason about. And so that's the idea is that, like, if you're someone that's

comfortable enough,

in Python developer, like, it's a perfect project to do, you know, for its open source contributions on as well, which should be really fun. Like And for people who are looking to get engaged with the project and maybe they don't necessarily want to modify

the core of what you're building, but they are interested in extending or augmenting its capabilities. What are some of the

interfaces that you've built in to make it open for extension and customization and adapting to a particular,

customer or operating environment?

Yeah. So that was a a good example of where I haven't tried to be too complicated from the start. So, obviously,

we support,

you know, BigQuery,

Snowflake,

DuckDp, a couple of other databases. And I didn't originally, I was thinking like, okay. Do I need to build some fancy plug in architecture or plug in system where somebody could bring their own plug in?

And I said I decided not to do that because probably it's at the edge of my capability, but also it makes it harder to to contribute on as well. So the way the approach would be at the moment, for example, I'm I'm working on Redshift at the moment and adding Azure Blob Storage.

You know, just

make a fork, make a PR, and it's it's everything's kind of easily testable.

And so that's where we haven't gone. It's not as complicated yet in terms of, like, taking, say, something like the airflow approach where you have plugins that you can provide you can install separately, dependencies and stuff like that. We haven't kind of we're not it's it's it's not sort of, taken that approach yet mainly for that goal to have, like, the as low a barrier as possible to contribution.

And but definitely at some stage, if, you know, if the project does become more mature and stuff like that, then, yeah, like, that would be something that I would imagine would be refactored at some stage.

Digging more into the

I'm using this. I'm running it. I want to feed in these different metrics. You mentioned that it has support for pulling from databases,

running Python scripts. I'm wondering if you can talk a little bit more about

the process of producing the metrics that a nomstack is going to work from

and the overall flow of

data in evaluation,

alert out, or, you know, ignore because there's nothing to alert on?

Yeah. Yeah. So, like, the the main approach there the inputs are there's a metrics folder basically in the root of the project. And in the metrics folder then you have you can you can have a folder for, you know, each subject area or each metric batch. Or you can kind of do you can you can organize the metrics however you want as long as they're they're running into metrics folder. And then all a metric batch is

is some ingest SQL. So there's a template that you you just define an ingest SQL file, which is basically just some, you know, whatever SQL you

wanna use to to generate your your metrics. And so basically, this is SQL that generates a table which just has a metric name,

a metric value, and a metric timestamp. There that's all that's required. So and once you have that then, that's like that's that's the ingest that's the basis for the ingestion. YAML configuration file. And the YAML configuration file has all the other things like schedules and, you know, parameters for

the

models. And, again, you don't have to fill any of them. You can kinda just leave that file pretty much empty and it'll use the defaults. There's also, like, a defaults YAML that you can you can edit your defaults as well. So the idea is you you you just bring your your ingest, you know, logic basically. And it's you can do you can use a SQL you can see use an ingest SQL function or you can actually if you if you want you can also use your own. You can make a custom Python function. So you just all you have to define if you're doing something that's maybe say, you're scraping metrics from a website or from some public, you know, metrics or even doesn't you know, it it could be anywhere. But if it's a Python function, you can then also just use it. You You can just bring your own Python function. As long as that Python function generates a Pandas data frame that then has those same 3 columns, metric name, metric value, metric timestamp, that works as well. So we have and we have loads of examples in the repository that do that. Like, there's examples that pull metrics from Hacker News and weather metrics and Yahoo Finance and all that sort of stuff. And once you have that, then you can obviously you can so you can customize anything. Under the hood, there's there's default templates. So, like, there's a default template for the preprocessing function that the ML uses. You don't ever have to worry about that, but if you want to, you can you can you can bring your own for each individual metric batch.

Likewise, for the,

for the alert logic, you know, you can also define your own alert SQL template if you want. And or you can edit the default 1 that's there. So the idea is once you bring your, you know, your ingest logic and your configuration,

then that will, you know, that will trigger off everything. So the ingest jobs, the training jobs, the score jobs. And then all that's happening behind the scenes is it's it's gonna kinda run that ingest script, save the results onto a metrics table. And as it's doing the scoring, it'll also save the scores onto the metrics table. So this all then just becomes kind of orchestration that's reading to and writing from this metrics table

in

your warehouse, basically,

which could be Snowflake, BigQuery, whatever.

And this is like a long a long format metrics table where each row is basically a new metric. So it's kind of easy to think about as well. Because as you add new metric batches, you're just depending on to the end of that table. Or you can, of course, also have, like if you want, you could have different metric batches going to different metrics tables. That's all flexible. But it's easiest to think about just having, like, 1 start with 1 single metrics table that that a Numst stack is reading from and writing to. And that kind of becomes then the, you know, that's the actual the heart of what's going on here, basically. And you can plug that into your own tools as well. So if you have your own BI tools or your own Alerik tools or, you know, anything like that, That then, it's just another table in your day in your in your data warehouse, so you can kinda use it like anything else, basically.

And recognizing that it's still a very early project that you are still working on gaining visibility and getting feedback, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen an AMP stack used

so far? Well, I was at a couple of weeks back, actually, I was happy enough. So 1 of the 1 of the examples we use in the examples yeah. Out of the box examples is like a Hacker News. It scrapes the top

the the the scores from the hacker news, top stories, you know, every and I was like, as soon as all the, Sam Altman drama with OpenAI kicked off, I was kinda crossing my fingers thinking, oh my god. This this has to now get picked up. If this isn't picked up by but in the example job, I'll be I'll be kinda half egg on my face. And funny funny enough, I I have it was, as soon as all that kicked off, Hacker News exploded

and, you know, I had a nominee straight away for the Hacker News, jobs. And I've put them into the the gallery. There's a there's a little gallery folder in the repository as well that has examples of, like, real, anomalies that I've as I've been using it on real data. And there's a there's a Sam Altman fired HN explodes dot PNG in there as well that's I was happy to have. But, yeah, that's that was like I it's interesting as well just recently with sort of, we're also doing looking at stock prices and stuff as well. Like, just trying to get a wide range of as many examples as possible to get, like, realistic data. And just the other day, I noticed all of the tech stocks went were down a couple of points,

based on the Yahoo Finance job. And I actually Googled it. I was like, yeah, actually, they are they are all closed. I thought there was a problem with the I thought something was going wrong somewhere, but actually, you know, it was it was valid.

That's an an interesting use case as well where maybe it's not business metrics that you care about. Maybe it's just personal curiosities, and you can build your own sort of Google Trends style alerting of, hey. I wanna know if something changes in this particular ecosystem. As long as there's some sort of API you can hit, then you can build your own personal anomaly dashboard about what are the anomalous things happening in the world today. Yeah. Yeah. No. And that's actually Google Trends is another example. I we have a a Google Trends example as well. So I'm kinda constantly building out this this examples folder

within the metrics folder so that you can you can turn them off as well. Like, so, you know, you you can but they're just they're useful to kind of be realistic,

types of examples that people can look at as well. Yeah. It's it's definitely a very cool project in that way where, as you you mentioned, there are anomaly detection tools. A lot of times, though, they're very coupled to the product that they're trying to generate the alerts from. So Datadog has some anomaly detection.

I know that the Grafana cloud product has some ML capabilities for alerting on on anomalies, but, again, all of those are very tightly coupled to the ecosystem that they're built for, whereas this is a little bit more open ended of as long as you can get data somewhere, we can let you know if something is weird. Yeah. And that was almost as well 1 of the 1 of the kind of design principles here was to have no UI and have, like it's all basically

config based and get ops based so that, you know, it's what we're used to working in as, like, data engineers and, you know, it's it's it's lower overhead. We don't have some crazy management UI and admin console where you have to go and click around and configure stuff. It's all kinda your metrics as code basically and and everything as code. And that kinda helps

make it easier to, you know, if you wanna add new metrics, it's just a PR and then no problem. You know? Absolutely.

And in your experience of building this project,

publishing it to the community,

looking for feedback, what are some of the most interesting or unexpected

challenging lessons that you've learned in the process?

So it's it's been fun actually to learn. I had to learn kinda quite a lot about Dagster. So because Dagster is really at the heart of it doing all the orchestration. So I had to go quite deep in terms of getting familiar with, you know,

even some edge cases and stuff around how Dagster works and and all different configurations to to be able to support, like, running your run locally in your own Docker versus DAGSTAR Cloud versus, you know, a Python environment. There's there's a few different kind of considerations there.

So that's kind of being being fun and being, like, interested to to kind of start from new with a new a new technology. It's always fun, especially all these modern data stacks

technologies. They're they're so it's overwhelming. Like, there's so many of them that it's it's almost scary. It's almost too much sometimes and you kinda just turn turn on the blinkers. So but it's been good to have an excuse to actually then take 1 just pick 1 and use it and and go with it. It's been it's been useful. And but, yeah, also as well, just my own capabilities.

I would say actually I should preface, like, probably

another part of it. Like, projects like this are now actually easy to do because we have all these tools that we can use. And once you kinda know enough to put the ingredients together,

I've I've also been using, you know, Copilot and chat GPG to help a lot with the code as well. Like, it's it's it's crazy how much more productive you can be these days, especially with an open source project like this where it's like you can develop fully in the open. You don't have to be worried about, you know, anything confidential or anything like that. It's it's it's it's you're just so constrained to actually use these tools. And, yeah, it's been, like, I'd say probably 30% of of the code in parts has been at least inspired by, Copilot and and and, like, chat gbt. So that's been really interesting because if you it's like, you you know, when you used to ask for help on Stack Overflow, you have to spend a lot of time, you know, reproduce with examples and ask ask the question in the right way and show your work and things like that. Same thing applies for for, you know, the language models. And once you do that, they can actually be ridiculously useful. So it actually it hasn't been half as much work as I thought it would be because, you know, we have all all the tools that that we're using are quite easy to work with. And then like this assistive, you know, copilot type approach,

it just means, you know, if I have an idea, I can make an idea and then spec the idea out and actually get it done probably, you know, in in half the time that it would have taken originally. So that just means you've got more time. You can get more done with the, you know, the you know, another time you can focus on a project like this, you can just get so much more done for it. You know? And for people who are interested in a nomstack, they want to start to incorporate some measure of anomaly detection on their cases there would be if it's, like, low latency, you know, per second. Like, that's some of the stuff that we've done with with NetData. It's all infrastructure

per

per second. Like, that's some of the stuff that we've done with with NetData. It's all infrastructure per second metrics, you know, thousands of metrics a second. That's a completely different domain where you have, like, just different design challenges. And so an Omnitech wouldn't be wouldn't be right for anything like that. It's more typically, like, you know, hourly metrics. I mean, we can I do have, like, 10 minute metrics and things like that? So it's but anything below anything too near real time, it wouldn't make sense. And a situation like that, you're in more of a data observability

situation where you like things like Prometheus and and that's and so it's could be more useful, and InfluxDB. So but, the other use case would be, I guess, if you have scale. Like, if you've got thousands and thousands and thousands of metrics, I don't think I'm not sure how how how well that would happen. You know, how well, say, a DAG server running in a container, how well that would scale to if we had, like, hundreds and hundreds of metric batches. I reckon there's probably that that would be a nice problem to have if we ever get that far. We have that problem, but I would say that's probably another issue where I would say it's not right for you. And then also, like, if you're if you're not sort of if you're not comfortable enough with sort of running the Docker app, basically,

then it's a good excuse to layer. It's a good chance to kinda get your hands dirty, and and it's not as as painful as, like, things used to be. But also that's something that, like, it's there's a little bit of consideration there in terms of, like, are you comfortable enough running this yourself? Or, obviously, like, you can use the tags to clouds. You know, if you have a tags to cloud account, that's that works as well. But, yeah, that I'd say situations

like that, I'd say it's probably not not quite the right the right option. Also, if you're using Airflow, if you already have an Airflow,

you you should probably look at the Airflow anomaly detection provider, which is a different project that I maintain. That would be really cool to get some get some move in there at the moment as well because that 1 only has. I've only really set it up for for BigQuery. But, you know, obviously, there's all the different type of operators and all this stuff already exists in in in Airflow. So it's it's not that hard to actually use them. It's just if somebody's motivated to, you know, come and use it, then they might be as well to actually use your own Airflow that you already have. You know? And as you continue to build and iterate on the anomstack project, as you work to onboard more contributors, what are some of the things you have planned for the near to medium term or any particular projects you're excited to dig into?

Yeah. So I'm I'm there's a couple, like, there's a couple of open issues in the repository of ideas, and I'm just kinda throwing issues in all the time. 1 thing I wanna do,

I have a feature a feature request, open for TimeGPT. So in in still kind of shaking out these set of approaches. There's TimeGPT, which is a new sort of time series friendly large language model. And I'm hoping to see if I can if I can it's still sort of in a closed beta, so I'm hoping to get access to that to actually see if we can use that so that that might actually be more, useful.

And then also just there's a few things around wanting to

to let the user run multiple models. So, like, at the moment, it's for each metric you define 1 model, and the default model is this PCA based model. But actually, really, maybe you wanna define, like, 3 or 4 met 3 or 4 different models and actually just run let them run for a week or 2. And then you can actually see, okay, as the metric comes in, how do the anomaly scores behave and which ones work best for this metric. So there's there's definitely a whole load of stuff where we could make the ML part of this easier as well, I think. So if you could run multiple models and then over time pick them, that would be good. Or if you could do if we could do some sort of way where you could benchmark and simulate your metrics on different models, that could help with the ML parts. I think that could be really useful as well because that's always challenges. Like, it's very hard. There is no 1 size fits all model, and it can sometimes take a bit of iteration as well. So if we could take the pain out of some of that, that could be really useful as well and kind of fun fun to work on as well. And given the time series nature of the data, it might also be interesting to bring in some sort of time series predictive capability, whether that's using the profit library or, I think there's another 1, gray kite. There are number number of them out there now to say, this is the current trend line. If this continues, then this will maybe then trigger an anomaly. And so here's some kind of preemptive alerting of something to keep an eye out for. Yeah. Yeah. And and there's, like, there's also there's there's lots of other concepts on AML that we could bring into this in terms of, like, forecasting is an obvious 1 as well. But then there's also, like, change detection is another 1 where sometimes what you're interested in is is a sudden change even if it's not an anomaly. Like, maybe sudden changes happen. They happen every time, but, you know, they're so so they're not gonna be flagged as anomalous because the the ML is gonna look at those shifts as like, oh, well, steps happen every now and then. But actually, if you're if you have a real focused area where you're interested in, okay, what happened last night, something went wrong,

what you really wanna ask a lot of times there is, okay, change detection. Show me the metrics that had a sudden change. And that's like a different use case where it's like a subset of it. It's like a subset of anomaly detection. It's not quite a little bit different. So there's all these other kind of little ML,

time series based, you know, ML use cases that we could for sure build in, like, over time that would be would be interesting.

Are there any other aspects of the anomstack

project or this overall space of business metrics and anomaly detection that we didn't discuss yet that you'd like to cover before we close out the show?

No. No. Just I'm I'm I I definitely think it's it's an interesting time especially, like, as we can you know, there's a lot a lot of the modern data stack, there's a lot of stuff going on. It's crazy. It's overwhelming. But I do think technology is is catching up, you know, in terms of actually the metadata and making sense of, like, you know, making sense of what's going on in your data. That's like the hard part. We have all the plumbing. We have all the flows. We have all details. It's just how do you actually make sense of, like, what what things matter the most. That's still sort of an open problem that I think now, like, a lot of these kind of AI

I I want I want to that's the first time I think I've said AI. I hate to say it. I cringe every time I say AI, but actually this is 1 case where, like, it actually will, I think, really be useful over the next couple of years and, like, making sense of all of the crazy, you know, business metrics and data that companies have.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or contribute to the project, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think

possibly the biggest, the biggest gap is just the complexity of the space. There's there's there's still a I'm not sure where I sit on this as well. So there's there's point solutions that kind of focus on 1 thing and do 1 thing well and then there's all these platform options.

And, I think that's the biggest complication now is just navigating the space in terms of, you know, how do you compose things together.

There's

there's still like, there's work on, you know, standards and stuff like open lineage and and all these kind of standards that are trying to, you know, become a glue for all these different, solutions. But I think that's the biggest the biggest challenge is actually, like, you know, how do you actually just put things together and,

already actually try and go, like, just a big cloud provider and just use whatever they have.

That's probably the biggest the biggest,

gap I see. Absolutely. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on the AnomStack project and for building it in the first place. It's definitely a very cool project. Definitely excited to try that out for my own data platform

and explore the possibilities that that opens up. So I appreciate all the time and energy you've put into that and for taking the time today, and I hope you enjoy the rest of your day. Thanks. Thanks. Thanks a lot for having me on. I'm a big fan of the show. And anyone else who's interested, just come check out the repo and make some issues, make some discussions. On why you see as being the biggest gap in the pooling or technology that's available for data management today?

I think

possibly the biggest, the biggest gap is just the complexity of the space. There's there's there's still a not sure where I sit on this as well. So there's there's point solutions that kind of focus on 1 thing and do 1 thing well and then there's all these platform options.

And, I think that's the biggest complication now is just navigating the space in terms of, you know, how do you compose things together.

There's there's still like, there's work on, you know, standards and stuff like open lineage and and all these kind of standards that are trying to, you know, become a glue for all these different solutions. But I think that's the biggest the biggest challenge is actually, like, you know, how do you actually just put things together and,

already actually try and go, like, just a big cloud provider and just use whatever they have.

That's probably the biggest the biggest,

gap I see. Absolutely. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on the AnomStack project and for building it in the first place. It's definitely a very cool project. Definitely excited to try that out for my own data platform and

explore the possibilities that that opens up. So I appreciate all the time and energy you've put into that and for taking the time today, and I hope you enjoy the rest of your day. Thanks. Thanks. Thanks a lot for having me on. I'm a big fan of the show. And anyone else who's interested, just come check out the repo and make some issues, make some discussions. I will be delighted to have, people come along and say

hi.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at data engineering pod cast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links