Summary
Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Paul Brebner about his experience designing and building a scalable, real-time anomaly detection system using Kafka and Cassandra
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing the problem that you were trying to solve and the requirements that you were aiming for?
- What are some example cases where anomaly detection is useful or necessary?
- Once you had established the requirements in terms of functionality and data volume, what was your approach for determining the target architecture?
- What was your selection criteria for the various components of your system design?
- What tools and technologies did you consider in your initial assessment and which did you ultimately converge on?
- If you were to start over today would you do any of it differently?
- What tools and technologies did you consider in your initial assessment and which did you ultimately converge on?
- Can you talk through the algorithm that you used for detecting anomalous activity?
- What is the size/duration of the window within which you can effectively characterize trends and how do you collapse it down to a tractable search space?
- What were you using as a data source, and if it was synthetic how did you handle introducing anomalies in a realistic fashion?
- What were the main scalability bottlenecks that you encountered as you began ramping up the volume of data and the number of instances?
- How did those bottlenecks differ as you moved through different levels of scale?
- What were your assumptions going into this project and how accurate were they as you began testing and scaling the system that you built?
- What were some of the most interesting or unexpected lessons that you learned in the process of building this anomaly detection system?
- How have those lessons fed back to your work at Instaclustr?
Contact Info
- @paulbrebner_ on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Instaclustr
- Kafka
- Cassandra
- Canberra, Australia
- Spark
- Anomaly Detection
- Kubernetes
- Prometheus
- OpenTracing
- Jaeger
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including a new 1 in Toronto and 1 opening in Mumbai at the end of this year. And for your machine learning workloads and other CPU intensive tasks, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And integrating data across the enterprise has been around for decades, so have the techniques to do it. But a new way of integrating data and improving streams has evolved. By integrating each silo independently, data is able to be integrated without any direct relation. At Klud Inn, they call it eventual connectivity. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method and the technologies that make it possible.
Then schedule a demo or presentation of the Klueden data hub by visiting data engineering podcast.com/cluedin, and don't forget to thank them for supporting the show. And you listen to the show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. And coming up this fall is the combined events of Graph Forum and the Data Architecture Summit. The agendas have been published, and the super early bird registration for up to $300 off is available until July 26th, and the early bird pricing is available through August 30th for $200 off. Use the code b nllc to get an additional 10% off any pass when you register.
And go to data engineering podcast.com/conferences to learn more about this and other events and take advantage of our partner discounts when you register. And don't forget to go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers. Your host is Tobias Macy. And today, I'm interviewing Paul Bredner about his experience designing and building a scalable real time anomaly detection system using Kafka and Cassandra. Paul, can you start by introducing yourself?
[00:03:01] Unknown:
Hi. I'm Paul Bredner. I'm the technology evangelist, with Instaclustr. It's a, open source company based in Canberra, and we put offices around the world in in the US.
[00:03:12] Unknown:
And do you remember how you first got involved in the area of data management?
[00:03:16] Unknown:
So, yeah, I'm a computer scientist. I've, spent a lot of time in my career doing research and development, mainly universities, government labs, and a few startups, including this 1. My sort of area of specialty is distributed systems, grid computing, when that was fashionable, and more recently, things like performance engineering. So I had some involvement with the spec organization doing standard benchmarks at 1 point and have worked with quite a lot of government and enterprise clients solving some quite real and difficult performance problems of large scale distributed systems.
[00:03:55] Unknown:
And so in some of your work at Instaclustr, you ended up getting involved in this project for building the anomaly detection system. So can you start by describing a bit about what it is that you were trying to solve and the overall requirements that you were aiming for in tackling this problem?
[00:04:14] Unknown:
Sure. So, look, on Instaclust specializes in open source reliability at scale, so we provide a managed service for Cassandra, Kafka, and Spark. So we're interested in in big systems, the performance and scalability of big systems, and that's what our customers depend upon. So I was quite interested in building an interesting application that would be able to use some of these technologies, in particular, Kafka and Cassandra, and build demonstration application that I could, use to to learn, myself, but also that I could then blog about. As being the technology evangelist. I need something that I can actually produce some interesting content around. So we wanted a a problem that was both, I guess, what you'd call a fast and a big data problem.
Using Kafka, which is a streaming technology, we wanted a problem that would mean that the data was coming in quickly, and we could then do something with it. And then using Cassandra, which is a a big data scalable, no SQL database, we wanted 1 a problem that also had the size aspect in terms of the sheer amount of data that needed to be stored and then read and having something done with that. I guess we were also keen on doing something in the way of sort of a benchmark that would show off these technologies running on our managed service in particular. So we wanted something that would generate some quite big numbers, potentially as big as we could ever ever get. So the sky was was the limit in a sense in terms of how far we wanted it to scale.
So we we decided that a problem called anomaly detection was a good fit for this. Anomaly detection involves the real time aspect. You've got a stream of events coming in continuously, and you have to react to them very quickly and make a decision about whether there's something funny going on, whether there's anything wrong with 1 of the events. It's got a high cardinality in terms of the the key that would be used to do the the the writing and reading into Cassandra. You can't just store the the data in memory. It's too big for that. You have to be able to persist it on disk. And, also, it has the interesting problem that, potentially, there's some load spikes that might come along. So you can't just assume there's a constant average load on the system. There might be some quite big spikes caused by by some unexpected use of the system that you you have to have to cope with and make sure that all the data actually does get processed eventually.
[00:06:36] Unknown:
And so in terms of the use cases for anomaly detection, I'm wondering what it is as far as the source data that you were using for doing the analysis on and some of the other cases where anomaly detection is potentially useful in an actual actual production environment?
[00:06:54] Unknown:
Yes. So, look, that's a good question. I mean, monomer detection is used just about everywhere that you can imagine, I think. I mean, that particular scenario was domain independent, but we had in mind a scenario perhaps in the financial industries area where you've got a lot of transactional data coming in, perhaps, the ID would be people's account ID, and you'd be trying to check to see whether there was something odd about the, spending behavior from someone's account that had been hacked or or something like that. And particularly, a lot of the finance industry these days is moving away from the big banks to more sort of peer to peer, real time sort of transactional systems. So the the the number of transactions is going up and up all the time. I mean, computer systems in general are a good application of anomaly detection.
We run, I think, about 3, 000 nodes for customers running Cassandra and Kafka, and we have our own, metrics that we're computing from that. And in a sense, we we're looking for anomalies in those systems all the time, so that would be a typical case. Internet of things is a good example where where the data is actually coming from some something in the the real world, not just a computer system. It's something that's perhaps embedded in some other system. So machine monitoring. And even things like drones, of course, are becoming more common and important these days. You want to be able to make sure your drones are behaving themselves and nothing unusual is happening, particularly with things like, proximity to things that they're not allowed to be near or each other. So we've actually come up with an extension to this, application over the last few months that involves geospatial anomaly detection as well.
[00:08:37] Unknown:
And so now that you have sort of enumerated the problem space that you were aiming to handle, and you said that going into it, you knew ahead of time that you were planning on using Kafka and Cassandra together. But I'm wondering what the overall, process looked like beyond that as far as determining what the system architecture would look like and how the processing would be constructed Yeah
[00:09:11] Unknown:
Yeah. So, look, we we started out with actually a very simple prototype anomaly detection pipeline just in pure Java, and and, essentially, in fact, the entire application was was written in Java. So it was more or less just a standalone chunk of code that I initially developed just to check out whether the pipeline actually worked and did something interesting. So, essentially, this, takes in new events. It stores them, whether that's in in memory or some other database system, which it was eventually. Then based on the the key of that event, it actually gets some historical data related to that key. So the algorithm that we used eventually relies upon essentially, having access to a time series for each key. So it goes back and gets the the the previous 50 events in order, a time reverse order for that key. And then it runs some sort of anomaly detection, algorithm on that and then decides whether it's an anomaly or not. So that was the standalone version. We then, moved to a version that used Kafka and Cassandra. So Kafka was used to essentially store the events as they were generated from some outside system. So it was being used as a buffer. We had a Kafka producer, which was generating interesting data, some of which had simulated anomalies that was going into Kafka.
Then on the other side of Kafka, we had, Kafka consumer, which for each event arriving in Kafka, it would essentially just run the the pipeline that I've previously described and get the event, write the event to Cassandra in this case, read the previous 50 events for the key from Cassandra, and run the anomaly detection algorithm, and then make a decision. If there was sufficient data and it was able to find an anomaly, it would register an anomaly. Otherwise, it would would say no. There's no anomaly. I mean, in a real system, that result would be pushed back to Kafka or some other other system for for action. And the assumption was that this was all happening in real time and that the, the result would be needed almost instantly, and some some action would be performed based on that.
[00:11:23] Unknown:
As far as the overall platform selection, I'm curious if you hadn't already decided upon Kafka and Cassandra going into it, if you would have considered any other tooling or technologies that would be suitable for this particular problem space. And, also, as far as your overall experience of working with Kafka and Cassandra for it, I'm curious in retrospect what your opinions are as far as their suitability for this problem space and, some of the other ways that you might combine 1 or the other with other technologies that are available?
[00:11:59] Unknown:
Yes. I mean, that's that's a good meta question. Look, in reality, for this particular project, we're pretty much tied to Kafka and Cassandra, and that was for a number of reasons, including things like cost and scalability, reliability, automated deployment, having a managed service so we didn't I didn't have to worry about spinning up Kafka and Cassandra clusters and keeping them running while I was worrying about trying to build the rest of the application pipeline and also, monitoring to make sure that the systems weren't being overloaded as I was scaling things up. So it was it was pretty much given that we were we were using cluster managed Kafka and Cassandra services.
So that enabled me just to spin up clusters on demand because this was an experimental project, but I basically had to spin up clusters of varying sizes pretty much constantly and then use them for all and then delete them again. So that was that was a critical requirement, which really no other technology satisfied. And as I scaled up eventually as well, it was critical that I was able to have clusters of arbitrary size. I didn't wanna have any limits on the Kafka or Cassandra plus decide that would would affect the the results that we were getting. The the other critical part of the problem, I guess, that we were interested in was how we would deploy the application in a production like environment and then how we would scale and run that.
That I guess that that became the area where there was there were more possibilities and where we made some, perhaps, somewhat arbitrary choices, but ones I think which eventually were were quite good choices.
[00:13:35] Unknown:
And so as far as the actual anomaly detection code itself, I'm wondering if you could just talk through what the algorithm was for evaluating the previous data and then determining what the thresholds were, and then how you decided whether a given event or a sequence of events was anomalous within the context of the previous information that you had available.
[00:13:59] Unknown:
Yeah. Look. In reality, we we didn't spend much time on the anomaly detection algorithm itself. It was it was 1 that was an open source 1 that I found. Basically, we treated it like a black box, So it was called a cusum anomaly detector. It's a cumulative sum. It's basically a very simple statistical approach for detecting anomalies. It actually came from the industrial control sort of world in the, I think, the fifties sixties. So it predates quite a lot of the the more sophisticated approaches, including any of the the model and machine learning based approaches. But it does have the advantage that you don't need to train it, and it's quite fast. And and for this this exercise, we were more interested in showing how well Kafka and Cassandra scaled. And we had previously built some interesting demo applications, including an IoT 1, but they turned out to quite intensive on the application side. So that was something we didn't want to fall into the trap of this time was having 1 that was too complicated in terms of the application or needed too much in the way of resources on that side.
So I mean, the the actual algorithm essentially was, as I described in terms of the anomaly detection pipeline, there was a there's a box in that pipeline where it runs the the statistical method and, decides whether a current event is significantly different from the previous 50 events or not. But in practice, the the slightly more interesting part of the story, I guess, is that in order to deploy that in a scalable way, we had to make some design choices around how we how we implemented that. So we ended up essentially running the Kafka consumer part in its own thread pool and the rest of the pipeline in a separate tunable thread pool as well with the events being passed between them. So we did this after a few false starts. Initially, we had no constraints on how many threads were being spun up for the pipeline, but that didn't work particularly well. Parts of the pipeline essentially blew out and used up far too many resources.
So it became, more important to be able to tune, the pipeline so that it's scaled well as we we scaled up the whole system.
[00:16:13] Unknown:
And 1 of the interesting aspects of the way that the anomaly detection is run is that in order to be able to determine whether any given event is outside of the acceptable bounds, you need to have some windowing capacity to set the sort of rolling threshold and keep a running average. And so I'm wondering how that impacted the requirements for the Kafka queues and for the Cassandra databases in terms of being the rate at which you needed to be able to access the data and the overall historical span that you needed to be able to query across to determine what those baselines are and whether you experiment at all with changing those windowing sizes and how that impacted your ability to effectively detect anomalies.
[00:17:02] Unknown:
Yes. Look, that's certainly an interesting part of the problem. The algorithm relied upon having a fixed window size, and in this particular case, we picked 50 because that was sort of a magic number. It worked okay. It wasn't huge amounts of data, but it was enough to make sure that the the anomalies that I knew I was injecting actually were being picked up. Okay? But I guess the interesting aspect is those events could be over an arbitrary time period. And because of the high key cardinality, some of the the keys may have events that are very, very infrequent and go back almost to the beginning of the universe potentially. So you have to be able to store all those events in a way which you can then retrieve them based on the time, reverse order, which is exactly what Cassandra is really good at. So so that was a really good fit for Cassandra. So we were using it really in the way that it was designed for.
The in terms of the, write read ratio on Cassandra, every time there's an event written to Cassandra, there's also a read event. So it's a sort of a 1 to 1 write read ratio. The main difference, though, is that we are returning 50 rows for each read, whereas there's only 1 row for each write. So it's it's it's perhaps not a typical load for Cassandra. Cassandra's often use very high write rates compared to the read rates. This one's a bit more demanding. So it does use more Cassandra resources than than some use cases.
[00:18:26] Unknown:
And in terms of the data source that you're using, I know that you said that it was largely synthetic, but I'm wondering what how you approached the introduction of anomalies to make sure that they were in some sort of a realistic pattern so that you didn't just end up optimizing for the way that the source data itself was constructed versus how it might play out in an actual real world scenario?
[00:18:55] Unknown:
Yes. We we weren't too too worried about the functional aspects from that perspective. I did want some anomalies to be injected because when I was instrumenting the application and looking at the metrics. I did want to be able to see that it was actually picking up some anomalies. So it's it's a good question. So, I may have overthought this initially. I actually built a reasonably sophisticated sort of biased random walk event generator in the first version that more or less kept the the values the same. And then occasionally, it would just go off like a a sort of a random drunk, behavior, and then eventually come back to the the normal value.
For the final version, in fact, we just generated events that more or less at the same value with some some known number of very large events, or large valued events, and that worked fine as well. So
[00:19:45] Unknown:
And then in terms of ramping up the capacity and scalability of the system for being able to process larger volumes of events. I know that you took it in stages, and so I'm wondering if you can just talk through what your approach was for determining what the acceptable performance capabilities were at each level of scale and what you viewed in terms of the behavior of the system at those different levels.
[00:20:16] Unknown:
Yes. Perhaps in order to answer this question, I just need to step back slightly because I realized I haven't really explained what some of the other technologies were that we ended up choosing. So because of the the importance of the application itself being scalable and, being understandable as well as I built it and tested it and deployed it on different size systems and tried to ramp it up. We decided on on on a few other technologies that would would hopefully help us. And I guess was this was the more experimental part of the project. So so we soon realized that in order to be able to deploy the application repeatedly and easily and scalably, we needed something like Kubernetes.
So Kubernetes became quite critical as it turned out. So we we ran Kubernetes on AWS, which is where the rest of the system was running. So the, Instaclustr managed, Kafka and Cassandra are running on a particular AWS region. And we spun up a Kubernetes, cluster on that region using the AWS x service, which allows you to then create as many worker nodes to run the application as you need. So that worked really well. Well, there was, I should say, some, learning curve for that. Having, overcome that initially, it was certainly worth the the initial effort for the the final degree of automation that it gave us. The other problem with the system of this scale, particularly using multiple heterogeneous, technologies and and different cluster types and different applications is you need some way of seeing what's going on inside it. So so the 2 aspects of that that we thought were critical were being able to get some metrics out of it. So the metrics we were interested in were things like, the throughput at various parts, the system response times, and also more critically, probably the business metric that we were interested in, which is the the number of checks per second or hour or day that the whole system was able to actually process.
So in order to do that, we needed some metrics. So we we picked Prometheus, which I think came out of SoundCloud. It's a really interesting scalable monitoring open source as well solution, and it fits in really well with Kubernetes. That's the other critical thing. Kubernetes has a lot of dynamic stuff going on. It's got things called pods, which are the sort of the unit of concurrency. Pods can come and go all the time, and the IP address has changed. So in order to be able to monitor the applications running on those pods, you need a system that can actually know about all the changes, and Prometheus running on Kubernetes does that as well. The other technology we we thought was worth exploring was something called open tracing. So this is another open source, standard, really, I guess you'd call it. It's 1 which allows you to do end to end tracing of an application, right, sort of from the front end and across multiple components all the way to the the sort of the end of the system. And we we picked a particular tracing tool called Jager, which I think comes from Uber for that. And that also complemented the the Prometheus monitoring because it gave us the end to end tracing ability as well. So both of those technologies, all of those technologies, including Kubernetes, were quite critical for the scalability part of the experiment. It was quite critical for debugging, for example, to have good ability into the system. When something went wrong, I need to be able to work out what was going wrong.
For tuning the system, it was critical to be able to work out throughput and response times at various parts as well. And the Kubernetes was critical for for scaling the the actual application. So look so in terms of the scalability process coming back, I think, to the question you asked, we initially tried something a bit a bit grandiose. We tried spinning up a really, really big Cassandra cluster, and it did work, but it didn't scale particularly well. So we we knew from some tests on a smaller cluster we'd run that we weren't getting the the throughput we'd expect from the bigger cluster. So essentially, we had to change the process at that point and make it a lot more incremental. So we started out with the smaller Cassandra cluster sizes and increased the cluster sizes, once we knew what was going on and understood the behavior of the system at that particular size.
Essentially, we we realized we had to spend quite a bit of time and effort tuning the thread pools. So these are the the Kafka consumer and the Cassandra client thread pools that run the actual anomaly detector pipeline. It's quite critical with these sorts of systems to make sure that you don't have too many Cassandra connections and also that you don't have too many Kafka partitions. So you really need to optimize the whole pipeline in order to try and minimize those 2 things. And finally, yep, we're able to to spin it up to quite a large cluster size. So the 1 that we got the final results of 19, 000, 000, 000 anomaly checks today for had 400 cores in the Cassandra cluster, 120 cores in the Kubernetes worker nodes, and only 70 cores in Kafka. So it was interesting that the Kafka was, was highly efficient, and we didn't need to resource that pretty much at all.
[00:25:17] Unknown:
And as you moved from the smaller sized clusters through to the larger ones, I'm wondering what you found as far as bottlenecks or tuning requirements to ensure that you're getting the appropriate levels of performance that you needed and how that differs both from the small to the large scale.
[00:25:35] Unknown:
Yes. I think it's it's safe to say that, it's just a question of tuning. The Kafka and Cassandra clusters themselves are intrinsically scalable. As you add more nodes to either of those cluster types, you will get more throughput. So it's really on the application side where you can make mistakes or where you can have something that's not optimal. And that's what we found in in practice as well that that you really have to have the right sort of monitoring both on the application and the clusters. You need to be able to understand the the effect of different, tuning parameters. In our case, the different pool thread pool sizes. And 1 size doesn't fit all. So I want the the settings that we had, even for the clusters that were just slightly smaller than the final ones. In fact, we're we're not optimal for the final cluster size at all. So
[00:26:26] Unknown:
Going into the project, I'm curious what your assumptions were as far as how you thought the system would perform and what you thought the system requirements might be as far as being able to handle a particular volume of data and the overall lessons that you got out of it as you progressed through and then brought it to its conclusion?
[00:26:49] Unknown:
Look. So I think the the assumptions around Kafka and Cassandra's scalability were all were all valid. The and the idea of using a managed service to to automate, spinning up those Kafka and Cassandra clusters and managing them and monitoring them, I think, was was quite critical as well. There's no way that that we could have done this sort of a project using any other approach. That that meant that we could focus on developing the actual application, the 1 that in reality would add business value for for a customer. And and around the application deployment using Kubernetes achieved something similar for the application that the managed service does for, Kafka and Cassandra. It meant that once we had that infrastructure set up on AWS, I could simply, click a button and the application would would be deployed, and scale up to the whatever requirements that I had.
And having the, Prometheus, and open tracing deployed in that Kubernetes cluster with the application was also, I think, something that we had assumed would be be important, and it did turn out to be be critical. We just couldn't have scaled the system up and got the, the sort of the business level benchmarks out of it at the scale that we were dealing with in any other way. In in particular, Kafka observability, I think, is quite a hard problem because it's across process boundaries. The open tracing approach with with the 3rd party, add on that we had for Kafka worked really, really well. So it enables tracing right, from Kafka producers through to consumers, which normally you wouldn't have visibility, into that sort of dependency and the topology of events as they flow through a system. And oddly enough, that actually helped us debug it as well.
It's perhaps a strange example of a case where the Prometheus monitoring itself actually, had a bug in the instrumentation that I'd put in the code. We weren't detecting any anomalies in the final system. However, the, open tracing was saying that the events were going through all the way until the end, so they should have been, being counted at that point. And I realized I've made a mistake in the Prometheus instrumentation. So
[00:29:13] Unknown:
And what were some of the most interesting observations that you made as far as the capacity or capabilities of the tools that you're working with? Or, I'm wondering if you were to revisit this project, what you might do differently to try and gain some new understanding or maybe experiment with some new combinations of technologies?
[00:29:35] Unknown:
Look, 1 lesson I think is that application tuning is still quite tricky for for large scale distributed systems. I think there is still room for improvement there, both in terms of the tools and methodologies that would help people with doing that. Initially, when we started this project, in fact, we had wondered about using some sort of a Kubernetes microservices framework, as well as sort of the vanilla Kubernetes. We didn't do that. I still think that's an interesting extension at some point. I think having the sort of balance all the different thread pools and the application and everything is a bit old school. I'm sure there is perhaps a better way of of doing that in the future. I I guess, I mean, 1 1 aspect of this that we were keen to demonstrate was that the Instaclustr managed services worked pretty well and they and they did. We've got a lot of customers that depend upon our systems, and this is just another example of an interesting use case that may be of interest hopefully to potential customers using open source systems at scale and reliability.
Internally, we're actually quite interested in Kubernetes ourselves. We've we've developed a Kubernetes operator for Cassandra that enables people to actually just deploy Cassandra themselves on their own systems easily using Kubernetes. We're also thinking about making our cluster metrics available as Prometheus metrics. And most recently, I think we've also added another service for our provisioning API, which is Kubernetes based as well using the open service broker. So we we're quite keen to make it easy for for customers to integrate their applications with, managed services using the sort of technology stacks that that people are now using for these sorts of systems.
[00:31:17] Unknown:
Beyond the anomaly detection use case, I'm wondering what are some of the other experiments that you have run recently or other experiments that you have planned for being able to exercise the capabilities of the technologies that you support?
[00:31:32] Unknown:
The most recent other 1 that I've done was, Internet of Things demonstration application. That was a pure Kafka example. So that was really designed to show some of the quite sophisticated things you can do with Kafka. Kafka is not just a message broker. It really enables you to to build loosely coupled microservice applications. So that was quite quite a fun system to build. And as I mentioned, we are extending this 1. We have have extended, in fact, to do geospatial anomaly detection, which was also quite an interesting problem because the the definition of proximity actually is dynamic and the scale can change dramatically from sort of the square meter that you're in at the moment to the whole planet or even sort of 3 dimensionally out to the geostationary, geostationary satellite orbits and things like that. So that's that's quite an interesting challenge, and it put a lot more demand on the Cassandra side than to be able to store and retrieve data that has potentially three-dimensional coordinates associated with it. So that's been quite fun. So that's that project's just about finished at the moment.
[00:32:40] Unknown:
Yeah. It's definitely interesting seeing the differences in capacity and performance for the same system doing ostensibly the same thing, but just by changing the format or context of the data that might have either additional fields or slightly different processing requirements to be able to retrieve the same end result. And are there any other aspects of the anomaly detection project that you're working on? Or, any of the work that you're doing at Instaclustr with supporting these cluster technologies that we didn't discuss yet that you'd like to cover before we close out the show?
[00:33:15] Unknown:
Look. Yeah. I think there's a couple of interesting outcomes. 1 is that elasticity is quite critical still, I think, for systems like this. The type of use case that we had in mind meant that we didn't necessarily know what volume of data would be coming through over what periods of time. So to some extent, choosing the combination of Kafka and Cassandra meant that we can cope with part of that problem by using Kafka as a buffer. So essentially, it can absorb load spikes for for potentially arbitrary periods of time, and the rest of the system can just go ahead and process and see don't lose any data. It does mean, of course, your s l a will get pushed out and will actually take a lot longer than to process some of those events. So it's only a partial solution. The the perfect solution would be 1 where the application itself and indeed the Cassandra cluster can resize dynamically and in time to cope with any load increases. So I think that's still an area that that some work could be done in. We have 1 mechanism that we have to to increase the size of Cassandra clusters, which essentially is by increasing the instant sizes so we can increase the number of cores available on the nodes that that you're running in as a cluster. And that can happen very quickly. So I did do that as an experiment as well, for the anomaly detection use case. The catch is it can only increase the size of the cluster up to a factor of 8 times. So that is an upper bound. So that's 1 aspect. I think another interesting aspect is having the the metrics and the the tracing of the application was quite critical, but we had to have 2 systems in order to do that. Ideally, you'd have 1 integrated system that would do both metrics and and tracing and give you a comprehensive view into the system.
The other aspect is that we had to do we had to use instrumentation to to add Prometheus and the open tracing data into the application. That's quite quite tedious and error prone. There are systems, I know, in some of the commercial APM products, which, avoid having to to do the instrumentation manually. And it would be really nice if the open source community would pick up on some of those techniques, I think, as well.
[00:35:25] Unknown:
Yeah. It's definitely interesting how the requirements, as far as being able to maintain the real time capabilities of Kafka was acting as the buffer so that, as you're saying, it can handle any spikes in terms of input data, but then you still have the issue of Cassandra draining it at a fairly constant rate, and then the application relying on the read of data being written into Cassandra, and how you might approach that by having the application and Cassandra both consuming from the Kafka topics of the application maybe handle some measure of uncertainty or error in favor of being able to get more real time app more real time data, but then be able to have maybe a secondary process that will analyze the longer window within the Cassandra cluster for sort of resetting what the anomaly boundaries are or changing what the windowing size is to be able to handle those different variations in load. Yes. My that they're,
[00:36:32] Unknown:
they're really good ideas. We had thought about essentially having a dynamic capability, which we did actually have in the code, but didn't really use very much. Essentially, it then, decides whether an event is significant enough to to run the check on. So that would be a good trade off at that point if it was starting to run out of resources. It could perhaps pull back on which events it's actually bothering to check. And then you check the ones that are perhaps high value or or it has some a priority reason to think might be suspicious. Yep.
[00:37:01] Unknown:
And was there anything else that you wanted to cover before you move on?
[00:37:06] Unknown:
I think that's just about everything I've got on my whiteboard here. So yeah.
[00:37:10] Unknown:
Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. Then with that, I'd like to thank you for taking the time today to join me and discuss your experiences of stressing the capabilities for Kafka and Cassandra for handling these large volumes of input data, being able to use that for building a test case for anomaly detection. It's definitely an interesting problem space, and it's good to see the scaling capabilities that you're able to achieve with that. So I appreciate your sharing that information, and I hope you enjoy the rest of your day. Thanks, Tobias. Thanks for all the interesting questions. And I hope, the listeners,
[00:37:52] Unknown:
benefit from from our discussion.
Introduction and Sponsor Messages
Interview with Paul Bredner Begins
Overview of Anomaly Detection System
Use Cases for Anomaly Detection
System Architecture and Design Choices
Anomaly Detection Algorithm
Data Source and Synthetic Anomalies
Scalability and Performance Tuning
Assumptions and Lessons Learned
Other Experiments and Future Plans
Final Thoughts and Closing Remarks