Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including a new 1 in Toronto and 1 opening in Mumbai at the end of this year.

And for your machine learning workloads and other CPU intensive tasks, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute.

And don't forget to thank them for their continued support of this show.

And integrating data across the enterprise has been around for decades, so have the techniques to do it. But a new way of integrating data and improving streams has evolved.

By integrating each silo independently,

data is able to be integrated without any direct relation.

At Klud Inn, they call it eventual connectivity.

If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method and the technologies that make it possible.

Then schedule a demo or presentation of the Klueden data hub by visiting data engineering podcast.com/cluedin,

and don't forget to thank them for supporting the show.

And you listen to the show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference.

And coming up this fall is the combined events of Graph Forum and the Data Architecture Summit. The agendas have been published, and the super early bird registration

for up to $300

off is available until July 26th,

and the early bird pricing is available through August 30th for $200

off. Use the code b

nllc

to get an additional 10% off any pass when you register.

And go to data engineering podcast.com/conferences

to learn more about this and other events and take advantage of our partner discounts when you register.

And don't forget to go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.

And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Paul Bredner about his experience designing and building a scalable real time anomaly detection system using Kafka and Cassandra.

Paul, can you start by introducing yourself?

Hi. I'm Paul Bredner. I'm the technology evangelist,

with Instaclustr. It's a, open source

company based in Canberra, and we put offices around the world in in the US.

And do you remember how you first got involved in the area of data management?

So, yeah, I'm a computer scientist. I've, spent a lot of time in my career doing research and development, mainly

universities, government labs, and a few startups, including this 1.

My sort of area of specialty is distributed systems,

grid computing, when that was fashionable,

and more recently, things like performance engineering.

So I had some involvement with the spec

organization doing standard benchmarks at 1 point and have worked with quite a lot of government and enterprise clients solving some quite real and difficult

performance problems of large scale distributed systems.

And so in some of your work at Instaclustr,

you ended up getting involved in this project for building the anomaly detection system. So can you start by describing a bit about what it is that you were trying to solve

and the overall requirements that you were aiming for in tackling this problem?

Sure. So,

look, on Instaclust specializes in

open source reliability at scale, so we provide a managed service for Cassandra,

Kafka, and Spark.

So we're interested in in big systems, the performance and scalability of big systems, and that's what our customers depend upon.

So I was quite interested in building an interesting application that would be able to use

some of these technologies, in particular, Kafka and Cassandra,

and build demonstration

application that I could, use to to learn,

myself, but also that I could then blog about. As being the technology evangelist. I need something that I can actually produce some interesting content

around. So we wanted a a problem that was both, I guess, what you'd call a fast and a big data problem.

Using Kafka, which is a streaming technology, we wanted a problem that would mean that the data was coming in quickly, and we could then do something with it. And then using Cassandra, which is a

a big data scalable, no SQL database, we wanted 1 a problem that also had the size aspect

in terms of the sheer amount of data that needed to be stored and then read and having something done with that. I guess we were also keen on doing something in the way of sort of a benchmark that would show off these technologies

running on our managed service in particular. So we wanted something that would generate some quite big numbers, potentially

as big as we could ever ever get. So the sky was was the limit in a sense in terms of

how far we wanted it to scale.

So we we decided that a problem called anomaly detection was a good fit for this. Anomaly detection involves the real time aspect. You've got a stream of events coming in continuously, and you have to react to them very quickly and make a decision about whether there's something funny going on, whether there's anything wrong

with 1 of the events.

It's got a high cardinality in terms of the the key that would be used to do the

the the writing and reading into Cassandra.

You can't just store the the data in memory. It's too big for that. You have to be able to persist it on disk. And, also, it has the interesting problem that, potentially, there's some load spikes that might come along. So you can't just assume there's a constant

average load on the system. There might be some quite big spikes caused by by some unexpected use of the system that you you have to have to cope with and make sure that all the data actually does get processed eventually.

And so in terms of the use cases for anomaly detection, I'm wondering

what it is as far as the source data that you were using for doing the analysis on and some of the other cases where anomaly detection is potentially useful in an actual actual production environment?

Yes. So, look, that's a good question.

I mean, monomer detection is used just about everywhere that you can imagine, I think. I mean, that particular

scenario

was

domain independent, but we had in mind a scenario perhaps in the financial

industries area where you've got a lot of transactional data

coming in, perhaps, the ID would be people's account

ID, and you'd be trying to check to see whether there was something odd about the,

spending behavior from someone's account that had been hacked or or something like that.

And particularly, a lot of the finance industry these days is moving away from the big banks

to more sort of peer to peer,

real time

sort of transactional

systems. So the the the number of transactions is going up and up all the time. I mean, computer systems in general are a good application of anomaly detection.

We run, I think, about 3, 000 nodes for customers running Cassandra and Kafka, and we have our own, metrics that we're computing from that. And in a sense, we we're looking for anomalies in those systems all the time, so that would be a typical case. Internet of things is a good example where where the data is actually coming from some

something in the the real world, not just a computer system. It's something that's perhaps embedded in some other system.

So machine

monitoring.

And even things like drones, of course, are becoming more

common and important these days. You want to be able to make sure your drones are behaving themselves and nothing unusual is happening, particularly with things like,

proximity to things that they're not allowed to be near or each other. So we've actually come up with an extension to this, application over the last few months that involves geospatial

anomaly detection as well.

And so

now that you have sort of enumerated the problem space that you were aiming to handle, and you said that going into it, you knew ahead of time that you were planning on using Kafka and Cassandra together. But

I'm wondering

what the overall,

process looked like

beyond that as far as determining what the system architecture would look like and how the

processing would be constructed

Yeah

Yeah. So, look, we we started out with actually a very simple prototype

anomaly detection pipeline just in pure Java,

and and, essentially, in fact, the entire application was was written in Java. So it was more or less just a standalone

chunk of code that I initially developed just to check out whether the pipeline actually worked and did something interesting.

So, essentially, this, takes in new events.

It stores them, whether that's in in memory or some other database system, which it was eventually.

Then based on the the key of that event, it actually gets some historical data

related to that key. So the algorithm that we used eventually relies upon

essentially,

having access to a time series

for each key. So it goes back and gets

the

the the previous 50 events

in order, a time reverse order for that key. And then it runs some sort of anomaly detection,

algorithm on that and then decides whether it's an anomaly or not. So that was the standalone version. We then,

moved to a version that used Kafka

and Cassandra. So Kafka was used to essentially store the events as they were generated from some outside system. So it was being used as a buffer. We had a Kafka producer, which was generating interesting data, some of which had simulated anomalies that was going into Kafka.

Then on the other side of Kafka, we had, Kafka consumer,

which for each event

arriving in Kafka,

it would essentially just run the the pipeline that I've previously described and get the event,

write the event to Cassandra in this case, read the previous 50 events for the key from Cassandra, and run the anomaly detection algorithm,

and then make a decision. If there was sufficient data

and it was able to

find an anomaly, it would register an anomaly. Otherwise,

it would

would say no. There's no anomaly. I mean, in a real system,

that result would be pushed back to Kafka or some other other system for for action. And the assumption was that this was all happening in real time and that the,

the result would be needed almost instantly, and some some action would be performed based on that.

As far as the overall

platform selection,

I'm curious if you hadn't already decided upon Kafka and Cassandra going into it, if you would have considered any other tooling or technologies

that would be suitable for this particular problem space.

And,

also, as far as your overall experience of working with Kafka and Cassandra for it, I'm curious in retrospect what your opinions are as far as their suitability for this problem space and,

some of the other ways that you might combine 1 or the other with other

technologies that are available?

Yes. I mean, that's that's a good meta question.

Look, in reality, for this particular project, we're pretty much

tied to Kafka and Cassandra, and that was for a number of reasons, including things like cost and scalability, reliability,

automated deployment, having a managed service so we didn't I didn't have to worry about

spinning up Kafka and Cassandra clusters and keeping them running while I was worrying about trying to build the rest of the application pipeline

and also,

monitoring to make sure that the systems weren't being overloaded as I was scaling things up.

So it was it was pretty much given that we were we were using cluster managed Kafka and Cassandra services.

So that enabled me just to spin up clusters on demand because this was an experimental project, but I basically had to spin up clusters of varying sizes

pretty much constantly and then use them for all and then delete them again.

So that was that was a critical requirement, which really no other

technology satisfied.

And as I scaled up eventually as well, it was critical that I was able to have clusters of arbitrary size.

I didn't wanna have any limits on the Kafka or Cassandra

plus decide that would would affect the the results that we were getting. The the other critical part of the problem, I guess, that we were interested in was how we would deploy the application in a production like

environment and then how we would scale

and run that.

That I guess that that became the area where there was there were more possibilities

and where we made some, perhaps, somewhat arbitrary choices, but ones I think which eventually were

were quite good choices.

And so as far as the actual anomaly detection code itself, I'm wondering if you could just talk through what the algorithm was for evaluating

the previous data and then determining what the thresholds were, and then how you decided whether a given event or a sequence of events was anomalous within the context of the previous information that you had available.

Yeah. Look. In reality, we we didn't spend much time on the anomaly detection algorithm itself. It was it was 1 that was an open source 1 that I found. Basically, we treated it like a black box, So it was

called a cusum

anomaly detector. It's a cumulative sum. It's basically a very simple statistical

approach for detecting anomalies. It actually came from the industrial control

sort of world in the, I think, the fifties sixties. So it predates quite a lot of the the more sophisticated approaches, including any of the the model and machine learning

based approaches. But it does have the advantage that you don't need to train it, and it's quite fast. And and for this this exercise, we were more interested in

showing

how well Kafka and Cassandra scaled. And we had previously built some interesting demo applications, including an IoT 1,

but they turned out to quite intensive on the application side. So that was something we didn't want to fall into the trap of this time was having 1 that was too complicated in terms of the application or needed too much in the way

of resources

on that side.

So I mean, the the actual algorithm essentially was, as I described in terms of the anomaly detection

pipeline,

there was a there's a box in that pipeline where it runs the the statistical

method and, decides whether

a current event is significantly different from the previous 50

events or not. But in practice, the the slightly more interesting part of the story, I guess, is that in order to deploy that in a scalable way, we had to make some design choices around how we how we implemented that. So we ended up essentially running the Kafka consumer part in its own thread pool and the rest of the pipeline in a separate tunable thread pool as well with the events being passed

between them. So we did this after a few false starts. Initially,

we had no constraints on how many threads were being spun up for the pipeline,

but that didn't work particularly well. Parts of the pipeline

essentially blew out and

used up far too many resources.

So it became,

more important to be able to tune,

the pipeline

so that it's scaled well as we we scaled up the whole system.

And 1 of the interesting aspects of the way that the anomaly detection is run is that in order to be able to determine whether any given event is outside of the acceptable bounds, you need to have some windowing capacity

to set the sort of rolling threshold and keep a running average.

And so I'm wondering

how that impacted the requirements

for the Kafka queues and for the Cassandra

databases in terms of being the rate at which you needed to be able to access the data

and the overall historical span that you needed to be able to query across to determine what those baselines are

and whether you experiment at all with changing those windowing sizes and how that impacted your ability to effectively detect anomalies.

Yes. Look, that's certainly an interesting part of the problem. The algorithm relied upon having a fixed window size, and in this particular case, we picked 50 because that was sort of a magic number. It worked okay. It wasn't huge amounts of data, but it was enough to make sure that the

the anomalies that I knew I was injecting actually were being picked up. Okay? But I guess the interesting aspect is those events could be over an arbitrary time period. And because of the high key cardinality,

some of the the keys may have events

that are very, very infrequent and go back

almost to the beginning of the universe potentially. So you have to be able to store all those events

in a way which you can then retrieve them

based

on the time,

reverse order, which is exactly what Cassandra is really good at. So so that was a really good fit for Cassandra. So we were using it really in the way that it was designed for.

The in terms of the, write read ratio on Cassandra, every time there's an event written to Cassandra,

there's also a read event. So it's a sort of a 1 to 1 write read ratio.

The main difference, though, is that we are returning 50 rows

for each read, whereas there's only 1 row for each write. So it's it's

it's perhaps not a typical

load for Cassandra.

Cassandra's

often use very high write rates compared to the read rates. This one's a bit more demanding.

So it does use more Cassandra resources than than some use cases.

And

in terms of the

data source that you're using, I know that you said that it was largely synthetic, but I'm wondering what how you approached

the

introduction of anomalies to make sure that they were

in some sort of a realistic pattern

so that you didn't just end up optimizing

for the way that the source data itself was constructed versus how it might play out in an actual real world scenario?

Yes. We we weren't too too worried about the functional aspects from that perspective. I did want some anomalies to be injected because when I was

instrumenting the application and looking at the metrics. I did want to be able to see that it was actually picking up some anomalies. So it's it's a good question. So,

I may have overthought this initially. I actually built a reasonably sophisticated

sort of biased random walk event generator in the first version that more or less kept the the values the same. And then occasionally, it would just go off like a a sort of a random drunk,

behavior,

and then eventually come back to the

the normal value.

For the final version, in fact, we just generated events that more or less at the same value with some some known number of very large events,

or large valued events, and that worked fine as well. So

And then

in terms

of ramping up the capacity and scalability

of the system for being able to process larger volumes of events. I know that you took it in stages,

and so I'm wondering if you can just talk through

what your approach was

for determining

what the acceptable

performance capabilities were at each level of scale

and what you viewed in terms of the behavior of the system at those different levels.

Yes. Perhaps in order to answer this question, I just need to step back slightly because I realized I haven't really explained what some of the other technologies

were that we ended up choosing.

So because of the the importance of the application itself being scalable and,

being understandable as well as I built it and tested it and deployed it on different size systems and tried to ramp it

up. We decided on on on a few other technologies that would would hopefully

help us. And I guess was this was the more experimental part of the project. So so we soon realized that in order to be able to deploy the application

repeatedly and easily and scalably, we needed something like Kubernetes.

So Kubernetes became quite critical as it turned out. So we we ran Kubernetes on AWS, which is where the rest of the system was running. So the, Instaclustr managed,

Kafka and Cassandra are running on a particular AWS

region. And we spun up a Kubernetes,

cluster on that region using the AWS x service,

which allows you to then create as many worker nodes to run the application

as you need. So that worked really well. Well, there was, I should say, some, learning curve for that. Having,

overcome that initially,

it was certainly worth the the initial effort

for the the final degree of automation

that it gave us. The other problem with the system of this scale, particularly using multiple heterogeneous,

technologies and

and different cluster types and different applications

is you need some way of seeing what's going on inside it. So so the 2 aspects of that that we thought were critical were being able to get some metrics

out of it. So the metrics we were interested in

were things like,

the throughput at various parts, the system response times,

and also

more critically, probably the business metric that we were interested in, which is the the number of checks per second or hour or day that the whole system was able to actually process.

So in order to do that, we needed some metrics. So we we picked Prometheus,

which I think came out of SoundCloud. It's a

really interesting scalable monitoring open source as well solution, and it fits in really well with Kubernetes. That's the other critical thing. Kubernetes has a lot of dynamic stuff going on. It's

got things called pods, which are the sort of the unit of concurrency.

Pods can come and go all the time, and the IP address has changed. So in order to be able to monitor the applications running on those pods, you need a system that can actually know about all the changes, and Prometheus running on Kubernetes does that

as well. The other technology we we thought was worth exploring was something called open tracing. So this is another open source,

standard, really, I guess you'd call it. It's 1 which allows you to do end to end tracing of an application, right, sort of from the front end and across multiple components all the way to the the sort of the end of the system. And we we picked a particular tracing tool called Jager,

which I think comes from Uber for that. And that also complemented the the Prometheus monitoring because it gave us the end to end

tracing ability as well. So both of those technologies,

all of those technologies, including Kubernetes, were quite critical for the scalability

part of the experiment. It was quite critical

for debugging, for example, to have good

ability into the system. When something went wrong, I need to be able to work out what was going wrong.

For tuning the system, it was critical to be able to work out throughput and response times at various

parts as well. And the Kubernetes was critical for for scaling the the actual application.

So look so in terms of the scalability process coming back, I think, to the question you asked, we initially tried something a bit a bit grandiose. We tried spinning up a really, really big

Cassandra cluster,

and it did work, but it didn't scale particularly well. So we we knew from

some tests on a smaller cluster we'd run that we weren't getting the the throughput we'd expect from the bigger cluster.

So essentially, we had to change the process at that point and make it a lot more incremental. So we started out with the smaller Cassandra cluster sizes and increased the cluster sizes,

once we knew what was going on and understood the behavior of the system at that particular size.

Essentially, we we realized we had to spend quite a bit of time and effort tuning the thread pools. So these are the the Kafka consumer

and the Cassandra

client thread pools that run the actual anomaly detector pipeline.

It's quite critical with these sorts of systems to make sure that you don't have too many Cassandra connections

and also that you don't have too many Kafka partitions. So you really need to optimize the whole pipeline in order to try and minimize

those 2 things.

And finally, yep, we're able to to spin it up to quite a large cluster size. So the 1 that we got the final results of 19, 000, 000, 000

anomaly checks today for had 400 cores in the Cassandra cluster,

120 cores in the Kubernetes worker nodes,

and only 70 cores in Kafka. So it was interesting that the Kafka was,

was highly efficient, and we didn't need to resource that pretty much at all.

And as you moved from the smaller sized clusters through to the larger ones, I'm wondering what you found as far as bottlenecks

or tuning requirements to ensure that you're getting the appropriate levels of performance that you needed and how that differs both from the small to the large scale.

Yes. I think it's it's safe to say that,

it's just a question of tuning.

The Kafka and Cassandra

clusters themselves are intrinsically scalable.

As you add more nodes to either of those cluster types, you will get more throughput. So it's really on the application side where you can make mistakes or where you can have something that's not optimal.

And that's what we found in in practice as well that that you really have to have the right sort of monitoring both on the application and the clusters.

You need to be able to understand the the effect of different, tuning parameters. In our case, the different pool thread pool sizes.

And 1 size doesn't fit all. So I want the the settings that we had,

even for the clusters that were just slightly smaller than the final ones. In fact, we're we're not optimal for the final cluster size at all. So

Going into the project, I'm curious what your assumptions were as far as how you thought the system would perform

and what you thought the system requirements might be as far as being able to handle a particular volume of data and

the overall lessons that you got out of it as you progressed through and then brought it to its conclusion?

Look. So I think the the assumptions around Kafka and Cassandra's scalability were all were all valid.

The and the idea of using a managed service to

to automate,

spinning up those Kafka and Cassandra clusters and managing them and monitoring them, I think, was was quite critical as well. There's no way that

that we could have done this sort of a project

using any other approach.

That that meant that we could focus on developing the actual application,

the 1 that in reality would add business value for for a customer. And and

around the application deployment using Kubernetes

achieved something similar for the application that the

managed service does for, Kafka and Cassandra. It meant that once we had that infrastructure set up on AWS,

I could simply, click a button and the application would would be deployed,

and scale up to the whatever requirements that I had.

And having the, Prometheus,

and open tracing

deployed in that Kubernetes cluster with the application was also,

I think, something that we had assumed would be be important, and it did turn out to be be critical. We just couldn't have scaled the system up and got the, the sort of the business level benchmarks out of it at the scale that we were dealing with in any other way.

In in particular, Kafka observability, I think, is quite a hard problem because it's across process boundaries.

The open tracing

approach with with the 3rd party,

add on that we had for Kafka worked really, really well. So it enables tracing right, from Kafka producers through to consumers, which normally you wouldn't have visibility,

into that sort of dependency and the topology of events as they flow through a system. And oddly enough, that actually helped us debug it as well.

It's perhaps a strange example of a case where

the Prometheus

monitoring itself actually,

had a bug in the instrumentation that I'd put in the code.

We weren't detecting any anomalies in the final system.

However, the, open tracing was saying that the events were going through all the way until the end, so they should have been,

being counted at that point.

And I realized I've made a mistake in the Prometheus instrumentation.

So

And what were some of the most

interesting

observations that you made as far as the capacity or capabilities of the tools that you're working with? Or,

I'm wondering if you were to revisit this project,

what you might do differently to try and gain some new understanding or maybe experiment with some new combinations of technologies?

Look, 1 lesson I think is that application tuning is still quite tricky for for large scale distributed systems.

I think there is still room for improvement there, both in terms of the tools

and methodologies that would help people with doing that. Initially, when we started this project, in fact, we had wondered about using some sort of a Kubernetes microservices

framework,

as well as sort of the vanilla Kubernetes. We didn't do that. I still think that's an interesting

extension at some point. I think having the sort of balance all the different thread pools and the application and everything is a bit old school. I'm sure there is perhaps a better way of of doing that in the future.

I I guess, I mean, 1

1 aspect of this that we were keen to demonstrate was that the Instaclustr managed services worked pretty well and they and they did. We've got a lot of customers that depend upon

our systems,

and this is just another example of an interesting use case that may be of interest hopefully to potential customers using open source systems at scale

and reliability.

Internally, we're actually quite interested in Kubernetes ourselves. We've we've developed a Kubernetes

operator for Cassandra that enables people to actually just deploy Cassandra themselves on their own systems

easily using Kubernetes.

We're also thinking about

making our cluster metrics available

as Prometheus metrics.

And

most recently, I think we've also added another

service for our provisioning API, which is Kubernetes based as well using the open service broker. So we we're quite keen to make it easy for for customers to integrate their applications with,

managed services using the sort of technology stacks that that people are now using for these sorts of systems.

Beyond the

anomaly detection use case, I'm wondering what are some of the other experiments that you have run recently or other experiments that you have planned

for being able to exercise the capabilities of the technologies that you support?

The most recent other 1 that I've done was,

Internet of

Things demonstration application. That was a pure Kafka

example.

So that was really designed to show some of the quite sophisticated things you can do with Kafka. Kafka is not just a message broker. It really enables you to to build loosely coupled microservice

applications. So that was quite quite a fun

system

to build. And as I mentioned, we are extending this 1.

We have have extended, in fact, to do geospatial anomaly detection, which was also quite an interesting problem

because the the definition of

proximity actually is dynamic

and the scale can change dramatically from

sort of

the square meter that you're in at the moment to the whole planet or even sort of 3 dimensionally out to the

geostationary,

geostationary satellite orbits and things like that. So that's that's quite an

interesting challenge, and it put a lot more demand on the Cassandra side than

to be able to store and retrieve data

that has potentially three-dimensional

coordinates associated with it. So that's been quite fun. So that's that project's

just about finished at the moment.

Yeah. It's definitely interesting seeing the differences in capacity and performance

for the same system doing

ostensibly the same thing, but just by changing the format or context of the data that might have either additional fields or slightly different processing requirements to be able to retrieve the same end result.

And are there any other aspects

of the anomaly detection project that you're working on? Or,

any of the work that you're doing at Instaclustr with supporting these cluster technologies that we didn't discuss yet that you'd like to cover before we close out the show?

Look. Yeah. I think there's a couple of interesting

outcomes. 1 is that elasticity

is quite critical still, I think, for systems like this. The type of use case that we had in mind meant that we didn't necessarily know what volume of data would be coming through over what periods of time. So to some extent, choosing the combination of Kafka and Cassandra

meant that we can cope with part of that problem by using

Kafka as a buffer. So essentially, it can absorb load spikes for

for potentially

arbitrary periods of time,

and the rest of the system can just go ahead and process and see don't lose any data. It does mean, of course, your s l a will get pushed out and will actually take a lot longer than to process some of those events. So it's only a partial solution. The

the perfect solution would be 1 where the application itself and indeed the Cassandra cluster

can

resize dynamically and in time to cope with any load increases. So I think that's still an area that that some work could be done in. We have 1 mechanism that we have to to increase

the size of Cassandra clusters, which essentially is by increasing the instant sizes so we can increase the number of cores available on the nodes that that you're running in as a cluster. And that can happen very quickly. So I did do that as an experiment as well, for the anomaly detection

use case. The catch is it can only increase the size of the cluster up to a factor of 8 times. So that is an upper bound. So that's 1 aspect. I think another interesting aspect is

having the

the metrics and the the tracing

of the application

was quite critical, but we had to have 2 systems in order to do that. Ideally, you'd have 1 integrated system that would do both metrics

and and tracing and give you a comprehensive

view into the system.

The other aspect is that we had to do we had to use instrumentation

to to add Prometheus

and the open tracing

data into the application.

That's quite quite tedious and error prone. There are systems, I know, in some of the commercial APM products,

which, avoid having to to do the instrumentation manually. And it would be really nice if the open source community would pick up on some of those techniques, I think, as well.

Yeah. It's definitely

interesting how

the

requirements,

as far as being able to maintain the real time capabilities of

Kafka was acting as the buffer so that, as you're saying, it can handle any spikes in terms of input data,

but then you still have the issue

of Cassandra draining it at a fairly constant rate, and then the application relying on the read of data being written into Cassandra,

and how you might approach that by having the application and Cassandra both consuming from the Kafka topics of the application

maybe

handle some measure of uncertainty

or error in favor of being able to get more real time app more real time data, but then be able to

have maybe a secondary process that will

analyze the longer window within the Cassandra cluster for

sort of resetting what the anomaly boundaries are or changing what the windowing size is to be able to handle those different variations in load. Yes. My that they're,

they're really good ideas. We had thought about essentially having a dynamic capability, which

we did actually have in the code, but didn't really use very much.

Essentially, it then,

decides whether an event is significant enough to to run the check on. So that would be a good trade off at that point if it was starting to run out of resources.

It could perhaps pull back on which events it's actually bothering to check. And then you check the ones that are perhaps high value or or it has some a priority reason to think might be suspicious. Yep.

And was there anything else that you wanted to cover before you move on?

I think that's just about everything I've got on my whiteboard here. So yeah.

Well, for anybody who wants to get in touch with you or follow along with the work

that you're doing, I'll have you add your preferred contact information to the show notes. Then with that, I'd like to thank you for taking the time today to join me and discuss your experiences

of stressing the capabilities for Kafka and Cassandra for handling these large volumes of input data, being able to use that for

building a test case for anomaly detection. It's definitely an interesting problem space, and it's good to see the scaling capabilities that you're able to achieve with that. So I appreciate your sharing that information, and I hope you enjoy the rest of your day. Thanks, Tobias. Thanks for all the interesting questions. And I hope, the listeners,

benefit from from our discussion.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links