Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode.

With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/

linode today to get a $20 credit and launch a new server in under a minute.

And you work hard to make sure that your data is reliable and accurate, But can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle.

Scaphos maximizes interoperability with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure instantaneously.

Request a demo today at data engineering podcast dotcom/metis

dashmachine to learn more about how MetisMachine is operationalizing

data science.

And if you're attending the Strata Data conference in New York in September, then come say hi to Metis Machine at boothp16.

And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at dataengineeringpodcast.com/chat.

Your host is Tobias Macy. And today, I'm interviewing Pete Cheslock and Thomas Hazel about ChaosSearch

and their effort to bring historical depth to your Elasticsearch data. So, Pete, could you start by introducing yourself? Yeah. I am Pete Cheslock, and I'm currently the VP of products for ChaosSearch. Search. And Thomas, how about yourself? Yeah. Thomas Hazel,

founder, CTO,

venture of the unique technology we use, as well as the idea. And going back to you, Pete, can you share how you first got involved in the area of data management? Yeah. So, you know, just getting involved in kind of managing large scale systems

on Amazon, you know, kinda back in 2009. I actually worked for a email archiving company

And, you know, we were storing petabytes of data on s 3 when probably you shouldn't be storing that amount of data at that age in the early days of s 3. But, you know, what's interesting is is kinda every company I've been at,

since then has seen just the data explode. Right? And and having to build and run databases to manage that, like Cassandra, Elasticsearch, Postgres, and and and then the overhead and the mental overhead that you have to deal with managing that data.

You know, so I've I I come from, this role is interesting for me at ChaosSearch being kind of the head of product in that.

My history has been actually just in

the running of the systems to deliver, you know, data for customers. Being on the other side of it, it's like I I I now can help drive, you know, helping helping customers with their data problems.

And, Thomas, how about yourself? Do you remember how you first got involved in the area of data management? Well, so I come from a background of,

distributor systems from my telecom days and the last 50 years building database and really going after some of the information theory problems of like, text search or column stores or big data analytics. And so what I wanted to do was,

solve some of the cost complexity

and, I had an idea with respect to a new, data format

that, I thought could solve these scale problems that Pete just mentioned.

And so given that, can you discuss a bit about what it is that you've built at ChaosSearch

and the original problems that you were trying to solve when you started building that technology and the company around it? Yeah. Sure. So we call this technology DataEdge.

Like I mentioned, it's a new file format that supports both text search and relational queries. And if you're familiar with Elasticsearch

as a platform or a variety of others, it uses this technology this indexing technology

called Lucene, and it's great for text search. However, it can explode your data size, meaning that the raw source that you store can go up to 5 x the size because of the underlining in index technology.

I wanted to go off and solve that, but I also wanted to solve not just the reducing of the size of the data, but the ability to do it in a distributed way,

and do it in object storage. In this case, Amazon S3 because S3 was cost effective. You used their terabytes of data for very little amount of money. It's elastic. It's simple. And create a data fabric around s 3 with this indexing technology and ultimately allow you to provide analytics

on your storage directly.

And in the documentation

and the website for your product, a lot of the focus is on the idea of log data.

But when going through the other documentation, it looks like you also support different,

file types or formats. So I'm curious if you can discuss a bit about what types of data you're focused on supporting

and what your reasoning was for focusing on those areas to begin with? Yeah. So from, you know, from my perspective, the the the initial kind of use case that we wanna support with this technology is, kind of the login event management. And the main reason is that the the problem is is just so vast in that in that space. If you rewind back about I think it's 9 years at this point when Elasticsearch

first came out, you know, I was very lucky working for a company that got to start using that, at a very early stage. And we were storing our,

email platform. Our email discovery platform was in there. And in that scenario, Lucene made a ton of sense

for that because the the raw data in those emails was was actually reduced on the index side. As Elasticsearch evolved as a into a business

and brought on other open source tools like Kibana and Logstash, you saw this huge explosion in the usage of Elasticsearch

for

log data. And it was almost it's almost comical that it was used for, like, highly structured log data, like JSON logging. And, like, what Thomas was saying before is you're putting in these JSON logs and the inverted index is causing them to explode up dramatically in size. So now you're managing significantly more data on these clusters.

Now, I love Elasticsearch and I've been using it for a long time. APIs are great and scaling it is relatively easy from a technology perspective.

But the challenge really is that the cost of it. Right? After a while, you're running 100 of Elasticsearch servers

potentially spending, you know, in in the Amazon world or any cloud world. Right? You're you're paying for every every bit of storage on there. And so you enter in this scenario of saying to yourself, well, maybe I'm gonna drop this data instead of keeping it around. And I think when

the founders here at ChaosSearch were was looking at this this market,

too many people were saying to themselves,

I can't keep the long tail of data. I need to throw it away. I can only save 7 days of data. And what's amazing I think in my perspective is that, you know, what this technology really has done is enable the long tail so that you don't have to make that choice anymore.

Do I keep my data or do I throw it away? Because it can be saved and queryable on Elasticsearch or on, s 3 via this this new file format for for so much longer. Yeah. And I'll add to that real quick is the the basis of using object storage to store your long tail and then with our service enabling you to do the same tooling, the same elastic API Kibana interface, even with Logstash, the ability to dump into s 3, really, that was the value that customers love. They just couldn't do it from a long perspective because it was cost prohibitive. And we're offering a service at a price point that, you know, weeks, months, years of data retention and analytics

is where we're going in on the market. And also from the Elasticsearch

perspective, as you're

increasing the amount of storage that you're retaining to be able to have longer views of that log or metric data, it increases the operational overhead of being able to manage these larger clusters, either in terms of number of instances

or just the overall volume of data and some of the complexities that go into the different shards for the indices and keeping them active and in memory. So, on that point, I'm wondering if you can talk a bit more about some of the challenges that are inherent to having a larger scale Elasticsearch infrastructure for being able to manage those larger volumes

or longer histories of data? Yeah. Absolutely. I mean, the the the great thing about Elasticsearch, like I said before, is that operationally, at least, for a database, it's very friendly. It's got fantastic APIs. It's restful. It's got tons of metrics, tons of real time ability to tune it, to update it. And the problem that the company Elastic is trying to solve is is incredibly hard and they're doing a very good job of it. The biggest problem though that you really run into is, you know, when you build out these Elasticsearch clusters to store the data, again, on 1 side, this explosion of of,

leucine,

in the size of your data. And the other trick too is is because storage and compute are so tightly coupled that that's usually your scale point. So like 1 example is at a at a previous company we used to run, you know, like, our I think it was like the r 3 or the r, you know, type where it's a decent sized memory. But then we have to attach EBS volumes to get the storage,

for that. So it's like you run a you run a certain number of servers to support the memory and CPU requirements of ingest. And then you have to run a certain amount of disk to support kind of the storage of that data. And I and at a previous company, we tried a ton of things around tiering our Elasticsearch where maybe we would take this large ingest of data onto really hot,

NVME or SSDs.

And then every day we would move indices over to like hard drives or a lower tiered storage. And it's kind of the fundamental problem of Elasticsearch in general, which is if I'm using Elasticsearch to store

30 or 90 days or if I'm in compliant, you know, some sort of compliance mode for even longer, I have to build that infrastructure to support both

the real time ingest and millisecond query. But do I really need a millisecond query for something that happened, you know, 6 months ago or 8 months ago? Or can I even afford to keep that event from 8 months ago because I have to spend so much money on that storage? So Yeah. And and we talked about how we wanna solve the Lucene prompt. We also built out an architecture, as you mentioned, a data fabric

up in Amazon around, s 3. And so really decoupled storage and compute. And the storage being only s 3, no SSDs or ACDs to back this this, solution. And that allows us to scale and use any aspect of compute, whether it's indexing, whether it's the cord itself to lastly scale out without having to physically tie that storage,

that capacity of that compute to 1 instance of of a node. And so that allows us to dramatically reduce

cost as well as scale elastically.

In terms of the infrastructure necessary for somebody who's using chaos search for being able to access that data in s 3. Is there actually any requirement for having an Elasticsearch cluster at all if they don't need these,

sub second query

responses? Yeah. So it's a great question. You know, as we talk to customers,

you know, the the stage of the company we're at now finding,

and refining the product for, kind of real user use cases. You know, the concept has always been this hypothesis of, well, we'll enable the long tail of data. And so we've talked to some customers that have said, well, I wanna store a week's worth of data,

hot in my Elasticsearch cluster because I have lots of queries that that I need, really short

time, you know, within that millisecond range. And then, over the, you know, more recently, we've we've talked with with companies that are dealing with such data challenges and and a wealth of data coming in that,

when they make queries against the cluster, it actually causes some of the Elasticsearch host to reach out of memory conditions or or affect the ingest of the data. And so

in the conversations we've had with those customers, you know, they've said things like, listen.

A query that takes, you know,

well, 1 customer is just like, hey, if you can if the query finishes, we'd be happy. And we're like, well, we're aiming for queries to finish in around maybe 10 seconds. And they were like, oh, well, as long as it finishes, we'll be really happy with that 1. Right. And then on the other side, you know, it really comes down to data ingest. Right? So when you're shoving data into Elastic Search, it can be available for search within, you know, whatever your refresh rate is on Elastic Search and that's configurable. Maybe it's 1 second or 30 seconds. But it takes a lot of, you know, processing power to reach that level. Since we're trying to enable kind of that long tail, it's really a question of how quickly could a thing be searchable when it lands in s 3. And from our perspective, you know, we're trying to build for like maybe 5 or 10 minutes or something like that. Again, trying to aim for the long tail. But what we've found and kinda answer your question is that for some companies, for some people,

a 5 minute delay in that log data is actually totally fine for them. Right? It it meets their requirements because they're they're saying to themselves, it's just so much data. I don't actually need a millisecond query and, you know, to to get it back for for this this amount. And what the cool thing about our solution, it's on your s 3, meaning that you don't move the data out of s 3 into put into another logging system or into your manual l configuration.

It's your s 3. You provide a read only, I'm role access, and you're up and running. We store our indices in your account, so you own all the data. You just provide the right location so we can store that and all queries, all of that

scale is, on these indices,

on your data account, where we provide the compute and the scale.

Yeah. You know, that's actually a really good point that we didn't touch on before which is, you know, it's your data. Right? Your data, your rules. We'll read the data from your bucket. We'll give you the compressed

indices back in your bucket. And then the beauty is is is you can keep it for as long as you really want to keep it. You don't have to make that choice of do I keep it in my elastomer's cluster? Do I throw it away? You can just let it live for, you know, whatever time you need. And some of our customers are actually moving their raw data to Glacier for cheaper storage using our indices as a representation.

So there's a lot of combinations that, come out of a solution like ours. But the idea is that, as as, Pete mentioned, it's your data, it's your rules. 1 question I have there

in terms of your running the queries against the data in s

3 that is owned by the customer

is that in some cases,

there might be some concerns in terms of

leakage of private data or anything like that to any third party. So I'm curious

how you mitigate that in terms of running the compute against their data. Yeah. Absolutely. I mean, the it's always gonna be a question. And and especially in this day and age, really, any any SaaS company,

you know, if if you're not building kinda a security first model, you're gonna find that your future is gonna be very painful. Right?

And and in a lot of cases, you know, companies go down the lottery list of making sure that they, you know, are in compliance with with things like PCI, HIPAA, SOC 2, and now GDPR being a big deal.

What was really amazing about the way that this architecture was built

is is with that kind of security first model, which is that ability to not only have things like, you know, per customer encryption keys. Or even talking with some companies in the health care space where they can bring their own encryption key, and and and ensure that the data, you know, is really fully under their control. Much like how KMS can do with Amazon.

In addition, for customers and in their compute, they're essentially running their own environment of

processes

linked together with an encrypted network, again, with with its own keys. So,

from a data handling perspective, you know,

coming from a security company who had to live through SOC 2 compliance and stuff, I I actually am incredibly looking forward as we grow and go through those those various audits. I think it'll be really refreshing to, to experience that from this architecture design. And for loading the data into s 3, I know that Elasticsearch,

for instance, has the capacity for backing up indices to s 3 directly from the cluster. And then there are also systems such as Logstash or fluentd that let you push your data directly to s 3 from those systems. So I'm wondering for somebody who's getting set up with ChaosSearch, what's the preferred mechanism for being able to load that data into s 3 and making it accessible to ChaosSearch, and what are some of the considerations

in terms of format or structure for that data to allow it to be indexable by the product that you're building? Yeah. So it's yeah. Another good really good question. You know, when I was doing kind of my due diligence into the company, you know, to come on board and really trying to think about the problem they were solving. You know, 1 of the things I I said to Thomas, I said, hey, you know, Elasticsearch has this ability to push an index up to s 3. And that's Lucy, and it's got a defined schema that we should be able to read. Right? Like, could we just, you know, ingest, for lack of a better term, these leucine indexes as they're aged out of a hot cluster?

And his response was, yeah, it's just another data type. Right? And so that's really I think the mindset that we're trying to take which is, you know, there's a lot of companies who already have data in s 3 because Amazon is very nice and and lets you log really for a lot of their services right into s 3.

They have things like Kinesis Firehose where you can just stream data into s 3. So obviously, they've really optimized for that. And like you said, fluentd and Logstash Stash and these other tools that can stream data to s 3.

The nice thing from our perspective is that we don't really care what the data is. Again, it's just another data type. But as an kind of an early stage company, the data types we're most interested in working with right now are things like CSVs,

JSON,

kind of a standard logging format. If you think like Apache logs,

NGINX logs, things like that. Lucene indexes, stuff like that.

As as time goes on, you know, supporting more and more data types is is, something that is pretty exciting,

with some of the the

schema type things that we can do on read. Yeah. Some of our benefits that we provide, obviously,

reading in the Lucene and the schema,

and the data,

parquet,

which is a common format that is used. We add in a whole additional functionality that, say, Athena doesn't even have, particularly with full text search, obviously, the visualization

of Kibana. So there's a lot of data sources that we tend to add to our product line, and, we're excited to, include them. And your mention of Athena brings me to my next question

of what's the benefit of using something like ChaosSearch

with the Elasticsearch interface and the Kibana UI

versus some of the other tooling that's been built on top of being able to access data directly from s 3 and other stores such as

and the Athena that's been built on top of it or the Apache drill project or other things along those lines? Yeah. I would say that's a good question. We get that often. There's a couple aspects to our product offering that is unique.

1, we provide data management

around s 3. So data cataloging

that's published in the Elastic,

API, Elastic Indices,

as well as this concept called virtual folders where you can create object groupings that will filter out,

different aspects of your bucket. So that's 1 thing that Athena does not have. The other thing is all the text search and visualization that people have gotten to know and love,

and then the price. You know, to scan 1 terabyte of data,

it's either 1 to $5.

We have a significantly lower price, both from an Elastic's perspective as well as a payment perspective. I'll use, you know, Pete, to talk about pricing if you want wanna get into that. But what we wanna do is the ease of raw data to these elastic indices quickly

as well as a price point that really has been seen, in the market. Yeah. That's always been the challenge

of logging companies is, you know, the longer you store it for a lot of logging companies, they're using Elasticsearch under the under the hood is, you know, kind of 1 of the first problems.

You know, it's hard to scale that cost effectively just on your own.

But then, if you add on top of it the, you know, they have to essentially,

you know, charge upfront for for you. Maybe you would only store that data for longer than 14 days. But they have to charge you upfront for that because you could potentially. And that's why you see companies like Splunk

so wildly successful

because,

a, it's a great product, don't get me wrong, but it's it's also incredibly expensive because of the challenge they're trying to solve for. And what we've heard with people that have used Athena, it gets pricey very quickly.

Even if you have your data formats in parquet, it does make it a little bit smaller

and, and and and perform it. However, every query, every result set increases your storage size. We do, late materialization, so there's no increased size for our query.

And we've heard, you know, stories of Athena's bill getting pretty high. It's a wonderful tool, but, it's missing the text search as well as the Kibana Elastic capability.

S 3 in particular has become some of somewhat of a de facto standard in terms of the API that they've built around it that other platforms have adopted to allow for easier interoperability.

And I'm wondering if you either currently or have any plans to support using

alternate object storage systems just using their s 3 compatibility layers, such as what's been done with the,

OpenStack Swift project or things like, I think, Backblaze has an s 3 API.

No. You you you're right on. We made 2 bets. We made a bet on Office Storage, s 3 as a de facto

API standard, as well as Elastic. We actually export

through our data fabric the the s 3 API.

You can either use Amazon's,

interface or ours. It's a pass through for a lot of the functionality that they provide already,

but we've extended it. Like I mentioned, the data cataloging,

data discovery,

the virtual folders, we've extended the experience. But to your point, there's great products out there that have an s 3 compliant API. Minio, an open source product runs on Azure and Google. So we're actually excited about, using Otter storage, particularly the s 3 interface, because we think it has a lot of legs. Yeah. Kinda think of it like Kubernetes has become the de facto API for deploying applications across multiple clouds. Now, kind of look at what we're building as, you know, the underlying object store we're talking to really shouldn't matter. And our goal is to try to compute as close to those object stores as possible. Because what we're gonna see over the next 3, 5 years with with Kubernetes and and all of these hosted Kubernetes solutions on these cloud providers is more companies actually

running multiple vendors. Right? Multiple cloud vendors because it's it's now so much easier. Right? They're gonna be putting data in these object stores because of the cost perspective. So, you know, from from our mindset, trying to be as close to that as possible is is I think always key. And what we've done is we've really made Avastorage our fundamental,

architectural choice

over distributed Akascala

framework with our data edge indices. We really believe that we're planning for the future where, you know, object storage is really core. As you, I'm sure, know, you know, SSDs and HDs have been core to databases for a long time, but, they're not as flexible and not as elastic and as cheap as AutoShore has become and will continue to become. Yeah. The whole logging world, I think the joke there is like the, you know, the the death of logging is is yet to come, you know. At the end of the day, you know, as much as people are moving to things like, you know, tracing

or advanced time series metrics or, you know, any sort of that type of ways of kind of monitoring

their applications.

At the end of the day, logs are still

the best way to kind of slice and dice on structured data coming from applications and sometimes unstructured data. And so

I've I've yet to see a company who who has

basically not used logs as part of their kind of troubleshooting,

and and it's just seems like it's gonna keep continuing to grow, as as more and more micro services. Microservices. Yeah. Right. Example. You know, it's it's a great way. And and there's a lot of people I think trying to solve, like, that problem of how do you monitor

these microservices.

But, you know, at the end of the day, you know, it'd be there's still gonna be a lot of data kinda being generated. And I then that's the thing I don't think anyone really is solving right now is how do you enable that long tail of data and, you know, coming from a security company, I got excited because it's like, if there's a new vulnerability that comes out,

and I can go back,

to every,

web request log I have for the last year and see, like, hey,

like, what endpoint, what vulnerable endpoint did someone hit and what IPs did they hit it from. I mean, there's a real power to that that, you know, you just really can't do nowadays without, you know, if you are lucky enough to have that data necessary, it's like, okay, like, load up an EMR job and, like, process that in or,

you do it all all the data processing real time versus just, I just wanna ask the question and get an answer. And that's really why I wanted to create ChaosSearch and this technology underneath. I wanted to make data information as small as possible, as cost efficient, and as well as the value of the querying of that data and enables

all these things that we hear customers time and time again talk about and,

the the cost prohibitive nature of storing that data over long tail as well as the margins that something like a Lucene Technology

has, caused logging companies to, really discourage,

data over weeks, months, let alone years.

And 1 of the things that you touched on there in terms of metrics puts me in mind of the work that Elastic has been focusing on with their Beats technologies of being able to ship things like system metrics

and heartbeat results.

And I'm curious

what kind of support you have for being able to query those types of events and that type of data

and use the same types of dashboards that are being built for some of the newer releases of Kibana for being able to use it as somewhat of a,

replacement

or supplement to things like Datadog or New Relic? Yeah. I mean, I think the goal of of high cardinality metrics is something that a lot of companies are trying to solve. I think it's what everyone wants. You know, that ability to slice and dice

on, like, every customer and every request to every single endpoint. It's a supremely hard challenge, that everyone's trying to solve in obviously different ways. 1 of the things that we were having kind of an internal discussion around time series metrics and stuff, and I had spoken of my past of building out graphite clusters and and having a lot of fun doing that kind of stuff. 1 of the responses was, oh, we'll we'll probably just use Data Edge for that and that's how we'll monitor it. Basically, like, in a very meta way of we'll use this technology to monitor this technology, which I thought was was pretty awesome. And going back to the topic of space savings, and you mentioned that you have your own indexing

strategy and format. So I'm curious, what are some of the mechanisms that you're using to allow for such drastic space savings of the data indexed into s 3 versus what's present when it's hot in an elastic search cluster and using that Lucene format? Well, so quick question.

The Lucene is an inverted index, so it has some power, but it actually has some cost where, you know, it points to all the symbols and all the documents

and does increase the size. With our technology, and we have 3 heavy patents on this, we think it's so important. I'll go at a at a high level, just because of some of the,

the secret sauce we use to enable this. But the idea was we didn't use a commsort technology

or row or text. We really came up with a kinda insight I had with respects to information

and locality.

And with this insight,

I had the ability to compress the data. Actually, I think you can take,

compression items like Snappy, which is from Google. It's very fast and but doesn't compress very well. I can take Snappy with our representation of the information

and actually get a 2 to 3 x reduction

over what Snappy can traditionally do. So,

I'll leave it there, and I, you know, don't wanna get too much into the weeds, let alone, the tech. But, you know, this is something that, you know, maybe even someday in the future, we'll we'll open source it. But for now, we're, we're keeping it inside, the service. Fair enough.

And talking about the compute layer, I'm curious if you can discuss a bit about the overall

system architecture

and the life cycle

of the data as it gets run through a query that somebody submits into the chaos search platform. Yeah. Sure. So as as we mentioned, we have a indexing technology. We use s 3 as a backing store and EC 2 in Amazon

for our compute.

We have a Scala Akka distributed framework. For those listeners,

it's a really relatively popular,

distributed framework. I like to say it makes hard problems easy and easy problems hard.

But it really solves the hard problems and in distributed architecture or distributed frameworks, You know, that's really where we want to focus and Scala is even a fun language, if you ever play with it. So that's kind of our core framework.

Data Edge as indices is streamed through this. So from an experience perspective, what we do is when you say through the API or the UI,

index these, resources in s 3, We read that data and through workers, through Akka workers,

we spin up work on compute local to your bucket. So whether your buckets are, the compute gets executed to minimize

performance and cost. And that way, we actually are indexing data roughly, you know, 10 x,

faster than Elastic can and then write these representations

into your s 3 location you specified for us. From there, when a Elastic query and we publish these indices,

these named indices that you have indexed through Elastic API. So when Kibana or Elastic queries our system,

it queries us over the,

the aca HTTP request.

We have identified the indices and we know the topology

of these indices within this s 3 indices within your s 3 account. From there, we query these indices and we materialize the results just like an Elastic API would, but we do it at a scale and price point that is quite unique. We also have fabric underneath the hood that we spin up, compute. We always have reserve compute so that a query plan always has resources. Maybe it needs 1 worker, maybe it needs 10, maybe it needs a 100.

But that way, we can spin up the query estimation fast without having to, as as, Pete mentioned, you know, blow up your cluster because you've over committed the request to the resources you backed. And for systems such as,

1 of the things that I'm using at my work is a last alert or other things that are relying directly on the elastic Search API,

they can just run directly against Chaos Search being able to run those same sorts of operations? Yeah. Absolutely. I mean, that's that's kind of the ultimate long term goal is, you know, to to you know, maybe we're never at full feature parity to an Elasticsearch cluster because, obviously, that's designed for, you know, high throughput,

low TTL,

you know, responses.

But, you know, there's there's definitely no reason that being able to expose those similar APIs,

know, we really expect them to be used kind of in a similar way that customers would. But I think it's most interesting is for, you know, there's a lot of tools out there, especially as things like TensorFlow

and a lot of these, you know, ML type projects are are becoming more mature in the in the open source world. You know, really, I think it it'll enable tools like that to tie into these APIs as well,

which we're pretty excited about because now you can you can go to companies who previously could never afford to run ML models across a year's worth of data. Now they can do it for, you know, arguably pennies on the dollar. Yeah. And actually, I'll add to that. 1 thing that we haven't addressed is, as I mentioned, this file format supports text and relational

functionality, but it's unique representation

works nicely with Tensor and TensorFlow

integration. So that's something that, you know, Pete here, he's our product guy, but our road map, you know, has, TensorFlow integration. It's something that we've heard our customers are looking for. And do you have any additional capabilities

beyond what the elastic API supports that you're exposing through ChaosSearch for being able to leverage some of those relational aspects or simplify certain types of operations that might not necessarily

be executed directly against an elastic Elasticsearch cluster, but would be useful for some of these more long views of the data?

Some of these more long views of the data? Yeah. I'm actually glad that you mentioned that. We actually have a data refinery that you can take your indices that you've created through the chaos search process

and join them to create new indices, something that is missing in the elastic,

search, functionality to be able to denormalize that dataset. Or with us, we can actually virtually

join the data to create new indices. They use all the same tooling. That's something that is really, cool to our product as well as important with what customers said. I wanna have this correlate against that, where today, it's it's hard to do. We've we've also,

have what we call 3QL API, which in essence is a s 3 like query language.

So, a lot of the relational functionality can be queried through

an s 3 look and feel. Now this is something that we love because we love s 3. We love its simplicity.

And for folks that wanna query, in a s 3 like

fashion, we have a language for that. But, obviously, you know, Elasticsearch

API is a wonderful API and, we support that. But the refineries where we marry all these unique datasets

in a relational way that folks in,

you know, today can't just do. Yeah. I mean, the data science world, I think there was a research article or something that said that, you know, data scientists spend, like, 90%, 80% of their time just kinda cleaning up the the datasets.

And so having having this location where you can go through and and make adjustments to this schema like on the fly, you know, going through. There's there's there's a concept of this automated schema detection, but, you know, it it's a data science a person going in in into into the data might have different beliefs on on how things should be represented. But that, the concept that Thomas was talking about of the relational aspect of joining and correlating and making that much easier, I think is pretty awesome too. And in terms of the query latency and the overall

latencies

in terms of interfacing with s 3, and just trying to reduce the overall

time budget for the system, both in terms of indexing and querying. I'm curious what you found to be some of the biggest contributors

to that latency and some of the strategies that you're using to mitigate that? Yeah. No. Great question. And this is really where we spend a lot of our time over the last several years is query optimization, query planning via this data edge indexing. And so s 3 has its limits, although, ironically,

every year seems like they're increasing those limits and the amount of, data to be queried. So the idea is that you always query these indices that are stored in your history account because of our compression ratio and because of our unique way to scope and plan against the request that you have, S3 has a limiting factor. So far at our scale testing in the terabytes

has not been not been an issue at all. So we are really in the, you know, less than 10 second type queries for huge queries of aggregation

to, you know, seconds to sub seconds,

for, you know, quick finds, you know, text search across, you know, gigs, if not terabytes of data. So right now, for what the use case in the long tail has been really,

honestly, happily surprised. We knew what we were doing. We knew SRE could achieve these goals. But, you know, so far, you know, we've been really pleased and we're really betting that Avi storage is gonna get cheaper, faster,

and, with our technology, with our architecture,

and the use cases that we're going after seems to be a good marriage. And 1 of the other issues

with storing and accessing these large volumes of data is the question of permissions

and access control, which has generally been

fairly lacking in the elastic product unless you're using their paid services or their XPAC products. So I'm curious

what types of controls you're exposing in your platform to help facilitate some of those security concerns. Yeah. So, I mean, at a high level, there's definitely some ways currently to to break out at a in a general way of restricting

access down to, you know, who's able to see which things. You know, 1 of the things that I'm most excited about is, you know, and and something we're actively doing and reaching out and trying to find, you know, people that have these interesting security challenges

and really get feedback from them on what are the the kind of interesting feature they wanna see. Whether it's being able to restrict down, you know, rewrite

at the industry level,

at, in in different areas with the product or bucket or the data level. The nice thing is that there's a lot of ways to slice and dice the data within the product. And so that then, you know, gives people the ability to then potentially control within those kind of virtual buckets that get created when you slice and dice. And 1 thing that we've seen our customers, you know, they've been slicing it via the storage access. So how they create the roll on access to a particular user is 1 level of access they provided, as well as field level access that you mentioned, like, Elastic,

has with Xpack. That's something that, we are integrating with our product is that when you create an object grouping or this virtual folder I mentioned, you can specify what

fields, you want to have access to, and, we won't index it or we won't provide that access to that particular user. So this is something that we hear,

all aspects of security, but we have Pete Chitlak here who,

who knows security and knows products. So, I'll look to him to, you know, direct us, navigate through, those use cases. And so given the fact that this product unlocks

a fairly large amount of data that most people have

generally not really had access to for such,

long time spans. I'm curious what you have found to be some of the most interesting or unexpected uses

of the chaos search platform for being able to

interact with and analyze all of that information. Well, Well, it's it's a really good question because, you know, we we see that people come to our site, come to talk to us because of their classic long tail logged event requirements. So for instance, I got a fault or I got an alert. I know that it happened, but I don't know why. So they wanna use KS Search to go search why, where, and how did it happen. But then once they have this long tail, it's it's fine. The product team over at those companies go, well, now that I have this data, a lot of the times,

it's for internal use. But more often than not, we're hearing they wanna provide that data back to the customers.

And so these companies are looking at whole new product offerings

from what they what they originally thought this data could provide. And so that's not saying a surprise, but it's been a happy trend. And then lastly, you know, because we have that long trail, and this is a little bit, you know, further out for more the advanced users is the analysis, the machine learning, the predictive analysis.

Now it's something that we we hear that they wanna do, but, you know, really, the access of this data to their customers is really what we wanna see and, you know, we'll see where it goes. Maybe new products are built around KSearch that, haven't been seen before in the market. Yeah. I think what's interesting from my perspective is, you know, there's a lot of data already out there that exist today, you know, that that exist in Elasticsearch or other, you know, logging type systems. But there's actually even more data, and the question is how much data there is that is essentially the data that's thrown away. And I've chatted when when I was kinda diving into this company over the last few months really thinking about if this was,

a group and a and a product that I wanted to get involved with. And I started chatting with a few of my friends on this. You know, the answer that I got from a lot of them was, you know, if I if I could do something with with this data, I would keep it. But I can't, so I just throw it away.

Or, you know, I was thinking about, like, you know, getting my data data science people to dive into this large dataset.

But at this point, it's like I don't even know what database to put it into. I don't know how they would even access it. And so it's just kinda sitting there right now. So I think,

you know, that's kind of my hypothesis is like that there is this, like, large

amount of data that that is either exists and is is not being used or is just being thrown away. And I think people are gonna say to themselves, wow. It's so cheap for me to process this. You know, I wonder what's there.

And, and like and like Tom was just saying, you know, that ability to say, you know, let's actually monetize this data, you know, from a product standpoint. You know, I know there's a lot of people out there that, would would probably have some ideas that they could just do it, you know, in a way that doesn't cost them a fortune. And I imagine too that it would lend itself well to

more useful

and more accurate

anomaly detection algorithms where a number of the layers that are built on top of elastic search or some of these other metric systems are generally working on much shorter time horizons

and detecting whether the data is anomalous within the past few hours or days, but you could

even create analyses or models for seasonal trends where, you know, this time last year, we saw this similar spike, so it's not actually

anomalous from a yearly perspective despite being anomalous within the span of the last month or so. No. And that's a great point. You know, you know, Elastic has some new features like roll ups. And when you roll things up, you lose, you know, the the the the texture and you lose those patterns that may have provided that insights that the amount of text, you know, requires. So, you know, roll ups are a great technique,

when it's too complicated. But what if you could store the data, all the data, and really index fully the dataset versus, you know, sub parts that you realize, oh, shoot. I wish I had this attribute within the system. So, you know, we think, you know, having the real data accessible at your fingertips, as as we mentioned, really opens up a lot of doors. And you've mentioned a little bit some of the goals that you have going forward

and some of the ideas that you have for the platform. So I'm wondering if there are any specific

features or improvements

that you're working on that you'd like to share that will sort of exhibit what you're trying to work move towards in the future with ChaosSearch?

Well, I'll I'll let Pete talk talk to the short term, but the long term vision, we are building a data platform.

It's beyond log and event data. We have a data lake philosophy with the ability to dump data

into your s 3 without having to worry about schema or or provisioning

or scale,

at a price point. And then from there,

really be able to analyze your data in the way that you like to analyze it, you know, really a design studio,

really a database with storage,

convergence

is where we really want to build be a first generation in that. But from a short term perspective, I'll I'll leave it to Pete,

for features and functions. Yeah. I mean, in the near term,

the thing that I'm finding most joy in is is sitting down with people and and hearing about their their pain and sadness of their very expensive

and very hard to to manage Alaska search cluster. And and basically go into them and saying, listen, like, you know, I can I can give

you a lot more data for a lot less money and a lot less Elasticsearch and, you know, a little gleam of brightness comes in these in these poor operators eyes when I when I can, you know, tell them that like there there is a future where they could actually store more data and run less servers? And so I think, you know, the my biggest goal is really just trying to find, trying to find more people out there who,

who wanna go on a journey with us and try this out and give us feedback and and and really help us, refine and and make that make that experience as as good as possible. And 1 other question that that just triggered is

with having these

longer

time spans of data,

you increase the potential

for having

compliance issues either because of the length of the time that the data is being stored or

because at some point in the history of when you started collecting the data, you had a certain field that you've now either obfuscated or eradicated.

And I'm wondering if there are any mechanisms built into ChaosSearch

for being able to go and retroactively

modify that information or if you would just be relying on your other ETL tools for being able to do that processing and retroactively apply these compliance regimens? Actually, processing and retroactively

apply these compliance regimens?

Actually, yeah, I'm glad that you asked that question. We've gotten that request a a few times, and I didn't really plan for this. But Data Edge actually has the ability to wipe clean those,

black and bear terms, symbols,

without having to re index or transform the data. So

It's kind of the GDPR question. Yeah. Which is like, you know, for a lot of companies, it's like, I got a GDPR request.

I need to remove some data.

And usually the first response is, well, first we have to process it all to figure out what even we we have to delete it. And we started talking about this and Thomas was like, well, we can just selectively purge that out. It's pretty easy. And I was like I don't have to yeah. I don't have to rebuild the indices which is,

it just that was not intentional.

Sometimes with computers, there's happy happy accidents with computers. We'll we'll chalk that with I I would I would say that was that was a free 1 that I got listed, this representation.

Most of the accidents are tragic, but occasionally, they can bring joy.

Exactly. Alright. Are there any other aspects of ChaosSearch

or the problem domain that you're solving for that we didn't discuss yet? Which do you think we should cover before we close out the show? No. I don't think so. I mean, I I think we're we're really excited. I think we're onto something. You know, there's there's a lot of data out there. It's growing more. You know, for any of the listeners that that wanna check it out, you know, reach out reach out to me, check out our website. We're we're gonna be around and and out and about. So we'd love to chat more and just hear,

really, like, who's got some really interesting unique kinda data challenges and and and how how might we be able to help. Alright. So with that, I'll have you add your preferred contact information to the show notes for anybody who does want to follow-up and keep track of what you guys are up to. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Wow. The biggest gap right now. I mean, it's it's we live in a golden era of of open source software, I think. And in some cases, it's good. In some cases, it's not great. You know, there's there's

kind of thing that I see is that there's still not a lot of good stuff around there around kind of,

cleaning data up for processing. Right? Like, data has to get structured in a certain way, whether it's I've got log data. I need to convert it into JSON or YAML or whatever other format, stuff like that. So, you know and maybe it's just my own experience in that. I haven't really found anything good from that perspective other than kind of old school UNIX tools and and and and rub a little JQ on top of it. Right?

And Thomas, how about you? I mean I mean, hear we hear it is the golden age of information, and information

drives value. And the ability to store more, to analyze more is important, and it changes business. It changes, you know, our lives. So ChaosSearch

is really what we believe is a fundamental tool

service that, really unlocks new ideas, new new businesses.

And, you know, I love solving these types of problems and, you know, hope to solve a whole bunch more as we learn about what people wanna use the service for. Alright. Well, thank the both of you for taking the time out of your day

for joining me and discussing the work that you're up to at ChaosSearch.

And it's definitely looks like an interesting platform and 1 that I may, find some of my own use with. So thank you for that, and I hope you enjoy the rest of your day. Thank you. Yeah. Thanks a lot.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links