Build A Data Lake For Your Security Logs With Scanner

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture

with first class support for Apache Iceberg, Delta Lake and Hoody,

so you always maintain ownership of your data.

Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst

and and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

Your host is Tobias Macy. And today I'm interviewing Cliff Crossland about Scanner, a security data like platform for analyzing security logs and identifying issues quickly and cost effectively. So, Cliff, can you start by introducing yourself?

Yeah. Absolutely. Thanks, Tobias. Yeah. My name is Cliff Crossland. I am a a software engineer. I've been focusing on data platforms and data engineering really high scale systems,

a lot of c plus plus and Rust through the ages. And,

yeah, we've had a lot of fun, on our team. My cofounder and I, another software engineer, Stephen Wu.

We were at a startup together beforehand. We ran into some really big problems with log scale. And so we built a tool to to make, log searches super fast.

And we found that, in particular, security teams really have this problem more than almost anybody. So we're really focused on helping

people tackle massive data problems in these security data lakes and keeping search and detection super fast. So that's, that's what we're up to. And And do you remember how you first got started working in data?

Yes. So the startup where my cofounder, Steve, and I worked before,

we had a massive crawling system. So we built this really fun, intelligent executive assistant. And what we would do is we would crawl the web, and we'd look for, news articles about

companies. And we'd help people by sending them,

these intelligent briefings before their meetings for the day, give them, like, news alerts about the companies they're about to meet with, about the people if they showed up in the news.

And so we were crawling

a massive amount of data, and

we we we definitely, like, ran into some interesting bugs with,

with curl, like the lip c, or like the lip curl written in c. We we go really we went really low level to try to optimize our crawler as much as possible and ended up moving that to rest. But,

yeah, it was I think the the the first taste of, managing petabytes of data

came at that start up. It was super fun,

and, it was interesting. We would definitely went from, like, Ruby all the way down to

c very quickly in order to to optimize. So, yeah, we we've seen the the whole,

stack, the whole spectrum of different data engineering tools that can be applied to massive datasets.

And with that exploration of cURL and optimizing it and moving it to Rust, obviously, cURL is written in c. It has a massive amount of history and,

community knowledge that has been baked into it. I'm wondering what are some of the things that you gained by moving to Rust, and what are some of the things you lost by the fact that you weren't integrating directly into curl? Or were you actually still using curl and just using the unsafe keyword?

Oh, yeah. Well, I mean, that that that is totally doable now. I mean, it's it's it was super interesting. It was a it was a fun experience. So our crawler system was written in c plus plus, and we used lib curl was it was excellent. It is kind of hard to write, like, this asynchronous multithreaded

curl code and interacting with,

networks and so on in in c, c plus plus is kind of annoying. But what we found is we ran into some memory and safety

problems in curl. We, we were able to reproduce it. It got fixed, eventually upstream. A bunch of other people ran into the same thing. We we we thought, like, wow. We basically, our our server was something would crash, and this is like a server that's interacting with the public web constantly and is sort of this vulnerable,

endpoint that, where if someone were to take advantage of a bug in cURL, it'd be really scary for us, because we were hitting all kinds of servers, like, as we're calling the web. And so some evil servers, like, trying to send us, you know, gzip bombs and and so on. But,

yeah. So we we were nervous about the fact that, okay, there's this memory and, safety problem where it would segfault got fixed upstream.

We we were able to, like, add some commits to to curl, which was a lot of fun. Like, I think curl is 1 of the coolest, like, and and most important,

c programs ever written. But we felt like, okay. Well, if if this is, like, 1 of the best c programs in existence, c libraries in existence, and still having these memory safety problems,

what what should we, what can we do about this? And,

we built, like, a Rust prototype. It was a lot of fun. We got a lot of memory safety out of that. It was actually back in the day when you couldn't use the async keyword yet in Rust. So so 1 of the things I think we lost was it was way less ergonomic than using,

curl.

But, eventually, it got really nice, once the async keyword was added and, like, then writing that code wasn't so insane in Rust. But, yeah, we we got a lot of, of confidence that the program was safe, that we wouldn't have, the same kinds of memory safety problems and threat safety problems. So that was it was really nice. It kinda got us hooked on Rust and and why we're using Rust in this,

new,

data lake product that that we're building at Scanner. And so digging more into Scanner, can you describe a bit about what it is and some of the story behind how it came to be and why you decided that you wanted to spend your time and energy on it? Awesome. Yeah. So at our prior start up, here's the origin story of scanner. We had a we were using Splunk for our logs. We were generating, like, tens of gigabytes of logs per day, which is fine. And our Splunk bill is, like, tens of 1, 000 of dollars. And then we very quickly scaled up our crawling system and our user count and so on. And our our, log generation volume grew from, like, 10 gigabytes a day to a terabyte per day, and our bill jumped up. Our log bill jumped up to, like, $1, 000, 000 a year plus. And so we had to just kind of throw away logs by just uploading them to s 3. We still wanted to see if we could use them for debugging,

but it was just so expensive to use Splunk and and many other tools and, like Datadog, for example, pushing logs there at at 1 terabyte a day is also super expensive.

So we put things in in s 3. It was super cheap, but then it was impossible to find anything again.

So, you can use Athena. Athena is pretty good there. But then to use Athena, you have to transform all of your logs into sort of a common format. It's not very good at doing ad hoc searching.

So we just felt like, okay.

It seems to be the case that

logs,

follow this cycle

where,

there's, like, this on premise era where everyone has log files stored on hard drives across their data center, and that's really painful for investigation. You have to SSH around.

Then everyone moved to, using things like SQL, like, Oracle and MySQL and Postgres based, log search tools. So, like, a lot of the early security search tools like ArcSight were built on top of Oracle. But that's painful because you have to transfer from all of your logs to fit a SQL table schema. And then, then you have tools like Elastic search and Splunk come along in the on premise world, and they're like, oh, unstructured data is totally fine. Perfect for logs. You can you don't have to transfer your your data into a SQL schema beforehand. This is awesome. And so then we're kind of repeating the cycle again in the, in the cloud era where people upload logs to s 3 as flat files, but that's really painful for for search,

exploration. And then, then people are are reaching for SQL tools like,

Amazon Athena or Presto or Trino or Snowflake,

in order to interact with these logs in s 3. But the thing is logs aren't really you know, they're not tabular all the time. They are often JSON. They have free form structure. The structure's changing all the time. New fields are added or removed in in this log data. And so SQL is not quite the right fit. And we just thought, like, where is the search index that's built on top of s 3? That's not sort of like, you know, old school like, Elasticsearch kind of is,

was built in the on premise era. Still, it's good for the cloud, but but it's all the logs are stored on servers,

with with scanner. We thought, like, why why are all the, the logs and the indices? Why why isn't all of that in s 3, and then you can just search through that data extremely quickly? Where's, like, the s 3 based indexing system? And so, that's what we built, and it makes it just incredibly fast to search through a massive amounts of data.

The index files that kind of help zoom in on the regions of logs that contain the the tokens in your query.

So then instead of, like, you know, kind of scanning the whole world, like you do oftentimes with some of these s 3, SQL tools, It really zooms in on just the chunks of logs that contain the hits that that you need to see. And so, yeah, we just felt like this this solved our problem in a major way, that the problem that we had at the prior startup. And we found that as we were sharing scanner with people, the some of the the teams that really suffer from this the most are security teams, so who have, like you need to store years of data for compliance purposes or because they want to do investigations.

We hear horror stories where on average, it takes, like, a couple 100 days to detect,

that you've had a breach and then, you know, a couple like, a month or 2 after that to really terminate the breach and and disconnect attackers.

But if you only have, like, 30 days of logs that you can see in most tools going back, It's really painful. Like, why not just, you know, search all of your s 3 data going back forever? And, why isn't that fast? Why does that take, like, you know, days to run those queries? So,

yeah, that that's, we've been super happy to share this tool with, like, security teams

and, security engineers and and in addition to application,

debugging teams who are using scanner just for looking at their application logs. But just to make it really easy to look, at past historical data in s 3 and just, yeah, search through s 3 data super, super fast.

You've already mentioned a couple of other products in the space and some of the reasons why you decided that you needed to build this solution, but I'm wondering if you can give a broader view of some of the shortcomings that exist categorically

in other tools in the space

and some of the ways that you're thinking about the specific problem that you're trying to solve for?

Yeah. Absolutely. So the there are sort of 2 categories of logging tools that you see used by modern security teams or, application

teams, who are using logs for observability and,

fixing errors.

And, what you see is you see either,

old sort of like old school architecture

designed for the on premise era, or you have a new architecture, but it's all SQL based. And so the on premise era

architecture or or tools

like Elasticsearch

and and Splunk and so on. They're they're tools where you, you stand up a bunch of servers and the logs are

uploaded to those servers, and they distribute those those logs around among all of the servers in the cluster.

But that is really hard to scale once you start to generate a terabyte of logs per day.

The like, most of the work that's being done in in in those clusters is, like, replicating logs around,

from 1 server to another, trying to get, sufficient redundancy,

like healing, partitions and so on.

And, like, it's it's surprising how little compute is actually

being used for, powering querying or doing,

like, basic indexing. A lot of it is just trying to maintain,

these massive log sets coupled that's sort of like where compute and storage are coupled together on these servers. So that's sort of like the old school approach where in your old, you know, on premise data center, you'd have a bunch of servers running and you would replicate your logs between those servers. But now in, like, the the world of,

cloud storage, you have s 3, you have blob storage. In Azure, you have these really cool, really, really scalable data storage systems that do all of this duplication, redundancy, partition healing, and so on for very, very low cost. And so you get, these really cool tools like Snowflake,

comes along, and they do a really cool thing. It de they decouple,

compute and

storage from each other. So all of your,

SQL data in Snowflake is stored on s 3, and the compute's totally separate. You can spin up a lot of compute for your query. You can spin up a little bit. It's really nice. You don't have to have, this, like, really high maintenance cluster. You're you're running all the time, because storage and computer are totally decoupled. But the the problem with tools like Snowflake is it's SQL. It's really great for business analytics and for, data that has a lot of really strong schemas,

but it's not great for logs, which is, like, totally free form, semi structured

JSON data usually and with tons of nested fields.

And, and so it's really hard to do all the work to transform your logs to fit into the SQL schema. And we we just felt like, okay. So we want we want that kind of really cool decoupling of compute and storage, but why can't my logs just be anything? Why can't, you know, I want them to be semi structured, and I want my indexing tool to be good at, semi, like, indexing and searching through semi structured data.

So that's where Scanner makes a huge improvement. So,

I guess, like, it's it's much, much more scalable and faster than those kind of, like, high maintenance old, tools like Elasticsearch and Splunk that can handle semi structured data fine, but they can't handle the scale that you need in this era of, like, using cloud tools and generating terabytes of logs per day.

And then, it's also much better a much better experience in terms of onboarding logs, and searching logs than, like, the cloud tools the cloud SQL tools today, like, Snowflake or Presto or Trino. Yeah. It's it's it's sort of like yeah. It it's trying to, like, build the search index version of of, like, a Snowflake.

And so it just giving you really, really easy onboarding of logs.

Doesn't matter what the schema is, and really easy ad hoc free form search.

The closest analog that comes to mind, at least in my experience of working in this space, is the Loki project, which is what the Grafana team has built. But

that does have some minimal amount of structuring that it requires as far as the I forget the terminology that I use, but effectively the labeling that you're using for the different log streams that you're populating, whereas it sounds like what you're doing is

bypassing

the right path and just being the index and read capability. So it doesn't matter if the logs were

generated through scanner. You just say, there are some logs here. Now I'm going to do something useful with it versus Lokey, which requires

that it be both in the read and write path in order to be able to work effectively.

Yes. Definitely. I think low key is a super cool tool. 1 of the things that's interesting is yeah. As you mentioned, it's sort of like this metadata labeling, this tagging that you have to add to the logs and push them through,

Lokey through, like, a pipeline to add the the metadata to the logs. And then when you execute a search with Lokey,

that also executes a search against, like, s 3 data,

and it can use those tags to determine, like, which files to go scan. But the 1 of the interesting issues that you run into, especially if you're doing security investigations, is you may not know what the tag should be. It might be like I have an IP address here or maybe it's like a collection of malicious IP addresses that I'm scared about. And I it could be in my, like, Okta logs, my GitHub logs. It might be in CloudTrail. It might be in VPC full logs. I don't know which log source this this, go check. I kind of wanna check everything together. And what scanner does is it actually indexes the content as opposed to just using the metadata tags as Loki does to, like, determine which files to go scan.

And so the index files will these, like, scanners, Lambdas will go when you execute a query, will go and ask the index files. Okay. Here's here are the set of IP addresses that the the users asked for,

which, which files and which chunks of those files,

contain those tokens. And then once it gets that information, it will it spawns out,

Lambda to go and just search that small subset. So it's basically like, yeah, you can use the contents indexed, and and you can use the the content to narrow down the search and not just the the metadata tags. But I think Lokey is a a really cool, and it's like another log tool in this vein where it is

scalable. It focuses on, s 3 storage. But I think that's a that's an accurate characterization where, yeah, you don't need to change the logs. You just kinda point scanner at your logs in s 3 and say, index this and let me search through it really fast.

And that brings us to the

data acquisition

piece of the question where having a powerful tool that can do all kinds of automated discovery or very fast querying is completely useless if you don't have anything to run those queries against. And so I'm wondering what are the

different data acquisition paths that you are focused on prioritizing

as you're early in your product life cycle and some of the ways that you're thinking about the future trajectory of supporting

the,

the data acquisition path and being able to branch out into a broader variety of systems that you can crawl?

Yeah. Absolutely. So

the what we're focusing on right now is, tools that naturally

upload their logs to s 3. And so this would be like AWS CloudTrail

or or, CrowdStrike has, like, Falcon data replicator

or GitHub has an option where you can, push your GitHub audit logs into an s 3 bucket of your choice. Cloudflare, same thing. Like, many many security related tools actually already have,

connections to push logs into s 3. And so for those, like, it's really easy to get started with with scanner.

You, just enable those in those different tools to push your logs into s 3 buckets, and then you point scanner at those buckets, and scanner provides really fast search on that data. So any any AWS data in particular, I think, like, that's what most of our users are focused on. Like CloudTrail,

is is a really, killer log source that everyone wants to monitor and run detections on and so on.

But in the future, 1 of the things that we are beginning to build is data connectors that go and pull in logs from tools that are, like, API based. Like Okta is a good example.

So some of our users use, like, really cool tools like Cribble or, or Vector

to go and fetch logs from different places and push them into s 3. But there are others who just want us want to do it from the same UI in scanner. And so that's that's a future direction for this year is building up these, like,

the this larger set of data connectors that can go and pull,

logs down from different different sources, especially, like, API based sources,

and then pull them into your s 3 bucket for you. And then they're there in s 3 for you forever. If you want to query them with Athena or something or some other tool, that's great. Or if you want, like, super fast search, then Scanner can index those for you, and then you can execute queries from within Scanner.

And given your security focus as the primary impetus for building this system, what are the things that you are explicitly

not solving for or at least not yet?

Yes. I think this is great. There are some really amazing security tools. So as everyone knows,

Splunk has, like, everything in the kitchen sink in, all built into it, and we don't want to try to replicate every last thing that Splunk can do. But what we are, maybe eventually, but, 1 of the things that we are are really focused on doing is integrating really well with other,

excellent security tools. So 1 example is Tynes. We have an integration where with your detection rules, you can set up, you can push, the detection events that are triggered by the textural system. You can push those to Slack. You can also push them to Tines, and Tines is a really cool

SOAR tool,

to automate, security response.

And, you can use that. We push to webhooks in Tines, and then you can automate your Tines workflow to,

do other things, like you might hit an API to go and

reject a user or block an IP address,

or undo, like, a security group change in AWS that when your developers accidentally did, you know, like, by opening up a a network to the world or something. So there are some cool things you can do with Tynes, and and Scanner is integrates with that and and can send those messages to Tynes. And then you can use Tines or Torq or, other tools to to do an automated response.

So we're not we're not gonna build the automated response ourselves. You want to hand them off to, like, really cool modern tools to do that. And then,

there's also, pushing off. Instead of doing all the kind of the,

security event, case management where you assign a ticket to to somebody within scanner. We don't wanna do that either. We push events off to Jira and probably linear as well. Linear is is is really fun, ticketing system, but, basically,

pushing off the issue tracking and the automated response to other tools. We we don't wanna do all of that. We just wanna make extremely

fast search and then these detections, which are really easy to write. You can run these queries all the time on your data. We're just really focusing on, like, the search and the query experience.

Digging into the architecture of the scanner platform, you mentioned that you're very focused on cloud first, modern systems. Wondering if you can just talk through some of the ways that you have built the platform and some of the design philosophies that are core to your product identity?

Yes. So we really believe in decoupling storage and compute. We think that is an extremely

cool new pattern that you see in SQL tools so far, and we really feel like search indexes

need to need to do the same thing. And so the architecture for scanner

is serverless. So we use, ECS Fargate and AWS to for our indexing compute, and that can scale up and down extremely rapidly. So what what happens is your s 3 buckets might get full with a lot of, log files. We receive SQS notifications.

The ECS Fargate system

will spin up tasks to go and

handle that compute, spin back down again once, like, the the volume comes back down, and then, it creates these index files, which are in s 3. We also have, like, a MySQL database for everybody, and that MySQL database, is there for metadata, like, here

here here where all the index files are, here are the time ranges they span, and which, which indices,

they they refer to, things about, like, which users have access to which, indices in the system. Those are stored in MySQL. And then when a query occurs, we spin up 1. Our our API handler will go and query MySQL. We'll figure out, okay. Here are all of the index files that are relevant to the query because they're looking at this index or these 7 in indices,

and they're looking at this time range. They're executing this query. So here are all of the index files that we want want to go and hit, we'll spawn up a huge number of Lambda functions,

which will all individually go and very rapidly traverse those index files,

in chunks and and jump around in chunks. And the index file sort of guides the the Lambda to the small subset of of regions that contain the hits. And then, we accumulate those results back together

in,

we we this could be like a fun topic for math nerds out there, but, like,

we have the we we call the Monoid data structure server. So we,

compute all of these aggregate

values and then merge them together in the the Monoid server and then report those off to the user in the UI. So that's how we do our aggregation queries. And, actually, all of our queries are are technically,

monoid values, which is, like a fun, idea from group theory that's really useful for doing this kind of thing.

So that's that's roughly how the the system works and and how it hangs together.

Getting even deeper into the weeds,

indexing

is a bit of a dark art, and there are multiple different ways that you can construct those indexes for different use cases, whether that is accuracy, retrieval speed,

reducing the amount of scanning, where those indexes live, how you manage the update cycles on them. I'm wondering if you could talk a bit more into

how you think about the specifics around indexing this variegated data that is semi structured

and some of the complexities that come about building out consistently useful indexes across this mess of data that people are just throwing at you?

Yes. I that that's a super great question. This is something that we really played with, because it seems like there's a a pretty broad spectrum of what you can do with an index. Like, if you look at Elasticsearch's indices,

they're extremely fine grained where, every token basically has a mapping to every single individual document where the token appears, which means that the index is really massive. It it might be, like, But then on the other hand, you have, like, no indexing at all, which means that in But then on the other hand, you have, like, no indexing at all, which means that ingestion is super fast. So with maybe Elasticsearch, it's it's slow. With low key, ingestion's super fast because there really isn't any indexing. There's just, like, labeling the logs with the different tags, and that helps partition them into different buckets. But then, the problem with a a tool that doesn't do any indexing is that when you execute a query and if you don't know exactly what partition you need to look in, if you wanna look kind of everywhere and look for, like, this is a this is a scary IP address, I wanna see all activity,

for the past year for this particular,

set of IP addresses or this user or something, no matter what log source it's in, that can be really hard with a tool that doesn't do any indexing. So we we feel like we're in the middle. Our index is more coarse grained. And so,

we have, like, a number of different index file types. So,

it's not just string tokens, but there's also numerical indices

and indices that keep track of things like the most common fields and the the range of values that appear in those fields. But, we're trying to to balance,

search speed with really, really fast and low cost ingestion is where I think, like, kind of elastic search and and tools like that fall over is ingestion is really, really slow or it's just takes so much CPU. It's really expensive. And low key is, like, no indexing, so it's super fast to ingest, but query is slow. We wanna make querying really fast. So, like, our our trade off is we want it to be possible to query for a needle and haystack and a petabyte of logs in a 100 seconds or less. And in in,

various other tools that might actually take you something

like, hours,

like or or sometimes days. But we also want it to be the case that where, like, the compute, for each gigabyte index is really cheap for us so that we can handle tons and tons of of traffic. So we have, like, this course grained index system where you map from the tokens and the content that appears in the logs. Instead of making a really fine grain mapping to every single individual log event that that, contains

those tokens or that content. Instead, you have a much more coarse grain mapping so that the index files guide you to a region that you still need to go scan the entire region of, you know, maybe hundreds of thousands of log events. But it's actually really fast to do that, especially if you're using Rust, especially if your

format is really fast to download and parse and decompress,

then, then doing that and and narrowing down the search to kind of these reasonably large chunks, but the those chunks that are pretty fast and easy to scan. It gives us a nice balance,

in our indexing system

between, like,

yeah, super expensive,

but really fine grained and, like, super cheap, but, like, very unhelpful indexing.

So we we like to be, like, super fast, but, also really, really cheap to, index.

And in terms of the

motivating constraints, I'm wondering what were the what is the North Star that you're using to decide this is something that we want to include in the system? That is something that is nice to have, but we can't execute on right now, and that is completely out of scope and is somebody else's problem.

That's a great question. So what we we really want to focus on what we think is missing in the space, which is,

search and and making that sufficiently fast on massive datasets, but at low cost. So we don't want to, you know we we think it's silly for someone generating a terabyte of logs per day to pay $1, 000, 000 a year.

It should be 10 times less than that. And we we think that so our north star is how can we make search 10 times cheaper but also remain extremely fast,

and we're really focused on that. And so the search that we're focusing on is both ad hoc searching and also these detection queries that run all the time to find threats, and we're just, laser focused on that. We don't want to go and build every, you know, feature that that, you see in the security space. We want to integrate really well with those tools, whether they're sending us data,

into s 3 for us to index or if we're sending out events to them to go and and handle different responses. So, yeah, we're just we just feel like search is not good enough, and we're we really want that to be,

excellent and save people a huge amount of time when they're doing these really big investigations over their data. And then a bit more

nuanced on the question of indexing is also the fact that log data is notoriously messy. There is no consistent

format that you can be guaranteed that anybody's going to have because most log systems will allow you to customize it to no end. And I'm curious how you are approaching that problem of being able to identify

this is a string that is worth looking at or this is the structure that's being used, whether it is a standardized structure such as JSON or LogPlex or Syslog, or this is just a completely arbitrary string, but there's actually interesting stuff in there. Now you need to pull out my regexes.

Yes. I I think that's a super great question. We we feel like,

1 of the really like, 1 of the most painful things about using logs right now is all of the data engineering work that goes into setting that setting up in the first place. And so a lot of the the data lake style tools that people are reaching for do require a significant, amount of effort to do things like transform

your Syslog or your JSON

or your plain text or your CSVs

into SQL tables. I guess with CSV, it's actually not that hard because it's, like, basically a table. But with especially with, more free form,

data types and log types like JSON or even like protobufs and so on, it it can be really painful

to map that data into

a really strict schema structures. And so we we feel like logs are, like, by their nature are really

they are semi structured. They're free form. It should be really easy to query them in an ad hoc way. And so we really focus on making it, imposing no structure at all on users. We do support, like, various,

important log formats like JSON, like Parquet, like, CSV, plain text. Avro,

is coming if more people want Avro. But, but, basically,

we want to make it really easy for you to onboard logs. You don't have to transform them into a different schema. And that's the way that we've generated our indexes is that the whereas, like, a lot of other in indices are are hierarchical

where the index is organized where, like, the the field name is on top. And then once you know what field name the user is looking in, then you go and search within that scoped amount of data. So it's, like, very columnar. For us, the columns don't matter at all. And in some sense, the columns are just annotations on the data. So if I am looking for content like a user ID or like a string or or multiple

strings that I I might be curious about and then but I have a column name in front of that in my query, Then what we'll do is we'll find that content, and then we'll see if any of those hits also have this column annotation.

So that way your columns can change day to day. They don't have to to remain,

stable. They don't have to remain small. You're not limited to, like, a 100 fields or something like that. There could be millions of distinct fields in your system. It doesn't matter at all. We are really just focused on the content. And so, yeah, we we really want it to be possible for people to just dump their logs somewhere

and not then have another project where every new log source requires this massive onboarding transformation work. It's just like, I'll just point Scanner at it, and Scanner can discover the schema, handle it, query that schema without any problem at all. So, yeah, we're really focused on that. I think 1 thing I might mention there, which is, super interesting that we're playing with that's in beta right now is auto schema discovery. So for a lot of log sources, it's really nice to to transform them into a common schema if you can. So something like OCSF, which is this open cybersecurity,

foundation format,

and and a bunch of tools are beginning to adopt this. But it's basically,

trying to come up with a common schema so you don't have to remember what, like, the the source IP address name is, of that of that field is in every single,

log source you have. You can just say,

give me the source endpoints IP address and look that up across

into the like, if you if you type in sort of an OCSF,

or like a common schema style query, we'll transform your query so that in the log sources that you're looking at, it will match the the fields and the structure of of those log sources. So just, again, we're just trying to remove as much data engineering effort off of everyone's plate as we possibly can so they don't need to do all this transformation ahead of time.

And in terms of the overall design and scope of the system and the problem that you're trying to solve, how has that changed from when you first started working on the problem?

Yes. So when we first started working on scanner, we really felt like we wanted to solve problems for,

people building applications

and for people who wanted to quickly search through logs and debug them and have a really cheap, place to put those logs.

And we discovered something really interesting, which is that most people who are debugging applications,

they only really care about having, like, 7 days of logs or something like that or, like, maybe 14 days maximum,

and they don't necessarily need very many logs.

Error messages maybe are, like, the most important thing to keep, maybe sample some subset of, like, normal logs. And we discovered, yeah, they they don't like,

application developers who are using observability products, they don't need that much log retention. They don't have,

that much log scale, or it's it's it's a little bit more rare. And, whereas as soon as we started to talk to security people and security engineers

and DevSecOps,

like, everyone who's really focused on the detection response and,

and log tools for security teams,

they all said the same thing, which is like, wow. Can I use this tomorrow? Like, this is so cool because

I have a few years of logs, but they're invisible to me. I have some log tools that that keep something like,

14 days or 30 days of logs in in scope. But when there's a breach, oftentimes, I won't know about it. And, you know, if a third party that we're using gets breached, they might say, like, this this breach started about 6 months ago. Here are some indicators of compromise you might wanna look for, like, IP addresses, domains, emails, and so on. And it's really hard to go look in the past and go find that data. And so we thought, well, like, if it's in s 3, why isn't it super fast to search through s 3 data? And so, yeah, we discovered

that instead of focusing on, the problems of application developers,

that we started to to focus on the problems of security teams, and then we started to see security teams using us, like, every day of the week,

like, because, there's so much data they're curious about. There's so many ways that people can can, attack you and,

and threats are extremely creative. And so it's very helpful to be able to look at, not just, the past handful of days, but, you know, a ton of time in into the past, you know, 90 days, a 180 days, like, multiple years.

Those kinds of queries are really important to security teams. And so,

that's how we've we've evolved is to really focus on those problems and and solving

problems for people who have, like, the biggest logs log sprawl issues. And that's been extremely rewarding is to help because security teams are extremely overworked

and have so much to worry about in addition to logs. They have you know, they're setting policies for their companies. They're doing trainings.

There's just so much on their plate, and everyone is bothering them for more reports to ask them questions about whether they're vulnerable to this or that.

And so the more time we can save for them by making search extremely fast, and make their historical investigations

really fast, the better their lives are. So,

yeah, we've definitely, shifted from, like, in the early days focusing on, like, the observability use cases to focusing on the security use cases.

Given the security oriented customer base that you're focused on, I'm wondering if you can talk to some of the ways that the

architecture of your product is designed to fit into the types of

regulatory and compliance constraints, both organizationally and legally, that they need to be able to accommodate,

particularly

with ensuring that data access for this sensitive information, which is required for them to be able to do their jobs that determine whether or not they've been compromised,

doesn't leave a predefined premise that they're able to maintain full control of and some of the ways that you work to

give them comfort and guarantees

around the ways in which you are processing their data?

Yes. Absolutely. So that's 1 of the things I think has has been really fun to build and to work on is trying to

make the the connection between the compute and the storage

as, safe, but also as secure and fat as fast as possible and, as as low cost as possible. So 1 of the things that we do is with scanner, you, we create a brand new AWS account specifically for your team. It's not multitenant.

Unless the the free tier, if you want to play with Scanner, there's a multitenant environment for you to to play with. But for teams who are using the product and on on really serious problems, we spin up a a completely separate AWS account in the same region alongside where your buckets are. So instead of, like with other SaaS tools, you're pushing logs out to some third party. You're pushing them over the Internet. With our tool,

Everything stays within the same region, and we just use I'm permissions to say, cool. Like, this this 1 single tenant AWS account that's completely isolated is, the compute there is allowed to communicate with your buckets. And when we create the index files, we save the index files into your s 3 bucket as well. So you don't need to worry about, like, the the data being stored by somebody else. Everything remains in your buckets,

and and there's not, this vendor lock in where your data is kind of owned by some other person's cloud. It's all just within your own s 3 buckets. So it's sort of like this compute service that you can leverage, but then all of the data remains

in your s 3 buckets forever. And so that does a couple of fun things. It it means the data transfer cost is lower. That's kind of fun where instead of pushing logs to, like, a third party vendor over the Internet, everything is just remains within the same region. So data transfer cost, thankfully,

I'm grateful for AWS for this, but, in the same region, data transfer cost between compute and,

s 3 storage is free. And, but it also means that there is extremely tight control over where where this data goes, who has access to this data because everything remains all the data remains in the, customer's s 3 and their AWS account, and, also our compute is in this isolated AWS account that's unique,

to them. We are we are playing with there are some users who are curious about whether we we could deploy into their AWS account as well, and they can run everything there. And that's actually very easy for us to do. That's that's something that we'd love to we'd love to to try out, but that it's architected

to be able to just deploy the compute to any AWS account that you point out. It just you just need enough permissions to set up the various infrastructure inside of that account. So that's something I think will probably, like, be fun to play with this year,

is deploying into, someone's environment,

directly instead of running the compute in our own AWS account. But, anyways, yeah, we we really try to keep everything isolated

and all of the the compute is yours and all of the storage remains in your AWS account. For teams who are adopting scanner, they're incorporating that into their day to day workflow. I'm wondering if you can talk through what that typically looks like and some of the ways that they are

applying scanner to the problems that they're trying to solve and some of the ways that you think about the collaboration

activities

of security engineers in these teams who are doing this either ad hoc discovery or maybe they have some scheduled scan that they're running to detect any persistent issues.

Yes. I I think it's really interesting to to learn kinds of queries and what kinds of investigations that people can do now and that get unlocked by a tool like scanner. And so what we see that, security teams do is they jump

in and whether an alert comes from from Scanner or from some other sources, what you you can jump in and start to investigate

the recent past, like, over the past couple of days very quickly. But then immediately, you start to see really fun activity. Like, people start to explore

over the past 6 months, over the past year, and look for indicators of compromise going back a very long way. But, another really interesting thing that scanner unlocks that we were surprised by is that extremely high volume log sources that are that tend to be sort of low value,

usually, but can be extremely important,

Like VPC flow logs, which are voluminous, and there are so many like, every single connection that's going on inside of your network in AWS,

shows up there, basically. But you can start to run cross correlations between VPC flow logs and

your other, like, high like, higher value but lower volume log sources. So you might say like, oh, this IP address is showing up in my, like, AWS cloud trail because someone's trying to log in. But I also see that, that IP address is showing up in the VPC flow logs, and I and I know that the destination address is e c 2. And so it's like, wow. Wow. This is like a like, VPC flow logs and other kinds of really high volume

log sources, which are have been which in the past have been sort of things that no 1 would dare, you know, like, upload to, like, other tools because it'd be way too expensive. It now becomes, like, possible to run a single query that touches them both. And, like, you can find activity

across so many different log sources and including these, like, really high volume log sources because, it's cheap enough to do that in s 3 now. So and it's fast enough. I guess, like, that's the other thing is, these high volume log sources, you can run queries in in other tools, but it's they're often slow. But, in Scanner, it's really fun to see security teams start to do these extensive searches across

really, really high volume log sources that they typically don't index at all because it's too expensive.

So and, yeah, the the collaboration that we see, people do is often copy pasting, like, these, like, permalinks that we have to different views and to to different log events and hunting down, like, the source of different kinds of of traffic. What what, like, strange users, strange policy changes. If if MFA gets disabled

for a user,

why did that happen? Has it happened before? Did it has it happened? Like, how many times does that happen for this user over the past year? Is there something weird going on 9 months ago? Like, those are all kind of cool things that, Scanner allows you to do that other tools don't because, the the retention period's low or, like, the the number of log sources you're allowed to use is is, is too low. And that correlation aspect is interesting too because there are a number of other security focused products that I've seen that are oriented around graph structures and being able to do automated

relationship discovery between different events because most of the attacks that are actually going to have an impact are going to be multistage.

If it's not just a brute force denial of service or I'm just going to scan your whole website, then it's going to be something where somebody is putting in the effort because you're a high value target. And so they're actually going to be doing it in multiple stages, possibly with a long time delay between them. And I'm curious what are some of the ways that you think about being able to

identify and surface some of those types of correlations

that do require multiple hops before an exploit is actually realized?

Yeah. That's a that's an awesome question. So it's really interesting to see, yeah, how these kinds of attacks evolve. You'll have

someone who is able to reset a password on a particular user. You might see some kind of low urgency detection event, about that. And what what a user will do after that is they'll start to create more, like, AWS IEM roles, and they look pretty

convincing, the the names of these, and but they're all associated with the original user. And so, yeah, it does start to to look like a graph structure. But then they start to you start to see those IP addresses show up, in other places, in VPC flow logs. You start to see those IP addresses show up in, like, Okta logs or, like, in your maybe your, they're starting to hit your API. And so, yeah, I think, with scanner, it's really fast to build those, relationships

as you're running queries because you can sort of really quickly see a given name or a given I'm user, a given IP address, a given, AWS access key across many different, log sources

at the same time. But, I think, that sort of thing you, in scanner, you do have to build you you do yourself by executing queries as opposed to us generating it. I do think that's like a a really cool future direction to go and do some research in where you could actually do multi hop queries, as a result of the detection. You could actually build this, in scanner,

like, by

running a detection event, which then pings tines,

which then, like a like a sore tool and automation tool like tines, which could then, like, execute a query in scanner. And then given those results, execute maybe another query in scanner. So maybe you could actually do some of these things. But I do think, yeah, it's we do think, like, the ability to search really quickly,

to follow these, like, multiple hops from from 1 bit of scary activity to the next is really important. And so I think, while scanner doesn't, like, automatically

generate those relationships,

it's possible that that we will in the future, but we definitely make it, like, totally doable,

from, like, really fast search, by by executing search really quickly, and making it so you don't have to wait, like, 5 minutes between queries. It's more like 5 seconds or something, and you can jump, quickly from place to place to to find what an attacker is doing. But, yeah, I think that's a really cool a really cool direction to go in and and could be something that that we work on and make a feature at some point. But, we'll we'll see.

And from that iterative discovery process too, just from a user experience perspective, I'm curious what that looks like where I've run a query, I get a result.

A lot of times when you're working in a SQL editor, for instance, you run your next query and it completely blows away your previous result. I'm wondering what that looks like in scanner as far as being able to do that iterative

process of building up more complex correlations or being able to say, okay. Well, this was the result here. I'm gonna store that to the side because I need to find this other piece of information and then being able to go back and forth and collectively build up a more complete picture of what you're trying to discover. Yeah. Absolutely. So I think 1 of the things that's really interesting, which, about scanner, which I I I

find, like, really annoying about using

SQL for a lot of these tools is when you're doing an investigation with a SQL tool, you have to kinda jump through and say, I'm going to to look in this table, and it might be like this log source and this, like, AWS region and has is like what this table contains. And then I have to run the query again in totally different table. 1 of the really cool things in scanner is you can say, cool. I have this query, and I want to see the results for this log source. And now I have, like, oh, I have, some data that I've discovered from that log source that's interesting, and I can just continue to edit that query to render both of those things together in the same result table. So I can say, like, I want this kind of thing from this log source, and I want, like, like, please show me results from this other log source as well. And, you can start to build your these, like, stats tables if you want, or you can just kind of render them all as, like, raw search results. But you can very, very easily,

combine

the the search results from many different things altogether

into the same view. And so I do think it I think I I really like that idea of, like, could you do something like pop over and and, like, have multiple,

like, queries going side by side or something like that? We do have like, you can open up tabs as you

drill down into something. You can say, like, execute the search and, like, open up, the this particular search and, like, look at the context of this new tab, look at the context in that that tab. But another thing you can do is actually just continue to extend your query, and you don't have to do things like select the specific table, and that's the only thing that can, render results. You can just really gather results from many, many different log sources. So we often see, like, hey. Please look for,

this email address or this, like, file hash across or wherever it is. I actually have no idea. We have, like, 50 log sources. Like, I don't know where this file hash appears or this domain or this IP address. Just show me everything, and then you can start to,

gather everything together into the same view. It is, I I think, like, this sort of, like, search index like experience in my,

it just feels like way more flexible than SQL in our opinion.

So that's, like, why we really wanted to, replicate the the the experience that you get using a search index, but but on top of s 3 now, which, which makes it, like, a little bit easier and faster to gather all of the data that you need. Given your focus on search as the core product experience, when are you gonna jump on the hype train and run everything through vector embeddings and do semantic and similarity search? We we have played with that a little bit actually. So,

the the the thing that's kind of interesting is that the the scale is brutal.

Like like, when you're generating a terabyte of logs per day and trying to to to build vector embeddings for all of that. And then, so, like, 1 way to to to play with this would be, like, in detections. It might be something like, well, here is, like, the vector space where here are here are some regions where there are evil things going on, but it's a little bit hard to,

build concrete rules to describe it. It could be really fun to say, alright. Well, now, like, let's take let's take all my traffic and let me do something like create vector embeddings for this, like, really massive

amount of of log data and then,

and then see, like, if that falls in 1 of those, regions in the the vector space that that's like, this seems evil, that seems like a threat. That it it's just it's just extremely slow now. Like, it's it's, you know, a terabyte of logs per day is just a brutal amount of of stuff to to drive embeddings for. But it is something that we've played with is, like, 1 thing that, again, might be fun for, like,

a a a future research direction to to play with is something that's lower volume but still really important. Like, when we when a detection rule goes off, it creates a detection event, and you would, like, build a bunch of detection events out. And but the number of detection events you have, instead of 1, 000, 000, 000 and, you know, tens of billions, 100 of billions of those a day, it might be 1, 000. And that is totally,

viable

to to do something like generate vector embeddings and then go and see if these detection events, which all might be low urgency, like, combined together into something

that's scary, that would be really fun to play with. But that's I it is exciting. I think, like, you know, vector databases and and basically

trying to to search in semantically as opposed to with, like, very

concrete rules would be awesome. It's also just tough to do at scale in terms of expense and speed. Like, keeping up with the ingestion would be really painful. But there are some cool ways that we we may we may play with in the future. And it could also be interesting for outlier detection as well because

in the vector space, a majority of your logs are actually probably going to fit within that similarity search, but it's when something is anomalous and outside of the that bound within a certain threshold that it becomes interesting and requires further investigation.

Yes. Absolutely. I think, like, it could be, you know, anomaly detection could be that you do you just might have to,

reduce the the fine grain resolution to something, like, manageable by a vector database, and that's definitely what the the direction things are going. It's just like, I feel so bad for all these security teams and everyone dealing with, you know, 15 terabytes of logs per day and things like that. Like, everything breaks and everything is hard at that volume. But, that could be a fun,

a fun, like,

a fun problem to solve, in the future for us.

And as you have been building scanner, working with your customers, getting deeper into this security space and the detection of these security events, what are some of the most interesting or innovative or unexpected ways that you've seen scanner applied? Yeah. I think, I think the most interesting thing, that I've seen people do with scanner is to cross correlate across many different data sources and

to use creative log sources, to do this. So I think what what we tend to see people do I guess, I or I guess I'll pause. I'll I'll I'll think about that for a minute, actually. There are so many different ways there's so many different things that people do which are,

surprising. Well, I think the most innovative things that people do with scanner is is hunting through

very different log sources and finding ways to track users or, like, or threats that are are spanning many different log sources together. And so it's doing things like looking at GitHub,

activity that might be something like the policies are changing for an organization.

Who is that user?

Have they done AWS, have the has that user made AWS API calls in the past, what kinds of API calls are they, and the ability to go and cross correlate and and very quickly search through many different log sources and, start to view those results altogether, starts to paint a a really clear picture about, like, a threat or a non threat or different ways you can improve your policies to to be a little bit less less strict or or to be more strict based on what you're seeing in your log data and what tends to go off your detections, what what tends not to. So, yeah, I think the some of the the coolest things that we've seen people do is just jumping around from log source to log source and adding really interesting different new log sources to tools.

So in in addition to security logs, like, also jumping into application logs,

to look for activity from from users and then jumping into VPC phone network logs. Just a huge,

a huge

diversity of different log types that people are using to search. And in your experience of building the scanner product and figuring out how to make it work for the teams that you're selling to, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think 1 of the most interesting lessons that that we've learned is that integration is really, really painful.

So it might be the case that people,

feel very comfortable

spinning up a project

and building, like, their own pipeline to start to build their own data lake. But, really, when you get down into the details, when someone has, like, 50 or a 100 different log sources,

it's extremely

nice to be able to,

take as much of that burden off their shoulders as possible. So, like, the search and the indexing and the detections are really cool and are the most technical aspects of what we do. But, some of the things that we've discovered is, by helping people reduce the the friction in integration,

it makes an enormous difference. Like, it's it's maybe the fastest way you can improve someone's life. You know, instead of every log source requires a new parser, a new transformer,

a new pipeline. Instead, if you can say, cool. Like, you you connect them into s 3. You don't have to transform them. You don't have to get them to fit a SQL schema.

We'll just take it from here.

That is an extremely powerful thing. So I think, like, that's been 1 of the most interesting challenges

is, like, how can we,

reduce the integration friction

as much as possible so that people can go back to just, like,

investigating,

querying, playing with the logs, and they're not spending time, yeah, spinning up yet another, log source, trying to integrate yet another log source. We just, like, want integration to be, you know, basically take that entirely off of your shoulders,

and handle whatever,

log input that you send our way. And for people who are in the security space, they're trying to figure out what are all the events that I need to care about, what are the cases where scanner is the wrong choice?

I would say that Scanner is is not what you want to use if you have,

like, a a you you want to run, like, sophisticated, like, large SQL queries where you have many joins and you're trying to do, like, a query that's going to run for a few hours and generate, like, a big report or something like that. And so,

people will often do this, when they when they do things like they try to compute all of the the sum of all of the bytes transferred,

across the network from from this VPC to that VPC or from their network to outside in the Internet or something. And, those kinds of queries that that just take a a long time and and and, really take advantage of of SQL and how sophisticated

the queries are in a tool that,

like, the kind of business analytics queries that you might might do in a tool like Snowflake or or something.

Like, that's really great, that's really great there. But Scanner is really meant for extremely fast investigations and and fast needle and haystack search with, like, basic kinds of aggregations. So instead of, you know, every query taking an hour, scanners will take a few seconds. But if you wanna do something kind of kind of sophisticated, like, let's compute input and output balance for different hosts or, like, if you're looking at, like, financial transactions and trying to compute, like, a balance or something and try to see, like, has anyone's balance gone negative,

over, you know, like, a week long period or 24 hour period and you're running, like, a big aggregation type query,

SQL would definitely be what you would reach for there, and you wouldn't want to use Scanner for that. Like, Scanner would would be much better for, doing, like, threat hunting and finding finding activity and and zooming in on activity quickly,

and not not as good at, like, the the big aggregation type queries that you might do in another tool. As you continue to build and iterate on the scanner product and work with your customers, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore?

Yes. Everyone really wants an API, and we're really excited about that. So a lot of people, what they do is they'll they have many different, log tools

or, they there are just a lot of features that they that they want out of a tool, but they they want them to integrate well together and and play well together. And so we are really excited about building an API that can do 2 things in particular. 1 is to allow you to do ad hoc searches, very quickly, but the other is to take advantage of this aggregation

caching system that we've built for detections

and make it possible for you to run API queries that can power dashboards

really quickly. So, basically, the it generates these time series aggregation values

for detections, and it's really helpful for rendering dashboards. We wanna make it very, very fast for you to have an API where if you're using Tableau or using Grafana, you can use the scanner API to take a look at your s 3 data and, build dashboards incredibly quickly and run, queries incredibly quickly in scanner. So you don't have to jump into to Scanner to build the dashboard. Your dashboards can be in many different places, and and Scanner can fit in well with what you do or like a front directly from Slack. If you get a detection event from Scanner,

you might write your own bot or or, like, integrate with some other bot where you can say, cool. Now, like, take the next step, and can you run this query and scanner, please, and render the results, for me here in Slack? So those kinds of things, I think, like, that's the the next big frontier for us is, building out this API to make it so, like, scanner can be this core search, tool and, like, aggregation value tool over your logs that integrates with tons of different tools and allows you to build on top of it. So, yeah, that that's, like, the by far the biggest thing. Everyone's also asking for dashboards,

as well. That's something that we we want to build, but it might be possible that people might just build all their dashboards used on top of our API. Yeah. That's that's the big, next, big frontier for us. And then also,

increasing the number of data connectors that we have so that you can very easily pull in data from many different,

log sources

into your s 3 buckets and make onboarding new log sources as easy as possible.

Are there any other aspects of the work that you're doing at scanner, the overall space of security log analysis, and threat discovery that we didn't discuss yet that you'd like to cover before we close out the show? I think that's probably good, if that's alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? Yeah. I think the biggest pain that,

I think everyone experiences is in integration.

So,

trying to get, massive amounts of data

into the right shape, into the right destinations, and so on, and and also making it visible. I think, like, that that's

a lot of tools exist and a lot of cool functionality exist to get data into a place like s 3. That's probably where everything should be.

That's it scales well, but that data often becomes invisible and very hard to interact with. And so I think, like, getting really great visibility,

search, and queryability

on,

massive amounts of data,

at low cost is is like a huge problem, and that's like a a big reason why we're we're working on scanner. Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing at scanner. It's definitely a very interesting product. It's great to see you working on making

the discovery

and exploration

of security threats easier and faster for security team. So I appreciate all the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Awesome. Demias, really appreciate it. Thank you so much.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the machine learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

Email

hosts

at

dataengineeringpodcast.com

with your story.

Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links