Honeycomb Data Infrastructure with Sam Stokes

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at www.dataengineeringpodcast.com

/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

Go to dataengineeringpodcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. And to help support the show, you can check out the Patreon page, which is linked from the site.

To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.

I've got a couple of announcements before we start the show.

There's still time to register for the O'Reilly Strata Conference in San Jose, California happening from March 5th to 8th. Use the link data engineering podcast.com/strata

dash sand dash Jose

to register and save 20% off your tickets.

The O'Reilly AI conference is also coming up, happening April 29th to 30th in New York. It will give you a solid understanding of the latest breakthroughs and best Also,

if

Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1st through 4th.

It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective.

To save 60% off your tickets, go to data engineering podcast.com/odscdasheastdash2018

and register.

Your host is Tobias Macy. And today, I'm interviewing Sam Stokes about his work at Honeycomb, a modern platform for observability of software systems.

So Sam, could you start by introducing yourself? Hi there. My name is Sam Stokes. I'm an engineering manager at Honeycomb.

Honeycomb is a pretty small,

b to b startup in the Bay Area,

and

I've been at Honeycomb for almost a year now. And do you remember how you first got involved in the area of data management?

I got involved sort of by accident.

So I've been a for most of my career, I've been an engineer

who has occasionally pretended to do operations.

And

I've,

for a while, thought that I really needed

a tool that would let me use my engineering instincts

to help me solve operational problems.

And so when I saw a demo of what Honeycomb was building, I sort of stopped them 30 seconds in and, like, okay, I I get it already.

This is the thing that I've been needing for years.

How can I help? And they were like, great. Well,

we built our own data store, and we

really like someone to help maintain it. So you can do that.

And I was like, oh, okay. Sure. I guess I'm a data engineer now.

So, yeah, I I found myself maintaining a custom data store implementation.

And,

it turns out if you carefully

scope what it is that you're doing and you make it very clear that you're not building a general purpose database,

that it's not as crazy a thing as it sounds to do that.

But, yeah, it's definitely been a learning experience of, okay. Again, this is what it means to actually be in charge of

a data store implementation and a significant amount of customer data. Yeah. The sort of general wisdom is the first rule of building a database is don't build a database.

But the second rule is that if you if you know that you really need a database, then go ahead and build

it. Yeah. And if you have to build a database, we'll call it a data store because and people don't think you build a database.

And so you've mentioned briefly that Honeycomb is a tool for being able to get observability

into your software systems.

And I'm wondering if you can describe a bit more about what Honeycomb is as a company

and what you mean when you say observability.

It's a great question. So what Honeycomb is it's easier to explain what Honeycomb does. What Honeycomb does is we let you query

the events that your own system is producing.

So what's an event? Well, let's say you run a web server or a web app that consists of a bunch of components, 1 of which is a web server.

So an event might be I process the request from a user, and that request is

serve the sign up page,

on this server,

and it took this long, and

the user was on an Android device and, you know, lots of other details that you might have about the processing of that request. So that's an event. So to use Honeycomb, you

have your services

send us events.

Everything

that happens

is an event that you might wanna tell us about, and then we let you run queries on those events.

And

a query you might wanna ask

is

how many events did I process this week,

you know, per day? Show me a time series graph of my traffic this week. And so we can we can give you that. You run a count query that just says count my events over every 15 second interval in the last week, and we can give you a time series graph. But then because you sent us all these details along with the events, we can also do things like show me

the 95th percentile

response time of the events that I served this week for every 15 second interval over the last week. Now you have a response time graph as well. And these are the kinds of things that you get in a traditional operations tool. Now you have traffic and you have latency. You could do errors as well. And then with Honeycomb, you can also ask questions like, okay. Give me given that 95th percent 95th percentile response time graph, break that down by

mobile platform. Show me how fast Android users are versus iPhone users, or break it down by geographical

location or

show me the top 10 slowest users. And so that's, in a nutshell, what Honeycomb is trying to give you is the ability to

ask these sorts of operational questions about your system,

but in a very flexible way, which doesn't

require you to sort of predefine a whole bunch of metrics front.

And that is sort of what we book about that's sort of what we mean by observability

as well. It's the ability to

ask questions about your system and see what it's doing

in in real time and in a fairly fine grained way. My system is doing all these things. I have a web server. I have a database.

I have,

a microservice for doing

authentication,

I have a microservice for,

let's say, verifying that I'm not doing fraudulent activity. And all these things are opting in concert with each other and 1 of them is calling the other, and they they go up and down independently, and they get deployed independently, and 1 of them is an API that I'm calling from some other company.

How do I know what's going on? I need to be able to look at my system

in the same way that an engineer developing

developing software on their laptop would run a debugger against their code or try running it in a test mode and see what happens. We wanna give you the ability to do that

in production, to just ask, what is my system doing right now? What does normal look like? What does my system look like when it's processing normal load?

What changed when I rolled out this new version of software?

What changed yesterday when this latency spike went to, you know, double normal latency or something like that?

So, yeah, there's definitely a lot to dig into in terms of

the ways that you can

structure the event data that you're sending into the system to be able to,

then gain interesting insights out of it. But,

for the time being, I'd like to focus on the path that the events take from

the client system

all the way through your data infrastructure

to the point where it's being rendered out as a data point on a graph for somebody who's trying to debug a systems issue and the different systems that you use to be able to manage that event data?

The life cycle of an event looks like this. First of all, the system that is processing an event

the system that is originating an event needs to tell us that an event happened. So we give you a few ways of doing that. We give you, we we provide SDKs,

which you can use to instrument

your own code if it's if you're running code that you wrote and you have access to add instrumentation,

in which case you have an SDK, which will let you do things like emit an event

that this thing happened with the following properties.

So that's 1 way that you can send us an event.

But we also provide an agent which can take

if you're outdoing things to log files,

we have an agent that can tail those log files and

parse them into structured data, which then gets submitted to us as events.

So 1 way or another, you send us an event.

That looks like

making a post request to a

web service that we provide. So you send us an event. Events look like JSON blobs. So they have a bunch of different properties. You know, property might be the response time something. It should be numeric.

It could be a string, you know, what path did you hit in the web server.

It could be, you know, any any property about, you know, what what server was running incident. So you post that to us.

We have a service that sits on that that receives all the events and, basically, just dumps them straight into a Kafka topic. So we rely pretty heavily at Honeycomb on Apache Kafka, and all our events through 1 big Kafka topic that partition into lots of partitions.

And consuming from that Kafka topic, we have a service that we call retriever.

I mean, retriever is our custom data store. So retriever

consumes events from the Kafka topic

and stores them

onto this. We can go into

some interesting things about the designer retriever, but, basically, retriever stores events on disk. We store every single event that you send us, and the idea of storing every single event is the only way to let you ask

questions about your events in a flexible way other than having to

sort of decide what you're going to pre aggregate and ask questions about upfront. The only way to,

let you ask questions flexibly is to just store everything you send us and then query them all and do any aggregation. So if you're asking, you mean the average of something, we compute that average at the time that you ask the question. So retriever does 2 jobs. Retriever consumes from the Kafka queue, receiving events as they come in and writing the disk, and then retriever also serves queries. So when you run a query, that gets,

sent to our web service. Our web service forwards the query onto retriever. Retriever reads events off disk according to the query parameters that you specified.

And let's say you were asking about something like, give me the average response time, all events

that were served from an Android device. So retrieval read all of your events and apply that filter so it will read event by event, discarding any events that wasn't read from an Android device, and it will keep track of the response time of each event and keep track of the number of events that it read so that once it's read everything, it can sum up all the response times and count up all their events, divide sum by the count and the average, and then serve that back up, for example, in a time series, if that was the query that you were running, and then you get a graph out at the end. And then we have a pretty rich

visualization

system which can take those graphs and render them in lots of different ways and that you do things like compare different breakdown routes against each other.

And so as you mentioned, the data that you're posting into Honeycomb is just JSON. So I'm wondering if you can describe a bit of about the characteristics

of the event data that you're dealing with and the challenges that it poses

when you're dealing with

large volumes of it, particularly given the fact that there isn't necessarily any inherent structure to it. You may be dealing with sparse data because not every event is going to have the same keys in it to be able to look up. So

the nice thing about

column oriented storage is that let's say you're asking query you're running a query, which is something like give me the 95% of our response time of all events

that

had no error. So

you could serve that from any kind of storage engine. Right? You run over all of the events, you run some filter. The advantage of column oriented storage is all of those other fields, the path, the surname,

the,

mobile platform.

I don't need to read any of that data when I'm answering the query

of showing me the 90% of response time filtered by,

there not being an error. And that sounds pretty good for the example that I gave where I have maybe, you know, 6 fields, and I only have to read through them. But

it's quite common for us to get events that have 50 fields or a 100 fields, and so that means we're

disregarding

about 90% of the data that you sent us

for most of the queries that you're likely to run. And in turn, that's a really big speed up because a lot of the cost of running this kind of analytic queries is the IO cost of just reading the data back off disk. And so that's 1 of the ways we deal with this sort of sparse,

schema less sort of data. It's just by storing it in a way where at read time, the schema doesn't matter very much.

We only use the schema to tell us what they need to read according to the equipment in the run. Working particularly in an operations context, but also who are dealing with

unstructured data or loosely structured data

are probably familiar with systems such as MongoDB or Elasticsearch

for being able to handle this schemaless nature, but

the advantage, it sounds like, of having the custom built data store

is that the usage patterns that you're dealing with for being able to answer these analytical queries focused on an operations context as opposed to a more texturally oriented system like Elasticsearch, which people might use for logs

or a more general purpose

queryable document store like MongoDB where people are looking for the entire document to be returned at once

is that because you're not trying to optimize for

doing substring

matches

with free form text fields.

Right. Exactly. So if we're asking the question of give me the 95th percent of our response time, then we know that that's all we need to keep. We can go event by event, just keep those

floats,

run a streaming aggregate

that doesn't need to use lots of memory because what we're doing is adding 2 numbers together

and then do 1 division at the end. So, yeah, we can read massive amounts of data with constant memory usage.

And 1 of the problems that's often faced when you're working to

gain visibility

into

an operational system

is that oftentimes it can be difficult

or time intensive to actually add all the instrumentation

code to the software that you're running or the underlying,

operating systems or cloud platforms. So I'm wondering

what you found to be the most difficult aspects of collecting the data and whether you have any specific tooling that simplifies

that work for the end user who's trying to integrate with your system. This is definitely an area where people have usually invested

gathering

some kind of data from their systems. And so, often they've invested a little bit of effort into

gathering some kind of data from their systems,

and

sometimes that is helpful because they have just enough data coming out. For example, if they're emitting something into logs, if we're lucky, there's enough structure in those logs that we can run our agent against those logs, parse out in the data to get something useful out, and

we can turn that into the structured data that you really want and then let you query against those.

Sometimes people invested a lot of effort into something like a metric system,

and

metrics is something that looks superficially very similar to what we're providing. You end up with time series graphs at the end, and You can compare 1 graph against another.

The problem is Metrics is about pre aggregating things. So Metrics, you might say, I saw 1 request

from

this user in,

Europe,

and

you

increment a counter for that and you

add some value to a gauge about you know, you add an aggregate thing to the response time for that user in that geolocation.

If you then want to split that

metric by some other fields like mobile platform, well, you didn't gather that data in the first place, so you don't have the data to go back to. So sometimes if people have invested a lot of effort into instrumenting the code for metrics,

we find that they don't want to go back and do the effort again in order to emit structured data so that they can send it to us.

So, yeah, instrumentation is something that

it's hard to get away from. You have to put some effort into instrumenting your code.

However, you can often exploit

details of the platforms that you're running on.

For example, if you're running on a

web

like Rails or Django,

they tend to provide a lot of structure around this kind of thing already. If you're running arbitrary software, there's

all kinds of stuff you could be doing. But if you're running a Rails app, then you know that your system is mostly processing the requests and that every web request has a bunch of properties which the framework knows about. It went through this

code component.

It hit this particular endpoint.

It took us to run. It spent this long talking to the database. It spent long talking to

the, template rendering layer.

It rendered this particular template.

It was done on behalf of this user.

So if you have structure in your system that you can make use of, you can often

get a lot of instrumentation with a lot less effort, and we're looking at how we can provide better tooling that can sort of take advantage of that structure that's there.

In the

instance of instrumentation,

there are some things that you can hook into at the platform level.

So 1 example, if you're running on a runtime platform like Heroku,

Again, Heroku gives you a lot of,

logging

and instrumentation

just by virtue of how the Heroku platform is architected so you can subscribe to their,

log Logplex

system, which emits all kinds of

event data about every request that gets rooted in and the metrics using the database product.

If you're using a platform like Kubernetes,

there's lots of telemetry that gets emitted from all of the different components of Kubernetes that we can use to get an idea of what's going on in your application.

But the real power is

being able to tell us

not just the generic details in every web application that's processing web requests, but

what is unique to your business or the problem that you're trying to solve.

User ID is an obvious thing. If someone's logged in

while you're processing a request, then it's really useful to tell us what's the user ID, what's the

characteristics of the person that you're performing the request for. But maybe your business is dealing with shopping carts. Maybe you don't care so much about

the person who's logged in. You do care a lot about how many items they have in their cart

or

what product they were looking at when they clicked on a different product or something like that. And that's something that is

really, really powerful when you have access to that sort of data. If you can say something like, if you're troubleshooting some kind of performance issue in your site, in your ecommerce site,

and you can say, it looks like it was running slow,

Show me the top 10

slowest products that were involved in

in people's shopping carts. Maybe you've got some product that's really popular and lots of people are looking at it, and that's blowing out some cash somewhere. That thing is really powerful, putting a business level piece of information into your operational data.

But that's really impossible to do unless you can write instrumentation. So

it it does come down eventually to showing people the value they can get out of instrumenting a code, so that we encourage them to do it. We've been focusing a lot of the discussion on being able to gain observability into user facing systems, but it seems like this would also be a very useful tool for somebody who's doing more of a data engineering workflow where you could potentially

emit the various events from different stages of airflow jobs to be able to gain visibility into how hit the health of that system or being able to

track some attributes

of your data ingestion pipeline to make sure that that's functioning properly. So I'm wondering if you can talk a bit about

some of the ways that Honeycomb benefits that type of workflow.

Absolutely. I can give 1 example from our own internal usage of Honeycomb.

We were

trying to roll out a new

a new storage method in our own internal data store. The same thing I've been talking about really, but we wanted to change the way that that data store

stored strings.

And I won't go into the details of what the change was, but it's a change to the way that we took events and sold them on disk. And we wanted to roll that out in a fairly cautious because

we were processing a lot of customer data. We wanted to see if there's gonna be any performance impact.

So

we

use feature flagging

as a practice

reasonably

consistently at Honeycomb.

We use a service called LaunchDarkly, which is a feature flag as a service. And

the way we decided to roll out with store exchange was to put the change under a flag

and then flag it to 10%

of customer datasets.

And this was a change that, in theory, was just under the hood that customers wouldn't see the difference. We just wanted to see was there a performance change. So we flagged out a change to 10% of datasets,

and then we just ran it for a week and

let us gather data about the performance of our own system. And then we can start asking questions.

What does this thing look like on both sides of this flag? What does the average time to write a record

look like

for

datasets that have this new storage feature turned on versus datasets that have this feature turned off?

What does the average size of data look like for things on those sides of this flag? So, yeah, that's 1 example where, again, it's useful to be able to break this down by lots of different facets.

1 thing we found was that at 1 point, we decided that this 10% rollout was going well, and we wanted to roll it out further. So we flipped from 10% to Netasend,

And we found

that the change was supposedly neutral.

We didn't think there was gonna be a performance impact, but when we flipped from 10% to 90%,

it seemed like a lot of things got slower. A lot of metrics that we were breaking down this flag got a lot slower. And we're like, wait. This this doesn't make sense. We didn't think this would have all this impact. So we did a lot of head scratching. And then eventually, we're, well, what do we actually change? We change we went from a scenario where 10% of our customers were on this this old system and 90% were on sorry. 10% were on the new system and 90% were on the old system. And then we switched over to most of the customers were on the new system. And what we realized was 1 of our largest customers

who was writing a much larger volume of data had not been in the 10% we'd chosen, and that customer doing all the metrics.

And so because we're in the majority, when the majority had the old system, the old system looked slower. But once we switched the new system to being the majority, the new system looked slower. The difference actually was nothing to do with whether the change was in effect or not. It was just what side of the flag was this particular customer on, and that's the kind of thing where

only because we could break this down by customer could we reveal that difference so we could see this apparent change in performance.

But then when we broke that when we broke the performance graph down by customer ID, we could see, oh, right. Now we can correct for this difference.

We can see who's on what side of the flag. So that's that's 1 example where

having access to more data gives us the ability to see really what's going on. You you were asking very specifically about data pipelines, and I think I didn't answer that part of the question. Actually, I have a I have a little more to say about that part if you think that's an interesting direction.

Sure. So, yeah, that's definitely

very interesting to hear

a specific example of how you're actually dogfooding Honeycomb in order to make the Honeycomb service itself,

run better. So I'm wondering if you have any other, sort of anecdotes

about using Honeycomb to improve Honeycomb.

We had 1 example where we wanted to

really optimize the performance of our query. So core activity of somebody using Honeycomb is to run a query

against the data that sent us.

So they go into our interface. They specify query.

We

take that query, send

send it to retriever, retriever queries or events. We send you back graph, and

that wasn't as fast as we wanted it to be. So, fortunately, we were instrumenting our own system, and so we had lots of event data for every query that somebody was running. And so we could ask questions like, what is the 95th percentile time to run a query? And let's say that came back in 5 seconds. So we could say, we're actually doing a lot better than that these days. But let's say that it came back in 5 seconds, then we can say, well,

19 out of 20 queries are going back in 5 seconds, but we'd like that to be better. So now we can break down and say, well, show me all the queries that are slower than that, and where are they getting stuck. And it turned out that

the details of how our front end are you know, we have a JavaScript

web application that's talking to our back end running queries.

And 1 of the details of how that works is it fires off a query to the back end, and then it polls. It just checks repeatedly to see whether the query result has come back here. And it turned out that a lot of the delay

in running queries was just

that polling was running too slowly. We were only checking,

you know, 1 time every second. And a lot of our queries actually were quicker than that. A lot of our queries were coming back in under a second, but we were still waiting,

you know, a second plus the race condition of, you know, we fired off the query and then we decided when to start that second timer. So we were waiting longer than we actually needed to to see when that query came back, and that came clear when we could look at the details of the, the performance profile. And we just dropped that polling time down to something like 200 milliseconds.

And suddenly all our queries were that much faster. It might sound like an obvious insight, like, you poll as quickly, but it's the kind of thing where it could just be polling every millisecond, but that would obviously be too fast. It's certainly with data that you can understand what's the right polling interval in order to not overload your system, but also to,

not be unnecessarily waiting for stuff. And in and by doing that, we were able to

achieve a pretty, pretty significant speed up. And by running that kind of analysis, we were able to achieve a pretty significant,

improvement in the

customer perceived performance of our system.

And as we've discussed briefly so far,

Honeycomb is a platform that operates in a space that's similar to a number of other systems that fill parts of what it can provide. So I'm wondering if you can call out any specific

offerings that are either available off the shelf or self hosted capabilities or as a platform as a service or a software as a service

that you consider to be the closest analogs to what Honeycomb can do and then maybe also call out when Honeycomb is not the right tool for a specific task?

The systems we're most common compared to are, on the 1 hand, log analysis tools such as Splunk

on the

service,

software as a service

side or, the

Elk Stack,

Elasticsearch,

Elk stash Kibana,

in the category of log log analysis tools. And then on the other side, people often compare us to,

metrics tools such as Datadog. Honeycomb is trying to help you with observability,

and observability is really good for

asking questions about your system.

And what that means is

you're really good when you don't know where the problems are yet. You've

instrumented your code. You have lots of data coming in, but you don't yet know exactly what's going to go wrong.

And it turns out that a lot of systems people are building these days fall into that category.

People doing microservices,

they're running in containers, they're running distributed systems across big fleets of of stuff. They're running

serverless,

which,

you know, in theory, makes your life easier but actually introduces new category categories of problems.

So there are lots of things people are doing that are introducing complexity, and that's a world where it's really useful to be able to ask questions where you didn't know the answer upfront. That said, there are plenty of worlds where you do know what's going to go wrong. If you are a media property and you've already

built out your site and you've already scaled,

you know, if you launch a brand new newspaper, you probably don't have any users to start with. But by the time you've got some traffic okay. You have a 1000000 readers,

and you know that you're gonna get this much traffic except when there's a big news story, and then you're gonna get 4 times that much traffic. And you know exactly what's gonna go slow at what point. And if you're in that world, you can pretty much get away with the next tool. You can just

instrument the things that you already know are gonna go wrong, and make sure you have an alert for when that thing goes wrong. And when that goes wrong, you know what you need to do to fix it.

This is also the case if you have a

you have a software system that you've pretty much stopped developing.

You've finished making all the changes to it. You're not planning to add new features.

You don't need to scale it up anymore. You're just ready to hand it off to an operations team to just keep it running.

Again, in that case, if you're in a known scenario where

the failure mode and known and understood and the, you know, the runbook is easy to write, you know what you need to do in response to an even alert, then you don't really need to do like Honeycomb in that scenario. Another example with would be if your data really is unstructured,

if you are emitting lots of text rings, or if you just

if your logs are

really need human attention. They're the kinds of things where you can't turn it into metrics. You can't turn it into

structured data. You just need a human to look at every single log in order to,

for example, untangle some chain of events that led to a certain thing happening.

And there's

you're at a small enough scale where you can afford enough humans to read through all of the details.

That's also something where you may just want to really keep all those lots of things and have someone go and read the things that happened at 4:30 AM last night.

And so it seems like Honeycomb

is useful as a complement to an existing suite of metrics

tracking

where you can use it for being able to prepare for

the sort of 4th quadrant of the conefin framework where you're trying to be able to manage the unknown unknowns

and possibly some of the known unknowns,

but you can then still rely on your existing metrics and logging platform for giving you the information about business as usual so that you can track that but still be able to get alerted about anomalous information from that, but then use Honeycomb to dig into those specifics?

We we love these existing tools, but the funny thing and I guess part of

our long term philosophy is that the knowns and the known unknowns,

you can usually automate. The easy problems are the ones which you shouldn't be spending human effort on anyway.

The problems where you really need to know about are the unknown under the unpredictable things.

So, obviously, we exist in a world where people have lots of tools that they're invested in.

But at some point, if you have humans being paged

about predictable outages,

then you're just waking up the things that you could have had you you could have had a machine solve. If you have an outage that happens once a week or once a month and you're waking people up for that, that's just sort of a sad use of humans. We would like people to instead make sure that there are only paging people about stuff that really wasn't expected. And what have been some of the most challenging aspects

of building and scaling and marketing the Honeycomb platform?

Definitely getting people to

believe that this is a problem that they have. We come across lots of people who, as soon as they see Honeycomb, they're like, yes. Okay. Great. I'm finally glad I have a solution to this problem.

And those people have been easy to sell to.

But other people have been

we have we have customers who have,

invested large amounts of time and money into

running a gigantic

ELK Stack cluster. And they're throwing hundreds of machines at this thing, and they've thrown engineers at customizing

their ELK Stack to work in the particular demands of their environment.

And the users of the system don't really like it. The engineers who are working on it love it because it's a really interesting system to work on. It's really cool to kind of run your own log analysis system, except that that's not what these companies are. That's not the product these companies sell. Right? This is entirely

internal tools use case that they've decided to invest a lot of money in.

And these people don't necessarily know they have this problem. They just think it's

this is just a difficult thing to do, and the best thing they can do is throw money at it and convincing them that, no, there is a better way, that you just need to look at things in a slightly different way and, invest a little bit in instrumenting your system and in changing the way that your systems talk to the world. Yeah. That's that's definitely the main part of the challenge is just helping people to understand where this fits in alongside the systems that they already have, especially when they've invested a lot of effort into overcoming the shortcomings of those existing systems.

So what is it about

the

high

cardinality

of the data that you're managing that makes it such a difficult problem that it hasn't been solved in a comprehensive manner by anyone else before or at least in a satisfactory manner?

High cardinality

is these terms that makes a lot of sense to certain people and makes no sense to certain others.

So if you are a user of a metrics product like Datadog,

you've probably come across the, quote, unquote, high cardinality problem. And this is the problem where

let's say you are you have a graph of response time,

and then you have a graph of response time for each server

because sometimes you want to see if 1 particular server is running slow.

And maybe you have a graph of response time for each

piece of your web application.

And then someone says, yeah. Okay. But I'm only interested in this piece of the web application, but I still want to know if it's running slowly on a particular server.

So now you need a graph of response time for each piece of the web application for each server.

And if you have a 100 servers and your web application has 10 points, now you have a 1, 000 graphs.

And then someone says, yeah. Well, what about if it's running slowly when the customer is coming from a different

geographic area?

Okay. Now you have a 1, 000 times 7,

how many geos your company is in. And then someone says, well, which is this which user is seeing the slowest performance?

And you're like, well, I don't have a graph for that.

And they're like, well, can't we just track a graph for every user?

And so now you end up with a 1000000 users times a 100 servers times 10 endpoints times, and you end up with millions of graphs. And millions of graphs has a couple of problems, and 1 is just no human can make sense of mix of us. And there's just a a data management angle on the front end. Like, how do I even find the right graph to look at when I have so many graphs? And then on the back end,

metrics products

basically consume storage

and other resources

according to how many graphs you're tracking.

And if you have a 1, 000, 000 graphs,

most of them are not set up to handle linear graphs. Most of them are set up to, you know, track thousands or tens of thousands.

And what we're really talking about here is

combinations.

Right? You have a combinatorial

explosion of how many dimensions

you need to keep track of, And that is just a problem for anything which consumes resources according to the number of graphs you're trying to plot. So that that's what can be difficult about high cardinality.

If the way that you consume storage is

not affected by that combinatorial explosion,

then you have a lot more flexibility. And this is why we store every single event. It seems like really expensive thing to do. You know, can't you just roll these things up in advance into a number, and then you don't need to store all the details? But, actually,

if you're interested in these fine grain combinations,

what if this user was having a problem every time they happen to run-in this availability zone, for example? Then it's actually a lot cheaper to not have to precompute all of those different combinations of things if you can just ask all of that as a post hoc filter arbiter. So that's the high cardinality problem. Why is that interesting? Because

often the high cardinality fields are just to be clear,

when we talk about cardinality, we're talking about

for some given dimension of the data, the cardinality of that dimension is how many different values can that dimension take on. So 1 dimension might be the HTTP status code. That's not very high cardinality. Right? It can be 200. It can be 404. It can be 500. There's a few other options,

but, you know, you've only probably seen maybe 20 status codes in the wild. And then a very high cardinality field is something like user ID, if you're lucky, if you have a lot of users.

And it turns out the things that are most interesting to your business are usually high cardinality.

The

user the unlucky user who is hitting an edge case in your system and can never get site to load, well, that user is permanently unhappy. And if you look at a graph of, like, average user happiness, you don't even see that person. It looks like your site is mostly fine, but that user is gonna go on Twitter and talk about how your site is always down. And, you know, if they're if they have a lot of followers, then you have a problem there. Or if that user was gonna buy something and they don't because they can't load your shopping cart,

again, you have a problem. So

being able to look at your your most effective use is 1 of those cases where you can get a lot of interesting business value out

of asking these questions about things that are naturally high cognitive,

and that's why we think this

is something that people need to

understand is possible,

and

they need to start thinking about what questions can I ask about

about high cardinality fields?

And are there any other aspects

of the Honeycomb

platform or the underlying infrastructure

that we didn't cover yet that you think we should talk about?

We we try not to be too inventive with regards to our infrastructure.

We've already invested some resources into building our own data store, so we try not to be too clever about the rest.

We run, we run, we run for pretty much all of

our infrastructure.

So our web servers run Golang. Our

We use Terraform

and AWS

to manage all of our infrastructure.

So we can do things like spin up new servers if we need to and spin up another copy of our environment,

but we're not being very clever with auto scaling. We pretty much know if our cluster needs to get bigger. We take a pretty pragmatic approach. We have we built some tools to make our lives easier, but we don't rely on them too heavily. A lot of things are just bash scripts and cron jobs,

and it turns out that those things are really effective and you don't wanna mess with them. So for anybody who wants

to follow you in the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes.

And as a final question, I'm wondering if you have any opinions on what the biggest gap is in the available tooling or technology for data management today. I'd say there's a gap in

the affordances

of getting to grips with large amounts of data.

If you have a data scientist, they can do all kinds of

queries with Spark or, you know, some Hadoop job. And there are lots of visualization libraries, but there are a lot of them are pretty much geared towards experts. And it turns out all the questions you want to ask are not expert questions.

They're more like you start with a simple question,

how slow am I,

Then you gradually refine that question.

How slow am I on differences?

How slow am I for different users?

Who are my slowest users?

What are the let's say I found a slow user. What are the properties of that person? And whatever tool you're using,

it's a difficult data presentation problem. How do I understand that? How do I see all of the different combinations?

How do I get a visualization of all of the data that tells me what I want to know, but is easy to manipulate? And there's been a lot of work that's gone into, you know, taking a time series and automatically smoothing it so that you can see the important details, but you don't see the noise or,

query languages that let you express combinations of graphs and that sort of thing,

and that'll

be a really interesting area to watch to see if that can get easier to the point where people who are not expert users of data graph, I think, libraries can use them. Alright. Well, I appreciate you taking the time out of your day to join me and talk about the work that you're doing at Honeycomb.

It's definitely a very interesting product in an interesting problem space and 1 that I'll have to dig a bit more into possibly using in my own infrastructure. So thank you again for your time, and I hope you enjoy the rest of your evening. Awesome. Thank you very much, Tobias. Cheers.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links