Customer Analytics At Scale With Segment

Today, I'm interviewing Calvin French Owen about the data platform that Segment has built to handle multiplexing continuous streams of data from multiple sources to multiple destinations.

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so you should check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show. Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams.

If you're tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need, then it's time to talk to our friends at StrongDM.

They have built an easy to use platform that lets you leverage your company's single sign on for your data platform.

Go to data engineering podcast.com/strongdm

today to find out how you can simplify your systems.

And go to data engineering podcast.com

to subscribe to the show. Sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms,

big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with O'Reilly Media for the Strata Conference in San Francisco

starting on March 25th

and the Artificial Intelligence Conference in New York City on April 15th.

In Boston, starting on March 17th, you still have time to grab a ticket to the enterprise data world. And from April 30th to May 3rd is the open data science conference.

Go to data engineering podcast.com/conferences

to learn more and take advantage of our partner discounts when you register.

Your host is Tobias Macy. And today, I'm interviewing Calvin French Owen about the data platform that Segment has built to handle multiplexing continuous streams of data for multiple sources to multiple destinations.

So, Calvin, could you start by introducing yourself? Sure. Thanks, Tobias.

I'm Calvin. I'm 1 of the cofounders

and CTO here at Segment.

So I focus on a lot of different parts of the back end, everything from the AWS infrastructure

to parts the architecture, to parts of the data pipeline and product itself. And do you remember how you first got introduced to the area of data management? Yeah. It's funny, actually. We have kind of a a bit of a winding story. When we first started out building Segment, actually, the product looked nothing like it does today. And instead, we are building this classroom lecture tool called Classmetric.

And at the time,

when we started, we were just 4 students fresh out of college.

And

main reason that we wanted to build a start up was because we were interested in working together, and we thought it'd be a cool thing to do,

not because we really understood any sort of data problems or what it would take to run a company. And so we built out this,

college lecture tool, which was designed to give you feedback,

during a lecture, and I'll just sort of keep it brief. But and generally, it was targeted at students and professors,

to help them get more out of their college lectures rather than the professor saying something and everyone in the class being confused for a few minutes. And we built that out, over the summer. We applied to YC, and we got in with that idea. And we put it back into classrooms at the end of the summer, we ended up,

just having this huge train wreck of the product itself, where kids would go to

Facebook and Google and YouTube, and in retrospect to all of the places that you'd expect kids to go when they're not paying attention in college lectures.

And so,

we kinda went back to the drawing board, and we asked ourselves, like, hey. What are problems that we have with this product? And 1 of the big ones was that we found it hard to understand what our users were doing. And so we built out these various analytics products for a while, and we're trying to compete with different tools out there on the market. Tools like Amplitude, Mixpanel, Google Analytics,

and effectively give them or give our users a supercharged version of those. And what we found,

after spending about a year and a half building out those different tools

is that none of them really seem to stick. And pretty much the major

blocker

to getting our tool installed

was actually convincing people to set up this JavaScript and integrate the API. And

actually going back to the very first days of segment, we realized that we had this problem as well, where we initially were trying to decide between Google Analytics and Mixpanel and Kissmetrics

and all these different tools,

and we really couldn't understand the differences between them. So at the time, we took kind of this lazy way out where we said, okay.

As engineers, we can't understand the differences between these tools.

Why don't we just write a layer of abstraction to send them the same data? And that's really where the idea for segment was born. We decided, hey. Why don't we actually turn this into a product? We think that other people have this problem of managing their data and collecting it all once and then fanning it out to different places consistently. And

we launched it on Hacker News as probably this 100 line JavaScript library to do exactly that, where it send data to these 6 different tools. And

after that point, kind of in the next 6 months, the idea really blew up, and we found there's a lot more demand for it than anticipated. So to answer your original question around how we got involved with data management, I'd say a lot of it was honestly kind of by accident.

But as we explored

more and more parts of the market and parts of what customers want, we realized that actually

the problems go incredibly deep. Yeah. It's pretty funny how many times I've asked that question and have the answer essentially be by accident where Yeah. Somebody set out to solve 1 problem, and then by way of just trying to figure it all out, they end up solving about 15 different ones along the way and then discover that what they're actually interested in is all of the data problems and not necessarily the original thing that they set out to do.

Yeah. Exactly.

And so now you've got the segment platform. You've been around for a few years. It's gained a lot of popularity,

and you've added a lot of additional capabilities. So can you give a bit of the

overview of what segment is now and what the main sort of motivating factor is for the business and,

along the way, maybe discuss some of the primary ways that your customers are using the Segment platform?

Sure. Happy to.

So today,

Segment's primary product that the vast majority of our users are using, is a product that we call connections.

And, effectively, what connections does is it gives our users a set of libraries and an API

to collect first party data about their customers.

So this is data about what their users are browsing. Perhaps, let's say, they're running a music app, maybe they're tracking things, like users listening to various songs, adding songs to playlists,

following 1 another, adding friends, that sort of thing. We help companies collect that sort of data, and then we help fan it out to over

250 different tools that they might be using. And kind of the key idea here is that our API is heavily focused around this idea of user data. This is

users performing events on your website or mobile app, as well as understanding who they are, information that you might have collected, like their email, and then taking that same data and putting it in different tools by function. So let's say you're running a company where you have a sales team, we'll help to get that data into Salesforce.

Let's say you have a customer support team, we'll help put that data into a tool like Zendesk or HelpScout. Let's say you have analysts who really just wanna dig into kind of the raw data, we'll help get that data into a tool like Mixpanel or Amplitude for doing analysis.

Or if you really wanna build a custom data pipeline,

we'll help integrate with data warehouses like Redshift, BigQuery,

streaming tools like Kinesis or Google Cloud or Pub Sub. And so, Segment at its core is kind of this hub or router for all of your data. A single place where you can send it, and then, as you said, multiplex it to many different destinations. On top of that, we've also built out a couple of additional products. 1 of those is called protocols, which hooks right into our customers who are using connections.

Protocols

effectively gives you rules

that you can use

to specify where your data should go and what it should look like. So with protocols, for example, you could create what we call a tracking plan

saying, hey, I'd like to send these 10 events,

and here's which property should be included with each of those events. And, effectively, it helps replace these kind of QA systems that a bunch of our customers have built, where they typically have a Google spreadsheet, which indicates their schema for all their data, which then gets passed over to an engineering team to implement, which then gets passed over to a QA team to

check and make sure that all of those events are properly firing. We kind of combine that into 1 single place with protocols. The other add on product that we've added is called personas,

and this product

is particularly focused on marketers

rather than our core engineering audience.

And it helps marketers create

user profiles,

of their users so that they can then basically advertise

or create audiences

of those users elsewhere

if they wanna send them all a bulk email for users who match certain rules.

And 1 of the things that you mentioned in there, particularly as it pertains to the protocols product, is the need to have some sort of control or standardization

around the format of events and the different properties that you particularly care about and then be able to do some measure of event routing. So I'm wondering what you have found to be some of the most challenging aspects of managing

these multiple sources for events, trying to ensure some measure of consistency so that you can have reliable information on the other side and ensure that you're capturing all of the meaningful actions that take place

without necessarily

having to, as you said, go through that entire process of requirements gathering and then engineering and then implementation?

Yeah. It's an interesting question. Because at least for myself personally, I hadn't realized how deep this rabbit hole of getting,

kinda, quote, unquote, clean data

really goes.

And,

after we talked to a bunch of our customers who are using the core connections product,

What we heard from a really large critical mass of them was that segment is great for collecting all of my data, But as I spread it across my organization, and as I spread it to tens or hundreds of different business units who might be spread across 10, 15 different time zones, it gets really hard to coordinate exactly what data is ever everyone is sending. And so protocols was really our answer to that question from a product perspective. I think in terms of seeing people use it, it's been interesting, first, to understand the depth and variety of use cases. We have some customers

who use the segment. They have a tracking plan, which is incredibly detailed and complex, and they're actually using protocols in a large number of areas to better be able to filter and, basically, white list certain events and black list others. Another 1 that we've heard from a variety of our users

is around the filtering of,

personally identifiable information or PII. A lot especially of our bigger customers

have this problem where they want to be able to send certain events

to their infrastructure,

places like Redshift or Kinesis or s 3, where they kind of trust and know all of this data, and they want the full event for that. They wanna know everything about the user. But for places where they're sending it out to other tools to do analysis,

they wanna be able to

limit what data is sent, and they wanna be able to impose rules that say, hey, for anything which is,

identifiable to the user,

strip this out, and just let us do aggregate analysis in these tools, like Google Analytics, for example. So I think that's probably the biggest major trend that we're seeing in the ecosystem, is that

as we

explore these use cases much more deeply, and especially as privacy becomes something that's more top of mind,

for a bunch of businesses out there. We're starting to see that

shape some of the rules that customers want to implement, and we're figuring out how best to productize that and make it really easy to control your data.

Yeah. I could probably spend the whole episode just talking about event structuring,

contexts. And I know that in some cases, particularly where people are using something like Google Analytics, there's a default amount of information and way that the event is structured.

But depending on your particular needs as an engineer or as a business, you might want to collect additional information

or structure the events slightly differently or be able to perform some more complex analysis than what you can get from those sort of default events. So I'm curious what your approach is in terms of what you send as a default when somebody just adds the segment tracking snippet

without necessarily adding any additional customization?

And then what your thoughts are on the other end of the spectrum with something like what Heap is doing where they just collect every event ever by everybody by default. Kind of 2 perspectives on this, and 2 approaches. Kind of from the start, Segment has

focused on appealing to customers where they're very deliberate about what they track. So for the most part, the libraries that we give to our customers and that default snippet don't actually track a lot for you. Instead, we found that the most successful

customers end up thinking

pretty deeply about which metrics matter to them, which events they wanna be tracking, and what their funnel looks like. And so for the most part, collecting data with segment is all opt in, and that's a very intentional choice. Because kind of at the end of the day, what we found is that for most of our customers,

if we were to auto collect everything that was going on on the page, it wouldn't actually make sense to them at the end of the day, and we wouldn't really know their business as well as they do. So for the most part, we've mostly skewed towards customers having to be explicit about what they track, having to create well defined schema themselves,

and thinking carefully about what those things are. But I think ultimately, it makes customers more successful. I'd say, on the other side,

we've also experimented a little bit in certain small areas where where the surface area is well defined of creating some of those auto tracking pieces. So a good example would be our WordPress plugin. There are a lot of WordPress sites out there, and each of them kind of looks vaguely the same, where you have these page views and these blogs. And depending on how the WordPress site is configured, we can automatically give you good things there. So in areas where the the scope is a little bit more defined, for instance, like WordPress, we do this as well with our mobile libraries, like iOS

and Android. We'll focus actually on pre generating those events that we know pretty much everyone wants, but we'll kind of leave the things that are more specific to your business up to you.

Once you generate those events and ship them off, I'm curious what your experience has been in terms of downstream integration

being able to correlate events either to

identify different users

as they span different devices or browsers or being able

to combine different

behavioral patterns so that you can perform analyses

on the types of engagements that are most useful or most valuable. Seen that be all over the map

depending on what the partner looks like and what the tool looks like and how it's meant to function. In terms of the data that we collect with Segment, we've focused on including 2 fields with every single call that we collect. 1 is what we call the anonymous ID, which is just placed on the device and generated

either in local storage,

by our JavaScript

or by our mobile libraries. And that is just kind of a random ID that we set that we can tag a specific user's journey with. And so each of those calls from a single device,

include that ID. The second is actually the user ID, which

a majority of our users set, via our identify API call. Effectively, once the user tells us, hey, this customer who's visiting my site is their ID is x y z 123, something like that, whatever it is pulled from their database, Then we'll also persist that, and we'll send it along with each of our API calls. So we we effectively provide the plumbing and groundwork

to stitch those 2 things together. It's then another question of how well the downstream tool is at responding to that. And I'd say it primarily depends on what that tool is designed to do. There are tools like Amplitude, where they're really good at actually stitching that behavior together because

Amplitude's whole purpose, or 1 of them anyway, is to understand

how you're logged out users are then converting to be logged in users. So for them, it it's paramount that they do the stitching together well. There's other tools like Salesforce, for example, where it's pretty much useless if you don't have a logged in user or some sort of email or piece of identifying information like that. So they will explicitly reject any of those calls and not were really worried about that data. And then I'd say there's some tools which are kind of in the middle. I group a lot of the raw data category into this, like the Kinesis and s 3, where we'll send all that data along, and then it's up to the customer about whether they want to do that themselves.

A good example of a customer who's really sophisticated here is HotelTonight,

who effectively helps you book a hotel tonight,

via their mobile app. And for them, they've actually built out an entirely custom machine learning pipeline based upon that s 3 data. And I can't tell you for certain whether they use the anonymous ID or not, but I'd wager that a bunch of our more advanced customers

are using this ID themselves to create even more complex and sophisticated user journeys

than the ones that are offered by out of the box tooling.

And in terms of being able to

handle receiving and processing and routing these different events, can you give an overview of what the Segment platform looks like and the overall architecture and some of the evolution that's happened over the past few years? Today, most of our data line is written in Go. We found Go to be actually a really nice language for building these sorts of distributed systems, not only because it's really fast and simple and

compiled, so you just ship binaries around, but moreover because the concurrency primitives in Go are incredibly good. And we found that as we built out and expanded more and more of our pipeline

starting from a handful of these different services which are running to now over a 100 different microservices.

We've actually found that Go continues to work extremely well for us, both in terms of the tooling around it and just the simplicity of the language. So in terms of how data enters segments pipeline, that first step is what we call the tracking API. And you can think of this as the front door for all segment data. Again, this is a Golang service, which is set up, and it's essentially set up just to receive data that is being written into it. It doesn't do a ton of validation. And the reason for that is that

no matter what, we always want to be able to receive data, effectively at the edge. So these instances will basically spin up an API, which accepts data. And then the first thing that they'll do is actually write that data to disk. And so that we know even if there's a network partition or a central database has gone down or there's some other issue with our infrastructure, no matter what, Segment will always be able to receive data. Because these edge nodes are just a stateless API that receives data and writes it to disk, and if we need more of them, we just scale up and spit add more. From there, there's actually a couple of different places where data goes. And the first place where we send that data

from these edge nodes is into our main Kafka cluster. We use Kafka extensively at Segment as the primary log for all of our data. And you can think of each of these services

effectively working more like a worker, where they read data off of Kafka, do some sort of processing,

and then publish that data another Kafka queue. Using Kafka everywhere has been really nice because it allows us, obviously, to partition the data where we can send certain subsets to certain consumers

and be sure that they get good locality. But it's also nice in that it allows us to actually rewind that data in case we want to deploy a new version of the code. From there, it kind of goes through a couple of different places in our pipeline. We have a step which actually validates the messages,

that we call our internal API. We have a step which is running protocols infrastructure, which is filtering out that data. And then, actually, a new piece that we added in May of this year is our GDPR suppression step. I'm sure as you're probably familiar,

GDPR was this regulation that rolled out in Europe in May of this year. And, basically, what it does is it says that any website which is collecting user data for users in the EU has to be able to give those users the right to suppress that data

and delete it from all of its collection systems. Now, as a data processor, we figured, hey, this makes sense for us to add as part of the product, So we actually add infrastructure

inside our systems,

which if a user matches that particular user ID, we'll suppress it at this point, and it won't be stored long term or collected in any of our systems. From there, the last place that this data goes before it actually gets fanned out is into our deduplication system, which I've actually written about a little bit on our blog. The deduplication system ensures that when we see a message, we process it

exactly once. And this is especially an issue with some of our mobile libraries.

As you might imagine,

for mobile libraries, oftentimes, they'll lose connectivity for a variety of reasons.

Maybe the user's phone will go through a tunnel, maybe they'll lose battery life, maybe the app will crash. Whatever it is, typically, they'll have to send events more than once to ensure that they reliably,

make it into our system and that we're not dropping data anywhere. And so when they do that, we found on average that about 0.6%

of events that get sent actually are sent through multiple times. So the dedupe system is responsible for actually checking that data against an embedded RocksDB, and then making sure that we don't send it along the 2nd time if it's already been sent once. So I'd I'd say that's kind of the core ingest side of the pipeline. Those 4 steps, the tracking API, our schema worker to filter out data on protocols,

our GDPR pipeline,

and then our dedupe

And after data has made it there, we know that, 1, the data is totally clean and valid, and 2, that it's fully replicated on disk, to Kafka,

and that we know we won't be losing it as we fan it out.

And in terms of the protocols worker, 1 of the things that I'm curious about is given that you define the schema ahead of time for the events, if you were to add additional attributes as part of the as part of the event generation,

will the protocols layer prevent that event from then propagating further along because it has information that you haven't designated ahead of time so that you don't accidentally

start tracking personally identifiable information that isn't under direct purview or governance? Or do you allow it to pass through or put it into some sort of, like, a dead letter queue for a later analysis of Right now, we don't do any sort of dead lettering for it. Instead, what's most popular about right now customers is that they will specify

a white list of fields. And so to your point about personally identifiable information, protocols will look and filter out anything which isn't in that white list, or if they have a black list, filter out anything which is in the black list, and then send on the event with the remaining fields. So it doesn't quite block the data entirely, but it does give you control around what's sent.

And I remember recently reading some of the posts that you've written about the rearchitect

of your delivery platform for being able to handle scalability

and working with back pressure from downstream systems

where they either aren't able to process at the volume that you're trying to send them events to, or they are unavailable for different reasons. So I'm wondering

how

you were dealing with situations like that beforehand or what some of the pain points were in your infrastructure that led you to go down that path of dedicating engineering effort to build this brand new system? This is 1 of my favorite parts of our

describing all of the ingest and ingress parts of the system. Centrifuge sits explicitly

on the outbound

or egress side where we're making requests to all these APIs.

And, centrifuge itself is born out of this, pretty interesting problem that we were running into segment. As I mentioned before,

we send data to

250 of these downstream APIs,

and actually another few 1, 000 different web hook endpoints.

And managing deliverability

in those scenarios

was honestly getting to be really tough. And the reason for it is that let's say that each of these APIs is relatively well behaved,

and maybe they only go down once per year, which I think is a reasonable assumption. Like, even Facebook, Google, YouTube, those sorts of companies have downtime maybe once a year. For us, in terms of trying to deliver data to each of those places, if you have 250 endpoints,

and 365 days in a year, that means that we are going to be seeing downtime on our side, assuming they're all just uniformly, randomly distributed,

about once every 1 and a half days or so, which means that our customers are going to be running into issues

in terms of seeing their data get to where it needs to go. And so this is actually the part of our product, which I think has evolved the most over time, and I can walk through the the ways that it has evolved. So when we first started, we had all this data mixed into a single queue. We had a worker which would read each message off of the queue, and it would figure out where to send that data. So maybe for this given event, it sends it to Mixpanel, Google Analytics, and Intercom. And so it makes requests to each of those 3 services.

And then once it's done,

it acts the message. But then we started running into this problem where,

let's say, Intercom's API

goes down or maybe it just gets really slow. Well, suddenly,

a single API failing has backed up the entire queue for all messages.

And, obviously, that's not good for anyone. Right? Especially, if we're seeing these downtimes all the time. So we decided, okay. There must be a better way here. What if we actually split up these queues and and partition them instead based upon where that data is going? So let's say we partition this data based upon the destination.

And we have 1 queue for intercom and 1 queue for Google Analytics and 1 queue for mix panel. We have separate workers reading off of each of those. Well, that worked okay for a while, and it solved this problem where 1 queue which goes down only affects data with or sorry, 1 API which goes down only affects data within that queue. But we started running into a second problem related to inter customer fairness, where, let's say, 1 customer is sending a bunch of data all at once, and other customers are sending a lot less data kind of intermixed with it. That 1 customer might be taking up much more of the queue. And if we're keeping kind of a working set of data where we're trying to make those requests and then retrying them, if that customer is getting rate limited and they're seeing a bunch of these 42 nines indicating that we should retry, they're effectively exhausting the throughput for every other customer. And when we saw this, we said, oh, man. What we actually want here is a queue, which is per customer,

per destination,

where we want, let's say, 1 queue for Instacart going to Google Analytics, and 1 cart for hotel tonight going to Google Analytics, and 1 queue going from Instacart to intercom,

that kind of thing. But looking around at all the different types of queues that were out there, we didn't really find any that would help solve this use case for us. And in particular, Kafka, which is our Swiss army knife of queues, it doesn't really scale well to more than a few 1, 000, topics and partitions.

So instead, we decided to build our own piece of infrastructure that we call centrifuge,

which takes a bit of a different approach. And the way that centrifuge works

is it acts kind of like this traffic absorption layer for all of this data that also allows it to reorder that data on the fly. So if we do have a customer who's sending too much data, we can actually adjust where that data is stored, not by shuffling around a bunch of bits on disk, but by changing how we're actually delivering that data. And the way that we end up doing that is, we keep this pool of what we call directors

who are there to accept new data and then make those outbound requests

to all these third party APIs, but then we also create a pool of actually RDS databases.

And so each director is paired with a single database, and it's using the database as effectively its right ahead log and its ledger

for what should be delivered next. And, suddenly, what this gives us is this nice ability where if we wanna change around delivery order for data that's currently sitting

in our databases or our queues, we don't have to shuffle a bunch of bits on disk in Kafka,

which would require you to specify, hey. I wanna deliver event a after event b after event c, and rewrite potentially terabytes of data on disk across the network. Instead, we can just change the way that we're querying those databases.

And suddenly, without having to move any data at all, we can, on the fly,

do some reordering

so that, we're actually able to get better customer fairness. Now, I'm gonna stop there because I think there are a bunch of problems related to that that I could also dig into. But if there are other questions, I'm happy to discuss those as well.

Yeah. I remember when I was reading through the blog, and I'll add a link to the show notes here too because I was very pleased with the level of detail that the author went into. And it's definitely a very unique problem with needing to have this

proliferation

of queues

with different priority ordering based on volume and delivery issues and round trip times. And as you said, there's not really anything that covers that use case specifically, and I'm not sure that there are really that many other businesses

that are facing that particular issue. So it's always interesting to see some of the custom solutions that companies build to suit their own needs. So in your case, Centrifuge

or in the case of Honeycomb, the database that they built for being able to support, sparse matrices of different events for system observability.

So, I don't I don't have any further questions specifically about centrifuge because I think that you did a good job of covering the details of it, and there are other details that people can look at in the blog post, but it's definitely an impressive feat of engineering, and it's always interesting to see the ways that an engineering team will approach

a given problem and reach a particular solution based on the constraints that they're dealing with. And so

aside from the work that you did on centrifuge and the challenges issues that you face both,

on the receiving

side of being able

to source data from,

events that are delivered from your tracking

APIs or also from being able to pull from source systems and then some of the problems that you deal with in terms of deliverability

and being able to

populate events to these downstream systems. Problems on the source side is interesting because

there are there are sort of multiple types of data which flow through segment. 1 is the the library sources that I referred to earlier, where for each of those, we have to be thinking carefully about if the network goes down or it fails for some reason or whatever it is,

that we run into

kind of on the ingest side. We wanna make sure that each of those libraries

are properly replicating that data, if they're on mobile or web, that they're queuing it in a background thread, if we're running on a server, because maybe it's generating tens of thousands of analytics events per second, and that it no matter what, it doesn't disrupt normal program flow, because

the worst thing that would happen is if your app goes down because of some analytics library that you're using to record data. And so we we keep pretty careful,

tenants there in terms of how we build our libraries. I think maybe another interesting facet to walk through would actually be the other type of data which segment inset ingests, which we call our cloud sources pipeline. And, actually, this pipeline was built out of requests from our customers. A number of them said, hey. It's amazing that Segment is this place where I can collect all my data about what my users are doing on my website and in mobile apps. I can take that data and really cheaply load it into a data warehouse like Redshift or BigQuery to to do this custom analysis. But I have this problem,

especially for tools where my code isn't running or data, which I might wanna get at that my customers are generating that I don't have a good interface for. And in particular, the the top tools here were places like Stripe where maybe users pay you and they interact directly with Stripe because you don't wanna deal with credit card payments, but you don't really know

or have a good view into how many were, denied payment, or how many have a subscription that's coming up for,

expiration.

Another 1 that we heard a lot was Zendesk.

People wanted to know, hey, how well is my customer support team doing, and do I have areas of the product which are actually generating substantially more tickets,

where, like, a user goes down some sort of path. Maybe they skip the quick start of the app, and then suddenly that leads to a 10 x increase in ticket volume. So we wanted to take that data from places where it lives, where it's in another app, your code doesn't run there, it's not easy to collect data,

and help pull that into segment so you could actually join it against all of your behavioral data in your warehouse. And so we built that out,

about 2 and a half years ago, I wanna say. It's actually gotten great adoption, but it's been interesting to deal with the scaling challenges there. Because, effectively, what we're doing is we are, in some ways,

replicating the database from each of these 3rd party places

and helping bring that data into segment and helping,

basically, helping you do analysis on that replica of all this data. And so it's actually led us to

build a large number of pieces of interesting infrastructure.

Typically, when we pull this data, we'll do some sort of checkpointing where if we can kind of incrementally sync from the API, we'll run that, and so that might be the most obvious.

The second place where we've invested here is in terms of the runner and framework for all this data. Not all of these APIs are nice squeaky clean JSON APIs.

Particular, AdWords

has this SOAP API. So if you wanna get any data out of there, you basically have to use their Python SOAP client to pull data.

So for that 1, we we run a Python client. In other cases, we run node. In other cases, we run go. And, actually, it's all packaged together in what we call the source runner, which is running as a sidecar to the program, and then it exposes these gRPC hooks so that the program can publish its data, which then we publish

locally to a queue and upload to s 3. And then from there, we do some out of band processing where it actually gets loaded into Dynamo as a deduplication

step. Because you might imagine that we load a ticket once, and then maybe we load that Zendesk ticket a second time, but only a few fields have changed. We wanna be able to detect what's new and what's not. So Dynamo acts as that for us before we fully load it into the warehouse.

As you mentioned,

there's the desire for being able to populate some of these events into a data lake or a data warehouse.

And in particular, the data warehouses

require some upfront investment as far as determining what the schema is so that you can ensure that the events that you are bringing in are able to be written out to that schema. So I'm curious what is involved in establishing

and maintaining that format

and any

transformations or migrations that need to occur as those

events evolve

and what the overall onboarding process

is. Where it's actually really helped us that users are explicit about what they want to track and kind of opt in and figure out the schema upfront because it generally means that the data in their warehouse looks much better. On our side, we do some level of schema inference, which right now honestly is fairly basic, where we'll look at the types that are coming in, and we'll kinda make a guess based upon that first chunk of data, which types we should then create in their warehouse. On the customer side, you don't really have to do anything assuming that you're tracking your data reasonably well. We will do all that schema inference for you. We'll create the new tables when there's new tables to be create be created. And we've really set up our dataset in a way that it makes it friendly for the analyst. For example, rather than stuffing everything into 1 single table where there might be really wide

set of rows where you have to figure out which columns correspond to which events, we'll actually just create new tables for each 1 of those events. So it makes it a little easier to introspect and understand what's going on, and then you can join them all together. But this is definitely an area where we've been investing quite a bit, especially as we've seen newfound scale of some of our customers. There's actually a day last quarter where we loaded 23, 000, 000, 000 rows,

in a single day on behalf of our users, And so we're working pretty closely with the Redshift teams and the BigQuery teams to make sure that everything we do is, for the most part, append only, and that we're not spending time deduplicating on a bunch of their warehouses.

Yeah. We could probably spend a whole another podcast just talking about that 1 piece of your business. But in the interest of time, I'm interested to get your perspective on what you see as some of the most interesting or unique or challenging lessons that you've learned in the process of building and growing both the technical and business aspects of the sector? Side, there's 1 particular project that I'm really excited about that we're building internally,

and it's this piece of infrastructure called Control Store. And,

effectively, what Control Store helps solve is this problem where we have a bunch of different databases

telling,

different services

where different events should be routed. As I said before, we've got 1 in protocols,

which is telling us whether we should filter certain events or different fields. We've got 1 for GDPR, where it's telling us which users we should suppress.

We've got 1 publishing data to centrifuge, which is telling us, hey, this event should go to Google Analytics.

This 1 should go to Salesforce.

This 1 should go to 10 other destinations, that kind of thing. And

as the product has evolved as part of this microservices architecture over time, what we found is that the number of these databases which are required

to route events

has also increased. And so maybe we started with just the 1 telling us, hey. Send send this event to Google Analytics, this event to mix panel. Now we're up to 9 of these different databases.

And you might guess that as we get to 9,

and then maybe next year, it goes up to 20 of these different databases

indicating how data should be routed, the likelihood of each 1 failing independently

just goes up and up and up. And it means that on our side, if all of them are required to be in operation

to successfully pass data through, we're going to be seeing a lot more downtime on a regular basis. So we decided that we wanted to get ahead of this by building this tool called Control Store. And, effectively, what it does is it mirrors data

indicating where where data should be routed, which we call control data. Little bits of

metadata about API keys, which queues, different events are destined for, what should be filtered out. It takes that data from the system of record, which is usually a big Aurora database living on Amazon, and it actually replicates that data out reliably

to each individual instance

and stores it as a SQLite database on those instances. And we think this is actually a really clever approach because

instead of each individual program having to do some sort of in memory caching, all of them can actually share access to this data

by just querying on disk. And for 1, it's much faster. They don't have to do network round trips.

For 2, it's actually much more space constrained, because now each 1 doesn't have to keep their own in process memory cache, but all of them can share access to this data. So, potentially, we can store gigabytes of data rather than cache a smaller working set in memory. And then 3, it allows us to continue processing data even if that original database is offline, because each 1 of these instances is just accessing a local database.

And so

this is something this piece of infrastructure is something that we're migrating

basically, the whole part of segments data plane to, in order to increase our reliability

and decrease our footprint.

And are there any other sort of interesting or challenging

or unexpected

lessons that you've come across in the process of building and growing segments since you first started I think probably the biggest 1 in terms of the company itself,

is realizing how important

creating kind of a culture of writing and communication is.

And this is something that I think we've actually started to do much better in, you know, the past 2 years or so. But in the early days, we kinda made up a bunch of these decisions

around how we should build the product and which areas we should invest in. And,

honestly,

the reasons for those have now disappeared because we didn't clearly write them down. And so instead,

especially this past few years, we've put a push on sharing those ideas as an engineering product and design team and writing things like runbooks and product requirements diagrams

and design documents and those sorts of things, which I feel like now has actually allowed us to really hit our stride in terms of developing products, because we can understand

what's important and what's not, and if something didn't work before,

why it didn't work before,

whether it was a problem with parts of the infrastructure that we had then, or whether it was actually a bigger issue in the market. So I I had to say as rudimentary as it sounds,

just understanding

that writing everything down actually

has helped us move faster in ways that I wouldn't really expect.

And are there any other aspects of the business or technical nature of Segment or any of the projects that you either have built or are building that you think that we should discuss further before we get to that? That I can think of. I think control store is 1 of our areas of active development. Another 1 that we've been investigating a little bit more closely is actually working with Foundation

DB.

Foundation, is this really fast key value,

store that actually

we worked with

maybe 4 or 5 years ago when it first came out. And during that time, we actually found it to be an incredible database to work with. It had kind of the speed of Redis, but it also had replication,

and it also stored your data on disk.

And we've we just thought it was downright amazing. And then they kinda went dark for a little bit, and Apple bought them

and pulled immediately all of the DBM packages that we're using to run foundation in production. So we replaced our use cases with Redis and kinda thought that was that. It's never coming back. But recently, I wanna say it was about 6 months ago or so, Foundation open source itself, again.

And

they said, hey. We're committed to building this database in a way where we wanna keep it around for the long term. And so we've been experimenting with it internally to replace some of the cases where, we're using key value stores elsewhere. And so far from what we've seen, actually, it's been still a remarkable piece of technology. They did some really cool things where I think they're they built their own programming language to handle

asynchronous c plus plus programming with extensions, and it's still

lightning fast, does a great job replicating.

And I think for the use cases we're considering, it should reduce our cost footprint pretty significantly.

I'll definitely have to take a closer look at that. It sounds like a pretty interesting piece of technology, and maybe I'll have it on a future episode. Highly recommended.

And for anybody who wants to get in touch with you or follow along with what you and Segment are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available to data management.

Gap in terms of tooling or technology management. So I I think 1 of the most surprising things,

since we started segment

that I wouldn't have guessed coming in is that the people who are doing data management and data engineering, everyone is pretty okay with specifying what the data should look like and schema for it and that sort of thing, but there are 2 things that no 1 really wants to do. 1 of those is run data infrastructure and data pipelines,

which I think is where Segment has really succeeded to date. People love to use these different tools and to collect data and to wield it in powerful ways, But literally no 1 seems to want to run that data pipeline, which is why I think we haven't seen

as much uptake of some of the open source tooling that's out there that allows you to run a pipeline yourself, but it doesn't really solve this core problem of actually running that for you and being on the hook when things go down and replacing parts, that kind of thing. So I think that's 1 area that I'm really interested to see more tooling around. Some of these serverless options

for letting people just say, hey. Here's the code. Here's what my data pipeline should do. Please run it for me, and let me forget about it, and I'll help pay you for it. So that's 1 area where we're also actively exploring. I think the second 1 is around data cleanliness,

especially when we talk to folks who are trying to build, machine learning pipelines. For most of them, they end up spending the majority of their time cleaning data

and extracting features

and getting it all to be in the right format.

Usually, it's parquet on s 3, that kind of thing, and then they spend a very small amount of time actually running and training those models, sort of the, like, glamorous part of it. And so I think that's another

missing piece that we're exploring a little bit more over the next few months.

Yeah. And I think that particularly

in the big data and data pipeline space, 1 of the problems that is

fairly endemic to a lot of the tools is that a majority of them, or at least a significant number, come from academia and research projects. And so there wasn't a lot of effort put in up front to make them operationally

easy to manage. And so I think that is part of what contributes to that either unwillingness

or reticence to actually build and run your own data pipelines because it's not as

easy or

conducive

to being run in an automated fashion where some of these systems might require some manual steps or manual configuration to get everything talking to each other versus just drop something on disk and have it automatically

cluster or configure itself. Totally agree.

Alright. Well, thank you very much for taking the time today to join me and describe the work that you've been doing at Segment. It's definitely a very interesting company and product and 1 that I've been keeping an eye on for a while. So I appreciate your efforts on that front and everyone on your team. So I hope you enjoy the rest of your day. Thank you so much for having me.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links