User Analytics In Depth At Heap with Dan Robinson

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com/datadog

today to start your free 14 day trial and get a sweet new t shirt. And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data. So, Dan, could you start by introducing yourself? Hey. It's a pleasure to be here. Thank you so much for having me. And,

wondering if you can just, describe a bit about how you first got involved in the area of data management.

Yeah. It's really interesting. I've I've never really thought of what I do as data management. That's actually a perfect description. When I was in college, I studied a lot of math and machine learning because I was really excited about all of the things that you could do with all these emerging datasets and all of the magic that people could build in the world, all all of that promise. But when I came to actually do any machine learning, I I had an internship at Google when I was in college, and I did some research projects. Whenever I came to actually doing any machine learning, all the work was assembling datasets, munging datasets,

and some amount of feature engineering. And all the classes I took were just math. They were proving bounds on convergences of different algorithms or proving bounds on error rates or things like that. And I I guess I sort of came to think that the things that were bottlenecking

all this new technology were not actually stuff I was learning or working on. And all of the things that people were building that actually worked in the real world, I know I noticed this pattern that they had some sort of clever way of assembling a dataset or some clever way of building the training set, but the resulting

learning algorithms they use were pretty unsophisticated. They were the stuff that you learn in your in your 101 level machine learning class. So I got into this problem of how do you make these datasets actually useful, and the problems were really different. The problems are around actually building a dataset that's correct or complete or that you can understand or that humans can interact with and iterate on or all all that good stuff. So my first job out of college was at a company called Palantir, which is solving a lot of these similar sorts of problems. You have these huge institutions that have hundreds of databases or or hundreds of internal datasets and actually building a unified picture

on top of that is either for human analysis or machine learning is is incredibly complicated, and it really requires some sort of abstraction layer. And then a couple years later, I I joined 2 friends at a startup that turned into Heap, which,

is solving a similar problem in terms of how you understand your customers.

And can you give a bit of a description about what it is that Heap does and how the project got started? Yeah. So we make a tool for understanding your users. You might have a website or an app or anything like that. And, we make it we make a tool that makes it easy for companies to understand what people are doing in their products and and around their products, like, to to get a complete view of your users and and come to understand them better. So

there are about a dozen tools that do something like this. They all work basically the same way, which is to say they give you an API and you can log events against that API so that you might have some instrumentation in your product where you log checkouts and sign ups and views of a certain page or something like that. And all these prod all these products are really limiting for the same basic reason, which is that anytime you want to

iterate on your analyses or track something new or understand something new, you need to go get someone to write new logging code and wait for the whatever Jira ticket you filed to get actually processed, and you have to wait for the code to go live and wait for new data to accumulate so you can actually do analysis and all that stuff. Your your iteration speed is horrible. So the heap approach to this problem is to capture everything that your users do and let you analyze it retroactively. So you include our snippet on your page or our library in in your mobile app, and we capture every click and page view and form submission and text field change and tap and swipe and all of that good stuff. And then whenever you have a question, we already have the data so you can answer it retroactively. So that means you can iterate on your analysis in a in a span of about a minute as opposed to, you know, 2 weeks or, honestly, at a large company, 3 months, 6 months, something like that. You asked about, where this came from. Our CEO, Matin,

was previously a PM at Facebook. And that's a company that had, you know, the most sophisticated you know, Cloudera came out of there and has the most sophisticated tooling in the world for doing analysis of your users and all that stuff. And he ran into this problem that the things that he couldn't get answers to really basic questions. He was a PM on the on the mobile messenger product, and there were basic things that he wanted to know about how users were using this thing or did they use this feature at all or all of that stuff. And getting answers to those questions required

filing a bug to get someone to log something, and then the data would show up in some schema that he really had no understanding of and tooling that was way too complicated for him to use. And the result was that they just made a whole bunch of decisions on basically gut, which is a real shame. You have this incredible

Ferrari of a data processing

ecosystem, and and it was like, wasn't a value to a PM there who was trying to use it. So he eventually cooked up this idea of building analytics product that captured everything. Yeah. The ability to just automatically capture every interaction and every event,

out of the box is definitely very valuable because having worked with some of the other tools, most notably Google Analytics, it can often be difficult to get things set up properly or

understand ahead of time what it is that you want to be able to ask questions about. And so not having to worry about that when you first launch a project is definitely

useful

and very powerful in the end.

And so given the fact that you do automatically

collect every event, how do you prevent the user experience from suffering as a result of things like network congestion

or delayed processing

because of capturing all of those events and then ensuring reliable delivery of that data?

So the the capture portion of this product is actually doing this right is 1 of the software engineering challenges of what we're building. A lot of work has to go into doing this properly. There's the first order stuff like you don't want to

overload

the network on someone's mobile device or keep the network constantly active because that'll drain their battery, and there's all sorts of stuff like that. So you have to do your basic batching stuff up and waiting for connectivity before you can upload it later and and all that good stuff. On

web iOS and Android, there are different I don't wanna call them tricks, but really different solutions to this problem. But the commonality is that, yeah, it's really important in all cases that it doesn't affect the user experience at all, and and we were that's something we were able to achieve. And then another portion of this is obviously on our server side making a collection

and ingestion layer that is rock solid, which just took years of iteration and work. A lot of the interesting work in there came late last year when we started doing Jepsen style, Chaos Monkey style testing of that layer, and we rooted out all sorts of little ways that data could get dropped, even tiny percentages of data. And the result is that we haven't lost a minute of data in the last 6 months. Yeah. Chaos engineering

to be able

to ensure reliability at the data layer, I'm sure, was an interesting challenge because of the sort of inherent nature of distributed systems and making sure that there's consensus

and that you don't have dropouts and ensuring that there's appropriate back pressure, etcetera.

Yeah. All that good stuff. Yeah. Given the inherent complexity of a system like that, can you walk us through the life cycle of an event going from a user's browser all the way through to the,

end display of being able to process that data and visualize it? Yeah. So

an event is born somewhere

in a tracker somewhere or in a either in JS that's on your web page or in somewhere in your app or something like that, it can come from a third party source, and I guess we can get into that later. An event will be born, and it will be

sent to 1 of our web facing data collection

services, which is a really lightweight Node JS app that receives this data and,

receives this data and sends it to Kafka.

And, you know, you have to we can't we can't drop anything at this stage because data's not persisted yet. So there's all sorts of logic in there, the effect of, you know, if Kafka is not available, we'll spool it locally and re ingest it later and all of that good stuff. But the data will sit in a Kafka queue until it is processed. We have a Scala ingestion layer that consumes from that Kafka

queue and ingest the data into

the distributed system that we've built to make these analyses work. So we have a distributed system built around charted Postgres and Kafka and a whole bunch of Scala to,

basically, to make it possible to do

the analyses that, to to make the analyses that we're trying to offer performant

on this sort of product. So that that's a distributed system that we've built around the problem of

doing analyses that have this fundamental indirection between

the raw data and

the semantically meaningful data. So in in a traditional analytics product,

you log something,

and the all of the schema around that event is baked into the event. You log a checkout, and the thing that hits the analytics tool servers is a,

a checkout event with with a bunch of properties. In Heap, what we're capturing are raw events. They're clicks or page use or form submissions or text field changes or whatever. They don't have any

semantic value yet. And then in Heap, you create what are called definitions,

which are basically predicates. A definition is something, for example, you would click a button on your site and say this is the checkout button, and we'll extract from that various properties like the CSS of the button you clicked. And then we now we now have this label at checkout that we can apply to all the historical clicks that match that definition.

So then this is how that whole retroactivity idea works. You can define a checkout and then analyze all the checkouts that have ever happened going back to when you first installed this. But from a data engineering point of view, that means we're supporting analyses

that are in terms of these definitions, which you don't know when the events come in, and they can always change. They're dynamic. We needed to build a new kind of distributed system to make it possible

to do these analyses performantly.

Oh, and you wanted to know the, the full life of yeah.

And then the read path is fairly straightforward. We have an app server

that is built in that is built in Node as well, and

we have a a front end that's built in in React and MobX,

and you'll create some sort of analysis on your you'll create some sort of analysis on our product, and you'll run a query, and our back end will turn that. Our app server will compile that into a query that our distributed system can support, and we'll run the analysis and and give you results back. And then 2 other important sort of destinations of this data. 1 is in warehouses. We have a product that we call Heapsql, which you can think of as

your heap dataset. You're gonna give us your heap dataset in a warehouse, so you might have a Redshift instance or a BigQuery cluster or something like that, and we

offer a product where you can analyze your heap data there. So you can do something like create this checkout event, and we will populate a table of all the checkouts that have ever happened, and you can run raw SQL on it. So then there's a separate destination

that the data needs to go to. And finally, we are building out tech to do webhooks. So, for example, you can set up a flow like when someone views this white paper,

fire a webhook to this third party tool and add them to this drip campaign or or ping me in Slack so that I know that they're doing this and I can take some sort of action on it. So the webhook activity is just an additional Kafka consumer that's that's evaluating these definitions on all the incoming streaming data. And I'm I'm obviously aligning over some of the details. Like, 1 of the things that makes this very complicated is Heap has a really powerful notion of identity. I think an important pillar in any sort of dataset like this is having a coherent notion of an identity, like what a user what an actual end user is. It's not usually a cookie. It might be someone an an end user

probably

has an end user probably has components that are in different cookies because they send them on different browsers, and they might have also interacted with your mobile app. And they've also interacted with a bunch of third party tools, like maybe they did a payment on Stripe or maybe you sent them an email in Mailchimp, and all that data needs to go to 1 coherent user record so that you can do user level analysis that make sense. So there's a layer in here that I've elided over, which is how these users actually get combined. Yeah. That's definitely a complicated challenge in and of itself that we'll dig into in a minute.

And going back to the beginning of the life cycle for that data,

1 of

the most complicated aspects, at least in my experience, is ensuring the reliable delivery

from the browser to the consumer, you know, in this case, the Node. Js application,

particularly where you have things like the user might close their browser tab before the data gets flushed through, particularly given that you're batching it or if the,

Node JS application

the

mitigate some of the issues there?

So our strategy thus far has been keep the server level really thin and really simple and basically get it to the point where it never goes down. We have that set up to be multi AZ, and we have a whole bunch of Jepsen style. Like, we have we've done a whole bunch of chaos engineering around making it basically making ourselves confident that it'll never go down. On the client side,

the problem of reliable outbound capture, which which is what you alluded to. For example, how do I if someone clicks a link on your site that takes them to another to another website, how do I reliably capture that as they're leaving or as they close a tab or all that stuff? That does turn out to be a very hairy, very tricky problem. It's actually a problem that we have noticed a lot of people get wrong when they instrument their own websites in a naive way. Every other tool requires you to write basically click handlers that track the things that you care about. And if you just naively throw a click handler on this, you'll lose, like, half the data because people won't will the the browser will have closed and navigated away before your thing is received. So we do some trickery in the tracker to make sure that we handle all of those cases and reliably capture stuff as you're leaving a page or closing a tab or something like that. And I I think a a big part of this is just doing the work to get the software engineering like, do doing the software engineering work and being diligent about it. We have our pretty extensive suite of the tests that we run against our tracker to cover all kinds of different OSes and browsers and combinations of those things, and it's a core part of what we do. Like, this thing cannot. There are 2 main things. This thing cannot lose data. It needs to reliably capture even if there's some kind of network hiccup or some sort of weird user behavior on the site or some kind of bizarre

crime against the DOM that somebody does on their product.

And, also, it cannot break people's websites. And if you get too tricky with what you're doing in these in these things, it is possible to break people's websites. That's not something we've really done in in years, but in the in the very early days of this product, we learned about all kinds of different subtle ways you can affect people's products in a bad way. That's and it needs to be an incredibly well tested piece of code in this product. Yeah. The particularly in the context of browsers, it can be a very messy and chaotic environment because of things like browser plugins that inject various bits of JavaScript and some of the weird hacks that people will do on their websites to modify the DOM. And so I'm sure that it creates a fair bit of noise in the events as you're tracking them. So what are some of the ways that you have built up to ensure that the data that you do track has proper integrity and accuracy

and

a I'm assuming that there's a unified,

container or representation of that data to allow for ease of processing on the back end to ensure that you could merge events appropriately and have

some sort of commonality between the data points to be able to do an appropriate analysis on them. Yeah. We don't we don't generally do anything particularly fancy with the browser. We're not building something that does session like, we're not building something like full story that does session replay. So we're not trying to

dom within the layer that we operate. There's there's surprisingly little variation between the browsers, and it's usually in an area that third party extensions and etcetera don't don't affect too much. And and at this point, we only support IE 8 and above, so there's there's it's at least somewhat sane. But, of course, whenever like, I think this is mostly just a question of diligence. Like, you have to be very careful and write the tests and, handle all of the possible things that you've ever seen come up. We've seen we've seen things from particular JS libraries that someone has on their page, you know, breaking something or a particularly weird use of SVG or all kinds of weird stuff like that. And we have a we have a pretty large free tier, so just an enormous breadth of different websites. So we've we've seen a good chunk of what the Internet will throw at at this kind of tracker, and I think it's just kind of just a question of doing the work, like encountering a thing like this, treating it as a top priority to fix, writing a test so that you never that you never have that issue again and and, you know, rinse and repeat. And in terms of the data representation,

do you have some standard

unit of data that you ship to make it easy to integrate the information at a later point between the various

browsers or websites that somebody might have and mobile applications?

So we have a standard in in Heap, there are users who have sessions, and sessions have many events. And we have a standard I think the first layer of this is that we have a standard event schema for all of the events that we capture that includes across 3rd party. And then the identity piece of this is, I think, where it starts to get really interesting.

The way that we handle this is we give users an API that lets them basically specify

tags or handles that go on a user or identities,

and will combine different users that see the same identity. And what's cool about this is that you can represent a really complicated user flow doing this. So, for example, you might have an ecommerce product where users do guest checkouts and they enter their email when they do so, or they sign up and they have some some identity that's that's part of your system. And he lets you effectively

create a graph of different components of users where edges reflect

having a common identity at some way or, like, having a common these users have the same email or these 2 users have the same Stripe ID or these 2 users have the same identity or something like that, and we will combine all of those users in, basically, each connected component in that graph into something that looks like 1 user to you. So this is maybe an overcomplicated way of saying, we give you APIs where you can tag users with various properties and specify that if 2 users have the same value of this property, they should be combined.

And under the hood, we're doing a whole bunch of manipulations of this data so that when you run

analyses, it looks like those are just 1 users. So to to a person running analyses in this product, when you run a conversion funnel that has data from 4 different sources, like, I wanna see a conversion funnel between sending someone an email from this campaign to usage of this feature

and percentage of those users who later added something to cart and percentage of those users who later paid via Stripe to it to an analyst using Heap or to anyone asking a question in Heap, it looks like it's just 1 user. I think that's really important. All the really interesting analyses are at the user level. So that's a that's a big part of what that's a big part of what we're trying to represent. And what are some examples of some of the third party integrations that you support and some of the complexities

that that introduces

to your infrastructure

and your capabilities

for being able to ingest data and merge it together with the data that you're collecting from the event trackers?

That that turns out to be a really that turns out to be a really interesting question. Basically, doing a a good job handling third party data is really a full stack problem. There are product portions of this. Like, you need to make it you're you're now building an analysis product where events can come from n different sources instead of just your main product, or they can come from n different sources instead of just the website itself or the iOS app, and you need some way to communicate that to users so they understand what they're looking at. Then there's also the problem of doing the data capture. All these tools have their own APIs. I think the thing that makes it particularly interesting is that they all have their own semantics, like, you know, tool number 1 fires a webhook at you when an event happens and you need to receive it. Tool number 2 has an API where you can crawl it and get a CSV of the last hour of events. Tool number 3, Salesforce is a particularly interesting 1 because 1 way to handle it is to

deploy an Apex plug in that your user installs into their Salesforce instance that spits data at the heap server. Or another way to handle this is to

crawl things against someone's Salesforce

instance using the Salesforce API, but then you run into problems around the cost of doing so because users pay customers pay by the API use of Salesforce. So all these tools have their own weirdness, their own semantics, and managing all of that is a pretty considerable challenge. And then the infrastructural challenge under the hood is really 1 of data integration. Basically, you have users on in end different products, and there's there's generally no global identifier that ties all these users together. So our solutions problem has been for each of these sources, you select some sort of identifying

field that comes from the source. Like, Stripe events have an email address or a Stripe ID or a couple other fields. And you'll pick 1 of those fields, and you'll match it up against

user property, like the email or you might have been sending in the Stripe ID or something like that. And we will do that that join under the hood. Doing that join tends to be really complicated because we

do it at write time. We persist those changes in our data stores so that at read time, the analysis are really fast. But that that turns out to add a considerable amount of complexity. Yeah. And 1 of the challenges there too is that it ties into the conversation that's been happening a lot in the

data engineering space lately of

ETL versus ELT in terms of when do you do that transformation

and how does that impact your ability to do future analyses

or retroactively

change the way that you apply some sort of semantic meaning to the raw data. So it's difficult to decide when you want to make those changes and particularly when you're dealing with large volumes, you don't necessarily have the luxury of just having it both ways of load the data into some raw archive and then also transform it in 1 of the storage systems so that if you then do need to go back and change the way that you're processing it, you can just reapply it from the raw storage. Yeah. I mean, in some sense, this is the problem of Heap. We're building a product here that to a user feels like schema on read, but from a performance point of view needs to, under the hood, do a whole lot of schema on write type stuff. We are giving you the

subtler than that. It's this or that or this other thing, and then we need to instantaneously

update all of the raw events we've captured that were checkouts, are not necessarily checkouts anymore. You might, at any point in time, add a new definition or add a new event that you're interested in analyzing or change 1 of those definitions or merge some of them or something like that. And to you, that's a schema on read experience. You're deciding when you're doing the analysis how you want that data to be shaped. But from a performance point of view, these these datasets, there's there's a petabyte of data in this product. There's these datasets are way too large for us to be able to to actually lazily do that, so we need to do a whole bunch of of hoop jumping. But I think that's, in a fundamental sense, what we're building. I think what's needed here is an abstraction layer between what is captured and what is being analyzed. That's really what we're building. We're building data virtualization.

When you build an application for a desktop, you don't write in you generally don't code in terms of the underlying drivers. You you have an operating system that abstracts that away for you. Or when you use a browser, you don't punch in an IP address. You type in a URL, and we have DNS that actually handles that indirection.

And datasets should really be the same way. I should be able to change the configuration of my dataset or change basically my schema and have the dataset retroactively update. I should be able to say, actually, I additionally I wanna change my event schema. I want to additionally include

this event, or I want to additionally include this property,

or I want to remove this property because that's a a PII concern, or I wanna change how data is joined between

my Stripe dataset and my Salesforce dataset, or I want to change how users are combined between my different subproducts or all that stuff, there should be a there should be a raw a there should be a capture layer that captures a raw dataset that is complete

and totally automated, and there should be a configuration layer that you can

modify at any time that determines your, basically, your schema. It determines your event schema and your how your users are combined, your user identity schema, and all that stuff, and that should produce for you a,

that should produce for you a virtual dataset or a synthetic dataset that you then run analyses in terms of. So you should be running analyses on top of this abstraction. And under the hood, it should just handle all of that all of that schema complexity. Basically,

that's that's the product that we're building. This is data virtualization software. This is software that gives you the the power of schema on read, but the performance of schema on write. As you've grown the business

and your customer base and,

at the same time, the volume of data that you're working with, what are some of the problems that you've had to deal with and some of the architectural

changes or evolutions that have been introduced by necessity because of this increase

in, scale and complexity?

1st and foremost, we've had to do a lot of work on that that analysis layer that I described before. We we have a an in house distributed system that we've built on top of sharded Postgres and scaling that out, you know, multiple orders of magnitude over the past couple of years has been an enormous amount of So a a lot of the work there has been really within the Postgres layer experimenting with different hardware and different configurations and different file systems and different schemas and different ways of expressing these analyses and all of that stuff. So first order, there's there's a data there's a databases at scale problem here. Another general sort of trend that we've had here is that we are giving our customers more and more complex ways to model their data or the ability to model more and more complex datasets, like more complex notions of identity or more complex or richer datasets or adding third party sources. These are all additional complexifying factors. So our the number of distinct areas of our infrastructure has grown considerably even in the last 18 months

of of the last even even in the last year and a half, really. And what are some of the changes

anticipating

needing to make as you continue to bring on new users and increase the volumes of data that you're dealing with? A lot of the work that we're doing in 2018 is around

scaling out the product that we offer to enormous customers, customers that are that have websites that are, you know, in the 100 of millions of sessions a month, some enormous product like that, and allowing the underlying components of our infrastructure to specialize. So what I mean by that is we currently have this distributed system that we've built on top of sharded Postgres, and it is really good at supporting the analysis in

the Heap product. So running conversion funnels or graphing instances of an event per day or unique users who did an event or things like that or cohorts and all that good stuff.

But there are a lot of other uses of this data in this product where we populate datasets in people's warehouses. There are, real time use cases that it doesn't serve well. There are more advanced use cases, like, there there are more advanced use cases, like, more advanced statistical learning type stuff that people wanna do on top of this dataset and features will wanna support there. There. And the system that we've built is is really, really good at at 1 thing and not really very good at these other things. So we're building a lot of new stuff right now

to handle various different portions of this. So, for example,

we're building a separate s 3 store for all of this data that will power future warehouse use cases. This will allow us to support KeepSQL customers that are a 100 times as large as customers today for, you know, a 30th of the cost, There's there's some

incredibly improved performance like that. So that that that'd be an example of pulling out the

persistence and the warehousing use cases from this data system that we built and letting it focus on the thing that it's good at. So I think when people at start ups generally talk about things being monolithic, they're talking about their app layer, which does too many things that they split into microservices.

I think this doesn't necessarily apply to Heap because our app is still monolithic and has been totally fine because it's not it's not fundamentally that complicated. But the underlying data system is fairly monolithic, and we're splitting that out into a bunch of different data services

that are not necessarily all that micro, but at least charted out by by function. That's a lot of the work that we're doing right now. And

another large area of work for us is building out more of a platform and and allowing people to use this data in much more flexible ways. Right now, we have an end to end analysis solution that is very powerful, but this dataset that we're that we are allowing users to create, this virtual dataset, is really powerful and useful for all other kinds of stuff. It's there are a lot of other warehouses that people wanna use this in and and join it with other information downstream and run advanced things there. There are machine learning models people wanna train. There are real time use cases like a webhooks type use case. So building out the underlying data platform so that we can support a much more flexible set of use cases is another big area of focus for us right now. And digging a bit deeper into the

difficulties and complexities

of offering the

synchronization with users' data warehouses, I'm wondering if you can talk a bit about the architecture that you needed to build out to ensure that

that data synchronization

doesn't cripple the end user experience of people people who are doing interactive queries or visualizing the data on your end user portal? Yeah. So, I I mean, it's that's largely a question of capacity engineering. Building building a

Redshift destination for this data has been a really interesting interesting experience. I think the the underlying

difficulty here comes from the fact that the the underlying data model in Redshift and the underlying data model in our analysis layer is just very different. So supporting things like a really complicated notion of user identity that unfortunately makes the data immutable, for example, is something that's very tricky in Redshift. So we've had to iterate quite a bit on how we wanna represent that. And then another sort of interesting set of problem areas around around supporting something like this is that because Redshift is

halfway between infrastructure as a service and the platform as a service, like, I think you can think of Redshift as, like, 20% of the way to on prem. So there are just a lot of environmental factors that are not under our control that make it easy for a customer, for example, to accidentally break their own heap SQL experience. For example, if they if a if a customer changes some sort of security configuration and or a or a networking configuration, we can no longer access their cluster. Or if they toggle on a whole bunch of new data that they wanna send to their heap SQL cluster and they don't actually have enough space for it on the other end, there's all sorts of sort of problems that you can run into here and then just factors that are that are not under your control. This is in contrast to something like BigQuery where we haven't seen nearly as much of that. It's it's,

I don't I don't wanna say idiot proof. I think the real thing is that it's just

more of a platform and less of a piece of infrastructure.

And what have been some of the most interesting challenges

that you faced or lessons learned while building the technical and business aspects of Heap? That's a really interesting question. I think

from a business point of view,

1 of the interesting challenges is and you you may have already heard me struggle with this a half hour ago, is it really explaining what we are.

In in a kind of fundamental sense, we are building data virtualization. That's what this is. We're building the ability

to capture and control and organize

a dynamic synthetic,

you know, virtual dataset. That's that's not a category that exists. That's not something people are looking to buy because it's not it's not out there except for Heap. So it's a new idea.

So

I think something that we've really struggled with and and or I think something that we've really wrestled with over the last couple of years is how to explain this thing.

You can describe it to people as an analysis tool,

and it is a much better way to do user analytics than than other tools on the market, but that sort of undersells what this is for. That puts it in a that's a kind of narrow that is 1 of many possible applications of this of this dataset that that is possible with Heap. But on the other hand, if you if you don't map it to something that people already understand,

then you're sort of inventing a category, which is really hard to do as a 100 person start up. So really figuring out exactly how to position this and how to explain it has been a challenge. I think on the on the other the flip side of that is that when people understand what this is and they buy it, they love it. Unfortunately, I'm not gonna get into numbers right now, but our churn and our our our expansion, our upsell are extremely strong. Like, those are

for for a SaaS company, we do off the charts in that area.

And part of that is we have a really talented, hardworking account management team, and they do an excellent job. But another part of that is that the fundamental

product here is very powerful and useful, and people will use it and they want they want to buy more of it when it comes time to renew or they start using this and they really don't want to go back to to tracking plans.

If you're not if you're not deeply in this space, a tracking plan is like a

big spreadsheet of all the things that you're gonna track and it's supposed to list all the things you'll ever wanna analyze. And, of course, you miss some things, and it's it's an insane arcane way of doing this. No 1 wants to go back to that when they start using Heap. So I think we have this interesting challenge, and it is hard for us to explain to people what we do. And it's it's a hard conversation to start, and the person who would buy this at different companies is often very different. But once people do start using this, it's incredibly sticky and powerful, and they and they wanna keep using it. So I I that's I I feel generally obviously, I feel really optimistic about the company. I think this is 1 of the reasons that I this is a problem kind of of messaging and educating the market, and I think that's a lot more solvable than if our product does not have some sort of fundamental

fit. Like, I would I would rather solve the problem of educating the market about why data virtualization is so powerful and why they need it

than have something that people already know how to think about, but it's not actually that valuable so that when they, you know, they they end up churning off of it. From a technical point of view,

a lot of the

in some sense, the the whole

the whole product is technical risk. Like, it what attracted me to joining the company is that it's enormous technical risk and, I thought, relatively low product risk. It's it's a clearly better way to do analysis. Why would you ever want to it's kind of silly that the way we do this in other tools is we white list the things that we want up front, which is crazy. Of course, you don't know the things you want. All the interesting stuff is in the unknown unknowns.

But it's a really high technical risk. It's a kind of product that, you know, making this, scaling this to enormous datasets, scaling the data capture, scaling the analysis so that large customers can use this in an interactive

way, making this cost effective.

All of that stuff is is

really challenging. Another area of challenge for us is the is sort of unique as analytics products go in that at a lot of our customers, there's a lot of different producers of data.

In in analytics, there's this idea of producers and consumers. You have in in most analytics tools, there's a small number of people who create reports, and then there's a large number of people who who view dashboards that consume reports, basically.

In Heap, the

the fact that this product

in some sense, the fundamental power of this is that it unblocks people. They don't need to talk to IT or engineering or whoever to to get data that they need or to get access to something because it's already all there. It's already all captured. But that means we're building an analytics tool that has tons of producers and tons of consumers,

which is different. And that that that exposes all kinds of new problems that that we've had to had to wrestle with. But yeah. I mean, anything

where you're dealing with,

with this kind of data scale data scale when you're trying to run 2 second analyses on a petabyte scale dataset, like, everything's hard, I think.

And I'm sure that it wasn't made any easier with

the current,

actually, just today, beginning of enforcement of the general data privacy regulations from the European Union. So have there been any significant changes necessary in your product or your processing or in any of your messaging to account for that new regulation? That's right. Happy GDPR day.

It's my my everyone's favorite holiday. It's been surprisingly little from a technical point of view. We were already SOC 2 compliant, which meant that we had a lot of the bones in place to

we had a lot of the bones in place to comply with most of the portions of this law. Like, we're already,

first of all, a lot of this law is is stuff that you should be doing anyway. Like, you should be reporting incidents to authorities. You should be you should be making reasonable

making a reasonable effort

to,

to create you should be making a reasonable effort to create to to do least privileged type access internally in your systems, all that good all that kind of stuff. And then a lot of it is stuff we'd already been doing for SOC 2. I think the main

the this adds a lot of work for controllers, which is really for our customers because they have to acquire and manage

a whole bunch of different, you know, consents and all of this stuff. But from our point of view, it's it's less than you would think from a technical point of view. From a, obviously, from a from a processes and business point of view, it's an enormous amount of paperwork and, like, you know, signing and managing all these data protection agreements and signing and managing all these DPAs and all that stuff. There's, like it creates a lot of work. I think the main

technical change that this has

provoked is that we we need some notion of deleting. We have we need some way to delete users. Our customers our customers' users might revoke consent for processing of data, and they will need to they need to notify us via some sort of API, and we need to delete that data. And this turns out to be really difficult. This this can this can turn to be more difficult than it seems because I think in any distributed system, deletion is 1 of the harder problems because, again, something like the Heat back end, data about users exists in, like, 50 different places. It's in various different Kafka topics. It's in various different buffers and caches

and downstream warehouses that our customers have in our in our it's in our analytical store, all of that stuff. I think the problem so deletion in turn can turn out to be very tricky.

In Heap, 1 of the things that has made it a lot easier is that we already have such a robust

notion of identity

that that actually under the hood is trying to make it a lot easier to manage deletion. So what I mean by that is if you need to

delete all

of the information that you have delete or anonymize all the information you have about a particular user,

the fact that we have a whole bunch of of tech already to manage

the complexity of this web of user relationships meant that this is actually surprisingly easy for us to implement, but it's definitely a pretty major change. I mean, previously, there was there wasn't really any kind of first class way to do this. There was we have tools for

deleting PII if you accidentally send us some, but that's sort of a different problem. That usually means deleting a certain property from all of your events, which is different than deleting an an individual user or something like that. But, yeah, it's been it's been,

obviously, it's an adventure for everybody, but and and I think the next couple of months will be really interesting as we find out how this is actually gonna be enforced, and and

it's an interesting law because it it leaves

a lot up to interpretation, which I think is, is really smart. Like, I think if you write this kind of law and you're very specific about what the rules are, then the law is gonna be really outdated and silly in 2 years because the technology changes so fast.

But the way that the law is written, it has obviously has this trade off that there's it's very vague. So I think it'll be really interesting in the next couple of months to find out how this is actually gonna be interpreted and and enforced. But thus far, for us, from a technical point of view, it hasn't been it hasn't been as bad as you'd think. And what are some of the plans that you have for the future of Heap either in terms of new features or capabilities

or new business directions

or, just general, you know, infrastructure improvements?

There's a couple different directions here. 1 is

continuing to double down and triple down on that capture layer, handling

handling more different kinds of ways that that data gets created. We still don't support React Native, for example, and there's all kinds of third party sources that our customers have data in that would enable valuable analysis in Heap. There's there's also the obvious infrastructure problems of

scaling to even bigger and bigger datasets, scaling to customers who are doing, you know, a 100000000 sessions a month or more than that. That's brings its own challenges, and that's something we're working on right now. And as I mentioned before, I think building and selling more of a platform

that lets you do a whole bunch of different things with this data and less of a monolithic analytics solution

that it's gonna require a whole bunch of changes for us from a technical point of view and obviously from a business point of view. These these portions are pretty exciting to me, and then there's a lot more to be done in

making it possible to

interact with this dataset and and making it possible to iterate on this dataset. I sort of described this idea of data virtualization that there should be some virtual

or synthetic dataset

that is constructed based on some raw datasets that we capture and some configuration that you specify. Right now, we

we virtualize

the event schema. You can define your events at any time or change your event definitions or anything like that, and Heap will automagically,

update your warehouses if you've been tracking it that way from day 1 and update your analysis in the next you know, you can run a query 10 seconds later, and and your analysis will be updated. So the event layer is is already the event portion of your schema is already

somewhat virtualized in Heap, but there's a whole lot more that we can do here. There's

the the we have the most powerful tools on the market for representing a complex user flow, for representing users that have a, in a product that has a different notion of identity at different stages. Like I mentioned, the you you might have a customers who's who supply an email when they make an anonymous purchase, and then they actually log in later and they make a a logged in purchase and you wanna correlate this. This is the only product that lets you represent that properly. But even then, we don't let you do it in a retroactively modifiable way. We don't let you do it in a way where you can change the definition of a user, and and it will be as if you had been you had had it that way since day 1. Right now in Heap, it's still something that you need to write code for. So making it possible to virtualize all of this stuff and make a truly virtual dataset, I think, is

over the next few years is gonna is gonna require a substantive amount of work. I think 1 of the things that makes this space so interesting to me is that it is so

clearly

nascent. Like, it's so clear that the

data tech that so many companies use is just garbage. It's horrible. You can

it's so hard for people who are spending for smart people who are spending a lot of money to answer basic questions or to build basic things. Nothing talks to anything. Like, no 1 can understand these datasets. They no no 1,

the the problem of getting to the point where you actually trust your dataset and and it's correct

is incredibly difficult. That's usually most of the work. People spend, you know, 80 plus percent of their time just iterating on the dataset, trying to get to the point where they trust it, and it makes sense, and it's correct, and they understand it. And you have these data teams who are really,

in a lot of places, like,

knowers of the schema. They're glorified query compilers who can actually run an analysis or something like that. They understand which columns are correct and which aren't correct. All all of this is going to look ridiculous in 5 years. This is it's so absurd that this is the technology that we use to make decisions. And I think the thing that will make this look so crazy is when when I mean, it it doesn't look crazy to go to work in a car until you realize that there's you know, until Jetpacks get invented or something like that. I think when

when data virtualization is a widely used, widely applied technique technology,

this will that'll solve a lot of these problems. You'll have a virtualization layer where you can understand what your what your dataset actually means. It can explain it to you. It can proactively tell you things about its correctness or anything like that. So, I mean, this is this is this is what we're building. I I think this is the the data infrastructure that companies need. This is the decision infrastructure that's gonna make it possible for companies to do more complicated things at scale and serve their users better. But that's obviously I think what we do right now is is, you know, 2% of what we'd like to do. So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes.

And you already talked a little bit about this, but from your perspective, what do you see as being the biggest gap in the tooling or technology that's available for data management today? Yeah. I mean, I I think it's that lack of a virtualization layer. I think

people are building

really powerful tools that solve the easy parts of this problem. People are building tools that let you run Hadoop jobs over bigger and bigger datasets or shinier and shinier visualizations or or

analysis front ends with more bells and whistles, but the underlying problem is that people don't is the data provenance, is the data integration. People don't have a complete dataset. They don't have the event they needed tracked or they don't have they don't understand what checkout step 2 means or they don't understand they don't know if they can trust all these different things. They don't know if they can trust the

the,

if they should be looking at the price column or the amount column. Those, I think, are the the interesting and hard problems here. That's I think that's where it'll actually unlock all the value of this stuff for businesses. That's what strikes me as as

just egregiously missing and, you know, that's obviously the problem we're trying to solve. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing with Heap. It's definitely a very interesting product

and 1 that I just started using while I was doing research for the show. So I'm looking forward to interacting with it as you continue to grow the business. So thank you for that, and I hope you enjoy the rest of your day. Thank you so much for having me. It's a

pleasure.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links