Experimentation and A/B Testing For Modern Data Teams With Eppo

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Chetan Sharma about EPO, a platform for building AB experiments that are easier to manage. So, Chetan, can you start by introducing yourself? Yeah. Absolutely. My name is Che. I've been in data science now for, like, 12 or 13 years. Most of that time was at Airbnb.

I was the 4th data scientist there. Stayed there for about 5 years. After that, was 1st second data scientist in many other places.

And this has kind of guided a lot of my perspective on what ultimately

led me to build EPO. And do you remember how you first got involved in the area of data? Yeah. So I was trained as electrical engineer and then a statistician, which ended up being a pretty fortunate combination when I hit the job market.

You know, I started out

kind of more on the health care policy side, which was very cool to work on, you know, such important problems, but ultimately a little too slow for me. So then I joined Airbnb, which ended up being a great experience in a lot of ways.

And I think 1 of the things that's unique about how I think around

data teams and how they should operate is that

by being at Airbnb so early, kind of before

the company really had embraced data as a core process,

you just appreciate the problem a little bit differently. I mean, Airbnb,

I think most people probably would see it as, like, a pretty sophisticated

data shop today,

but, you know, it was not destined to be that way. You know? It was founded by designers. You know? Brian Chesky idolizes Walt Disney. It's not the sort of place where they're naturally reaching for data at every step.

But, you know, we ultimately won that kind of culture war, and now I would say data is a hugely instrumental process to everything there. And a core part of that was experimentation,

you know, in terms of getting this process to be ingrained everywhere.

So, you know, there's a lot more to my data management background. You know, I started out in production machine learning, worked on data tools. I open sourced a thing called the Knowledge Repo,

which was a collaborative notebook knowledge management platform

and, you know, ran a lot of experiments. But, you know, at the core of it all, I kept thinking that

if we really peel back what made Airbnb turn the corner, it was experimentation.

So that brings us to where you are now with the EPO project. I'm wondering if you could just talk a bit about what it is that you're building there and some of the story behind how you decided that this was the problem that you wanted to spend your time and energy on. You know, the reasons I wanted to work on experimentation was that and you've seen it now with a lot of these other tools. Like, there's a lot of problems the data team has to solve, and many of them have gotten a lot better. You know, back when I was at Airbnb, we had to roll our own datum structure, like, you know, host our own hive clusters and

literally had to build airflow. But

nowadays, so much of that stuff is turnkey, and you can get up and

spin up reporting, like, pretty easily. Within a week, you can just get Snowflake, get, you know, 5tran or something, get data in there, dbt to orchestrate it around.

But the problem I kept seeing everywhere I worked after Airbnb

is

that, you know, you don't build a data team to have a data warehouse. You know? That's not the goal here. You you build a data team to make better decisions.

And better decisions just weren't really happening at a lot of orgs even when you have this modern data stack.

And to me, a real crux of that is that, you know, reporting, while it's it's can be very informative, it doesn't really have much leverage in product decision making. You know, you'll spin up some dashboard.

It has, like, top line revenue or something like that. And, you know, the CEO and the board are happy because they can port on stuff, but product teams are staring at that dashboard and thinking, like, you know, I hope it goes up. You know? I don't really know what it means for me. But once you build these experimentation cultures, once you make it really fluid and easy, suddenly, every product team has an intimate connection with metrics. They have alignment on, like, here's what I am trying to boost, and here's what I have done

and what has not succeeded at moving that metric. And so it has this amazing second order effect of just getting people to engage with data in a serious way. And then the other thing that really made me excited around building EPO was that

when you nail experimentation,

it leads to this entrepreneurial culture where suddenly people can take risks. They can try stuff out. You know? And you can have, you know, amazing product developments come from some engineer in the building. It's very empowering that way.

But, you know, as, you know, most companies see with the commercial landscape, there's really not great options to have an experimentation culture like Airbnb or Netflix is. Like, there's not any tool you can get. And so you see every team have to build the same thing in house over and over, and they all look very, very similar. And so at EPO, what we're basically saying is that paradigm doesn't really make sense. Experimentation is not a nice to have. It's a core part of the data team's operations,

and everyone should have the type of tooling you see at Airbnb. Yeah. And experimentation and AB testing

build analytics by trying to actually conduct experiments. There are just a lot of challenges and points of friction. And I'm wondering if you can just talk through some of the kinds of experiments

and AB tests that teams might want to conduct

and some of the

sort of technical points of friction that exist in sort of the general state of the data ecosystem and in organizations that lead to this challenge of actually being able to

effectively build and run these experiments

and create that fast feedback cycle?

Yeah. Absolutely. So kinda starting off with what types of experiments would people run, I think it's 1 of those interesting things. If you go to an Uber and Airbnb,

you know, everything every product launch basically has an experiment. It becomes ubiquitous. Now that's not necessarily a realistic expectation for every company, but, you know, that's kind of where you end up asymptotically.

But where people start out with is running experiments.

We usually see 2 teams kinda be the tip of the spear for experimentation.

It's either some sort of growth team that is, you know, working across a lot of product surface areas trying to boost concepts like activation, engagement, retention,

or it's a machine learning team

who is constantly iterating on their models, and they need to kinda validate that 1 model is better than another. I think it's pretty log you know, this is this is the exact same trend that happened to Airbnb. You know, experimentation started at on the search ranking team before spreading to more of the org. And I think it makes sense because

growth and ML teams tend to be the first ones who have OKRs that are about metrics, that are around, like, you need to boost this metric as opposed to shipping these features.

The type of experiments they would run, you know, ML teams iterating on models is just the classic 1. For growth teams, it can be various ways of lowering friction. You know, maybe we don't need this additional screen. They can just go right to the checkout page

or creating more urgency. You know, Airbnb, we have

messaging saying, like, you know, 10 people are looking at this listing. It ends up driving quite a lot of bookings.

It can even be really small things. You know, 1 of the things I well, the stories I was telling is that, you know, I was at Airbnb for 5 years from 2012 to 2017.

By far and away, the most impactful experiment we ever ran in those 5 years

was this engineer made it where if you click on an Airbnb listing, it opens it in a new tab. That 1 little thing boosted bookings by, like, 2 or 3%. It was it was crazy

and,

like, as much as many multi quarter projects. So, like, once you get establish experimentation, like, all sorts of things can become tests. And, you know, as anyone who's been part of these cultures knows, it can be very humbling to see what works and what doesn't. And as far as the actual process of designing and building and deploying and analyzing the effects of an experiment, I'm wondering if you can just talk through that overall process and some of the

sort of technical considerations that go into it, some of the design elements, the

stakeholders that are involved in each of those stages of the process that goes into saying, okay. This is something that we want to test to. This is something that we have found out from this experiment that we want to actually permanently change.

Yeah. Absolutely. And I'm glad you phrased it this way because 1 of the things that I think a lot of tooling doesn't get right is just understanding the end to end of an experiment process.

So, you know, when you're running an experiment, there's basically

6 steps that happen. So first, you have to have some alignment around, like, what are you gonna run? You know, we want to improve,

you know, the search funnel. We want to decrease customer support tickets. You know, just some idea of, like, what are we even trying to solve?

Then you come up with these hypotheses, and you design an experiment, which is, like, if we, you know, put 6 items on the screen instead of 3, you know, we think people will, you know, find more inventory before getting distracted or before leaving the platform.

And so there's an experiment design stage just saying, like, we want to test this thing now.

And kind of alongside that kind of more conversation

of, like, what is the best product to improve things is also a statistical

question of, you know, if we want to test this thing, like, how long does it take to run? So this thing that's called a power analysis as probably most of your audience has come across. So you have to, you know, run a power analysis, define who is going to see this experiment, what metrics you're going to boost, and just to make sure it's a tractable thing.

Once you have

an idea of what experiment you're gonna run, then you have to actually run the experiments. And this is something where, again, there's a lot of challenges where

actually running an experiment

can be like, when you go from the world of these kinda Netflix tools

to trying to patch it together yourself, you come across all the sort of failure modes you can enter into.

There are things like imbalanced randomization, where, you know, you launch some experiment. You think it's 50 50%, and it's actually 5149,

which might not seem like a big deal, but that 1% off could be a very biased cohort and can ruin your whole experiment.

Another thing that can happen is some sort of bug can get introduced either by your own team or by another team. You know? These organizations are pretty dynamic in terms of the amount amount of product changes that are happening.

So it's kind of a bummer if you're running some experiments, and halfway through, some bug gets introduced.

Another thing is when you're running these experiments, you know, people are always looking at the results. You know? Like, I think this is 1 of the funnier

paradoxes of experimentation where, you know, if you read the stats textbooks,

like, especially if you're using the frequent test thing, the t test thing, like, you're not supposed to look at results until the end. But, obviously, everyone looks at these results all the time. And so, you know, how do you make sure you can hold people back from,

you know, bad statistical practices?

So all this stuff is happening while it's running.

Once it's done, you need to kind

of aggregate the results, report them out. Typically, this is where data scientists and analysts kind of get brought in more seriously and end up spending a lot of bandwidth.

So this is calculating the overall results plus any other inform like, informative data points. And then the team has to make a decision. And making decision, you know, can be hard if you haven't

created good alignment ahead of time. You know? Some of these metrics can be in opposition.

You know? For example, if you're at DoorDash, like, what happens if orders go up and revenue goes down or something like that? You know? You have to have some idea of what you're gonna do there. And then kind of once you've made your decision, then, you know, you you want to disseminate the knowledge because, you know, like I mentioned before, I, you know, I worked on the knowledge repo, which

did a lot of work around kinda creating a common

knowledge base for people to draw on. 1 of my big takeaways from that is that

institutional knowledge, the sort of stuff that, you know,

people talk about to others around what made a company succeed,

a a lot of that comes out of successful experiments because experiments really clearly show that this thing succeeded and this thing didn't. It lets people really know what, you know, tells the story of the company there. So 6 steps, you know, quite a lot of depth to each of them that involve cultural barriers and technical barriers, statistical barriers.

We're trying to make all of that go much more fluidly so that you don't have to take, like, a reforged class or take extra training to actually run this process.

It's interesting that you mentioned the sort of potential negative impacts on various metrics where you're conducting an experiment to say, I wanna increase conversions, but maybe those conversions are actually of a lower total lifetime value. And so your long term revenue actually dips despite the fact that you're adding more people at the top of the funnel.

And I'm wondering if you can talk through some of the types of

sort of unintended consequences and some of the statistical errors or misinterpretations

that that commonly crop up in the process of actually building and analyzing these experiments

and some of the ways that you can actually

identify and guard against some of these negative outcomes from the experiment because you're saying, I'm looking at this number over here, but the number that's actually changing is something that I'm not paying any attention to. And so I might actually not see that there is an impact from this experiment, but not the 1 that I'm looking for.

When when we talk about these kind of failure modes, you know, there's quite a lot of them. I think this is 1 of the things that doesn't quite get appreciated from people hand rolling it. So, you know, at the basic level, we talked a little bit around some of the diagnostics you need. Like, when you're running experiments,

you think you're testing 1 thing, which is just, you know, your product idea, but you're actually testing not only your product idea, but your execution of that product idea. You build it without bugs? Did you build it with good design? You know, was it a viable product? And then you're also testing your infrastructure. You know, are you randomizing properly?

Are you using the right way to calculate a metric and not the wrong way to calculate a metric?

And are you kind of noticing if experiments are interacting? So there's a bunch of stuff around the general health of an experiment that you need to handle.

But even outside of the general health, there are ways in which kind of organizations will, you know,

run into trouble.

So 1 of the big ones is early stopping. So this is something that gets a lot of tension in stat circles where

most companies are using the frequentist t test, z test paradigm

for running experiments.

If you're running those, then

you're not supposed to look at the results until the end. But most organizations do. They see it go green.

There's not a data scientist who's kind of holding everyone back, so they stop it early.

And if you do that policy where you say, like, this experiment is supposed to run for a month, but a weekend, it looks green, and I just wanna move forward, you're gonna end up with a lot of false positives. It's gonna basically take away a lot of the benefits of running experiments.

And there are ways to mitigate that, you know, both from statistical methods or from processes,

but, you know, that's a pretty common 1. As insidious as that is

this is something that I think doesn't get appreciated enough is actually running the experiment for

long enough to detect a signal. So 1 of the kinda more sad stories for me was my most recent gig was at Webflow.

And when I joined, they had just finished this experiment. They finished it maybe, like, a month or 2 before I joined.

And it was on this product that I thought was a really good product. You know? They had a pretty good set of research. They had incrementally built it. You know? They had built done the product work in a good way. They ran the experiments,

and it didn't show an effect. And everyone was very disappointed. You know? The the team was kind of steadily,

you know, disbanded, kind of turned into the resources reallocated.

But the experiment, they only ran it for a week. And if you did the math on it, it was supposed to run for 2 months or 3 months. You know? It was just not nearly enough time to detect an effect

for this dataset.

And that's too bad because, you know, if the organization internalizes it as saying this product didn't work. But in actuality, it just didn't follow the practices you're supposed to do for experiments. So,

you know, that's another 1. Another really common pitfall is looking at the right metrics.

So, you know, you alluded to it in terms of short term, long term effects

of a lot of the tools that people purchase off the shelf, they only tell you kind of widget clicks as metrics. You know, you can only use these click events and talk about conversions that way. But if you talk about the metrics that a business care about, they can be a little bit longer term. They can be things like net subscriptions,

you know, retention,

activation,

these sort of things.

And

you need a system that can actually use those and encourages you to do them. The other thing is guardrail metrics. So

maybe your experiment boosted revenue, but it also increased customer support tickets by making people very confused.

So having some sort of policy of saying, like, every time you run this experiment, you need to also check these other metrics

just to make sure our system stays healthy.

So, you know, quite a lot of things you want to get right with your experimentation program. We try to make it very easy so that, like, the product teams who are running experiments don't have to know all of those things. They just kinda get this health report. They automatically get nudged to the good good practices

that lead to signal at the end.

In terms of the

design aspects of experimentation

frameworks, that's definitely something that can help with

guiding people into the kind of best practice of how to actually build and think about these experiments. And I'm wondering,

what are some of the design and user experience principles that you focused on in the EPO platform to improve that overall workflow of building and analyzing the experiments and helping

to make people aware of some of the pitfalls

in their kind of logical approach to the experiment,

the ways that they're conducting it, preventing people from stopping it early by raising a warning flag saying, don't do this. You're not actually getting the numbers you think you are, and just helping to kind of educate people along the way.

So there's 2 high level principles we operate with. I can go into some of the more tactical things.

So the 2 high level principles we really believe in are,

1, is that

people need to be given the decisions they're equipped to make.

So every company

has, you know, a finite set of what I would call experiment specialists. These are people with statistics training. Maybe they've run experiment programs before.

They can be senior product leader, senior data leader, engineers, whatever it might be. But it's it's a finite set. You know? It's not enough to cover the appetite for experiments everywhere.

We want those people to have their

decision making scale across the org. You know? They set the rules of engagement.

Like, here are the metrics that people are gonna use. Here's the statistical regime we're gonna use by default.

Every experiment will have to run for at least 2 weeks. You know? And, ideally, you know, you finish it to completion.

You know, a bunch of these sort of principles. And then all the rest of these product teams, when they run experiments, they don't have to think around, like, you know, am I to figure out how long to run it? Like, am I reading these results

to figure out how long to run it. Like, am I reading these results correctly?

And so, in general, we say that, you know, experiment specialists set the rules of engagement.

All the product teams kind of can turn the crank within that kind of guided

sandbox with guardrails.

Now, tactically, what that means is there are some things which we just automate.

So, for example, the power analysis thing is, you know, very core to getting

signal out of an experiment practice, so we're just automating that. Create, like, a little progress bar so you can just see, like, you know, to get, you know,

a signal according to the default set by the specialists, you typically, that's, like, you know, 80% statistical power,

then you need to run this long. And we show you that you're 70% complete or whatever. The other things we do is allow people to structure metrics with guardrails

versus, you know, other metrics you might use in a more bespoke fashion.

Another thing we do is you know, I alluded to it before that the experimentation has this central paradox, which is that, you know, from a statistical hygiene standpoint, you're not supposed to look at that many metrics. There's a whole problem called the multiple hypothesis,

comparison problem, where, like, the more things you look at, the more likely you're gonna get some false result.

And so that's the statistical point of view is that you're not supposed to look at many things. But anyone has ever, you know, run experiments in an organization knows,

the organization wants to look at a lot of things at a lot of points in time. And so how can you create like, acknowledge that both forces exist but we do better decision making? And the way we do that is we have a much more opinionated

reporting

interface for what we call decision metrics, which are the things you're gonna make your decision off of, the primary metric plus guardrails.

And then we for all these other

kinda more bespoke metrics, widget clicks, you know, funnels, whatever,

slice and dice, we create a separate BI space

where, you know, you can go and explore those things, but it's kind of understood that that's where,

you know, the stuff that is closer to explanatory,

setting up the next experiment to succeed versus making a decision off of, all that stuff exists in this explore tab. So

in general, we try to say we use a mixture of

automations of things that should happen, design nudges for things that we think should happen,

statistical practices that kind of just take certain problems off the table, such as using sequential analysis in Bayesian.

There's more to it. But, yeah, those are the general principles is, you know,

make it so that people don't have to know all the complexity of experiments to run experiments

and, you know, design these interfaces to, you know, be able to scale to an organization.

1 of the interesting things about the idea of experimentation

so

my kind of initial introduction to this was through Optimizely,

which is a very narrow scope of experimentation that just has to do with the UI elements within a given product.

But because you're dealing with

experimentation

at the product level and it's

potentially much more expansive in terms of the types of things that you can test,

I'm curious how that plays out in terms of a multiservice or a multiproduct

experiment where maybe you want to say, we want to see what the impact is on our overall revenue if we make this change across all of our product line. You know, maybe you're updating your branding and you're doing it selectively.

And some of the ways that you have to think about

collecting that information

and being able to store and aggregate it and be able to build an effective analysis on top of that more kind of broad based set of data versus something that's coming from within the confines of a single application.

You know, the way we think about this stuff is that

1 of the overriding trends that, you know, underlies probably even this podcast is that we have this emergence of the modern data stack where there's a lot of tools to get everything into your Snowflake BigQuery Redshift. You know? And that can come from your Optimizely.

It can come from your own in house feature flagging service. Maybe you just offline created a 5050 split of users and you're using that. But whatever it is, it's very easy to get that into the data warehouse. Similarly,

you know, we have seen an overwhelming number of tools

that now say, when you are pulling metrics, whether it's for the CFO or for your product manager, that's also gonna live in Snowflake, BigQuery, Redshift.

And so what we've done is the entire back end of EPO is your cloud data warehouse.

So we do all of our calculations there. You know? We pull both the experiment exposure data as well as the metric data from there.

We have a pretty rich kinda semantic mapping of how we can utilize that data that was vetted from our time at Airbnb.

So what that means is that if you have multiple points of sale, so suppose you have a retail operation and an online operation

and revenue is pulling from multiple of these things, or suppose you have for experiments, you have a client for your server side and for your mobile apps,

all that stuff, the data can still end up in your data lake, and we can utilize it. It's kind of built modularly

to allow for a kind of more complex

experiment setup.

And can you talk through a bit more about how the EPO platform itself is implemented and some of the additional systems or services that are necessary to be able to actually feed into EPO to be able to build and manage these experiments, particularly thinking in terms of things like feature flagging or

the specific business intelligence layers that you're able to hook into and things like that?

Today,

you know, we wanted to build an analysis centric experimentation platform,

and that naturally led us to really focus on the diagnostics, the investigations,

the results reporting.

That means that our product today is built with a bring your own feature flagging, bring your own warehouse model. So we actually sit on top of

Optimizely,

LaunchDarkly,

or your own feature flagging service. You know, whatever you're using, you know, we're happy to sit alongside that. All we do is we pick up that data from the warehouse.

So that naturally means, you know, the product today, you probably need something like that, or we're happy to set you up with an open source solution

to get going on your feature flagging.

The other thing with the way our system is built is, you know, we primarily sell to data teams,

and that's because we've kind of built so much towards this world of Snowflake as the center of the universe or BigQuery's center of the universe.

That also means that you should probably have, you know, your dbt airflow layer to get a lot of value out of, you know, our tool. It's not it's not a strict necessity, but, you know, it does mean that there is someone in the org who is thinking around what the right representation of metrics should look like.

So that's usually the point at which organizations start thinking more seriously about experimentation anyway

once they start building a data team. And so we kind of naturally

slot into that maturity curve.

In terms of the integration points that are available for hooking in to Epo or for Epo to be able to feed into other systems, I'm wondering what your initial conception was as far as the overall scope of the EPO platform and where you wanted to rely on some of these external systems and some of the ways that people can drop EPO into their existing environments and hook in some of these different signals and metrics to be able to feed into the overall experimentation process.

In terms of, you know, the hookups and integrations, you know, today,

like, you know, it's very simple. It's that all you do is just link us to your data warehouse.

From there, what you have to do is just configure a bunch of the metrics and dimensions into our system. It basically says like here's how we calculate revenue. Here's how we calculate purchases. Here's how we calculate activations.

Our underlying abstraction

is SQL. So it's a similar paradigm you might see in, you know, some of these metric layers or kind of LookML.

What's great is that, you know, SQL is,

you know, widely accessible language, and it's pretty expressive.

So, you know, even if your warehouse is in some varying state of cleanliness or having canonical tables, we can work alongside any of those sort of systems. You just give us a SQL snippet. You know? All it has to do is give us fact tables. So, you know, this amount of revenue was accrued at this time.

These entities were involved. You know? There was a user.

The user purchased an item.

And from there, what we do is we can build a pretty rich

experiment centric BI layer where you can look at results not only overall,

but you can slice them by various concepts. You know, you can slice them by,

you know, the typical things like country or device, but also

more warehouse specific things like user persona or first touch marketing channel, stuff like that.

And then the other crucial part about this is that

experimentation is a system that continuously involves configuring new metrics. Like, you know, a new team starts adopting it. They want to drive

some other kind of internal metric. They wanna get that configured into the system.

So we invested a lot in making it very easy to add more and more metrics, add more and more dimensions so that each team can get the sort of analysis capabilities they need.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done.

DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced

data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

Beyond the technical requirements for being able to run these experiments, as we've noted, there are a lot of organizational

elements to it and community of the design environment. Wondering if you can just talk through some of the decisions that need to be made when figuring out what kinds of things you want to change and how to measure the impact and what you actually want to change to be able to get the desired outcome because there can be several orders of removal from the, know, the metric that you actually want to impact versus the change that you actually want to make.

You know, the cultural side of experimentation is part of the thing that gets me excited around this It's like the intersection of tools and culture. If you're trying to set up an experiment, especially your first few, you hit on exactly some of the initial questions you should ask, which is what metrics are you trying to improve?

And then

mathematically,

how long would it take to measure those metrics? You know, some metric maybe your sample size is not quite that large. You know, maybe these metrics are very lagging. And so you have to have a kind of candid conversation around, like, even if we ran this experiment, if it took 6 months, is that, you know, a good practice for us?

Now if it is 6 months, there's a few levers to pull. You know? And I think this is something that you see in a lot of org, you know, even outside of using EPO where

there's a whole world of calculating indicators, which are what are kind of more upstream actions that the data team has shown

is, you know, highly correlated

to long term outcomes.

And so when you see these things like LTV models, like early life cycle LTV models, it's sort of related there. You know? At Webflow, for instance, we saw that

even though our it's a SaaS business and, you know, seeing if someone's gonna subscribe and retain takes months, the things that were really indicative of getting value were, you know, publishing a site and getting traffic onto it. You know, if Webflow is this website builder, so once someone got the site out there, once people were looking at it,

then, you know, good things tended to happen. And the steps on the way to that were also just repeated publishing. So, you know, these are some metrics that, you know, you might use for your early experimentation efforts.

But the point I wanna emphasize with that is that I think most people get the

data science, statistical

standpoint on it, which is find metrics that are, you know, correlated and then can be proven causal to long term effects.

But there's a organizational cultural side, which is

you're trying to kind of build organizational alignment that this product is good, and this product did not succeed.

So you need to find the metrics that

kind of have a good audience among product leaders, among, you know, the c suite. You know, like, it basically,

as an experimentation effort, your ability to drive organizational change is gonna be

through alignment building,

which happens through picking the right metrics.

In terms of actually building and running these experiments,

obviously, there's a bit of

experience that's required to be able to actually do it effectively. And I'm wondering if you can talk to some of the ways that organizations who are early in the journey of actually building and running these experiments can

kind of flex that muscle and build up the capacity to be able to more effectively

run these experiments and more intuitively understand

the knobs that they want to twist to be able to actually have the desired outcome.

I think 1 of the things we have seen

in most companies, and I've personally seen at every company,

is there's a experiment maturity curve that you have to go through before you start kind of building alignment, showing clear ROI.

You know, Reforge actually emphasizes this a lot, so if anyone has taken that class. But,

basically, to have an experiment practice, there's a crawl, walk, run progression you have to go through.

And it basically what it is is the way Reforge teaches it,

learn to test,

learn to learn, and then learn to win.

So, you know, digging into that, you know, the first step is to, like, just learn how to do an AB test at all. Learn to do 1 that's healthy, that doesn't have a bunch of bugs in it.

I have been on multiple teams that have had OKRs

of run 1 experiment this quarter. And it sounds sort of funny and facetious, and so you start seeing, like, yeah, getting proper randomization,

building a thing that didn't have a bug in it, and actually completing and understanding the results.

Like, there's a lot of pieces that have to fall into place there. So, you know, the first step is learn to test, and this is something where I think EPO,

you know, in tools like that can really, really help you out. The second piece is to learn to learn. So 1 of the things around experimentation is the second you start measuring things, you realize

it's like, how few of the things you're launching are actually improving things are actually improving metrics.

There's been a lot of papers coming out of Google, Microsoft, Netflix, Air Me, whatever, that basically show that, you know, all the things you launch, usually around a 3rd,

end up successfully moving metrics positively,

and another 3rd end up doing it negatively.

So, you know, the 2 things you can think from that are, 1, you know, you need to be a little bit humble around what is actually a successful product. And 2, like, you need to figure

out, like, how do you quickly get to a successful product. So what ends up happening is you need to learn to learn.

You launch these experiments.

They often end up neutral or negative.

You need to figure out what happened. Like, is there some,

like, reason? Did like, did it work for some users and not other users? Did it work for, you know, certain regions, not in other regions, for different view ports, for different user personas and their goals.

And so what ends up happening is

these organizations will send their analysts into, you know, a bunch of Jupyter Notebooks to try to, like, unpack this experiment in 1, 000 different ways. At Airbnb, we built a bunch of

BI

type of workflows to be able to do these investigations yourself. And that's a similar thing with what we've done with EPO is make it very easy to understand what happened even if it was not a positive experiment. So first, you learn to test, make sure that the experiment runs successfully,

and then you learn to learn where even if it ran, if it ended up neutral, you understand what worked and what didn't,

what would be the reason why it didn't work. Once you can do that, that's when you start reaching these places where you can consistently show successful

results. That's when you learn to win.

You've understood how to

run good experiments. You've understood how to do good product development,

and now you can, you know, start to show some of these metric lifts.

And so the problem is that, you know, this is a maturity curve that basically everyone has to go through.

When you're fighting your tools along the way, you're kind of burning the clock a little bit. If a team is wants to take an experiment centric approach and 2 years in, they haven't learned to win yet,

like, it's it's gonna be hard to justify continuing it. So I think the teams that want to run experiments would do well to

take every measure they can to quickly get through this learning curve.

And another challenge as organizations

do become sophisticated

and very

experiment driven is that at some point, you're going to have a multitude of experiments that are running contiguously

and understanding

what the actual outcome is

where, you know, 1 user might be bucketed into 3 different experiments and how is that confluence of experimentation

going to actually impact

the numbers that you're trying to measure for each 1 of those individually

and just some of the strategies that you've seen for being able to effectively

manage that, you know, large number of experiments and prevent them from conflicting and colliding with each other and causing mixed signals where, you know, maybe numbers dropped not because of the experiment that you were running, but because of a different experiment that happened to be on the same page.

So, you know, this is a tough, deep problem. You know? We we call it experiment interaction effects.

So if you have a really large sample size, you know, if you're like Facebook or something like that,

the way this can typically happen is that your

experiment assignment

tool, your feature flagging tool will have a concept called layers

where it will say, like, a user can only have 1 experiment that is in the same layer. So it's just an organizational principle that says, like, you know, here's a set of experiments that are all happening in the same place.

We're gonna make sure that users are only experiencing 1 of those experiments and not multiple of them.

So that's a nice clean way to do it, but it's kind of it relies on having, like, a very large amount of sample size. You know, for a lot of these companies, even just getting 1 experiment delivered to the whole population

to finish reasonably is a hard thing. And so what you'll see, and this is what we did at Airbnb for a long time, is that a lot of companies will just kind of put it all out there, let people experience multiple experiments,

and then after the fact,

try to look back and see, like, you know, did something

funky happen? You know? And so the typical way you do that is you do these kind of 2 experiment comparisons. You essentially make a 2 by 2 grid of saying, like, for a given metric.

You know, for people who saw neither experiment, what was the effect? For people who saw 1 of either experiment? And then for people with both, you know, what were the effects? And, hopefully, it's something that's kind of closer to additive than in the opposite effect.

So it's a kinda crude process. What I've seen

well, honestly, the thing I've seen happen the most is just people ignore the problem, and they just can't keep going. But for the people who care about the problem, that's a fact I've seen people do the most, and we're, you know, we're building tools to make that

x to experiment comparison very easy.

The strategy that I'm actually excited about for EPO, we're gonna be building this in the future, is to try to do some sort of automation around it, which is you know this is a a strategy I heard from a experiment builder, a Stitch Fix.

But to run a daily regression

where

regress against a certain, like, metric, let's say, revenue or, you know, something that's broadly applicable,

and then you include covariates for each experiment and then every pair of experiments.

And what's interesting there is that, like, if a covariate for a pair of experiments pops,

then,

you know, there's probably something weird there. I think in general, 1 thing to keep in mind is that, like, for these experiment interactions, you're typically

you're looking for a bug. You're looking for something where, like, 1 experiment has ruined the other. And so, you know, most solutions end up looking for, like, large effect sizes in this problem.

And as far as teams who are trying to actually

get used to this experimentation

flow and understand how best to deploy their feature flagging and bucketing for the users who are the subject of these experiments and figure out how to actually

introduce

useful events and metrics as a result of the interactions that those experiments are having. I'm wondering if you can talk through some of the

kind of learning curve and points of understanding that are necessary to be able to actually make this effective and maybe talk through some of the overall

workflow of onboarding EPO and making sure that you have all of the requisite pieces in place to actually be able to use it to its fullest effect.

You know, in terms of the pieces that you need and to run

experiments

correctly, like, here are all the pieces, and what I'll say is that most companies

have maybe 1 purpose built piece and everything else is very ad hoc. So, you know, you need a feature flagging

tool, some way to randomize traffic. And that can be, you know, an optimized LaunchDarkly.

It can be something you do in house. You know, I've seen people do very basic things like just using the last digit of a user ID, although that has its own potential pitfalls.

But you just need some way to randomize traffic and deliver a different experience to both sides.

The other input you need is some way to

kind of utilize and govern a very diverse set of metrics.

So, you know, these metrics should include the core business things like, you know, revenue, activations,

you know, engagement.

It should also include a bunch of product analytics because the second you start digging deeper, you're gonna wanna know, was there a drop off in the funnel? Was you know, did people even see the widget I made? You know, did they even interact with it at all? So you need some way to utilize and govern a whole wide variety of metrics.

Then these 2 tools combined with a whole bunch of data pipelines. So you you first, you need to do the basic thing, which is, you know, take your primary metrics and see if it increased and run a stat test on it. And then from there, you also need to automate all these diagnostics.

So, you know, is the randomization proper? You know? Is there inordinate number of people who are seeing both sides of the experiment and that are getting thrown out? You know? Are we handling people who are cross device, who aren't logged in yet, being able to pair them with metrics appropriately?

And then the last set of data pipelines are around investigations. So, you know, I alluded to it before. Most experiments aren't gonna work. And in that world, you need to understand

kind of what happened. Did it work for some people? Did it not work for some people?

So there there's a bunch of different data calculations you have to do. And then once all of those things are calculated, you need to put together some way of reporting, some way of conveying the results to the organization.

And this is something where there's a lot of companies out there who porting interface is not consumable to anyone who's not a data scientist. You know, it's some kind of private

internal dashboard that people will look at and then copy paste the results into Slack or something like that.

And that sort of thing, it's pretty low trust process when you can't, like, engage with it yourself,

and it means that an analyst has to be involved in every single experiment. You basically have an experiment to headcount ratio you have to maintain.

So those are all the pieces. Now

team companies

end up today end up either putting together an experiment team, you know, cross functional, 2 back end engineers, front end engineer, data scientist, PM sort of thing to make it all work together. Or they'll use Optimizely thing, pay, like, 200, 300 k for it, and then also roll their own data infrastructure to build out the kind of metrics and data pipelines to calculate results

and, you know, try to reuse, like, Looker or Jupyter Notebook something to do the analysis.

So for EPO, you know, the way we think around it is what you need is

today, you need some sort of feature flagging system, like an Optimizely or in house thing,

and you need a cloud data warehouse. So Snowflake, BigQuery, Redshift.

That cloud data warehouse should have a transformation layer, a DBT, or airflow. So those are the 3 ingredients. Kind of from there, we can operate pretty well. You just configure

your metrics into our system. You connect us to a data warehouse, and then we do all those data pipelines. We give you reporting that is kinda widely consumable, shareable

that has, like, a viral effect.

And we create a system where, you know, again, specialists can establish the ground rules, and everyone else can turn the crank and run reliably informative experiments.

In your experience of building and growing the EPO product and working with your customers, I'm wondering what are some of the things that you have learned about experimentation

that you didn't encounter in your own work and some of the sort of ideas and assumptions that you had about the overall space that have been changed or challenged in the process?

You know, in terms

of ways in which people are using it, that's sort of surprising.

I think, like, there's been a lot of things which have been

I had thought was just my personal experience, but it's just been confirmed everywhere, which is that, you know, typically, a growth team or an ML team will be the starting place for experiments.

The other thing that's interesting is,

you know, people have a wide variety of assignment regimes. You know? Some of them are using Optimizely, whatever. Some of them have, you know, what are called switchback assignments where you instead of sampling users, you sample periods of time. That's something that you see at Uber and Lyft, etcetera. And And then some people are just doing offline assignments. And what's nice about this data lake model is that it's all worked with our system. So, you know, we we see people doing those.

The thing that has been I would say the most surprising,

and this has been very pleasantly surprising, is that, you know, when you look at all these commercial tools for experimentation, they usually advertise the high velocity of experiments. It's all about speed, speed, speed.

When we talk to most experiment leaders, you know, the people who are and are the kind of purchasers of these products,

they didn't necessarily want a large volume of experiments. That itself was not desirable.

What they wanted were

better experiments,

you know, stuff that is, you know, validating and invalidating

hypotheses,

stuff that kind of teaches you about, you know, your customers and what they want,

things that follow good practice, that follow rigor.

And, you know, it's cool because our system, you know, because of the way we architected it, is well situated to deliver a lot of that. That was sort of a a pleasant surprise. You know? I think growth culture

in a lot of corners gets this wrap of essentially spaghetti at the wall. It's like just coming up with random things, putting it through a test, and, you know, trying to claim victory.

But among the people, like, you know, heads of data, you know, director of product growth, these sort of leaders,

they want a much more of a intentional knowledge generating process

and for people to essentially follow good hygiene

reliably.

So cool to see. So, you know, we've actually changed, you know, a lot of the way we talk around to kind of emphasize that, like, we are a tool that is well situated to, you know,

enforce good hygiene. They put a hypothesis first in terms of what you're gonna test.

And as far as people who are using it, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

The most of the innovative surprising things have been around, you know, the ways in which people assign to different treatment groups. For example, we have 1 customer who they really want to have a handle over

preexperiment

assignment bias, and they've taken it to a point where they actually just create 2 control groups. So every experiment has 2 control groups, which means they basically are running an a test alongside every a b test. And so that's kinda cool. I think it means you have to devote a bunch of sample size to, you know, another control, but, you know, it does give you some, you know, good view on whether there's some bias there.

Another thing that we see are people who are doing these offline batch assignments where it's like you have your big user base. You've done some offline, like, process, maybe, like, hierarchical clustering or some way of dealing with stratified sampling,

And then you want to, you know, do analysis based on that. And what's nice is that, again, our underlying principles just live on Snowflake, BigQuery, Redshift. You can just upload that. You know?

We can just use that sort of offline assignment and just do all the analytics.

So those have been kind of 2 things that you might call nonstandard,

but have worked pretty smoothly.

In your experience of building the business and technology stack and company around EPO, I'm wondering what are some of the most interesting or unexpected or challenging lessons that you've learned in the process.

I think 1 of the things that has, again, been gratifying to see is

I think so much of startup success

comes from saying, like,

who is your customer? You know, getting a really good definition around that. Like, who is what type of person, what type of job title,

you know, and what type of businesses

are you trying to sell to. I think that especially in the technology world, the technical founders, you can see a little bit more of this, like, if I build a, they will come attitude. And, generally, that's a hard, you know, start up to build. But if you instead start from the opposite thing of saying, like, who do I want to build for? And then you talk to them and say, like, what do they need the most?

Then you

kind of naturally end up in a place that, you know, fits the market well. So for us, you know,

I was pretty passionate around the kind of mid stage growth

company and selling to data teams. So that's kinda series b, maybe late a, series b to series

e sort of stage.

And then selling to data teams, you know, we actually wanted to sell to consumer, prosumer teams.

And so, you know, that kinda fits a nice logical customer profile and then, you know, selling to the head of data, head of product sort of thing.

And so when you have a lot of definition around that, a lot of great things come out of it. You know, your product decision making, your messaging,

you know, your, you know, resource allocation, a lot comes there.

So that's been 1 of the big lessons that has continued to yield dividends for us.

In terms of people who are interested in being able

to more effectively build and run and manage their experiments, what are the cases where EPO is the wrong choice?

I think there's probably

2 ways in which EPO might not be the wrong might not be the right choice.

So 1 is, you know, if you're pretty early on like, if you don't really have enough sample size to run experiments,

you know, repeatably,

you know, you might be better off with some sort of staged rollout feature flagging tool that basically just makes sure you haven't crashed the site, and and that's it. We are trying to let you do a lot of different

investigations,

analysis, diagnostics around experiments,

but it might be, you know, not as applicable if you just don't have the sample size to run a lot of experiments. So that's why, you know, I I said before with our customer, like, series b and onwards kind of a nice spot for us.

The other is

if you don't have

the modern data stack yet, like, if you don't have Snowflake, Redshift, BigQuery, if you don't have DBT airflow,

you know, we have built our system to be really native to that

stack. So if your metrics are all sitting in, you know, a Postgres somewhere or maybe you don't have a database at all and you're just kind of reading, you know, a click stream off of RudderStack or something into various tools, you know, that's also not a place we have built for. So, you know, those are probably the 2 instances. Either you're too early and you don't have enough sample size to run experiments

or you are using a data stack that is not, you know, your typical cloud data warehouse plus transformation layer.

And as you continue to

build out the EPO platform and business, what are some of the things you have planned for the near to medium term or any capabilities that you're particularly excited to work

on? So, you know, there's quite a lot of automations to keep building in. You know, if you just think of all the reasons why an experiment

might fail or come with caveats, like, you know, we want to continue building these diagnostics and health checks.

We also wanna automate all the things that again, a lot of this tool is built out of me trying to scale in the Airbnb experimentation culture

at companies where there's only, like, 2 or 3 data scientists.

So, basically, situations where experiment specialists are very bandwidth limited. And so I keep wanting to

automate the things that, you know, I had to keep getting looped into. So already, we've made it very easy to, like, introspect data quality to see, like, you know, is the data appropriately? Has the table been updated in the past few days? Like, just basic sanity checking on data quality. But in the future, we wanna make it so things like power analysis can be done, like, before the experiment starts.

You know? Like, kind of in that planning stage. Like, we have all the, you know, metrics. We have all the data. Like, we can make a very good experience around that. And so that's gonna come in, like, the very near future, which we're very excited about. And then the other thing we're gonna be investing more and more into is, you know, I mentioned before, a lot of my former background was building knowledge repo. I believe experimentation is naturally a knowledge generating process, you know, when done intentionally. And so if you think of the whole end to end journey of saying, like, we need to figure out the right experiments to run, you know, the stuff that, you know, validates or invalidates the most important questions of the business, You know, how do you organize in that way? And so if you can look a little bit like, you know, experiment epics, like Jira or something like that. And then on the other end, how do you disseminate the results in a way that kind of spreads across the org and across time into the future? You know? How can you make it easy so people can know what experiments have been run? This is all places I'm very excited to invest in.

Are there any other aspects of the work that you're doing at EPO or the overall space of building and running experiments that we didn't discuss yet that you'd like to cover before we close out the show? I just want to reaffirm the kind of broader point, which is, like,

why do you even

run experiments? Like, why even have a data team? Right? Because I I think part a big part of why I started EPO was to say, like, you know, the whole point of having a data team is to make the decisions.

You know, the whole point of running experiments

is to make metrics a central,

you know,

component of how you run the business.

And

the experimentation, the reason we're all so excited about it is that it's a natural

point of leverage for doing all of those things, for making a metrics oriented point of view

of product development,

of product prioritization,

you know, become the DNA of the company.

So as people are thinking around the different ways a data team could invest in different capabilities and when they're looking at the the huge variety of data tools available on the market from all up and down the stack, you know, it's worth remembering, like, you know, data teams succeed when they show clear impact and when they they help establish their worldview

of, like, data as a central way of making decisions.

And once you kind of get back to that central core of why I have a data team, you know, experimentation becomes a very natural,

like, place where you have to invest.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think, you know, so much of this happens

in kind of similar themes what I said before, which is that

most of the data management

kind of discourse

can really center around

technical problems and not organizational problems.

You know? Like, no 1 cares about, you know, the bar charts or line charts or, you know, tables themselves. What people wanna do is, you know, make good decisions and, you know, get ROI from their data team. And so I think what I think data management needs is

a theory of change. You know? Just like what do data teams do in an organization? Like, why do they exist? Like, so if you snap your fingers and make it, like

you know, you just plug a, like,

block on a snowflake or the 5 tram thing, like, you know, if so much of this sort of reporting work is going to continue getting, you know, very easy to do, like, you know, I would guess you still need a data team to drive certain change and, you know, what's needed there. So, you know, obviously, this is a very tightly connected to why I started EPO and why I focus on experimentation. But

I think data management discourse and discourse in general on data teams should, you know, really get back to,

like, you know, how is the data team gonna have impact in an org? You know, what change is it going to induce

that will just help the company succeed?

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at EPO and your experiences of being able to build and run and manage these experimentation

systems for

businesses of various sizes and just a lot of the challenges that come along with that. It's definitely a very interesting and useful problem area, so I appreciate all the time and energy you're putting into it, and I hope you enjoy the rest of your day. Absolutely. It was great talking with you, Tobias.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links