Summary
Every business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
- Your host is Tobias Macey and today I'm interviewing Kevin Niparko and Hanhan Wang about Segment's new Unify product for building and syncing comprehensive customer profiles across your data systems
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Segment Unify is and the story behind it?
- What are the net-new capabilities that it brings to the Segment product suite?
- What are some of the categories of attributes that need to be managed in a prototypical customer profile?
- What are the different use cases that are enabled/simplified by the availability of a comprehensive customer profile?
- What is the potential impact of more detailed customer profiles on LTV?
- How do you manage permissions/auditability of updating or amending profile data?
- Can you describe how the Unify product is implemented?
- What are the technical challenges that you had to address while developing/launching this product?
- What is the workflow for a team who is adopting the Unify product?
- What are the other Segment products that need to be in use to take advantage of Unify?
- What are some of the most complex edge cases to address in identity resolution?
- How does reverse ETL factor into the enrichment process for profile data?
- What are some of the issues that you have to account for in synchronizing profiles across platforms/products?
- How do you mititgate the impact of "regression to the mean" for systems that don't support all of the attributes that you want to maintain in a profile record?
- What are some of the data modeling considerations that you have had to account for to support e.g. historical changes (e.g. slowly changing dimensions)?
- What are the most interesting, innovative, or unexpected ways that you have seen Segment Unify used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Segment Unify?
- When is Segment Unify the wrong choice?
- What do you have planned for the future of Segment Unify?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStackâs warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and youâll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
[00:00:16] Unknown:
Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost effective solution. Plus, it gives you more technical controls so you can fully unlock the power of your customer data. Visitdataengineeringpodcast.com/rudderstack today to take control of your customer data.
[00:00:37] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Kevin Naparco and Han Han Wang about Segment's new UniFi product for building and syncing comprehensive customer profiles across your data systems. So, Kevin, can you start by introducing yourself?
[00:00:50] Unknown:
Yeah. Absolutely. Tobias, thank you so much for having us on. I actually started as Segment's 1st data analyst back in 2015, so really helping our early team figure out our business model, inform our product direction, and go to market strategy through our own data. But 1 of the things that I learned really quickly is that being the lone data analyst at an analytics company means that I often was the first internal customer. Essentially this dog eating the dog food every day, constantly querying our data, our data warehouse, consuming it in various tools. And so I think that vantage point give me a good perspective that to really be able to answer some of the toughest questions that our business was facing. It required really high quality customer data to be able to understand our customer base, what they needed, and how we could better help them.
We'll hand it over to Han Han to share a little bit more about herself.
[00:01:50] Unknown:
Hey. I'm Han Han. So excited to be here today. So before segment, I spent 4 years at Amazon as a PM on machine learning and AI products. I ran the Alexa smart lighting business, helping people turn on and off their lights from the comfort of their couch. And I also worked on a new to world, kinda controversial product called Tone, which analyzes how you're coming off to others based on the vocal biomarkers in your voice stream. And at Amazon, I really saw the power that good data can bring to a business. It's such a huge competitive advantage.
And as a PM in Amazon, I could just focus on building these cool new ML experiences. It's just always assume the underlying data was in the right place, in the in the right format. So from there, I came to Segment, and what's super motivating to me is that we're democratizing customer data access and these insights so that companies of any size, not just these big tech behemoths, can make these truly data driven decisions that they need to compete.
[00:02:59] Unknown:
And so going back to you, Kevin, do you remember how you first got started working in data?
[00:03:03] Unknown:
Yeah, absolutely. So I had done some, a startup earlier, and that was sort of the first foray. It was a mobile social network, helping bring together friends around common shared interests. And so there was a lot of customer data that we could have had at our fingertips. We ultimately failed to find product market fit and, you know, I reflect a lot on that experience as really not following the data that we had at our fingertips and giving us signal as to whether we were on the right path. And so it was incredible learning from that perspective of how do you really listen to your customers, both qualitatively and quantitatively, to be able to drive product decisions that are going to lead to a better product at the end of the day. And, Han Han, do you remember how you got started in data?
[00:03:53] Unknown:
I think data is just very much in the fabric of being a PM in Amazon. Every week. Right? You have to keep a pulse of what's going on. Every week. Right? You have to keep a pulse of what's going on. When we launch products, like, everything is measured. Everything is mettricked. The dev teams focus a lot on it. It's part of the operational checklist. You know, it's part of the decision making. And and so it's I don't know if there's a start. It's just kind of the environment of being at a place that is very data driven and and metrics oriented.
[00:04:34] Unknown:
And so in terms of the Segment Unify project, which is what we're talking about today, before we get too far into that, can you just give a bit of an overview for folks who aren't familiar what Segment is, and maybe a little bit about kind of its role in the overall data ecosystem of a given organization?
[00:04:53] Unknown:
So, yeah, there's this really long and exciting history around customer data platforms that I'm sure we'll get into later in this conversation. But last month, we launched Segment Unify, which is providing consumer grade identity resolution. And we think this is the next big breakthrough around CDPs. An easy way to think about what a CDP is and where it fits into an organization is if you were running a business before the Internet, you'd get to know all of your customers personally, right? They'd come into your store, you'd develop this relationship with them. They'd tell you about their lives, the things that they were into, their favorite style, whether something seemed too cheap or too expensive, whether they like the pink pants or the green overalls. As businesses have moved online, that same interaction continues to happen, but now it's happening in the digital space. It's happening thousands of times per second across mobile apps and websites and CRMs and marketing automation and contact center systems on and on. Right? There are so many different tools which are informing that customer's journey. And on the other side, there's this this growing ecosystem of ways in which you can put that data to work from advertising and marketing, to texting, on-site personalization, even now fine tuning AI large language models. And so the uses around customer data are large and continuing to grow. And 1 of the things that we've realized is that collecting this data for our customers, it's this hard engineering challenge and it requires a lot of advanced infrastructure to get it right. It's not like this really exciting infrastructure that engineers show up excited to build. It's a lot of boring stuff. A lot of the plumbing behind the scenes to get everything to work well together. And so that's really where CDP sits within an organization, a set of APIs and infrastructure to get data from wherever it's generated to wherever it needs to
[00:06:49] Unknown:
go. And so in terms of the UniFi product, can you describe what that is and some of the story behind how it came to be and the role that it plays in the overall Segment product suite?
[00:07:00] Unknown:
Absolutely. So 1 of the things that we've learned along the way from our customers is it's not just enough to get raw data from 1 place to another. There are a lot of hard problems to make this raw dataable usable and actionable. So on top of that raw data with Unifi, we're providing identity resolution capabilities to turn the raw data into golden profiles. Golden profiles are essentially the most up to date trusted digital record of who your users are and where they are in their journey with your business. It's not necessarily a unique concept in the data world, but there are a few things that we're doing differently with Segment Unify.
The first is that these golden profiles are complete, which means it's the collective understanding around who your users are across different touch points. We've extended this to include data from the data warehouse with reverse ETL. So the insights that data science teams and data engineering teams are generating can now easily hydrate that profile. The second piece that's unique here is that these profiles are portable, meaning that they can sync across 100 of different tools that can be powered by the centralized understanding. And then the third is that they're real time and always up to date, meaning that they represent the latest state of a user and where they are in their journey. The stat continues to blow my mind. We're resolving 250,000 data points to profiles every second and executing over 50,000 incredibly complex profile attribute computations in seconds of receiving new digital signals. So that gives you a flavor of the size and scale and real time performance of the system that we were building.
[00:08:45] Unknown:
And as far as the kind of golden profile aspect, as you mentioned, unify is serving the purpose of bringing together these different data sources and combining them into this 1 cohesive view of the customer through their profile? And in terms of that profile object, what are some of the categories of attributes that need to be managed and maybe some of the different sources that those attributes and or categories of attributes might come from in the given organization?
[00:09:14] Unknown:
Yeah. A simple way to think about this is who is the user and how are they interacting with your business? So who is your user? Think about these as traits and identifiers. Traits can break down into a few different categories. You can have raw traits like the user's name or what billing plan they're on. There's computed traits, so things like total lifetime orders or average order size. There's also predictive traits. These are inferences about what a user is likely to do next. Things like likelihood to purchase or churn how a user ID and an email and a device ID all linked together in this representation of who a user is across touch points. And then there are events. So this is how is this user interacting with your business.
So think about a common e commerce funnel, somebody is viewing a product, they've added it to the cart and then they've ultimately checked out. All of these are events and digital interactions that are happening with your customers. There's also this need to append additional context to the profile. So things like what audiences and promotions does a user qualify for or where they fall in various marketing journeys that are being executed by marketing teams. I think there's this big moment that many data teams have as they mature with their customer data infrastructure, which is golden profiles aren't the static users table. It's really this dynamic and strategic asset that requires significant investment to get right.
[00:11:02] Unknown:
And in terms of that investment and the evolution of that profile object, what are some of the kind of technical and organizational challenges that come about in terms of understanding what are the actual semantic definitions of some of these attributes to determine how they're computed or aggregated, what are some of the ways that mutation of that profile object can have downstream ramifications on the overall data suite of the plat of the organization to some of those kind of, complexities that come about in in the overall life cycle of a given customer profile and the way that it is defined in the bounds of a business?
[00:11:43] Unknown:
It's a great question. And I think there are sort of 2 hard problems that we sought to solve with Segment Unify. So the first was around this concept of identity resolution, which is how do you know who your users are across touch points? We can walk through an example here to give you a flavor of why this is such a hard problem. Let's say a customer purchases a product in store and they provide their email address at checkout. Later they call or text in to support to get help installing this product. So you have their phone number. And then once the product is installed, they sign up for the service and ultimately get a user ID. Now imagine the same interaction is happening in different sequences with different identifiers, happening potentially 100 or 1000 of times per second.
This is the hard problem of resolving identity at scale in real time. And there are particularly hard parts that we've uncovered in this journey. So things like anonymous to known, how do you take somebody who's top of funnel exploring on the website down to downstream purchase behavior, as well as shared identifier detection. So identifiers can have varying levels of uniqueness. Phone numbers can be shared among a family, devices can be shared among many individuals, right? Each identifier has a different level of uniqueness that needs to be baked into your identity resolution logic. So that's sort of big bucket number 1 around some of the challenges that businesses have in implementing identity resolution strategies.
The second hard problem, which is more thematic across the industry and very top of mind for data engineers and data architects, is this concept of real time streaming architecture and how it relates to data at rest sitting in a data warehouse. You hear a lot about Customer 360 and Single Source of Truth or this 1 data store that's going to rule them all. And I think for most businesses, what we've realized is that this is ultimately a myth. The reality of customer data infrastructure is it's this tapestry of databases and Kafka streams and data lakes and SaaS tools and data silos, each which hold the subset of who your users are and how you can better serve them. And so there's this common challenge that businesses face as they mature on their data infrastructure, which is bringing real time data streams alongside data at rest, sitting in a data warehouse. And so that's where our new reverse ETL capabilities come in that we've introduced in conjunction with Segment Unify, which is the ability to query data that's sitting in the data warehouse and bring it onto the golden profile without having to spin up a ton of advanced ETL and data infrastructure.
So our, systems are operating at real time around 2,500,000 events per second, but we're also providing data engineers and data scientists the ability to tap into the rich data that sits in their data warehouse and join it onto their golden profile that can be used across the stack.
[00:14:57] Unknown:
Yeah. I think when we started pitching this, we called it a each destination was a data island. So every destination that you send your profile to in order to activate, your customer information, like Amplitude, Breeze, Iterable, ask you to send kind of this full fire hose of raw data and then infer a profile. And each of these, like, down engagement apps also need, like, a specific set of fields and IDs, in order to, drive the right ROI from their personalization features. So I think to the statuary, like, customers wanted await us. They're they're looking for ways to centralize that. They're looking for ways for their profiles to really be the full view and then to pick out parts of the profile and easily send exactly the fields that they need in order to drive the highest ROI, in their engagement applications downstream.
[00:16:00] Unknown:
And in terms of the return on investment, what are some of the ways that the this more comprehensive and cohesive profile object can impact the lifetime value of a given customer and some of the ways that you're able to think about measuring the investment or the impact of this more rich profile on the bottom line of the business?
[00:16:24] Unknown:
Yeah. Absolutely. So I think 1 of the biggest challenges that businesses face is there's data everywhere, but it is often unusable, not sitting in a usable format, and isn't really tapping into its full potential. And so Unify is really about getting all of this data from where it resides and bringing it together in 1 place. So 1 of the customers that's been relying heavily on Unifi is CrossFit, very active community of fitness folks. I think we've all had this experience of somebody who's gotten really into CrossFit and can't stop raving about it. They had this huge amount of data and insights from their engaged fitness community, but it was siloed and disconnected. And so they described Segment and Unify as giving them these data superpowers. So it gives their team this complete user profile and better personalized experiences. They're using it for virtual contests, providing local gym recommendations. So converting folks from top of funnel into joining their community and then generally helping their customers get more fit through data, which I think is a really exciting prospect.
MongoDB is another 1. So modern B2B platform, They're using profile sync to get this better understanding of a really complex B2B journey from multiple stakeholders exploring the product down to a complex implementation cycle. And so they use Segment Unify and ProfileSync as the foundation for their data strategy. They're joining in 181 additional tables in their data warehouse to really complete the profile and provide all of the dimensions of their accounts and user profiles. So really gives you a sense for the type of scale and, data coverage that's required to really understand these complex sales cycles and be able to deliver the right message at the right time for an account.
[00:18:27] Unknown:
And as far as the actual day to day business aspects of how different members of the team, whether in data or business operations or marketing or sales, etcetera, what is the impact on their lives of having this customer profile and some of the ways that they are able to kind of interact with it, both in terms of pulling data into their systems and being able to simplify the availability of information for their own needs, but also being able to provide feedback or requests on ways to evolve the profile, additional attributes to include, corrections on, you know, attributes that are incorrect, things like that? I think,
[00:19:07] Unknown:
you know, 1 of the huge wins is just being able to bring in a lot of different sources and build a complete view. So 1 of our retail customers is using reverse ETL to send and arch data from their warehouse marketing destinations. But this retailer, like, they're they've instrumented Segment for their website. You know, they can track ecommerce traffic pretty easily with Segment. They have been for a while. But they also now, with reverse detail and profile sync, they wanna add offline retail traffic. So things like you go into the store and you buy a product and you check out, and then really marry it to this complete view of the customers. This is a super hard problem before because often these these things rely in different they they come in different data stores. Offline processing is usually stored in a completely different upstream system. And there's no common link between the customer records you have there and the ones that, you you might be tracking through Segment. So with reverse CTL, our, our customers are able to, you know, bring in that offline traffic, use profile sync, tables, and join it against the identifiers we're using in segment, and then actually send in that offline traffic, into Segment, where now they have this, like, complete view of the customer. They can customize email marketing campaigns and and send emails to folks based on projects they purchase both online and in the store.
So I think this this has been a huge net add to our our data teams where it makes it really easy to bring, these 2 very disparate sets together and then also to marketers where now their targeting is better because they know, what folks are doing on their website, in the store, really speaks to me. I think I do a lot of online shopping and would love to have, you know, a more personalized view across the board.
[00:21:15] Unknown:
And digging now into the implementation of the Unify product, I'm wondering if you can just start by giving an overview about the architecture and some of the adjustments that you had to make to the existing segment kind of technical platform to be able to integrate and provide this Unify feature?
[00:21:35] Unknown:
I think the there are a couple of really interesting, technical challenges the team went through. So for profile sync, really, it was around, you know, driving these real time profiles and then how do we make these, you know, managing profiles at scale for customers. So our managed customers can easily have over a 100,000,000 profiles. Each of these profiles, in addition to having all of the trades, also have the full history of events of, every single customer going through, and all we're we're tracking all that. So identity resolutions system needs to be able to keep all these profile records up to date in real time. Whenever a new customer comes in, they need to be able to match it, do the merge on the identity graph, and then, you know, do this in within seconds of of a person hitting a button on a website.
And then profile sync, really, 1 of the big technical challenges, like, how do we make this work out of the box? You know, our customers were we had a very alpha product where we were syncing profile tables just kind of as is, to customers. And a lot of the feedback around was like, hey. It's really hard to query these tables. It's very slow to work with my profiles if I have, like, a 100,000,000 profiles. So so a lot of the innovation around profile sync is like, okay. Great. Well, how do we make these queries work really easily for folks? So it's it's really performant across your entire swath.
So we paired up with our best data engineer at Segment to design products for our customers' data engineers and thought very deeply on, like, what is the ideal table structure? What are the ways that you know, should we materialize these tables in segments? Should we have customers do the materialization? How do we optimize both our materialization strategy as well as the tables themselves for fast performing queries across these giant amounts of profile data? And then how do we make that really easy for for your data engineers? So we provide scripts for materialization with DBT, and we'll provide some scripts for materialization with other tools too.
[00:24:05] Unknown:
And as far as the implementation, what are some of the technical issues that you had to address while developing and launching this product?
[00:24:13] Unknown:
I think 1 another maybe interesting story here is around, on the reverse ETL side. We really, it was important for us to protect our customers' data privacy. And so we thought a lot about, okay. Great. What what is the best design for our customers that preserves privacy and is really performant and efficient too. So, you know, 1 of the technical considerations was like, hey. Should we should we copy query results into an s 3 bucket and then do the diffing kind of within that bucket, which we thought was maybe slow or efficient or maybe expensive? Or is there a way for us to innovate here and do some in warehouse, incremental diffing?
Basically, takes a check some operation on the customer's data model in their warehouse and figures out, hey. These are the changes that we then need to sync downstream to Segment. So on the technical side, we really wanted to optimize on a less compute intensive, more space efficient approach here. And I think the other win is that we don't ingest data unnecessarily, which is a big 1, I think, for our customers' data privacy side too.
[00:25:29] Unknown:
And as far as the adoption path for somebody, what are the steps involved in being able to onboard onto the Segment Unified product? Do they already need to be a Segment customer to take advantage of it? And within the overall Segment product suite, what are some of the hard dependencies to be able to implement Segment Unify within an organization?
[00:25:49] Unknown:
Yeah. If you are already a Segment connections or profiles customer, this is super easy. Right? So if you're already using Segment connections, reverse ETL is embedded into the connections part of the Segment app. So you can just find it, set up a source and destination, try it out. You can probably get started in about 15 minutes to send Robo CLI out to your 1st destination. If you are using Segment for identity resolution already, then setting up profile sync is also fairly simple. Go in the Segment app. We also have an API to, let you do this programmatically, put in your warehouse credentials, and we'll start syncing profile sync data hourly, to your warehouse.
And then from there, there's a, like, once you have the tables, we do a backfill of your historical records. We make sure you have your complete, all the events and trades over time. Usually, that takes, like, days, maybe a couple weeks, I think, is our official SLA. And I don't know if I was supposed to share that. It's okay. And then from there, you'll have all these tables now landed in your warehouse. We also offer offer a, DBT, materialization script. So if you have DBT, you can could just run the script. It'll start materializing profile traits tables, and and complete, you know, customer records. So we'll create a new table of a materialized view of your customer in your warehouse, after running that. And then and then you have the data, and you can start, you know, enriching it, joining it, playing around with it directly in with your warehouse tools.
[00:27:40] Unknown:
And talking through the of reverse ETL aspects of it, what are some of the what is the workflow for being able to move from implementing Unify, building out these profile objects to then propagating those objects into some of the I guess, I don't know whether to call it downstream or upstream tool since it's a bit cyclical, but some of the other systems that your that the business uses to be able to track these different customers through their life cycle.
[00:28:07] Unknown:
I think that's a really good way of describing it as cyclical. Right? It's not 1 direction. Data is now flowing bidirectionally from data warehouses into CDP, CDP into data warehouses, from data sources and destinations back into CDP and vice versa. And so it really does create this virtuous cycle of data building on itself, these golden profiles continuing to get richer and richer. I think your question is a really good 1, which is how do you think about profile sync across the stack? So how do you think about 1 digital representation that needs to now be ported into potentially tens, hundreds of different APIs, different tools with different data models. Each 1 has been designed independent of 1 another.
And I think this is really the foundation of, what we've built with CDP is the ability to really deeply understand what is the essence of how this tool defines a customer, and then how do you appropriately map data from this golden record into that tool. And just to give you a sense for the ways in which we can do this now, we can sync a Golden profile via an audience in batch. We can sync a profile via a patch change stream, so constant updates to the profile. We can sync a profile as an event. We can also run a little bit of arbitrary JavaScript that our customers can input with functions. And so there's really this growing set of ways in which golden profiles can be synced across the stack. It is a hard problem and that's very much why we've taken this approach of smart defaults, but extensibility baked in as our core philosophy.
We wanna be able to provide the best understanding of how golden profiles should map across your stack, but to the extent that you wanna customize or extend that we give you fine grain controls to do so.
[00:30:07] Unknown:
And then for the identity resolution element of this problem, where you do have information coming from multiple sources, you're trying to aggregate it into this golden profile. What are some of the challenges that you face in being able to accurately merge together different data sources into this 1 entity as well as some of the toggles that are available for businesses that want to manage the level of confidence that is required before performing that merge operation or being able to include a human in the loop in terms of reviewing, we want to merge these 2 things together. Does this look right? And being able to feed that back into the ongoing operation of the unified platform. Yeah. Absolutely. And I think this is 1 of the fundamental insights that we've had is that,
[00:30:55] Unknown:
every customer, every business really has their own unique identity graph that is relevant for their business. It's dependent on how they've implemented tens of different systems from their CRM to their tracking code in their implement, in their, on their website or in their mobile apps. And so there really is, the need for 1 configurability and flexibility, but also the ability to understand how these different identifiers are coming together to develop these golden profiles. And so, we provide a set of configurations where customers can define their identity resolution logic.
This has been trained on 1,000,000,000 and billions of events. And so we are able to detect when, issues may be arising, surface those up to customers for review, and provide observability as part of our platform to really give our customers confidence and understanding into how their identity graph is working, how it's evolving, and ways in which they may want to adapt or adjust their identity resolution logic given the signals that we're seeing.
[00:32:08] Unknown:
I think the inter the other interesting use case we've seen for profile observability into into the way that, Segment identity resolution process works, which we love because it's boosting trust in, you know, try it's boosting customer trust in our system. It's helping them with the understanding of how to join the data together. And it's so we're no longer just this, like, black box where, identity resolution is happening. You get some outputs out. It it's really bringing customers a lot more observability into what's happening in that box and giving them opportunities now to even change things, change the identity resolution, change their rules that they have in segment,
[00:33:08] Unknown:
and do adjustments even downstream too. And for the kind of entity resolution aspect of it, it's 1 of the perennial problems in computer science, but in data in particular is kind of understanding what are those kind of combinations that are valid. But another interesting angle to this, particularly in in analytical context, is the semantic elements of what to merge and how and whether there are additional computations or derivations or enrichments that need to happen in that merge path. And I'm curious how you've addressed that in the unified product to be able to say, okay. These are the same entities and these are the attributes. But in the actual representation, we want this attribute to be rendered differently, whether it's, you know, formatting the address or understanding which address is more up to date and accurate or particularly for things like purchases.
You know, what does it mean for a purchase to actually be completed? Like, do you have to have it in a holding, you know, a holding stage for a little while to determine whether or not they issue a return, etcetera, and some of those business rules around the entity resolution and attribute merging process?
[00:34:18] Unknown:
Yeah. Absolutely. It's a great question. And I think 1 of the hard challenges that many businesses face as they try to build their own identity resolution logic and profile systems. I think 1 of the things that we benefit from here just in terms of the overall approach that we've taken is 1 is a very clearly defined spec around data inputs. And so we really do have a set of well defined scope for who our user is and what they are doing in relation to your business. This includes those specific raw data, raw trait fields, Things like address and phone number which can be structured appropriately.
The other thing which is I think really hits at the heart of your question is defining a semantic layer in abstraction is very hard but doing it in relation to a specific use case and a specific tool provides clarity. And so by connecting golden profiles with a particular tool, we are then able to infer and define the profile as required for that specific use case. And so it really is about both the inputs, the resolution logic itself, as well as the end in terminal use case for the profile that allows us to understand what is the right representation of this profile, how should that be mapped into that end tool, and how is it ultimately going to unlock business value that a marketer or a support agent needs to really be able to deliver on that customer experience that they're looking to deliver on. And then as far as the Symantec attributes and the business rules
[00:36:03] Unknown:
to kind of compute and derive them, as you are pushing these profiles back out into the other systems, so things like HubSpot, Mailchimp, what have you, Salesforce. What are some of the challenges that you have to address in terms of understanding kind of the which attributes to overwrite versus which attributes to append to, etcetera. And, also, because not every platform is going to have all of the same fields, how you address some of the challenges of regression to the mean where everything has to just have a baseline set of attributes. And if you want to get more sophisticated, then you start to get diminishing returns because, oh, well, I created this new attribute on this user, but I can't push that into HubSpot. I can only push it into Salesforce and things like that. Yeah. Absolutely. And this is a really hard challenge that
[00:36:53] Unknown:
businesses face. So there are over 10,000 tools in the most recent Martech landscape. Right? That's 10,000 different APIs, data models, tools which have been defined and designed independent of 1 another. I think there are sort of 2 benefits and tailwinds here that we get to benefit from. The first is that given our size and scale, many of these tools and APIs and data models have looked to us and our spec for inspiration. So we actually provide our product as infrastructure for many of these MarTech tools And they leverage us and look to us to take some of the load off their customers for event routing and now even reverse ETL capability. So we get to benefit from this because there's this growing catalog of tools that are essentially leveraging our data and our spec. And they get to benefit from it because they don't have to recreate the wheel, define all of this data infrastructure, have customers go through the implementation cycle specifically for their tool. So that's number 1 around some of the ecosystem dynamics that I think really help reduce the complexity of this problem. And then the second is really being focused on extensibility and portability of profiles. So I think this is unique relative to some of the other data tools or suites, which take this sort of data hoarding approach, the walled garden approach, trying to keep things largely within their ecosystem and playing nicely, but not really well when you try to use a tool outside of that ecosystem. So we provide a ton of flexibility into how these mapping layers occur between this golden profile and these tools. We have, this set of capabilities called destination actions, which is essentially a layer of configuration which allows a non technical person, low code, no code to actually get in there, deeply understand how data is being mapped in from the golden profile into a particular tool and to be able to configure that exactly as they expect. And so that level of transparency and observability and configuration is very much at the heart of our philosophy, which is it's not going to be a 1 size fits all solution.
Oftentimes these tools are being implemented specifically for a team and a use case. And so we wanna provide the ability to adapt this golden profile, the centralized understanding of who a user is across an organization and apply it to a specific tool and a specific use case to really be able to unlock that business value.
[00:39:37] Unknown:
Another aspect of the challenge around customer profiles and particularly managing it across different tenants of the data ecosystem is the modeling aspect of it where, you know, particularly when you're talking about things like master data management, golden records, there are things like dimensional modeling to be considered. You know, how do I break this down into the different sets of tables, or do I just have 1 wide table with all of the attributes? What are the different data types that I should use? Do I want an array field, the JSON field? Should it all just be, you know, basic data types? And I'm curious how you thought about those data modeling challenges, particularly as you want to support some of the historical attributes of a customer as they, you know, engage with the business over time where maybe their address changes, but you still wanna be able to see what their address was because you shipped a product to them at their old address and things like that? I think on reverse CTL, like, we have we've gotten a lot of customer feedback that, you know, folks are doing this data modeling, this life cycle development in their warehouse.
[00:40:42] Unknown:
So they really want better support for what their what kind of data modeling their warehouse offers, So things like objects and array support, mapping JSON fields, arbitrary JSON fields that have, you know, maybe a full set of locations listed for a customer, or addresses where they've, like, previously lived. So that side. And then ways to better hook into that data modeling, development life cycle. So, supporting test and prod environments for mapping data downstream, version control, you know, testing out the different models and what, what gets synced.
So, you know, I think the reality is, like, folks are not going to be doing a lot of this data modeling within Segment. They are they wanna manage this in their warehouse. A lot of them are using airflow for orchestration, DBT for transformation already, And our tools need to flexibly handle all the data models that the warehouse has and then be able to, like, in very easily integrate with the ways that customers are using those tools already. That's, like, the the best, smoothest customer experience for our data engineers.
[00:42:01] Unknown:
And now that the unified product is out, you have made it generally available, people are using it, what are some of the most interesting or innovative or unexpected ways that you're seeing it applied?
[00:42:12] Unknown:
I think 1 of the 1 of our cost beta customers was a, midsize ecommerce retailer, and they got their hands on profile sync. And within, like, a day or 2, they had learned DBT, downloaded it, gotten it running on their site. They build an attribution analysis, and they were starting to play with a user recommendation engine. So they built this use case over probably days of if a customer hasn't purchased items in a year, they can now identify who those customers are. They never had that visibility before prior to profile sync And then recommend maybe, like, these 5 top items, for those customers to come back into their their store or ecommerce store to buy.
And this was a small data team of a single data engineer, and they were able to unlock this use case within days. And I just really love this story because Unify is all about empowering small data teams, giving them the tools that they need to go toe to toe with these big tech employers who have hundreds of data engineers and making their lives and their jobs a lot
[00:43:26] Unknown:
easier. And in the process of building this Unify feature and product, what are some of the most interesting or unexpected or challenging lessons that you've each learned in the process?
[00:43:38] Unknown:
I have a couple insights. These are maybe a little bit more hot takes here of talking to our customers lately. So as tech budgets are tightening and our customers are focusing a lot more on efficiency, predicting we're gonna see a consolidation of modern data stack tools. I mean, I think it that might be inevitable. There's literally thousands of tools out there to manage your modern data stack, and VCs have pumped in a lot of funding during the hot tech days to, to really get these tools, to a really, like, exciting and large ecosystem of data tools. But I think as folks, you know, focus on efficiency, they're gonna start really looking like, do we really need, you know, 2 tools that are kind of overlapping doing the same thing?
We're already kinda seeing that. You know, we saw that recently. 1 of our unified deals, like, a customer didn't have a marketing team. They love segment identity resolution, but they're like, hey. You know, I wanna use Unify and just do identity resolution in segment. I don't have a marketing team, so I don't really wanna pay for the the, marketing products downstream. I just wanna pay for Unifi, and I'm gonna take that extra money and, you know, boost up the amount of, our connections event volume and and, really, you know, allow us to expand to a new geo because we can bring in customers now with that extra budget.
And so I think we're seeing a bit of that consolidation and take and really inspecting both the number of tools and how they're being used currently in our customer base. And then as part of that consolidation, that a lot of that, I think, makes offering you know, we believe that reverse CTL is part of the CDP just fundamentally because the reverse CTL is about activating our customer profiles and bringing that data downstream. But, also, it's also a huge win from from the customer side because they can avoid the hassle of adding another vendor just for reverse CTL. They can get it if they're using Segment connections and profiles. It's really easy to integrate and just add reverse CTL, and try it out within Segment, and, and completes your accessibility activation story for the CDP.
So and then I think the other key piece of this is that the customer experience is gonna start mattering a lot more. So was as we focus on efficiency, you know, the customers don't wanna spend a ton of money, like, buying data tools and then standing up their own governance or observability tools to string these tools together and make them ready for prime time. They want the tools to work. They wanna get started easily, and they, you know, are just gonna want this, like, smooth CX in between the tools so that they work great together.
[00:46:46] Unknown:
Really well said. At the end of the day, what matters is the experiences that customers have with your product and your business. And I think we can often get caught up in, you know, data architecture and data engineering problems, but at the end of the day, this is really servicing the business and the customer journey. And so the fastest path to provide that great experience is often the way in which you're gonna learn the most and you're going to be able to deliver on customer expectations. So I think that's been 1 of the things that we've really heard from our customers is time to value is really important right now in this moment as everybody's facing challenges to their business. And so finding that fastest path is key.
[00:47:28] Unknown:
And for people who are exploring the challenge and the available options for managing their customer profiles, what are the cases where Segment Unify is the wrong choice?
[00:47:39] Unknown:
Well, I think if you don't have a data warehouse, you know, you won't be able to take advantage of these tools. It might might be time to consider getting a data warehouse and adding it, to your data stack. And then I think the second is if you are looking for you know, we see reverse ETL as a fundamental part of the CDP. Our reverse ETL solutions work great with event streaming, with our real time customer profiles. If you're just looking for a point solution for reverse ETL to get data in and out of different warehouse tables, then Segment Unified probably isn't the right choice for you here. It'll get the job done to, like, move data from point a to point b, but, I think it's you won't get the full value without the CDP side.
[00:48:28] Unknown:
And as you continue to iterate on and improve the unified product, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?
[00:48:39] Unknown:
Really excited about, our entities project this year. So NNDs is all around expanding the world of profiles, going beyond, so bringing in all the business objects, surrounding the customer profile. So things like accounts, your households, your subscriptions, even your pets, and bringing that together into a full view and then really, allowing allowing customers to easily marry that rich stateful representation with the real time profiles, that we already have in Segment and bringing it all together so that, we could really drive these amazing dreamy personalization campaigns that that folks are really chasing. What are you most excited about, Kevin?
[00:49:31] Unknown:
That's 1 of them for sure. And then, I think there's also obviously a lot going on in the world of AI and large language models. And so really thinking about how do you bring context and fine tune those with relevant data within your business? I think that's going to be absolutely paramount. And so, you know, I think that's something that's top of mind for many businesses and customers today and something that we're investigating.
[00:49:59] Unknown:
Are there any other aspects of the Segment Unified product and the overall space of the kind of golden profiles and entity resolution and reverse ETL that it enables that we didn't discuss yet that you'd like to cover before we close out the show? I I think 1 of the things that is just continuing to blow my mind is the pace of adoption
[00:50:19] Unknown:
here that we're seeing among our customer base, the excitement around these products and features. And so, you know, I really do think that there is power at the intersection of real time data and a lot of the investments that businesses have made over the years in their data warehouse strategy. And so bringing those 2 things together feels like we've really hit a chord and is really unlocking
[00:50:44] Unknown:
a ton of potential for businesses that has otherwise been latent. Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:51:04] Unknown:
Yeah. Happy to lead off. So, as I mentioned AI, large language models, obviously very top of mind for folks. I think if you played around with them 1 of the things that you realize is that the experiences can be really powerful but they are also largely generic. The context is lacking. And so I think 1 of the things that's top of mind is how do you fine tune these with customer context to get the most out of these large language models and co pilots. And so that feels like a gap. A lot, I know a lot of different folks are exploring that today, but an unsolved problem that is really emerging as these AI and large language models, grow in popularity?
[00:51:48] Unknown:
I'm excited to see where we're gonna go with the semantic layer, the modern data stack. So the semantic layer is about defining metrics, and it's kind of that missing link between the raw data and business meeting. Think about it as, like, kind of the Rosetta Stone for your business metrics. Definitions metrics definitions to date locked up in analytics tools, spread out across all of your engagement applications. They're not shareable. So it's very hard to build these, like, the stream of cross org engagement apps and experiences easily.
Like, you know, connecting if if you are a support, you know, agent and you get and you're getting a lot of returns from a customer, entering them in automatically into a special, you know, more handhold support experience or marketing experience so that you can kinda reduce the number of returns they're sending. I might be someone that falls into that bucket. And I think so building, you know, 1 metrics layer that is both very flexible for however, you know, you wanna use them downstream, but has some shot shared artifacts so that your data engineers don't need to, like, recreate the same definition of a customer over and over again is really exciting. I think there's still a lot of opportunity in this space, and we're starting to see some early, innovations with, like, DBT and transform, but I think I think still searching for,
[00:53:19] Unknown:
the right approach here within the modern data stack. Alright. Well, thank you both for taking the time today to join me and share the work that you've done on the Segment Unify product. It's definitely great to see that released and available for folks who are able to simplify the process of managing their customer profiles and enriching them and bringing them everywhere that they need them. So I appreciate all of the time and energy that you've put into that, and I hope you enjoy the rest of your day. Thanks so much for having us on. Thank you. This was
[00:53:52] Unknown:
so fun. Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the machine learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Kevin Naparco's Background
Han Han Wang's Background
Kevin's Journey into Data
Han Han's Journey into Data
Overview of Segment and CDPs
Introduction to Segment Unify
Attributes and Data Sources in Segment Unify
Technical and Organizational Challenges
Customer Use Cases and ROI
Impact on Business Operations
Technical Implementation of Segment Unify
Technical Issues and Solutions
Adoption Path and Dependencies
Workflow and Data Integration
Identity Resolution Challenges
Entity Resolution and Business Rules
Challenges in Attribute Management
Data Modeling Challenges
Interesting Use Cases
Lessons Learned
When Segment Unify is Not the Right Choice
Future Plans and Projects
Final Thoughts and Closing Remarks