Summary
Collecting, integrating, and activating data are all challenging activities. When that data pertains to your customers it can become even more complex. To simplify the work of managing the full flow of your customer data and keep you in full control the team at Rudderstack created their eponymous open source platform that allows you to work with first and third party data, as well as build and manage reverse ETL workflows. In this episode CEO and founder Soumyadeb Mitra explains how Rudderstack compares to the various other tools and platforms that share some overlap, how to set it up for your own data needs, and how it is architected to scale to meet demand.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy.
- The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
- Your host is Tobias Macey and today I’m interviewing Soumyadeb Mitra about his experience as the founder of Rudderstack and its role in your data platform
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Rudderstack is and the story behind it?
- What are the main use cases that Rudderstack is designed to support?
- Who are the target users of Rudderstack?
- How does the availability of the managed cloud service change the user profiles that you can target?
- How do these user profiles influence your focus and prioritization of features and user experience?
- How would you characterize the position of Rudderstack in the current data ecosystem?
- What other tools/systems might you replace with Rudderstack?
- How do you think about the application of Rudderstack compared to tools for data integration (e.g. Singer, Stitch, Fivetran) and reverse ETL (e.g. Grouparoo, Hightouch, Census)?
- Can you describe how the Rudderstack platform is designed and implemented?
- How have the goals/design/use cases of Rudderstack changed or evolved since you first started working on it?
- What are the different extension points available for engineers to extend and customize Rudderstack?
- Working with customer data is a core capability in Rudderstack. How do you manage the identity resolution of users as they transition back and forth between anonymous and identified?
- What are some of the data privacy primitives that you include to assist with data security/regulatory concerns?
- What is the process of getting started with Rudderstack as a software or data platform engineer?
- What are some of the operational challenges related to running your own deployment of Rudderstack?
- What are some of the overlooked/underemphasized capabilities of Rudderstack?
- How have you approached the governance model/boundaries between OSS and commercial for Rudderstack?
- What are the most interesting, innovative, or unexpected ways that you have seen Rudderstack used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rudderstack?
- When is Rudderstack the wrong choice?
- What do you have planned for the future of Rudderstack?
Contact Info
- @soumyadeb_mitra on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and de identification features eliminate the need for time consuming manual processes. And their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.
Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. That's immu t a. The only thing worse than having badged data is not knowing that you have it. With Bigeye's data observability platform, if there is an issue with your data or data pipelines, you'll know right away and can get it fixed before the business is impacted. Bigeye lets data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user friendly interface, and automated yet flexible alerting, you've got everything you need to establish and maintain trust in your data.
Go to data engineering podcast.com/bigeye today to sign up and start trusting your analysis. Your host is Tobias Macy, and today I'm interviewing Sumyadev Mitra about his experience as the founder of RudderStack and its role in your data platform. So, Sumyadev, can you start by introducing yourself? Hey. I'm Somadeep. I am the founder and CEO of RudderStack.
[00:02:34] Unknown:
So thanks for having me here. I'm really excited to be talking to you. I've been doing Nutter Stack for the last 2 and a half years and started in 2019. And before that, I spent a year in a company called 8 by 8. Prior to that, started a company in the marketing automation space, building the next gen marketing automations tool using ML and AI. And all through that, I think the pain point that I faced was, like, marketing teams talk about doing a lot of cool stuff, like, around personalization and so on, but they do not have the data to do that. And that's what we are trying to solve with another stack. Yeah. But that's kind of my element. And do you remember how you got involved in the area of data? So, yeah, it it goes back all the way to my grad school when I was doing a PhD in this group. My adviser, used to do a lot of high performance computing. So, like, the research group worked on, like, this rocket simulations. And and my adviser looked at, like, all that IO. I mean, how do you handle, like, this huge amount of, like, simulation data, saving that, loading that? So I kind of joke. I work on big data before big data was cool. Like, did a lot of ML at that point. Again, this was, like, pretty deep learning, like, traditional ML, high volume of data. So so that's what was kind of my start with data. So you mentioned that RudderStack started with sort of frustration
[00:03:47] Unknown:
with letting marketers be able to collect and take advantage of the data having to do with the interactions that users were having with the products. I'm wondering if you can just give a bit more of an overview about what it is that you've built there and some more of the story behind how it came to be and why you decided that this was the particular problem domain that you wanted to spend your time and energy on right now?
[00:04:10] Unknown:
So I'll I'll start with the problem that we are trying to solve it in my previous company MarianaIQ. And they'll give you the context of, like, what are the problems, like, people are facing we faced as a company, but also, like, our customers face. Right? So, like, let's say if you're any consumer brand, like, pick anything, credit and battle, let's say. And and you want to know what your customers are doing on the website and what products they're clicking, what purchases they are making. And why do you want to collect all that data? The number 1 use case is like analytics. You just want to understand how people are using the product, where they are dropping off in the funnel, what what products are being searched more, and and so on. So analytics is generally where you you all start, and you need to collect this activity data to do that. Right? Then beyond that comes, like, marketing. Right? I mean, you want to, like, do personalized marketing on top of that data. Right? So you want to, like like, send coupons based on what they did. Like, let's say, doc doctor at the checkout page, you you wanna send them some kind of a coupon. Right? Beyond that comes, like, more ML AI use cases. Right? You want to, let's say, predict which customers will convert or not. And based on that, you want to give them some coupons. Or you want to predict which subscribers are going to churn, and you want to give something to them, prevent that churn. And so there are these, like, ML use cases beyond that. And finally, there are, like, more of real time personalization use cases. Like, you want to, like, build a a model for users and then, like, personalize their web experience based on what they have searched in the past and, like, what they like, what they don't like, and so on. So the follow-up use cases that a team, whether it's a marketing team or the data team, wants to build. And, like, it all starts with a platform to collect this data, process this data, and and so on. Right? Now this was really hard for various reasons. So you have to build the infrastructure. You have to build, like, set up something like a Huddl cluster. Let's say, pre data warehouse era. Right? So to collect this data, you have to set up Spark to process that data. Like, it is not, like, trivial. Now some of the things have become easy. Really made it easy to, like, collect and store and process like, store and process that data. Some parts are still hard, right, around data collection, activation, and so on. And that's what we're kind of trying with solving with RudderStack. I mean, that's kind of a very, very high level overview.
[00:06:08] Unknown:
Particularly for the time frame of the, you know, 5, 6, 7 years ago when people were thinking about being able to collect all of this information and be able to route it to all the different destinations that they wanted to take advantage of that information in the kind of big name in the space was Segment. And I know that at least in some of the marketing, RudderStack has kind of positioned itself as an alternative to Segment for different use cases. And I'm just wondering if you can give your thoughts now a couple of years into your work at RudderStack, how you think about the position of your tool and your platform in the rapidly expanding ecosystem of data products and data tools, both open source and commercial, where, you know, when you first launched, it was as a kind of direct reflection of what Segment was doing, and you were adding some additional capabilities there and being the open source alternative. And now we've added in things like Stitch and Fivetran for data collection. We've added things like Census and Hitouch and Gruberu for the reverse ETL or data activation part. And just curious sort of how you think about your position in that space.
[00:07:21] Unknown:
That's a great question and goes back to, like, what I was saying. We are trying to solve the customer data use case. Right? As a company, like, consumer company or b to b company, what are you trying to do with customer data? Like, you start with, like, the very simple use case where you want to just collect the data and send it to multiple destinations, like, when you are a small company. And then, like, at some point, you mature, you hire your 1st data analyst and you want to, like, bring other sources of data, maybe your transaction data, maybe you're taking data to build out more advanced dashboards. Right? And the next step to that is, like, you make your data in your warehouse more active. Right? It's not just about getting the data and building reports, but also, like, taking the data and sending it to different destinations, whether it's, like, a leads model or a churn model or a recommendation.
And finally, it's, like, the real time personalization with use case where you want to, like like, sync that data into some kind of an online store and consume that in the app for, like, personalizing the app experience. Right? So this is the journey that a customer has to take, and we want to be the platform which enables the customers to take that journey. Right? So that's how we kind of think about this. Now, like, of course, there are overlaps with other vendors. And I mean, you can technically build this entire stack out with everything DYI. You can write the whole code yourself. You can buy 3 different vendors and do parts of it. And, I mean, you can buy, like, segment to do event stream. You can buy, like, ETL. Reverse, you can do some other part of it. Or you can buy data stack. Right? So I think where we will shine is the customer data use case. Like, the ETL vendors are the users beyond just customer data. Like, the BI use cases and, like, operations use case and same for reverse EPL. Like, customer data is, like, a narrow part of all the use case. Like, even the customers, from what I understand, is often different. Sometimes it's more b to b versus and, like, b to c. We are, like, again, laser focused on the customer data use cases, like, enable this entire journey for for our customers.
[00:09:03] Unknown:
The conception of what customer data is can definitely vary based on the industry or the organization. And in the default case, when people think of customer data, they're probably talking about maybe e commerce where somebody is coming to a website. They wanna be able to track the kind of user funnel to see, okay. They've landed on the site. They're browsing around. They may be searching for a specific product. They've added something to the cart. I wanna be able to do abandoned cart notifications. I wanna be able to track sales and then be able to feed that back into my sales and marketing to say, okay. People who followed this path and came in from this route are more likely to purchase, so I wanna put my marketing dollars into this particular advertising channel.
And beyond that, I'm wondering what are some of the
[00:09:47] Unknown:
ways that people can think about what actually constitutes customer data or a the sort of formulation of what makes somebody a customer and some of the ways that that might be expanded beyond this very narrow ecommerce scope? So I think if you can kind of, like, go back to the use cases you talked about. Right? I mean, what is it you are trying to solve with customer data? Right? So let's take that first example of, like, marketing attribution. You want to understand which channels are working to get, like, high value customers and, like, which channels are not working. Right? So that requires you to bring whatever web data. Right? What people are doing and they are transacting and so on. And that's kind of 1 source. And the other source of data is, like, just the ad spend data. Like, how much you are spending on, like, Google and then Facebook and what campaigns. And you have to kind of join these sources, right, to build out the marketing attribution model. And then people do very simple attribution models like 1st touch, last touch. You can do, like, complex ones. Like, we have kind of done, like, the multi touch attribution models. They're, like, game theoretic models, and we've kind of written some things around that. Right? So that's, again, 1 use case which requires not just your web data, but also your ad spend data from that network. So that's kind of 1 example. If you look at that cart abandoned use case, right, where you want to, like, let's say, give out coupons to some high value customers who think are, like, important. Right? Again, that requires your web data, like, to understand, like, that abandoned and then dropped off. But to understand, like, how valuable is the customer, you typically also want to bring in historic transaction data. Right?
And that transaction data is often not captured in your Web Activity. Sometimes it is, sometimes it is not. And it is probably comes from your production database, which is, like, you will have to run some kind of a ETL into your data when I was in combine those 2 data sources, right, to create that. So that's another use case where you have to combine multiple data sources to, like, come up with that, like, lifetime value score of it of a customer. So I I can keep going on and on about the use cases, but, like, yeah, this show that you have to combine, like, not just, like, your web activity data. This is extremely important. Web of product activity data, but also, like, these other data sources. And this is where think the warehouse first coding really comes in. Right? And it's very hard to do these joints, if you will, across data sources completely at third party SaaS, like, something like segment and so on. As far as the
[00:11:57] Unknown:
target users that you're focusing on for RudderStack, you've enumerated the target use cases where it's definitely this end to end integration and flow of data, but for this scope of customer information. But in terms of the users of this platform, who do you kind of think about as far as the personas, and how does that influence the focus and prioritization of the features and the user experience that you build into RudderStack?
[00:12:25] Unknown:
Yeah. So that is the other key thing compared to, like, all the other CDPs that are differentiator. Like, we are laser focused on the data persona. Sometimes, like, it is the engineering team. Sometimes it's a separate data team. But, like, we strongly believe that that is the persona who should be owning the customer data stack and who should be building out those use cases. So if you think about, like, companies like Amazon, it is not the marketing team which is building out the customer real stack. It is very much the engineering team. Same thing is happening across the board now. Like, even, like, traditional, like, ecommerce companies. We have enough customers where, like, they're putting together a separate data team under, like, a head of data or chief data officer who are, like, thinking about this holistically. Marketing is definitely 1 of the top use cases, but the use cases are beyond marketing. Like, 1 of customers, for example, is, like, have built a personalization, like, a engine for support calls. People call in and you want to see the entire history of, like, what happened to the customer, what is the lifetime value, and then so on. And then that is a very support use case, not a marketing use case. Again, built on top of, like, a customer data, stack stored in in a data warehouse. So, yeah, that's kind of the reason we're kind of laser focused. And to answer your second question, yeah, once we know that that is a persona, like, everything around the product experience to, like, open source and so on is trying to appeal to that person. Right? I mean, how do we make the best product for for data people and engineers?
[00:13:38] Unknown:
And recently, you have started offering a managed service for RudderStack so that folks don't have to deal with the setup and maintenance of the actual infrastructure to run the platform. They're able to just take advantage of running instance. They could just add the integrations that they want, manage their data flows. And I'm curious how that shifts the kind of range of user profiles that are a good fit for RudderStack, where maybe you don't have to have all of the DevOps and infrastructure knowledge to be able to get it up and running. You just need to be able to understand what are these data flows that I care about and the types of transformations or integrations that I wanna build. Interestingly,
[00:14:17] Unknown:
like, that was 1 of the key learnings when we started out of stack. Like, we do not have an open source project to start with. And we started the company and then we launched the open source product and at the same time almost like launched a cloud offering. Right? And if you had asked me then my hypothesis would have been that, like, data engineers love open source. They would all, like, download the open source product and then not go with the cloud tier. But what we have learned is, like, our growth on the cloud tier is, like, way, way faster than, like, on the open source. Like, even in large companies, there's not just, like, 2 people developers who are trying to open source. So I think, like, people are firstly, like, realize that data integration is a means to an end. Nobody gets excited about, like, building another integration or, like, setting up, like, an integration platform. Definitely not getting paged if something is down. Right? So, yeah, maybe that is the reason we have a lot more adoption on the cloud. So I think, like, to answer your specific question, I don't think the persona has shifted at all, like, with the cloud offering. In fact, like, it's the same persona, and we are seeing more adoption on the cloud than on the open source. We are of course, like, there are companies who are excited about open source. They will culturally believe we'll never buy, like, a cloud like, hosted product, and we have the open source for them. And we are, like, excited to support them. As far as the
[00:15:24] Unknown:
application of RudderStack compared to some of those tools that we were mentioning of the, you know, data integration and the reverse ETL. You said that you're scoped more specifically to the customer data use case, and I'm curious some of the ways that that manifests as far as the integrations that you target as far as data sources and destinations and some of the types of transformations that you support or some of the ways that you think about the design of those integrations that make them well suited to these customer use cases and some of the types of records or data formats that are typical for these customer use cases that make it worth being specialized in that way?
[00:16:09] Unknown:
That's a great question. And I think there are 2 ways to answer this. 1 is, like, the integrations themselves. If you would take, like, an ETL vendor like Fivetran and Stitch, for example, they will integrate with, like, every source NetSuite, and think of some things which are traditionally not, like, customer data, sources. Right? So that's kind of when it comes to integrations per se. And some sources, we know that, like, when it comes to customer data, like, 4 or 5 sources, which are very important, like, for putting the data on the ETL side. Even stream is absolutely important, particularly if you're, like, a consumer company. That's kind of on the sources side. Again, on the destination side, I think the reverse retail vendors are also, like, early in their journey. So it's not like we probably have a lot more integrations because of our event stream product. Like, we have to meet at the segment. But for example, like, because we are selling more to consumer companies, like, Salesforce is less important. Although we have that integration as opposed to, like, something like Graze or, like, the marketing tools are more important. So it's like we have kind of those kind of things between, like, more to do with, like, what kind of companies we are selling to. And transformations is, again, important. The way we think about this data is, like, we are lining the data ourselves. Right? So, like, we know what is the data format. Our customers are doing the transformations in the warehouse. So they land all the data in the warehouse, then they'll write some d b t or spark jobs to, like, combine all that data, create this, like, user profile. And think of it as, like, a big table, 1 robot user with a bunch of features computed for those users. And this is then fed into, like, an analytics dashboard or, like, an ML pipeline or even an activation in. Right? So we are kind of, like, building this feature where you don't have to, like, write DBT jobs do that. You find what features you care about and that will instantiate this table in the warehouse. So that if you need to these use cases and eventually we'll build some of the use cases as well. Right? So that's how we kind of think about this. But, like, taking a step back, I think the way we should think about this is, like, some customers have already built an infrastructure. Right? I mean, I have my even streaming infrastructure. I have an ATL vendor in place. I have data in the warehouse. And I just need, like, the reverse piece of getting the data. I mean, I've built all the infrastructure and need reverse. So reverse detail is a perfect product for that. I think these companies will do well. Like, they're great teams building great products. At the same time, some customers are taking their journey from day 1. I have to, like, start collecting data. And this generally not I mean, to our surprise, this is not, like, just, like, moment pop startups or, like, 5 people startups. Like, even large public companies are, like, deploying data teams and deciding, look, you know, we need to think holistically about data. If you are starting the journey there, what starting with, like, data collection and eventually want to take the whole journey, that's where we want to come in. Like, that is our sweet spot. So, again, like, depending on the maturity of the company and then and so on, I think, like, you're on a stack versus 1 of those vendors.
[00:18:38] Unknown:
Have you found that when somebody adopts RudderStack that they might be replacing 1 or a set of tools that they've already started tying together? Or is it usually homegrown systems that they're starting to replace in the event that they're not just completely greenfield?
[00:18:53] Unknown:
Like, I'd say, like, if I could take a guess, 1 third, 1 third, 1 third. 1 third is, like, sometimes they're replacing, like, an existing system, a commercial system around the main vendors. But, like, either it's, like, too expensive or it doesn't go in there, like, future needs of, like, what they're trying to do. So that's kind of 1 third. The second 1 third is, like, they have some DYI solution. It is breaking. The person who wrote the code is gone, and nobody wants to maintain that. So they want to, like, think holistically. And then 1 third is, like, almost, like, their first time thinking about collecting or thinking or not. I mean, like, the the data team is trying to figure out. Like, now, like, maybe they're collecting some data into, like, some marketing tools, but, like, now they're thinking about the customer data architecture with the warehouse first approach, and that's kind of the other 1. In terms of the
[00:19:34] Unknown:
kind of identity resolution aspect of working with customer data, there's definitely a very core need for people who are trying to understand the interactions that they have with their customers or with their end users and particularly when you're dealing with EventStream or clickstream data from web or even from mobile apps where somebody starts off as an anonymous user, and then they start to go through that customer journey where, eventually, they convert to creating an account on the platform or maybe they have multiple different interactions where they are an anonymous user on multiple different devices before they convert or they bounce back and forth. And I'm curious how you approach that identity resolution and entity resolution problem in RudderStack and some of the kind of systems that you've built in to be able to support that and some of the specific domain expertise that's needed by the company to be able to feed into that engine to be able to manage that mapping.
[00:20:33] Unknown:
Yeah. I mean, this is 1 of the biggest pain points particularly in the consumer world. Right? So there are, like, 3 kinds of identity resolution. 1 is, like, even stream identity resolution, which is deterministic. Just for example you gave. Right? I'm anonymous on the web browser. I log in with my email. Right? So at that point, I know that this login user ID or email is the same as that anonymous ID, right, the cookie ID. Because I logged in. Right? So that's kind of, like, deterministic identity. Unless I I came on a mobile device and I logged in with the same email, then you can know that this anonymous browser ID and their email ID and the mobile device ID, they are the same person. Right? So we kind of know that because, like, there's a direct edge association. And that's what we handle as a part of the code or stack offering. Right? So when you land the data into the warehouse, we'll create this identity table. We'll say that all these anonymous IDs and all these device IDs and all these emails and user IDs, they are of the same user and give it a canonical data stack ID. So that's kind of the 1 part of identity resolution.
Now if you look at the broader identity space in the online world, there is also, like, nondeterministic idea resolution or, like, whether you call it nondeterministic, probabilistic, or basically going through some third party vendors. Right? Like, those guys, like, for example, like, LiveRamp is a good vendor and there are a couple of other vendors also. They eventually all got acquired. And they do a good job of associating users even when we do not have that edge. So even if they did not log in with an email, they can say that this anonymous ID on the browser is the same as that anonymous ID on some other browser and the same as this device ID on this phone. Right? And they collect this data through, like I mean, sometimes neat ways, sometimes shady ways around, like, associating users. Right? They get data from publishers and games and so on. And, I mean, I have kind of worked in this space in this previous company. So I can't get into the details, but, like like and they apply, like, really sophisticated machine learning algorithms to, like, tie those users together. Right? So that's kind of the second level identity matching. And Rudostat does not do that, but our customers can do that. It is, like, as simple as loading these third party vendor, like JavaScripts. Right? You load, like, a JavaScript. They will give you, let's say, like, a live ramp ID. And you pass that as another stack event. And if the same user on the SDK on the mobile side can pass on, like, a live ramp ID on the mobile and then they have the same LiveRamp ID, you know, that they're the same user. And LiveRamp has figured this mapping between device ID and LiveRamp IDs and other stuff. Right? So that is non diagnostic ID, which you can, like, latch on top of data stack if you want to do that. Our some of our customers are doing that. The third part of the identity resolution is what happens in the warehouse.
So you got all these, like, events tagged with, like, not a stack ID or, like, live ramp ID or something. Actually, you know that these events belong to the same user. But then you're also bringing in other data. Right? You're bringing in data from your ticketing system. You're bringing, let's say, your CRM data. How do you know that, like, this CRM record is the same as this user ID? Sometimes it is, like, the same email. You just join on emails. Sometimes you have to go, like, multiple levels of joints. Right? Let's say, your app uses user IDs, but then you have an email stored somewhere, and that email is the same as a lead record c m in your CRM and so on. So you have to do all these kind of joints. And that is very, very customer specific. Right? Every customer has their own way of joining identities across these systems. So that is where I think, like, we don't do much. Our customers are taking care of it themselves. Right? Part of the transformation that I was saying. Right? I mean, you got get this raw data, and you want to create this 1 user table with, like, 1 group of user. A big part of that is that join that you need to do across different systems. Our customers are doing that. They're doing it in SQL or, like, Spark or something. But then when we launch this, we are working on this feature, as I said, like, programmatically creating that user. And that there will take that identity resolution function as an input, and then we'll go and execute. But there'll be that specified by the customer.
[00:24:25] Unknown:
Digging into the architecture of RudderStack itself, I'm wondering if you can describe a bit about how it's implemented and some of the design of the platform and the ways that you have architected it to allow for this extensibility and being able to support all of these different sources and destinations and some of the mapping between them?
[00:24:44] Unknown:
So at the core of it, like, we are a streaming platform. Right? Like, 1 way to think about it is, like, you are the stream of events coming in and you have to fan it out to all these different destinations. Right? And what are the guarantees you need? Number 1 is that, like, if an event comes into Ruddl Stack and you act that, okay, now the event is persistent, like, then the client can delete. So you cannot lose the event once it comes in. Right? So that's kind of 1 thing. And then you have to go and deliver to all these different destinations. So you have to handle things like destination failures. Like, your destination can be down for extended periods of time. Right? So you should have to, like, purchase the event and retrying for some time and then so on. So you kind of, like, have to build a streaming engine. You can build on top of Kafka, but, like, there are, like, whole lot of reasons we did not go with Kafka. So, like, I can get into the details, but, like, we have built our own streaming into, like, handle this. Right? So that's a conceptual level how this works. Now what are the other things we need to worry about? Number 1 is, like, event ordering. Right? So sometimes for a given end user, you don't want events to be delivered out of Huddl. I mean, you want the login event to come before, like, maybe a transaction event because, like, you'll build reports on top. Right? So the system is designed to handle that. Scale is an important factor. And how do you, like, scale to, like, billions and billions of events? But, like, scale is easy because this is a trivially paralyzable system. Like, for end users, events like that. Like, for different end users, like, there's no overlap between the events. You can, like, just route them to different notes. You have to handle event ordering if you, like, scale up and scale out, but that's again built into the system. So that's kind of the core of the stack. You can scale up, scale down. And everything is built on, like, Kubernetes. So, like, we built on that early. They got open source, our commercial offering, everything is, like, built on Kubernetes. And that's what we leverage for, like, all the scale up, high availability, and then and so on. In terms of the
[00:26:24] Unknown:
extension and integration points that are available in the platform, what are the different hooks or kind of types of plugins that developers and end users are able to add or ways that end users might add additional customizations that maybe wrap the RudderStack experience?
[00:26:44] Unknown:
There are, like, different levels of integrations. Right? 1 is, like, let's say, it's a destination, that you want to add in product stack. There are 2 ways to do that. Our entire, like, destination catalog code base is open source. So we have, like, people who have contributed new destinations, to data stack and they're using that. Right? So that's kind of 1 way to put that. The other is, like, you can use our webhook destination. Right? So we have, like, a generic way to, like, add any destination using the webhook framework. Right? So that's kind of other way. And then we have customers who have done that, and then eventually, we have kind of taken that code and wrapped into a native destination. So that's on the destination side. When it comes to the source side, like, even stream is straightforward. Like, we have another stack protocol. There's not much to add. And in fact, like, there are SDKs. We have customers contributing SDKs. Like, we have somebody developed a documented SDK. Like, Segment has been very generous with, like, open sourcing a bunch of their SDK and we are AP compatible. So even those are more caught up the box. I think that's 1 part. The other part is the ETL sources. Right? So we support, like, our own sources. Like, we have a bunch of our custom sources, but we also support, like, the single protocol. And now, like, Airbyte has a protocol as well. Right? So if you have, like, any of those open source projects, like, you can use 1 of those integrations or you can, like, custom build conforming to those protocols, and those can work out of both on data stack. So in fact, like, there's still a small and I mean, there's still a minor lift to integrate that into the platform. Like, particularly if you're using the cloud offering, how do you run your own native integration? But, like, it can be done. And as far as the evolution of the product, you mentioned that it's only been about 2 years since you first started working on it. But I'm curious over that time how the
[00:28:18] Unknown:
goals and design and use cases of the platform have changed or evolved. And in particular, given the amount of change in the velocity of the ecosystem over that period, how that has influenced your thinking about the role of RudderStack in the data space?
[00:28:35] Unknown:
I think the vision was always the same. I think, like, we wanted to build, like, a warehouse for CDP. And this was exactly what I was doing in my previous job at 8 by 8. Like, trying to build these exact set of use cases on something like allocate a red shift, and the challenges I faced was the reason for starting the other stuff. So I don't think the vision has changed. What was really surprising though is the momentum we have seen. And, like, again, I'd say, like, we got lucky in some sense. Like, segment got acquired, and the the company is, like, they are distracted. Twilio has their own charter to take Segment. So that's kind of 1 thing. Data warehouses themselves have exploded right away now. Like, 2018, 19, Snowflake was even, like, still early, right, in that sense. And now everybody has, like, knows about Snowflake. So, like, that explosion of data ecosystem. And then, like, the moment people buy data warehouses, they think about, oh, yeah. What can I do with that? Right? So, like, any data collection where there comes as a natural first step. Right? So I think these things have kind of helped us accelerate and also, like, helped us focus. Right? I think, for example, 1 thing we are realizing is, like, reverse CTO is exciting, important, like, an for something. But, like, lot of the market is still at, like, data collection. Right? Warehouse adoption is at 5%.
There will be 20 x more companies just deploying cloud data warehouses and getting data into the data warehouse. Right? So those parts are like ingestion is probably a bigger problem if us growing market and and so on. So, like, yeah, it's more about product prioritization and so on. But the vision has not really changed.
[00:29:57] Unknown:
In terms of the security and regulatory considerations that come along with dealing with customer data and potentially private information. How are you designing in some of those considerations in RudderStack so that users of the platform can add in the specific controls that they need to enforce or be able to layer in some additional security applications. So maybe they're using something like an Immuta for data governance and access control and some of the ways that RudderStack is able to coexist with some of these systems.
[00:30:34] Unknown:
Yeah. I think that is where also the warehouse first story really shines. Right? Our story is pretty simple. Right? I mean, we are not creating another silo of customer data. We don't want to store your PII. Like, the integration layer of getting the data into the warehouse, and then we'll do some relevant transformation with on that data. Perfect. The data still stays with you in your warehouse. You have complete control. So I think that resonates very well when, like, compared to other CDP vendors. Right? The traditional CDP vendors. That's kind of 1 part of it. The other part of it is, like, more features. Right? I mean, even if you say that the data is in the warehouse, you still need to support, like, deletion request. I mean, the customer is painful for them to go and figure out how to delete data. Particularly, if you have to delete data across, like, 20 different destinations and your data warehouse. So, like, there are features like that. How to integrate with, like, consent tools. These are features. We have some of them. We have, like, kind of working more and more on, like so, yeah, that's kind of the other part of it. The third part of it is, like, it's more of product marketing. There's a lot of concern and all, like like like, cookies going away. Apple has, like, all the idea if it's going away. So, like, if you look at the broader marketing then people are realizing that those traditional, like, the DMPs and the DSPs and all those, like, people that they're, like, kind of going away. Right? So first party data is extremely important. Only your customer data. You cannot rely on that data being stored in Google and, like, relying completely on Google to, like, give you the tools to work on that data. Right? You have to take on the support of your customer data. So I think, like, this broader ecosystem, like, has definitely helped us, the first party vendors. So it's a 3 standard, I would say. Like, general cookies going away and so on is helping pretty much every first party data vendor, whether it's us or all the other CDP vendors. 2nd is, like, we don't store any data. That also helps telling the story that yeah. I mean, we are not getting another data silo. Everything is with you. You have complete transparency. And 3rd is, like, just features to enable that privacy and compliance, with for for our customers.
[00:32:27] Unknown:
Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud. Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow. Now all the data users can use software engineering best practices. Git tests and continuous deployment with a simple to use visual designer. How does it work? You visually design the pipelines and prophecy generates clean spark code with tests and stores it in version control. Then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage.
Finally, if you have existing workflows in Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy. In terms of the workflow of getting started with RudderStack and starting to integrate it into your workflow and integrate it into an existing data platform, what is the process of, you know, going from deciding that RudderStack is a fit for the problem that you have and then working through implementing it and getting it deployed and running in your production environment?
[00:33:46] Unknown:
Yeah. I think, like, that kind of depends on, like, what is the use case you're kind of trying to deploy. Right? And, like, if you, for example, start with the marketing attribution use case where you want to just, like, look at what marketing spends are driving more leads. Right? So if you just care about that use case, it's literally deploying the Rautastack JavaScript testing on your website And coming to the dashboard, setting up the source, setting up the Snowflake warehouse, and collecting your Google Ads data, everything will start flowing into the warehouse. And then we have d b d models, which you can We have both d b d models as well as, like, Spark, Jupyter Notebook, ML models to, like, build out those attribution dashboards. And so literally, you can get up and running in, like, maybe half a day. So that's kinda 1 extreme. But then it's not just collecting web activity data. It's around, like, tracking events and and and and collecting data from your mobile apps and so on. So then you have to think about like, if you already have a tracking plan, that's easy. If you don't have 1, then you have to think about what events are, important. And and then then there's a journey that you start with some events and then, like, adding more and more as as you go. Those things will take some more time. Right? I think we generally work with our customers. We give tell them our best practices around event collection and what what are the things they care about. And then it's not just us. Like, the broader ecosystem is also teaching the world that that event collection is important. So but we have our customers who are already on, like, some competitive vendor, like, segment. If we are switching over, then it's, like, literally, like, 1 switch that you can go up and running in, like, an hour depending on the yeah. It's kind of depends on where the customer is and what the package.
[00:35:11] Unknown:
As far as the operational considerations for people who are running it themselves, what are some of the challenges that are involved in being able to manage it and being able to understand what the scaling patterns are, what the limitations are to those scaling abilities, what the different kind of components are that need to be orchestrated in relation to each other and just some of the additional knowledge that's required beyond just understanding the mapping of the data.
[00:35:41] Unknown:
Yeah. That's a great question. And I did not cover that in my previous answer also. Right? So if you're trying to run data stack console's version, not the cloud offering, then running a single node is pretty easy. Like, if you have access to a Kubernetes cluster, it's like we have hand charts. You can just deploy hand charts and you should have, like, RudderStack running a single node version. If you want a multi load version, I mean, it's generally trivialized. We can just, like, run multiple nodes and then put some kind of a load balancer proxy to route events across them. Right? So, like, even that is easy. The only thing that is not available in that is the event ordering. I mean, you have to make sure that if you care about event ordering and you may not, like, for some destinations, even ordering does not matter. Because, like, for example, if you're sending to amplitude, like, they will look at the timestamp of the event and order themselves anyway. Like, the order of arrival doesn't matter. Like, they just look at the timestamp and define things. And a lot of other destinations do that. So or something like that, then you just scale out a step, set multiple nodes, and put a load balancer and send data to everyone, and then Rolestack will send data to all the destination So if you can go up and running pretty fast. Like, I think we have done a good job on supporting our open source customers, and I am in this kind of self padding, but, like, we have an open source channel. If you run into any issues, we have, like, we have been supportive and we absolutely want them to be successful. We have some large customers like other open source companies deploying the open source version of products like Grafana and Gatsby and and so on. So, yeah, that's what it is. But then if you want event ordering, that's when I think, like, yeah, it's not available in the open source version. Like, I mean, we are finally also want to make money and, like, run that as a viable business.
[00:37:06] Unknown:
In terms of that boundary of the open source versus the commercial, I'm curious how you think about that and some of the governance that goes along with running an open source project that is so closely tied to the business model that you're operating on?
[00:37:22] Unknown:
We own the rights to the open source project. Like, we raised money. We formed the company even before the first line of the code was written. Right? And then we kind of open sourced the project as well as we started a cloud offering almost, like, together. Right? So I think it's not like another open source project. I thought you found their own, like, and then there is a conflict between, like, the company and who owns the rights to the open source and so on. So I think, like, that's where things are pretty clear. Now but it's still a valid question that, like, what is always like a well, I don't say it's always, like, we have a point of view, like but I think it's always, like, a question that, like, what goes into open source? What doesn't go into open source? Right? I think the way we think about this is, like, finally, as we said, like, it's really on the platform. Right? Even stream is 1 part of it, but there are, like, multiple other pitch to take through that entire take you through that entire journey. So even stream is open source. There are certain features which are not available in the open source, which is, like, multi node scaling, transformations, and and so on. And, like, a and a fully featured UI. That's not open source. All the integrations are open source. We'll keep making everything open source on that side. Right? So that's how you think about it. The rest of the platform is not open source. You want to care about that, you you have to go to the either the cloud offering. We also have on VPC offering where you if you don't want to send it outside, we can run it inside your VPC. So that's the other thing I did not touch upon in our, like, privacy question. Like, we also have that feature where we can run that task back inside a customer's Kubernetes environment. So and we have some customers who do that, like Fintech vertical and so on. And as far as the
[00:38:42] Unknown:
kind of use cases or capabilities or features of RudderStack, what are some of the things that you think are either often overlooked or underemphasized or use cases that you think are worth highlighting that people might not realize it's able to support?
[00:38:58] Unknown:
I think a lot of our customers also don't know that, like, we have ETL and reverse ETL, like, and and so on. So we have we have customers who have bought us for event streaming and some of that product for for the other parts. Right? So I think that that is part of it is product problem, particularly on the ETL side. I think, like, we are definitely way behind, like, other vendors in the space. But on the reverse ETL, I think we are pretty competitive. So that's 1 thing I would say. And this is like our fault that our product marketing needs to catch up. And the other part of it, this is, like, what other features might want to highlight? I think the data governance and data quality, I think, like, we have support for, like, tracking plans and things like that. The product also needs improvement, but I think, like, we can definitely do a better job of highlighting that feature. Because there is a common paid part that people have that they're not guaranteeing that, like, your events are conforming to a tracking plan and, like, spurious events are not being introduced and so on. In terms of the ways that you've seen RudderStack applied, what are some of the most interesting or innovative or unexpected ways that you've seen it used or
[00:39:53] Unknown:
integrated with people's data flows?
[00:39:56] Unknown:
This is a request. This keeps coming up. Right? I mean, people interestingly want to send log data via other side because at some extent, like, log data is also, like, customer event. Right? So, like, you have, like, request coming to your web server and then instead of you generating an event, you're generating logs, and then you want to ship those logs to our stack. And then we have a transformation framework where you can, like, parse the event. So they ship the whole log as an event. And in a transformation, parse out, like, specific parts of the logs, right, to, like, say, this is the user ID and this is the event name. And then you send that to Google Analytics or Amplitude so that you can do reporting. I thought that was pretty cool. I am because we did not have, like, a native SDK to to, like, something like for India or something so that they kind of, like, build their own thing to, like, ship logs into to data stack. That's, I think, not a use case we recommend, like, partly because of the scale. I mean, like, the volume is typically an order of magnitude more. I mean, like, 90% of the log lines are useless. They're sending everything to our stack and they're, like, throwing over everything. They accept those few lines which have the customer events. Right? So it's much better to do generate an event for that, but, like, that's 1 thing have it people have used our stack for.
[00:41:01] Unknown:
In your own experience of building the RudderStack product and the company, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:41:10] Unknown:
Technical, I would say, like, unlike my previous companies, like, I think we have been a fortunate state that we have scaled rapidly. And, like, it's often, like, I mean, some of maybe I can go back and revisit some of the engineering decisions. Like, you always, like, initially, you're trying to, like, launch as soon as possible, but then you have to later go and fix some of those things to, like, handle scale and reliability and then so on. That's 1 thing I guess. Like, if we knew that we'll grow this fast, then we would have probably spend more time upfront is what I'd say. But that's like a moot point. That's kind of on the technical side. I am pretty happy with what we have achieved from the engineering perspective. On the organization side, I think, like yeah. I mean, we should have invested a lot more on design and product marketing. I mean, we are, like, literally just an engineering run company for 1st year and half, like, almost 2 years. And I think that does show up in our product experience and our UI and so on. I think that we definitely do a much better job. I'm an engineer, a very data engineer, and so on. I mean, like, so user experience is not something which is very natural to me. So we should have invested in that is 1 thing I would say. Other than that, yeah, I mean, like, hiring is a problem. Right? I don't know if we could have done anything different, but it just have to, like, solve it.
[00:42:14] Unknown:
And for people who are interested in being able to manage these overall workflows of customer data and being able to integrate with their various sources and destinations or use their data warehouse for transforming or enriching? What are the cases where RudderStack is the wrong choice and maybe they're better off going with a different vendor or building their own in house tools? Or maybe it's not the wrong choice but not the sufficient kind of complete solution.
[00:42:42] Unknown:
You have to think through that. I think if you have, like, really customer requirements, right, I mean, like, you cannot conform to the other stack schema for whatever reason because you have already instrumented your code to do certain things, and you have, like, legacy events and so on. And it's it's a big lift to go and replace that. So maybe that's the reason maybe you should build something from scratch. But other than that, for this pure use case of, like, event collection and routing and and so on, I think, like I mean, yes. Run us back on, like, some of the competition. I mean, like, there are other players in this space, but I don't think, like, you should other players in this space, but I don't think, like, you should definitely build something yourself unless there are strong, like, legacy reasons. The other thing is, like, if you look at for a broader CDP use case, right, it's not just, like, event collection, but you're thinking about, like, solving a customer data use case. And then you are, like, let's say you're in marketing with, like, 1 engineer or 1 data analyst, and you don't have the time and resources to do with this infrastructure of, like, willingness to warehouse and, like, and so on. And, like, then you should go with 1 of the existing, like, off the shelf CDPs and that's perfectly fine. So we don't even sell to the marketing persona.
[00:43:42] Unknown:
As you continue to evolve and grow the RudderStack product and the open source and the business, what are some of the things you have planned for the near to medium term or any particular areas that you're excited to dig into?
[00:43:55] Unknown:
Yeah. On the product and engineering side, I think, like, what we have built as the plumbing layers of data. Right? I mean, getting all these, like, even stream data, cloud data, like, send it back to the different destinations to reverse details. So we've gotta build the plumbing layer. The opportunity is to now build the transformations there on top of the warehouse, which our customers are doing. So how do you, like, do that programmatically and, like, without having to write, like, complex dbt logics and so on. Dbt or power spot. That's kind of the first part. And then second is, like, how do you truly make it a platform so that, like, not just you, but other people can go and build interesting application over that unified data model? I don't think we have completely figured that out, but, like, we have some initial ideas around that. So that's kind of what we're really excited about.
[00:44:31] Unknown:
And are there any other aspects of the overall space of building a customer data platform or the work that you're doing at RudderStack or the open source or organizational
[00:44:41] Unknown:
aspects that we didn't dig into that you'd like to cover before we close out the show? We are hiring. So a lot of interesting problems to be solved in this space. Excited about this, but we need a team to go and execute.
[00:44:51] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:45:07] Unknown:
Well, I think from our customers, right, I think, like, data integration is kind of solved. Right? Hopefully, we'll solve some of that data by planning on that is solved. I think 1 thing is, like, data quality and data management is still a problem. Right? I mean, you have all these different events coming in. How do you make sure that, like, you're not introducing previous events and, like, things are not breaking down. And there are, like, multiple layers of failures. Right? So I don't think, like, we have, like, figured out the right solution. I think, like, data quality can be enforced at multiple layers. Right? And the source where you're generating the events and the pipeline layer, people like RadosTech were shipping the data or the destination where you're, like, riding it to the warehouse. Things like great expectations and so on. Or after the land, right, in your warehouse, somebody's just scanning and then monitoring things. I think, like, some some data quality vendors are doing it at the warehouse, but I think sometimes it's probably too late. So what is the right way to enforce, like, data quality? And data catalog also kind of fills and all that. Like, you have all these things coming in. How do you, like, create a good catalog of everything? And it's not just, like, raw data, but also, like, your transformed data and you're creating all these features is a mess. Like, somebody needs to come off with, like, a clean solution to to this.
[00:46:12] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at RudderStack. I'm sure there are lots of other areas that we could dig into at RudderStack and some of the other sort of surrounding ecosystem, but appreciate the time that you've been able to take today and all of the effort that you and your team have put into building such a great product. It's definitely a useful entry into the ecosystem. I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thanks a lot. I'm truly humbled to be on the show, and it was really great chatting with you. Likewise. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Episode Overview
Guest Introduction: Sumyadev Mitra
The Problem RudderStack Solves
Positioning RudderStack in the Data Ecosystem
Target Users and Personas
Identity Resolution Challenges
RudderStack Architecture and Extensibility
Security and Regulatory Considerations
Getting Started with RudderStack
Use Cases and Features
Future Plans and Closing Remarks