Summary
The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner. In this episode the head of data platform for DoorDash, Sudhir Tonse, discusses the technologies that they are using, the approach that they take to adding new systems, and how they think about priorities for what to support for the whole company vs what to leave as a specialized concern for a single team. This is a valuable look at how to manage a large and growing data platform with that supports a variety of teams with varied and evolving needs.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing Sudhir Tonse about how the team at DoorDash designed their data platform
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving a quick overview of what you do at DoorDash?
- What are some of the ways that data is used to power the business?
- How has the pandemic affected the scale and volatility of the data that you are working with?
- Can you describe the type(s) of data that you are working with?
- What are the primary sources of data that you collect?
- What secondary or third party sources of information do you rely on?
- Can you give an overview of the collection process for that data?
- What are the primary sources of data that you collect?
- In selecting the technologies for the various components in your data stack, what are the primary factors that you consider when evaluating the build vs. buy decision?
- In your recent post about how you are scaling the capabilities and capacity of your data platform you mentioned the concept of maintaining a "paved path" of supported technologies to simplify integration across teams. What are the technologies that you use and rely on for the "paved path"?
- How are you managing quality and consistency of your data across its lifecycle?
- What are some of the specific data quality solutions that you have integrated into the platform and "paved path"?
- What are some of the technologies that were used early on at DoorDash that failed to keep up as the business scaled?
- How do you manage the migration path for adopting new technologies or techniques?
- In the same post you mentioned the tendency to allow for building point solutions before deciding whether to generalize a given use case into a generalized platform capability. Can you give some examples of cases where a point solution remains a one-off versus when it needs to be expanded into a widely used component?
- How do you identify and tracking cost factors in the data platform?
- What do you do with that information?
- What is your approach for identifying and measuring useful OKRs (Objectives and Key Results)?
- How do you quantify potentially subjective metrics such as reliability and quality?
- How have you designed the organizational structure for your data teams?
- What are the responsibilities and organizational interfaces for data engineers within the company?
- How have the organizational structures/patterns shifted or changed at different levels of scale/maturity for the business?
- What are some of the most interesting, useful, unexpected, or challenging lessons that you have learned during your time as a data professional at DoorDash?
- What are some of the upcoming projects or changes that you anticipate in the near to medium future?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- How DoorDash is Scaling its Data Platform to Delight Customers and Meet our Growing Demand
- DoorDash
- Uber
- Netscape
- Netflix
- Change Data Capture
- Debezium
- SnowflakeDB
- Airflow
- Kafka
- Flink
- Pinot
- GDPR
- CCPA
- Data Governance
- AWS
- LightGBM
- XGBoost
- Big Data Landscape
- Kinesis
- Kafka Connect
- Cassandra
- PostgreSQL
- Amundsen
- SQS
- Feature Toggles
- BigEye
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Your host is Tobias Macy. And today, I'm interviewing Sudhir Tonz about how the team at DoorDash designed their data platform. So, Sudhir, can you start by introducing yourself? Thank you for having me, Tobias. I'm a big fan of your data engineering podcast as I mentioned earlier.
[00:01:12] Unknown:
I definitely find it a good source to increase my knowledge. There's a lot of good interviews and a lot of good information that can be had from there. I do lead the engineering organization that we call data platform at DoorDash. Consists of a few disciplines such as real time streaming platform, machine learning platform, experimentation platform, data warehouse, etcetera. And I've been at DoorDash for about a year and a half, and I came from Uber there for about 4 years doing something similar, managing the data of the marketplace organization at Uber. So glad to be here. Looking forward to this chat.
Definitely. Do you remember how you first got involved in the area of data management? Data has always fascinated me. When I first was a kid with my first access to computer, the first thing I did on a MS dash prompt was type in 2 +3 and hit enter. And I was hoping that it comes up with the right answer. But gosh, it said bad command or file name. And I was like, what? Anyway, that joke is sad. To me, computers were always, like, number crunching machines and data crunching machines. And, of course, with the advent of Internet now, it's also a messaging machine. And together with the data crunching capabilities and the messaging capabilities, I think is what has made the modern technical world possible.
As far as my professional engagement goes, for the most part of my career, I was a generalist, a back end engineer. Started my journey at Netscape back in the days. My first introduction to specifically about data and data management was at Netflix. We used to joke around at Netflix that it really is a log event processing company, which just so happens to be in the streaming movie business to make money. Crunching area. That's where my journey began. And it just took it forward at crunching area. That's where my journey began, and it just took it forward at Uber where the passion continued. And here I'm at at DoorDash
[00:03:10] Unknown:
continuing. So it's been a passion all along. It's definitely an interesting career arc going from Netscape to DoorDash with many stops in between. And so you mentioned a little bit about what you do at DoorDash, where you head up the data platform team. Can you give a bit more of a flavor about some of the responsibilities that fall on your plate and some of the ways that data is used to power the business at DoorDash?
[00:03:34] Unknown:
I lead the data platform engineering organization, which for us, the internal customers are data analysts, data scientists, machine learning engineers. Operations folks are the folks that manage the business on the ground. So those are my customers. The goal for the team is obviously to station the best possible big data stack, if you wanna use a buzzword, that enables all of the computing power that we need to gain insights and to run the marketplace. As far as where is data is used at DoorDash, a little bit about DoorDash. I'm pretty sure people have heard of the company, but a little bit of a primer on DoorDash, and that explains why we use data. DoorDash is really a multisided marketplace. There's the merchants, which could be restaurants or store owners. There are the dashers, who essentially are the folks that drive or or ride and and deliver the items. And then there are consumers like you and me, Tobias, that order food or any other items through the app. So that's the multisetted marketplace. And a few other areas include, you know, convenience and groceries where there are other actors involved as well. So anyway, this huge marketplace generates a lot of data. And the goal of the organization, the data platform organization, is to figure out how to harness the power of this large, large dataset to optimize the marketplace essentially and to optimize the business.
A few examples could include something like ETA. ETA is estimated time of arrival. When you order some food via DoorDash, we obviously want it to come in as soon as possible. And it's critical for us to be as accurate as possible when we come up with the ETA. Well, if we underpredict what the ETA is gonna be, then it will lead to a bad customer experience, a bad user experience, or overquote the ETA. The likelihood the customer is gonna churn and go to a competing app, for example. So these are the kind of work that really are my responsibility or my team's responsibility. We work with a lot of data scientists, data analysts, machine learning engineers
[00:05:42] Unknown:
together to make this all possible. Because you have so many different actors within the platform that you have to try and understand the behavior of and be current state of the world with the pandemic has thrown your overall capabilities of being able to work with that data and some of your existing models kind of out the window. And I'm curious what the overall effect has been in terms of the scale and volatility of the data that you're working with and your ability to be able to use it effectively.
[00:06:12] Unknown:
It certainly has. Although we've had steady growth, I should say, for many years now if you look at the DoorDash chart, but the pandemic definitely accelerated the shift in our consumers' behavior. For the most part, they have embraced the delivery option wholeheartedly, and then there's a lot more growth to be had there as well. But you're right. In terms of volatility, especially talking about the machine learning models that we had earlier built, well, machine learning models are typically built on historical features, which is how the customer has behaved or what the ETA or the prediction were based on historical data. Of course, that all changed when the pandemic hit, and we had to retrain our models to suit the new world. That was an interesting exercise. In addition to that, it's about the volume. Yeah. The volume has definitely increased many, many folds. The volume of data, I mean.
And so that brings in challenges in terms of scaling the services that we have and the right toolset to address the increasing volume and the complexity of the use cases that we have now. Thinking a bit more into the actual information that you're working with, can you give a bit of an overview about the types of data that you use and some of the primary sources that you collected from? I would say the primary source of data for us is our own marketplace. Right? So as I mentioned earlier, there are dashers, consumers, merchants, etcetera.
In the marketplace, that's the primary source of data. We have a set of microservices architecture in our AWS cloud, and and they publish a lot of different events. And they also store the transactional data in our transactional data stores. So that would be the primary data source for us. In addition to that, interesting big volume of data comes from the GPS locations of our dashers as they drive and deliver the items and the food. That's interesting because it helps us to understand how to better optimize the delivery time or to update our customers real time on where the dashers at. Another big area obviously is the observability area which is used for really understanding the user behavior and experience and to optimize it better.
DoorDash has got multiple applications, not for the dasher, for the consumer, for the merchant, etcetera. So we've got observability data based on those to ensure a good user experience.
[00:08:27] Unknown:
Because of the fact that you have so many different players and such a geographic range that the different participants in the marketplaces have. I'm curious what other sorts of secondary or third party sources of information you're able to pull in to enrich the primary data sources that you're collecting from the applications that you manage.
[00:08:48] Unknown:
That's a very good question. To be honest, as far as machine learning models go, for example, the more data that you have, the better it can get in terms of its accuracy and its personalization and its optimization. But having said that, it also adds complexity. So it's always a balance on how much external data do we want. For example, a simple thing that comes to mind is if you were to get hold of the calendar in some sense in local markets. In other words, if there are events going on there, that's obviously gonna affect something. Weather is an interesting idea. Weather definitely affects how the marketplace behaves. So these are some of the data that we do get, but we haven't ventured beyond that as far as my knowledge goes. So those are some of the ones that I'm aware of. So you have some of these secondary sources such as the weather and localized calendars of events, and then
[00:09:38] Unknown:
you also have your applications that you're pulling from. And I'm wondering if you can maybe give a bit of an overview about the overall collection process for being able to bring all of that data into your systems and make it operational?
[00:09:50] Unknown:
So, part of the name here is data transportation. Right? So for us, as I mentioned earlier, we are a microservices shop. There are multiple services behind our marketplace. They're all deployed on AWS, which is our primary cloud. And most of those data from the microservices get stored in transactional databases. Our main database currently are, you know, AWS is Aurora, which is Postgres, essentially. We have ventured out into a few other databases such as Cassandra and CockroachDB. So as I mentioned, it's all about data transportation to the analytics land, and the best way to do that would be CDC or change data capture.
Debezium is an interesting open source solution that definitely is a good start. And we do use, Debezium for CDC for some of these use cases. For some of the other use cases, we've grown our own homegrown scripting and solutions. But ultimately, they all make their way into our data warehouse and our data lake. For our data warehouse, we've got a chosen Snowflake, which we've been working on for a couple of years now. The other part of the transportation stack, I think the main component there would be Apache Airflow. So these are all the right ones for as far as batch analytics goes or batch pipelines goes. For real time data, we've settled down on a few technical projects as well. For example, we do use Apache Kafka, Apache Flink to transport real time data to our data lake.
And we are also venturing into a few OLAP like solutions such as Apache Pinot and Apache Presto. But these are still very much work in progress and early days. Another aspect of the overall collection and management of this data is I imagine you have a lot of
[00:11:36] Unknown:
responsibilities for being able to do privacy management because you have things like location information of the dashers and where they're delivering to and things like the information about the dashers, some of their demographic information. And I'm just wondering how you manage some of those security and privacy considerations.
[00:11:56] Unknown:
Data governance and compliance is definitely of primary importance, especially since we announced that we're gonna go public about mid last year or thereabouts. We definitely doubled down on the privacy angle. PII or personally identifiable information, it was a major project for us to ensure that PII data is encrypted and stored in in all of our internal systems that our access controls in place throughout the data pipeline to whatever PII data has been labeled. In addition to that, we worked on CCPA, which for your listeners, if they're not aware, is California Consumer Protection Act.
We we worked on that in order to be compliant in that area. GDPR is an interesting area, but since we are not yet in Europe, it's a project that we will undertake starting sometime soon. But, yes, governance and compliance are top priority. And we've, stationed different solutions in different parts of the pipeline to address it.
[00:12:57] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to dataengineeringpodcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool Water Flask. Because of the fact that you have such complex data that you're working with and you're trying to be able to build some nuanced models and analytics on top of it. And I imagine that there are some constraints in terms of the time available, which we all suffer from, but also in terms of the engineering capacity of your specific team. I'm wondering what you consider as part of the build versus buy decision making for what to pull off the shelf and what to build in house and how to stitch it all together so that you're able to maintain the overall platform. Because as a data platform organization, that's always the sort of guiding light and 1 of the most complex aspects of the entire data delivery and data utility aspect.
[00:14:38] Unknown:
Yes. Right. Again, we are spoiled for choices. Aren't we? There there's a ton of choices out there. But, yes, 1 of the guiding principles for us is we wanna leverage as much as possible. For example, I mentioned DoorDash runs on top of the AWS cloud. So, traditionally, we've been trying to leverage any existing technologies that fit our needs and, of course, our budget. On a few occasions, we've had to evaluate existing technologies and had to choose to build certain components in house. But these are typically done on top of existing open source technologies, and we just mostly weave them together or address some of the gaps that are in there. 1 such example, I guess, I could quote our ML platform or machine learning platform, which we built in house to address our need.
But, however, it is built on top of other existing OSS components such as Skykit Learn, PyTorch, LightGBM, XGBoost, etcetera. Yeah. Build versus buy is an interesting decision to make and it all depends on various factors such as how scalable are the solutions, how flexible it is to accommodate current and possibly future needs, how well it integrates with our internal services. Are they open source? Are they closed source? And cost, of course, is a big, big factor. With the growth of DoorDash's business as well as the growth in data volume, we definitely want solutions that scale and hopefully doesn't scale in linearly in terms of cost as well. Yeah. There's definitely
[00:16:09] Unknown:
ongoing decision making between, do you choose point solutions that are best of breed and you're able to integrate into your broader stack, or do you go for the monolithic vertically integrated approach, The cost there being cost of developers and time spent versus cost of purchasing the vendor solution and lock in and the sort of availability for extension of that system.
[00:16:32] Unknown:
Absolutely. You're right. The way I look at the big data spaces, if you were to look at the big data infographic, it's become very hard to really look at. It's like a jigsaw puzzle, isn't it? The only difference is that you could assemble it any different number of ways and comes awesome picture unique to your company. So it's a very interesting area when it comes to choosing which ones to support, which ones to double down on in terms of, you know, working in house versus which ones to simply go with the vendor or existing technology. In my blog that I published a few weeks ago, I referred to it as the paved path. And that's an interesting concept that's used in multiple similar companies. And we do have a paved path at DoorDash where we've assembled the jigsaw puzzle of big data stack for us that makes sense for us at DoorDash.
[00:17:22] Unknown:
Absolutely. And you mentioned the paved path aspect as well in another post that I read about how you're scaling the data platform at DoorDash. And I wanted to dig into that a bit more as far as how you select the technologies that are part of that paved path and maybe just describe a bit of what you mean by that nomenclature.
[00:17:43] Unknown:
So let's go with an example. Let's take stream processing, which is when there are real time events, 1 needs to publish it, consume it, compute on it, store it, query it, etcetera. Right? So it's just that 1 part of the puzzle, if you will. There are so many offerings in stream processing alone. For example, there are Apache SAMSA, Flink, Spark Streaming. Not to discount, AWS SQS, Kinesis, Kafka Connect, etcetera. Right? So what does the paved path mean in this in this instance is that for any given challenge area, in this particular case, stream processing, we're unlikely to need all of those. So we need to pick and choose which particular set makes sense for us. And that really is the paved path. So in the case of stream processing for us, we chose to settle down on a combination of Kafka and Apache Flink.
And this is well integrated with our compute infrastructure at DoorDash, which is mostly Kubernetes based. And so if you were to ask me what does a path look like overall at DoorDash, I would paint a picture that looks somewhat like Apache Kafka, Flink, Airflow, Cassandra, Postgres, Snowflake, Spark, etcetera,
[00:19:01] Unknown:
with a few new entrants, such as Pinot, Presto added to the mix. 1 of the things that I found interesting in your post where you were talking about this paved path concept is that because DoorDash has been scaling and there are so many different projects going on, different teams are given the freedom to be able to select the tooling and technologies that their particular use case, but that because you have limited capacity in your support organization of the data platform team, you can't full support to all the different list of tools because the situation that has potentially infinite complexity. And so you have this paved path of these are the supported technologies.
If you want to use these, then you have a happy path from going from local into production. If you want to use something else, then we can give you some guidance, but you're largely on your own for being able to figure out how to actually maintain that in the long run. And I'm wondering if you can just talk through some of the ways that that manifests from an organizational perspective of how that guides the tooling selection for different teams. And for teams that do choose to use something that isn't part of this paved path, how that factors into the overall evolution of their systems? Do they start with something that isn't part of the paved path to prove something out and then iteratively migrate into this sort of well maintained lane of going from concept to delivery.
[00:20:19] Unknown:
Tobias, you've described the paved path really well, so thank you for that better than I could. But, yes, you are right about the challenges. Let's focus a little bit on the other teams or the product teams that want to explore or go to the cutting edge or bleeding edge of some of the newer choices that we have. Our intention is definitely not to discourage it. We wanna encourage it. Innovation is good. Trying out and exploring new ideas are good. In fact, the way my organization or data platform would like to do it is to make sure that we do it collaboratively Rather than every team doing it, you know, in a fragmented approach, it would be nice to collaborate together and say, look. Hey. We've got an idea. I wanna use this brand new technology in the space of stream processing, and it sounds like it's gaining some buzz and gaining some traction. Why don't we try it? Alright. Good. Let's do it. But let's do it within the umbrella of data platform that we have together so that we could learn. And if it were to work, we could then figure out how to migrate the rest of the use cases. And if it weren't, that's fine.
The idea of a paved path is that, yeah, I mean, a paved path needs maintenance, needs upgrades, and needs, you know, different new things added to it. So the idea is not to restrict, but, yes, to work together collaboratively to define 1 beautiful path that is optimized and that is well supported within within the company. Example for that would be we do have what we call as an ML council or machine learning council. The idea is not to be a committee of folks that can say what you can use or cannot use, but the idea is to make sure that voices are heard throughout the company and what it is that's of interest to them so that we can be ahead of the game in terms of evaluating and understanding what are the right technologies for us. And so another aspect of the paved path approach
[00:22:12] Unknown:
and the plethora of tooling that somebody might use is how the concepts of and enforcement of data quality differs between the different technology choices and how you maintain that at a production scale. And I'm wondering if you can talk a bit about some of those challenges and how you're approaching the overall concept of data quality and consistency of data because you have different
[00:22:42] Unknown:
To To be honest, it's still an area of early investment for us. I do believe that data quality and data observability are super important and paramount to the success of a data platform organization. But we started small. What I mean by that is our initial focus was simply defining a small set of SLAs that we wanted to achieve. For example, 1 of the main reasons for a data platform to exist is from an analytical purpose to compute and to report on top business KPIs on a daily basis. Let's start there. So what are the golden datasets or the important datasets that are required to inspect from a quality perspective? So to reduce it to that that 1 small little space and figure out how we can define the quality there.
When it comes to batch ETL pipelines, which is the the whole pipeline that gathers data from various services and transactional databases that you mentioned, Yes. We've we've evaluated a few existing solutions. We've now settled down on essentially using airflow check operators, pre op pre check and post check. We allow for our teams to build rules based on definition of what they should check or validate in the pre and post operators. So that's just 1 example of how we utilize quality checking in in our ETL pipelines. Another example would be in the real time pipeline, the real time data processing area.
We insist on having a schema for the payload. And both on the publishing side as well as on the consuming side, we do schema validation and alerting and monitoring based on whether or not it passed the schema validation. So, yeah, those are some of the areas, but still very early time, and there's a lot more to be done. For the whole entire pipeline, I would say that data discovery, data lineage concepts such as those become very interesting and important. We're definitely evaluating a few choices that we have in those area. I would say we've evaluated a merchant. It looks pretty promising in the area of metadata and discovery and lineage. That's 1 way to go. But to be honest, there there are many solutions out there. We just have to figure out the right solution for us. As somebody who works in the operation space, 1 of the kind of perennial challenges of managing the variety of tools that are out there
[00:25:08] Unknown:
is gaining an appropriate level of expertise to be able to understand how to run these things in production, and then also the complexity of managing a heterogeneous infrastructure of being able to deploy things and manage the life cycle and the upgrade process and configuration management. And I'm just curious how you tackle some of those problems at DoorDash.
[00:25:30] Unknown:
As a DevOps engineer or a DataOps engineer, if you will, you understand the challenges really well. And that's the whole idea of a paved path. Right? Which is, yes, given the complexity of the big data ecosystem, we don't really wanna support every technology that's out there, of course. And we wanna settle down on the right few that makes sense for us so that we could get better at it, get understand them better, and over a period of time, build all the right observability, you know, and monitoring automation, etcetera, that we need for them and gain knowledge as we do those. So, yeah, the real answer to your question is we wanna keep the infrastructure or the data stack simple, but at the same time, allow for innovation and exploration. And it's a fine balance to have. In some cases, we've we've got, you know, various professional services that help us with the operations. For example, we use Snowflake, and we engage with Snowflake's professional services to to ensure that we have all the right knowledge and then support needed to run it. In many other cases, these are open source components, and we have engineers sufficiently knowledgeable on how to maintain and operate them. In terms
[00:26:39] Unknown:
of the evolution of technologies, I know you mentioned that you've only been at DoorDash for the past year or so. I'm wondering how much context do you have about some of the technologies that were used earlier in the life cycle of the company and how the requirements of the volume of data and the scale of analytics and the types of analytics that you're trying to do, how that has driven to some of the newer selections of technology that you're maintaining currently,
[00:27:07] Unknown:
and what the migration path looks like from some of those earlier choices to what you're using now. I mentioned Snowflake already. So that was a major migration a few years ago. As with any start up, as with any technology company in this space, you start off with something and then you evaluate as you go by to see if it will address your skills, your needs, your budget. So Snowflake is 1 such thing that we did adopt recently a couple of years ago. But that was before I came, so I'm not fully aware of all the minor details that that went into that decision. I could quote another 1 that is more recent that was done since I joined. 1 of them is we used to use Celery and RabbitMQ for some of the real time asynchronous messaging needs that we had.
And we have published a post, I believe, a couple of months ago on how we found that solution to be a little bit challenging in terms of scalability and automation. And we chose since then moved to a combination of Apache Kafka and Apache Flink, which we've, you know, settled down on. Similarly, in a few other use cases, we used to use SQS and Kinesis as a combination. Again, chose to settle down on Apache Kafka and Flink. This wasn't about scalability, to be honest. It was more about consolidating our stack so that we don't have to use every flavor of messaging experience perspective.
[00:28:39] Unknown:
Roto Stack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at data engineering podcast.com/ rudder today.
Another interesting subject in the blog posts where you were discussing some of the scaling challenges of the data operations at DoorDash is the tendency for building point solutions for a given use case before deciding whether or not to generalize that into a paved path or a platform capability. And I'm wondering if you can just talk through some examples of cases where the point solution remained as a 1 off and didn't get integrated into that paved path, but might still be operating, and others where the point solution needed to be expanded into a widely used component to just some of the process for navigating and understanding what are the tipping points from when a point solution needs to remain as an ad hoc instance versus needs to be more broadly generalized.
[00:30:02] Unknown:
That's an interesting area as well. And we also talked about how we should start small and iterate. Some of these solutions make it through into a generalized offerings. Others don't. What are these examples that I could code? In the area of where the ones that did graduate into a generalized solution, I would like to use the earlier mentioned ETA or the estimated time of arrival prediction service. Way back when when we started it, it was just a 1 off solution specifically focusing on just that ETA prediction. It was a bespoke point solution, of course. But then, of course, we knew that general inference and prediction services in the area of machine learning is a general topic, prediction.
So what we did in that particular case is to work on a general inference service or prediction service, which we call Sybil. We've posted a blog post on this as well, on DoorDash's engineering blog area. So all the details in terms of how this converted from bespoke solution into more of a generalized prediction service and how it has scaled to our needs is an interesting topic. It's out there in the blog. In terms of point solutions that did not make their way through into a generalized solutions, I could use the example of our experimentation framework or the AB testing framework.
We definitely used a few different variations of feature flags or experimentation configuration. But in the interest of consolidating, we decided to deprecate a few of them, and then we are now focusing on what we call as our dynamic value framework or ecosystem that helps us configure both the world of feature flags, feature toggles, rollouts, experimentation configuration, and workflow management. That's an example of some of the point solutions that really essentially conversion to 1 single solution.
[00:31:57] Unknown:
Another aspect of the overall challenge of running a data platform is managing the infrastructure cost to the operational cost and understanding when a particular solution is pulling its weight and returning the value of how much you invest in it versus when something is operationally too expensive and isn't worth the overall effort. And so I'm wondering if you can discuss some of the ways that you track the overall cost factors that go into the data platform and what you do with some of that information. Like, do you retire certain pieces of infrastructure? Do you change the life cycle policies for how long you maintain data? Do you change some of the human investment that you put into some of these challenges?
[00:32:40] Unknown:
Cost is definitely a factor. In many organizations, a data platform is essentially looked upon as a cost center. I would say we've got a good handle on our overall budget. From a metadata perspective, we know, like, the various spends that we have and we track pretty religiously. Where we can improve, I believe, are in terms of those micro optimizations or understanding the ROI or return of investment on some of the point solutions, etcetera. We're not definitely there yet. As you mentioned, we've got thousands of ETL jobs that store data into a warehouse. Not all of them are equally important, of course. How does 1 know which ones are worthy of of supporting and maintaining and which ones aren't? It's an interesting question.
I'd confess we don't have a really good solution as of now in terms of micro optimization. But on a macro optimization or a budgeting level, we do do look at it on a weekly cadence and also review it in a quarterly cadence. As far as, what can we do with that data? As you mentioned, some of the typical techniques used are understand which data can be archived, stored in a less expensive or less cost less compute intensive, area. We use Amazon S3 for parking, quote unquote, some of our data. It's a good efficient solution. That's certainly 1 of the ways we do it. We have a new project which we call as tagging. In combination with data discovery and lineage when we invest in it, I hope will give us more insights into different parts of the ecosystem and try to understand what the return of investment on those would be. But it's still early days there. More to do in that area.
[00:34:19] Unknown:
Another thing that I've heard mentioned all over the place, particularly in the data and operation space or in engineering more broadly, is the idea of OKRs or objectives and key results. And for things like a sales organization, it's fairly clear what kinds of things you would track, you know, how many people you've contacted, the conversion rates, the lifetime value of the customers. In a data organization, it can be a little less clear to understand what are the pieces of information that you track, how do you translate that into the types of value and the overall objectives of the organization. I'm wondering if you can talk a bit about your overall process for identifying what are useful OKRs and some of the ways that you quantify things potentially for metrics that can be subjective such as reliability or quality.
[00:35:09] Unknown:
Road map, vision, quarterly planning is an interesting area. I believe different companies have different approaches. Our approach at DoorDash is, yes, we've settled down on the OKR framework objectives and key results. And, Tobias, you're also right that if you are a product organization or a sales organization, it's easier to define these OKRs, isn't it? For example, you could go, my OKR is to launch feature x and move a business metric y in a certain direction by a certain percentage. Awesome. Good. Easy to track. It's much harder for infrastructure organizations or platform organizations such as mine. For example, an OKR I could quote is deploy and operate x number of Kafka clusters with y number of topics that can process z number of messages per day.
Okay. So you did that. What's the impact? How do you know that it's successful and it's useful? That's where it gets a little bit harder. And also adoption by various product teams is very challenging as well. Everyone's got different priorities. Things don't line up sometimes, etcetera. So, yes, it is challenging. And OKR is is something that needs to be evolved over time, and we are getting better at it. How do we understand? As I mentioned, if you were to do the tagging exercise, the data discovery, the lineage, etcetera, it's easier to figure out, hopefully, as we go in terms of what the impact is it. Why is it important to understand impact? Apart from the costing part of it cost part of it, the major reason to understand the impact would be prioritization, which is we've got a limited set of bandwidth.
And if you wanna know which ideas to prioritize ahead of others, it's better done if you know what the impact estimate would be. So it's an interesting idea. A simple example would be well, I mentioned Sybil, which is our prediction service. So it now handles multiple use cases including ETA prediction, search, recommendation, inferences, etcetera. How does 1 know the value of that service? And if you wanted to add additional features to that service, for example, we wanted to support ensemble models, What is the incremental impact of that? Well, it's a tough question to answer.
[00:37:25] Unknown:
There are a number of other interesting organizational aspects that we can dig into, and I'm sure that we could probably talk, you know, ad infinitum about all of the technical and organizational and strategic issues that you're tackling. But as with everything, time is finite, so I'm interested in learning a bit more about some of the particularly interesting or useful or unexpected challenging lessons that you've learned during your time as a data professional broadly, but also in your time at DoorDash and leading the data operations team. If I were to pick and choose 1 interesting or challenging lesson that I've learned that still keeps me up at night is
[00:38:02] Unknown:
trust. It's not necessarily unique to DoorDash, but for any given data organization, trust is the hardest to gain and the easiest to lose. Any given big data platform for that matter, there are multiple areas where things can go wrong. Fast moving companies similar to ours obviously wanna optimize for speed of innovation, for, you know, speed of adding more features, etcetera. Comes at a cost. It comes at a cost of reliability. It comes at a cost of quality. So losing trust when things go wrong would be the easiest way down from a spiraling point of view. And recovering from that is is the hardest 1, I would say. Thankfully, DoorDash, at least in my tenure here, that hasn't happened. We've seen the trust of the organization thus far, but that is the 1 that does keep me up at night. I wanna make sure that whatever we do, we focus on the quality and reliability aspects of our work.
There are some solutions out there that we plan to evaluate and plan to incorporate into our pipeline, but that's the area that keeps me up at night the most.
[00:39:08] Unknown:
In terms of the overall scope of projects that you're dealing with and the range of technologies that you're responsible for, what are some of the things that you're keeping a particular eye on as far as industry trends or upcoming technologies? And what are some of the upcoming projects or platform changes that you are particularly interested in or excited for? We did talk a bit about discovery, lineage, metadata management,
[00:39:35] Unknown:
cost optimization, etcetera. So those are interesting areas. And as you know and in your podcast, you interviewed a few of them as well. I wanna give a shout out to some of my ex colleagues from Uber. You might know Igor. They started this company called Bigeye, which was earlier known as Toradata, I believe. It's an interesting concept. It's an interesting way of looking at data quality. And not just that company, there are many other companies and also open source solutions that have been stationed in the area of data quality and data lineage and discovery. I mentioned Amundsen. That's an interesting 1 as well. So, yeah, those are some of the very interesting 1 I wanna keep an eye on. The other ones would be in terms of our in house development. What I'm excited about is I mentioned we've got a real time streaming platform.
But, honestly, not every person, like a data analyst or a data scientist, etcetera, is well versed with Java and Scala in order to build Flink jobs. If you wanted to democratize it, you know, in terms of how we can build KPIs and insights and features for our ML models, We need to make it a lot more simpler. So a new blog post will be posted soon for those interested, possibly in about a week's time, that describes how we've gone about this. We've used our own domain specific language, which is an extension on top of a SQL dialect, And it democratizes how real time jobs or real time insights and real time features can be obtained.
So other areas include experimentation, data lake frameworks. So there there's a whole bunch. So, yeah, those are the interesting ones that I'm keeping an eye on. Are there any other aspects of the work that you're doing at DoorDash or the overall space of managing a data platform and technologies involved in the organizational aspects that we didn't discuss yet that you'd like to cover before we close out the show? Well, it's been a good set of talks, and these are all as you mentioned earlier, if you were to dig and deep dive into any of them, we could discuss them for hours. So to close out, I guess, the best way I can picturize this is, again, I go back to the jigsaw puzzle idea. Big data is a ever changing, ever evolving platform.
For example, Hadoop was, right, the most buzzworthy thing. But that was just 7 or 8 years ago. No 1 talks about it anymore. I wouldn't say no 1, but a majority of them have moved away from that. Right? It's always interesting to figure out where we're all headed. When it comes to the analytics space, people are talking about data lake, but data lake is now old. Data mesh is the new thing, or lake house is the new thing. So it's a super interesting area just to keep track of in terms of understanding how it all evolves. But all said and done, at the end of the day, really, rather than getting starry eyed about all the wonderful technologies out there, the most important thing to remember is that us as data professionals, we're tasked to do 1 thing, give the right accurate data to our stakeholders, such as our business partners and our product partners,
[00:42:40] Unknown:
in a timely manner. If we can figure that out and keep doing that without losing their trust, I think we will be a success. Yeah. That's definitely a great guiding principle to leave off on. And so for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think we referred to change data capture, which I also mentioned that Debezium is an interesting open source solution. And, Tobias, I know you interviewed
[00:43:15] Unknown:
the Debezium folks a few weeks ago, I believe. I'm interested in that, but it's still a gap. The reason why it's still a gap is there are so many variations and different levels of support for different databases. Right? In our case at DoorDash, we are using Tempezium. It works fantastically fine, at least for now, in terms of Cassandra use case. But as I mentioned, there are multiple different data solutions. How do we standardize on that? How do we build it end to end is is an interesting area. For example, with Debezium, it's got good solutions up until the reading of the event log part up until making it reach into a Kafka topic.
But what after that? How does it sink into a proper data lake or a warehouse? Those are interesting areas. So I think there's gaps there. Another 1 that comes to mind would be the area of governance and compliance. There's so many governance and compliance needs these days. CCPA is something I mentioned. GDPR is another 1 I mentioned. Most companies that I'm aware of seem to be learning this all the hard way by themselves. I know there are solutions out there, but the data landscape in many companies, the size and scale of DoorDash is so complex that it's not a easy 1 stop solutions that fits them all. If somebody can figure that out, I think that would be a that would be a great service to the, to the community.
[00:44:34] Unknown:
There's many more, but I'll leave you with those 2 for now. Alright. Well, thank you very much for taking the time today to join me and share your experience leading the data platform team at DoorDash. It's definitely a very interesting business domain and an interesting set of data problems that you're working through. So I appreciate all of the time and energy you've taken to document them for other people to learn from and for joining me today to share your experiences. So I appreciate all of that, and I hope you have a good rest of your day. Absolutely. I had a blast talking to Tobias, and I look forward to listening to more podcasts from you so I couldn't reach my knowledge in a very selfish way. But also happy to give back to the community, and this was my chance, and I hope it's useful. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Sudhir Tonz's Background and Career Journey
Data Platform Responsibilities at DoorDash
Impact of the Pandemic on Data Management
Data Collection and Transportation
Privacy and Security in Data Management
Build vs. Buy Decisions
Paved Path Concept and Tooling Selection
Data Quality and Consistency
Managing Heterogeneous Infrastructure
Evolution of Technologies at DoorDash
Point Solutions vs. Generalized Solutions
Managing Infrastructure Costs
Setting OKRs for Data Platforms
Lessons Learned in Data Management
Upcoming Projects and Industry Trends
Closing Remarks and Final Thoughts