Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton,

Optum, Udemy, Zynga, and others.

Acryl Data provides DataHub as an easy to consume SaaS product, which has been adopted by several companies.

Sign up for the SaaS today at data engineering podcast.com/accryl.

That's

acryl.

Your host is Tobias Macy. And today, I'm interviewing Srivatsan Shreedharan about the technological

staffing and design considerations for building a data platform. So, Sri, can you start by introducing yourself?

Hey, Tobias. Thank you for having me on the show. I'm super excited to be here. As I said, my name is Sri. I'm currently leading the data infrastructure

manager,

I

But before I was a manager, I was an engineer,

kind of like a first data engineer when data engineering was still kind of an evolving profession.

And prior to that, I was in grad school. So I've been in the industry for,

you know, a little over 10 years. I've been in the data space for 8 to 9 years, majority of my time. I really enjoy the space because I think it's in a good intersection of technology, product, data, people

of which I think is always evolving and always exciting. And do you remember how you first got started working in the data ecosystem?

By chance, actually. I remember, you know, 1 of my very first projects when I came in as a fledgling new grad at Yelp, you know, many, many years ago

was to build out our very first version of our data warehouse.

And at the time, I don't think, you know, data infrastructure or data engineering was a very evolved fashion.

I was there at it with my team members trying to build out ETL pipelines, standing up a data warehousing solution,

working on facts and dimensions and all of that good stuff. So I really kinda get started in the data profession there. And, you know, over the years, as I grew in my career, you know, management position opened up and my director tapped me on the shoulder and said, hey, do you wanna take over management of this team?

Back then, I had no clue what engineering management or leadership would look like. I took the leap of faith, made a couple of not a couple made a lot of mistakes along the way,

and learned throughout the process. And, you know, that's what brought me here where I am today.

And so as you mentioned, you are the head of the data platform team for Robinhood. You have been doing that at Yelp for a number of years. I'm wondering if you can just start

by giving a bit of an overview of your experience

working in that space of designing and implementing data platforms.

I think while the space has been similar across both companies, the challenges have definitely been very different and very unique.

In terms of the space, my experience has been around

batch processing systems,

real time streaming systems, stream processing systems, data lakes, data warehousing,

querying engines on top of data lakes. So essentially the entire ecosystem

around data platform

and the ecosystem that basically powers

data engineers, data scientists, back end engineers, machine learning engineers to make sense of data and build data driven applications.

The fundamental space, like, across both the companies has been the same. But as we both know, the technology has really, really evolved in the last, like, 6 to 7 years. Right? Like, really skyrocketing. And it'll be interesting to see how the next 5 years pan out in the data space. In terms of the actual design process of thinking through the requirements and the different components of a data platform, it's a very

complex and multifaceted aspect. I'm actually going through that process myself right now for my day job, and I'm wondering if you could just talk through some of the

core elements that you have found to be

the same across different contexts and organizations

and some of the ways that the architectural

requirements

and the

foundational components that are needed to support them differ across organizations

and data volumes and contexts and use cases?

Yeah. Great question. And I think it all kinda starts with where the business is at and where the business needs to be. I see the data ecosystem as an evolution. Right? Like, in a truly evolved state, as you as you mentioned, it's fairly complex, lots of moving pieces.

But if somebody is kinda thinking about setting up a data platform from the ground up, I think the first question is to figure out

what does the organization need and where is the organization in terms of its maturity.

So for instance, you know, when we started off the data platform team at Yelp or when I joined the data platform team at Robinhood,

the first kind of quickly what I've seen is the primary

business requirement that comes up is, hey. We wanna, you know, slice and dice this data and build metrics and build analytics

on top of this so that we can make data driven decisions as a company, as a business, as a product.

And so with that kind of business use case in mind, the very first primitive or pattern that comes up in my opinion

is the concept of data warehouse or data lake, where instead of,

you know, querying against OLTP production databases, which is obviously a bad pattern and can bring down your production data store or cause performance regressions,

The idea is to take all of the data and put it in an offline store so that people can make sense of that data for metrics and decision making. I've seen that to be the very first primitive or need that comes up from a business standpoint.

And to build that primitive

further sub primitives around, okay, if you wanna get this data into in into an offline store, we need a way to transport this data and we need a storage mechanism, and then we need a way to access that storage mechanism.

So the storage mechanism can be a vendor solution. You know, think of your snowflakes of the world or redshifts of the world. It could be s 3 as a blob storage, whatever it is, you know, the the offline storage becomes a sub primitive.

And then, you know, moving data across

becomes kind of the next challenge. Then depending on the scale of the data, there can be different techniques to be used there. And then once the data lands in your offline data store, some kind of a querying capability to access the data. So I think that's the fundamental

in a very simplistic term, it is quintessentially DL pipeline. Right? Taking data from a production database, putting that in an offline store, and giving people access to that offline data. I think that's a pretty common pattern. And I think as organizations

evolve and mature,

both from a scale and complexity standpoint, like, new things emerge. So for example,

as the scale increases,

you know, your primitive ETL pipeline may not work and you might have to consider

distributed systems like Kafka,

some stream processing engines to get real time data.

Your data warehouse might not scale, so you need to think of a more scalable storage solution. So scaling introduces a lot of complexities. The other domain that can introduce a lot of complexities is product.

So if your business starts evolving and you have multiple product lines and you want to do experimentation,

you want to do machine learning,

then more capabilities are needed and they kind of emerge. So the platform evolves over a period of time.

There are a couple of different kind of fundamental

paradigms and approaches to data platforms

that

don't necessarily

relate to each other specifically, but, you you know, 1 is the difference of data warehouse versus data lake and the, you know, newfound marketing term of data lakehouse that's trying to create a hybrid of them. And then the other foundational

paradigm is batch versus streaming and at which different layers of the stack those exist and sort of what the predominant pattern is because they can coexist just as a data lake and a data warehouse can coexist, but there's sort of 4 different sort of overarching axes to think about as you're designing the different components. And which direction you go with 1 of them can influence other downstream and upstream decisions that you make about different tooling and and patterns. And I'm wondering

what you see as the motivating factors for

1 or many of those different kind of foundational core structures for the platform.

I think there are 2 factors

that influence it in my opinion. The first factor is what does the business need in order for it to be successful?

And then the second factor is, you know, what is your

engineering talent or kind of organizational

setup?

And

those 2 factors really influence,

you know, which of these kind of core fundamentals

make sense to dive into. So I'll give you an example for instance. So if the organization

you know, let's say it's a new startup, there are not a lot of engineers, and you just need to get a data data stack from 0 to 1, it might make sense to just focus on

key outcomes. Like, okay, do you need a way to query data? It might just be better to have a data warehouse because it's a managed solution and you don't have to store a lot of historical data. You can just get it up and running and make it performant from a querying standpoint.

However, let's say if you're a larger organization and you have terabytes, terabytes of data, you know, data lake architecture becomes more helpful because it's, you know, you're decoupling storage and compute and it it makes it cheaper from a cost perspective.

But then if you're an even larger organization with lots of different product lines and your data is really messy and complicated,

then you can't just have, you know, all the data be in 1 place in a dump. That data needs to be curated in order for it to make sense. And that's where kind of the whole lakehouse stuff comes in. So I think it really depends on where the business is and where the business needs to go to. The second factor that I was talking about in terms of, you know, engineering talent, a classic example is on the batching versus streaming debate. Right? I think stream processing is still a fairly complex primitive. You know, you have to think about things like Windows and joins

and streaming aggregations, and it's it's not a natural concept that, you know, start in schools, for example. So let's say if you have a team full of, you know, new college grads who have never worked with streaming paradigms,

batch is a much more simpler concept to start with. So if your business can afford, you know, batch like SLAs

and you just don't have the engineering operational maturity yet, it might just make sense to invest in batch and on streaming down into the future. So I think those are the different aspects that kind of influence

which direction to go. I objectively feel like in an evolved ecosystem, you probably need all of them in some capacity,

but the portfolio kind of varies based on the situation.

The other interesting aspect of streaming is that there are a couple of different points and styles of streaming that you might be talking about where 1 is streaming analytics using something like Spark or Flink where you're actually doing those aggregations

on the data as it's injected into the system,

or there's the streaming data integration pattern where you might just be pushing everything into a Kafka or a PulsarQ,

and then you're pulling it out the other side to land into your data lake or your data warehouse, which maybe doesn't require as much sophistication from a sort of analytical and statistical perspective, but still require some operational maturity to ensure that that system is up and running and that you're able to process the data

at, you know, close to the same rate that it's going in and that you're able to

absorb any spikes in terms of volume increases and being able to

understand

as it's happening,

whether or not you have a, you know, catastrophic drop in events that signals an upstream problem that it may be a little easier to catch in batch because as you're pulling the data out, you can do some statistical analysis in terms of, like, the volume of row counts, etcetera, to understand whether or not there was a problem. But you're gonna catch it at a later point than you would if you were doing it in real time. And just wondering if you can talk to some of those different axes of streaming as well and how you think about that and designing the platforms for previously at Yelp, but what you're doing now at Robinhood as well. So to borrow some examples from what we did at Yelp, we actually started off with a simplistic kind of batch solution where we built out our, you know, first generation of our data warehouse.

We had these batch jobs that would do full data scans. Read the production table once every day from a production DB replica,

scan the entire table, and, you know,

push it into a queue and then load it into the data warehouse or day to day. Obviously, over time, that doesn't scale and we really move to streaming solutions. I think what I've seen is once your data becomes large enough, then, you know, the batch oriented way of pushing data into a data warehouse doesn't really scale. And we saw that again at Robinhood 2, where you have to have some kind of a streaming solution to funnel data. And so the streaming solution that we did at Yelp and kind of similar is is what we're doing here at Robinhood is using a CDC change data capture based mechanism

to capture real time updates that are happening on production tables, schematizing them, publishing those events into Kafka topics. And then downstream have an aggregator like a Spark app or a Flink app that takes all of this data and lands the data lake in a specific format. Think of Arcade, Delta, Hoodie, any of those formats.

And I think the scaling constraints kinda really determine

the the choice of technology there from the perspective of building out building out data lakes. And I don't know. I'm a little bit opinionated here. I think traditional kind of table scans and batch oriented ways of bringing data into modern data warehouse or data lake probably doesn't scale and I think streaming is the right way to go. However, for production applications which are trying to access that data, I think both of those are equally applicable.

You know, imagine a use case where, let's say, an ML use case you wanna read over last month's

data, you really don't need a streaming solution. You can just try to batch up for that. So I think their batching and streaming have its place, but in building out your data lake, I think streaming is probably the way to go, when the datasets size becomes large enough.

And so the other aspect that I'm interested to dig into is the relative level of sophistication

and user experience for data lake environments

as contrasted with data warehouses that are more vertically integrated.

And I'm curious what you have seen as the evolution of the space from when you first started building out the data lake at Yelp to where you are now with Robinhood and just the

available tooling and the level of support across the different layers that are required to actually build out a fully functional data lake ecosystem?

Yeah. That's a great question. Right? The space has been super exciting, you know, more recently with the likes of Snowflake and Datapricks becoming really big and popular. Because I remember, you know, 8, 9 years ago, the choice was very simple. If you were on AWS, you would use Redshift. If you are on GCP, you'd use, you know, BigQuery

and that's it. And if you had your own kind of bare metal,

data center, then you'd probably

implement something on your own. And even before that, you had your informaticas and I games of the world. So however, in the in the last, like, 6 to 7 years, I think what I've seen is new players started to emerge. So it was not just Redshift and BigQuery.

RK really emerged as an open data format that people really got behind, and then storage format started evolving after that. So I think that's when the icebergs, the Deltas, hoodies of the world came into the picture.

Over a period of time, you know, these open data formats existed, and then there was a new wave of fully managed solutions like Databricks and Snowflake coming to the picture. So I think the space has been really evolving.

Previously,

you needed a very

talented engineering team,

a well staffed engineering team to build out a scalable data warehousing or data lake solution.

I don't think you need that anymore because with with all of these, you know, new solutions coming up, it's really democratized the data platform space, so to speak. That doesn't mean I think engineers are out of a job. I think what it has enabled is it has opened up opportunities to companies that, you know, weren't able to track the right engineers or weren't able to have the right engineering teams to really have sophisticated modern data infrastructures

that only the likes of, you know, the Google and the Facebooks of the world previously were able to. So I think that's the transformational change that I see with this, you know, companies emerging into the market and really enabling

organizations

that don't have the kind of sophisticated engineering talent to have the sophisticated data platform solutions in there. In terms of technology choices itself, I think we're in a really good place in terms of scalable data analytics.

Plenty of solutions, both open source and industries like vendor solutions

provide the ability to query large datasets very, very quickly. I think the next set of challenges that we are seeing is more on the data governance and quality side, which is, I think, still a somewhat unsolved problem, in my opinion.

In terms of your

experience

at Robinhood and building out the data platform, what are some of the

lessons that you learned in the process of building out the Yelp data infrastructure

that informed the ways that you think about architecting and designing and using the data platform at Robinhood

and maybe some of the ways that the 2 are

disjoint because of the different organizational

requirements.

I think I'll I'll talk about some of the mistakes we made and have learnings from there, which I'm trying to bring in to into my new role here. They kinda sound obvious in retrospect, but interesting challenges nonetheless. I think the first 1 is

data is

when you look at it from a different lens, data is essentially a product. It's a product that we're shipping to our customers, where our customers here are, you know, data scientists or data engineers or back end engineers. Right? When we built out our very first, you know, version of the data lake back at Yelp, we realized that, you know, if there are bugs in the system that leads to, you know, bad data, corrupt data, or erroneous data, the cost of fixing that is incredibly large

because, you know, once

data lands in a, you know, democratized data store like a data lake, people use that data to build derived datasets and then those datasets are further used to build more derived datasets. So, ultimately, there's a problem with the upstream. It it has a huge cascading effect downstream.

So 1 of the things, you know, I personally learned out of the process is when it comes to building data systems,

it's definitely better to, you know,

measure twice, cut once rather than the other way around because the cost of getting data wrong is incredibly high because of the downstream cascading effects.

And an example of that was from a data quality perspective. I know you were talking about things like row counts and then things like that, right, as a way to measure that. I think baking data quality as a primitive into your data lake ecosystem

is super duper important. Otherwise, you run the risk of of a big big fallout later. The other kind of learning I would say is

a technology might be interesting, powerful, or cool, but that doesn't necessarily mean it's gonna add a lot of value.

So an example here is streaming. So I think it was like stream processing to be specific.

So I think it was 5 or 6 years ago where at Yelp, we started to introduce Flink. We really saw the power of Flink to do large scale stream processing.

Really powerful engines, solved a lot of critical use cases. We brought that infrastructure,

got it up and running in our ecosystem.

But then we realized that people didn't wanna use it and people didn't wanna use it because at the time, the only way to create sprint jobs was with Scala.

And there was a really high barrier of entry for engineers to adopt that. So I think this was a good lesson that data technologies are often, you know, really power tools,

But if they don't come with a good instruction manual or a clean interface to use them, people are not gonna use them no matter how amazing or powerful they are.

So trying to take those learnings and really focusing in on things like user experience, which are critical for any kind of data platform success.

To that point of user experience,

1 of the common trends that I've seen is that for data lakes, that overall experience is maybe suboptimal,

although that has been changing with the evolution in the space.

And I'm wondering how you think about that design question of, do I go with a data warehouse to optimize for end user experience and make it easier for data analysts or maybe nontechnical users to be able to

interact with the data maybe in the same manner as they would with a Snowflake or a BigQuery

versus the scalability and flexibility benefits that you get from a data lake and then the additional efforts that are required to be able to add in some of those user experience elements and things like the

security controls and governance

that are maybe baked into the more verti vertically integrated warehouse platforms?

It's a really challenging question, and, you know, I don't think I have great answers here. But the way I think through those is

my general rule of thumb when it comes to these kind of build versus buy evaluations

is to see

is to bias towards

build

if

it is a core competency or something that the organization

really depends on.

And biased towards buy if, you know, it is something that

can be more plug and play. So to give you an example, I think this is kind of the thought process we were using both at Yelp and at Robinhood.

Some of the core infrastructural primitives like, you know, bringing this data into the data lake, serving the data lake as as a as our data store

are

kind of bread and butter for the business. Now we can offload it to a, you know, vendor solution or fully managed solution,

but there are constraints like, you know, data locality where we might wanna keep this data within our own ecosystem or we don't want to get into a vendor lock in situation

or, you know, the vendor is too costly for us to add the scale at which we operate. And so those are kind of some of the factors that come in, which

tend

to lead ourselves towards kind of a more build solution.

And then the buy aspects start to make a lot of sense if there is a, you know, tooling layer that is, you know, costly to implement, let's say because we don't have enough front end engineers on the team or user experience engineers on the team and the cost of getting the tooling layer wrong is low, the vendor lock in is low. You know, in those cases, it it might make a lot of sense to kind of take a vendor solution and integrate it on top of your build stack. So I think, to summarize, I know I kinda rambled here a little bit, but I think the factors that I consider there are, you know, what is the core competency for the business, what the business truly relies on. It might be worth building that in house.

You know, it might still be worth considering what other vendor options are there, but, like, things, you know, to prevent vendor locking, to prevent kind of the cost from ballooning, it might make sense to keep them in house. And then

thinking about things that are, you know, a layer on top, which are a little bit fungible and replaceable,

it might make sense to consider a vendor solution. So that's been my personal philosophy, but obviously, this changes from company to company.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool.

Sign up for free or just get the free t shirt for being a listener of the data engineering podcast atdataengineeringpodcast.com/rudder.

In terms of the actual

architecture that you're running right now at Robinhood, can you give a bit of an overview about where you are today, maybe some of the recent path that has gotten you to that point, and some of the

future projects that you are considering

to

augment or enhance or

scale out the capacity of what you're building?

So the way we've kinda set it up, and I'll talk a little bit about how the how I see the platform evolving.

The way we've set it up is, you know, we decided to go with our own custom data lake solution built on top of the open source stack. So most of Robinhood and Fred built on top of the open source stack.

We moved from a batch based solution to a CDC based solution to populate our data lake. And we also decided to go with Hudi as our storage format and ingestion layer to build out the data lake.

And in terms of the ecosystem of data processing, we offer both Spark for batch processing and Flink for stream processing.

And Presto is the query engine of Juno. That's the query engine that sits on top of our data lake. So pretty heavily invested in the open source stack, and it has served us really well over the last, you know, couple of years when when this team has matured and evolved over time.

So now looking ahead, there are

more challenges obviously on our path, and that's gonna influence how we design things. So 1 of the biggest challenges right now is around data governance and data quality. Like, making sure that

data in the data lake is organized and structured

and can be guaranteed from a security compliance and quality standpoint. And there's a bunch of projects that are kind of underway there, which might really influence, you know, the direction that we pick. Another big area is, you know, can we take these data platforms that are fundamentally being used today for, you know, your classic analytics stack experimentation, machine learning, and so on? Or can we take the same primitives and apply them to the core product?

I'm also believe that every company is a data company at the end of the day. Maybe not every company. Most companies are data companies. They're dealing with large amounts of data these days.

And I don't think it makes sense to completely decouple

your data platforms for analytics to data platforms for your product. Yes. You'd want to build isolation

and your SLAs might vary, but you can certainly leverage

your same batch processing and stream processing platforms for

powering product use cases. So that's another direction we are heading into. And so that direction basically means that we have to up the game in terms of our SLAs and operational maturity.

And that involves thinking through things like, does it make sense to run all of our platforms on top of Kubernetes?

Does it make sense to have a portfolio of, you know, build versus buy solutions?

But at the end of the day, I think it's a game of, like, how do we continue to support sustained

increase in scale

with tighter and tighter assays, more lines of availability

with cheaper costs. So it's kind of a multi pronged, multivariate optimization problem.

The machine learning question is another thing that always adds complexity to the data platform requirements

because of the

unpredictability

about what data assets you're actually going to need to be able to incorporate into the models and the rate of evolution of that depending on, you know, whether you start experiencing schema drift or if you want to start expanding the use cases for a single machine learning model or multiple machine learning models.

And I'm curious what you see as

the trade offs of data lakes versus data warehouse as the core element of the data platform when machine learning models are 1 of your ultimate consumers of all of that information?

I've seen this happen twice now, and I've seen this happen in a lot of other companies too where you solve the data availability problem, but then you introduce other problems. You make the data available and, you know, things like schema drift happens or ownership changes or meaning of the data changes and that leads to a lot of negative consequences down the road. So very real problem. Right? I think the way we are thinking about it, which is I think where the industry is headed to, is

thinking through, you know, concepts of the data mesh. Right? Which is essentially getting data to be owned by domain owners

and and make sure that there is a strong ownership model established for datasets.

And so there are different ways to do it. Like, 1 way is, of course, you know, taking the raw data and then cleaning up the raw data downstream, kind of doing an ELD, so to speak, which is get all the raw data and then build gold datasets

which have a very strong ownership model with with the underlying product teams that actually understand the data better. The other way to do this is actually create curated datasets at the source and then push them into our data lake. So that way the data lake has kind of this 2 pronged strategy where there 1 side of it is just raw data that exists for anybody to explore and visualize.

And then there's this other aspect where, let's say, product teams or business teams

curate specific datasets

and essentially use the same stack. So think of this as, you know, creating a dataset with a schema definition using, let's say, schema registry

and coding the data in, let's say, protobuf pushing that data into Kafka

And then that entire kind of quote unquote data frame gets into into the data lake as a single iceberg table or a hoodie table or a delta table. So we're still trying to think about how do we want to implement this domain ownership or data ownership concept. Do we wanna do it, you know,

post raw data landing in the data lake or do we wanna do it at the source? I think there are both pros and cons there, and that's something that we're evaluating.

In terms of the platform that you're running at Robinhood, what are the core capabilities that you're optimizing for and the primary metrics that you're tracking to determine whether or not you're successful in your overarching objectives of being able to provide useful and meaningful data to the organization

to be able to further the organizational goals and the analytical capabilities that people are looking for?

The way I think about this is 4 key pillars.

The first pillar is reliability.

Our data capabilities are as good as us being reliable. At the end of the day, we are, you know, a foundational

component. And if the infrastructure is weak or unreliable,

everything that's built on top of it breaks. And so reliability

is kind of the number 1 thing. And reliability, there are different ways to measure reliability as a metric. For an ingestion system, it could be latency

or availability of data, so to speak. For a standalone batch processing system, it could be

how many lines of availability we can provide in terms of the system being up. So there are many different metrics of reliability. So we have different

ways to calculate that for different platforms, but essentially keeping reliability as a primary metric. I think the second kind of really important metric by the way, these are not in order. I'm just kinda calling them out. I think the second metric there is security.

And I think security and safety matters a lot more in a financial institution.

Not saying that it's not important than other companies, but the stakes are just significantly higher when working for a financial institution because we are dealing with people's hard earned money. And so making sure that the data is secure, making sure that only the right people who need to access data get access to it, is just another kind of key KPI that we drive towards.

The third 1 I would say is user experience.

Again, the same argument holds, I think, which is the interface layer of your platforms really determine how useful or not useful your platforms

are, especially as a company or an organization scale. So not everybody will be a power user. Not everybody should be expected to know, you know,

how to write a Spark job or the internals of Flink or the internals of Presto.

The interface has to be really, really easy to use. And, you know, we have measures like

customer satisfaction scores or NPS scores that we kinda send out internal surveys to get a sense from our customers.

And then the 4th thing I would say is cost efficiency.

Like with any company, we're also a public company. We have a responsibility to our shareholders. Right? And then it's really important for us to make sure that our infrastructure that we run is as efficient as possible so that the company can drive more value. So

that's an easy metric to measure. It's a dollar amount and really kind of making it efficient and driving that down. So I think those are the 4 4 key

pillars. And obviously, there's a there's a broader 5th 1 which is the capabilities 1. Right? Which is, do our technology solve the problems that our customers are asking for?

So for example, if the customer says, hey, I wanna be able to look at 6 months of historical data and I wanna be able to do that in a very quick fashion,

can we support that capability? Or if a customer says, I wanna write these series of batch processing jobs orchestrated over a schedule,

and I want

to have, you know, less than 30 minutes of downtime,

can we provide that capability? So that's kinda how we think about it. Like, these 4, like, primitive pillars, so to speak. And then overlay that with what are the customer asks and user stories and then try to see if we can fulfill those capabilities

while making sure these core primitives of reliability, security,

user experience, and cost efficiency continue to go in the right direction.

To that question of cost, I'm curious how you factor that into the evaluation of different infrastructure components and that build versus buy

both in terms of the

cost of a potential vendor contract or the cost of running the infrastructure and the engineering hours required to be able to support it. And then also on the other aspect of cost, how you think about

the value of storing a certain set of data and whether it is worth the cost of actually keeping that

accessible from a storage and a query perspective.

I think in my experience of running a lot of these build versus buy evaluations over the years,

I've kind of come to the conclusion that you can make a data informed decision, but it's really hard to make a data driven decision, fully data driven decision.

Maybe people have done it perfectly, but I haven't been able to do so, to be quite honest. I mean, we can look at these different factors,

which we should. So the different factors, as you said, like the vendor cost, like how much does it actually cost to

sign this contract with the vendor. But it's complicated because during the negotiation phase, you can get various discounts and it's really hard to factor that in initially when you're evaluating. But, of course, you can start with the number. The the other aspect as you that you correctly pointed out is engineering art supported. Now engineering art is a very interesting metric because you can kind of objectively evaluate how much time it would take to keep a platform up and running.

But, you know, engineers being humans like you and I, like, our productivity

is not something that can be, you know,

measured very objectively. Like, if engineers are really passionate about a piece of technology, things will run faster.

If engineers don't like a particular piece of technology, things might run slower or you might find it hard to hire somebody who wants to work on that project or work on that initiative. So while we can start with that estimate, if you have a really motivated and skilled group of engineers, that estimate might get cut in half or, you know, the flip side where that estimate can become 2 x or 3 x. So you can start with that number, but there's a lot of variability there. And then the third aspect would be what happens or what is the consequences

of the decisions that you did take today

over, let's say, the next few years. And I think those consequences are really important to 0 in on because, let's say, for example, you sign a 3 year contract with a vendor.

And let's say it's a multimillion dollar contract,

you are locked in and you have to be comfortable with that decision for the next, say, 3 years if that's the contract agreement.

Similarly, you know, if you go down the build path and you say, okay,

looks like I need, like, 4 engineers to build out the system and I need a team of 4 engineers to maintain it over the next few years,

Know that the consequence of that is if the business goes through a tough time, then let's say, you know, you don't have enough bandwidth to hire people, you're kind of screwed.

So it's it's

really complex and really challenging. So the way I've approached it is, like, try to get as much data as possible by understanding the integration cost, the vendor cost, the consequences of the lock in. And on the build side, the cost of building it, evaluate it with the prototype,

see what the operational cost would look like, put all of that data together in a table, and then try to see, you know, which way we wanna lean towards. So I'll give you an example of buy and a build that we did at Yelp. So we did this whole analysis for data warehousing,

and we realized that the cost of operating Redshift, this was 4 or 5 years ago, was very, very high and the business did not wanna afford it. And we kinda came to the conclusion that building a data lake with with parquet data might just be better from a cost perspective.

And we had a bunch of engineers who were really excited about it, so we leaned in on it. And the decision was totally fine, I think, for the next 2 to 3 years. I don't know where the company ended there, but, like, the the 2 to 3 years after that was totally fine. Another place where we kinda swung the other way

was for data discovery and data governance.

So we were evaluating

their solutions with open source frameworks, but then we decided to go with with a buy solution

with I believe Colibra was the vendor there, and it worked out okay. And it saved us a lot of engineering hours in trying to build complicated kind of UI and workflows,

which we kinda got out of the box. And then there was decisions that I think did not go well and kind of backfired. So the way I think about it is, like, take all the data,

figure

out the organizational context

of which engineering interest and the ability to hire and vendor lock in are really important factors to consider.

You take the decision

and then you learn from that and you keep it trading and keep improving. So that's kind of been my mantra.

1 of the other interesting challenges of building data platforms, particularly if you're working in the open source space, is being able to evaluate

the

current and projected future capabilities

of a set of technologies, particularly where you have

2 or more that are

at surface value quite similar. So a couple of examples

might be, in particular, Presto versus Trino where they started off as exactly the same project, but they have since forked and diverged. And, you know, the core set of capabilities

is largely the same, but they have different priorities and different trajectories.

Or in the case of something like the

extract and load systems

where you have 5 tran and stitch on the commercial side, you have Airbyte and Meltano on the open source side. And, you know, at face value, they all do the same thing.

The core set of integration has a huge amount of overlap, but there's potentially widely divergent future projected capabilities.

And then also maybe in the Kafka versus Pulsar space, they're not as closely matched in terms of architecture and feature set. But at the base sort of initial use case, they provide the same utility. Wondering how you think about evaluating

those types of very close comparisons

with potentially

vastly different future outlays.

Again, I don't think I have good answers to this. It's a really good question. I think that the way I think about this is

the goal is to try to predict where that platform will head to, which, obviously, if people were able to predict everything, then everybody would be a millionaire.

But you kind of have to take a bet in those scenarios. When things are very, very similar, maybe there's 1 or 2 features that 1 system has that the other doesn't have and vice versa, but they're mostly kind of similar apples to apples. Then it really kind of boils down to trying to make an informed bet on where something would go. So to give you an example,

if we take Kafka for instance,

I feel reasonably confident that, you know, Kafka is gonna stay for many, many, many years to come because it's become such a ingrained piece of technology

and Confluent as its support company is very large. But obviously,

you could go with the open source Kafka or you could go with Confluent and there are nuances there. But like Kafka is a technology I think is here to stay for many, many years. Whereas, if you look at something that's a little bit more

recent,

Pinot versus ClickHouse versus Druid or any of those solutions,

to be quite honest, I don't know. I don't know which 1 will emerge as a winner. Or even, like, you know, Databricks and Snowflake and open data formats.

And even in the open data formats, I forget the name of the company, but the iceberg folks started a new company. The hoodie folks started a new company. And it's a really competitive space. And to be quite honest, I don't know where each of them is gonna head. So I think in those situations where it's not clear where the technology is gonna go, where either of those solutions might work out, I think really leaning in on your current experience

that you have on the table is probably the best way to go because at the end of the day, you know, it's the engineers who are building these systems.

And so if you already have engineers who have a lot of experience in a particular area and that or a particular product, it might just be better to lean into that for the next 2 to 3 years. And then keep evaluating, let's say, every year to see how the space is evolving and make course corrections.

Because

you can't really make a data driven decision at that point.

And even if you were to take a decision that was slightly better, let's say, from a technology standpoint, if you don't have the people to implement that or if you have to go hire for people to implement that, it's gonna take you 3 to 6 months to hire people and onboard those people and you've already lost, you know, half the year, it might just be better to lean in on your existing team and the expertise of your existing team.

The other aspect of that question, which is actually regardless of some of those close comparisons, but also hedging against future evolution of the ecosystem is being able to architect the platform to be able

to have an option of backing out of a decision. So in the case of the query engine, for instance, as long as all of my

data storage is in a format that is compatible vastly across the ecosystem. So if everything is in parquet, then I can start with Trino today. I can replace it with tomorrow, or I can just say, actually, I'm just just gonna ingest it all into snowflake a year from now. And so I'm curious what are some of the

abstractions

or interfaces

that you hinge on in some of those architectural decisions to be able to say, okay. This is the right decision for today, but I'm not a 100% convinced that it's going to stay that way in 2 to 3 years' time. So I'm going to hedge against that eventuality

and implement the system so that I can swap out this component without having to reengineer the entire stack.

You hit the nail on the head right there, Tawise. I think that's really the way to go in my opinion too, which is, like, build out abstractions

that hide away the underlying compute engines or processing engines so that you can swap them out later.

So an example I can give is, let's say, with batch processing with Spark. Today, Spark works really well, but that's what, you know, 6, 7 years ago, you know, MapReduce was the thing that everybody was using. And now the distributed computing framework kinda moved to Spark. And maybe 4, 5 years from now, it'll be a different thing. Right? And so I think it's really important to build abstractions that stay constant so that you can swap out the engine underneath. And so an abstraction in this case would be, you know, some kind of a job management service. So, you know, provide an interface for customers to write their business logic

and

chuck their business logic, let's say, over the wall to this computer generated. So you can have an API layer with service and jobs are submitted to the service via the API layer. And let's say the simple CRUD microservice takes care of taking the job, dispatching it to the compute engine, taking logs back, funneling that back to the user, and essentially act as a simplistic control plane. So if we do something like that and let's say tomorrow we have to swap out Spark, the BMR, with or with any other technology, Flink, anything that you have, at least the migration cost wouldn't be borne by the users

because the API or the interaction that they have with the system still remains the same. In theory, obviously, there'll be practical considerations there, but that's the idea. Same with, like, querying. Right? If we can get the data, as you said, in the same

format,

then,

yes, you know, Presto SQL might be slightly different. Barks SQL might be slightly different than any other SQL engine.

But for the most part, there's still SQL. And so if there is an opportunity for people to work for the same data,

even if the query engines get swapped, the cost of migration is is gonna be low. So I think identifying those invariants and making sure that we take the right decisions on the invariants

and then invest a lot of time in building abstractions

will help us hedge against, you know, changing technologies.

Even with Kafka, right, I said Kafka is the way to go, but probably it is more prudent to build out, you know, consumer abstraction or producer abstraction so that when when users are interacting with Kafka, they don't know what a topic is or they don't even need to care about what a topic is or what a partition is. But they take a scheme ID with that payload, they chuck it to the system through an API, and under the hood, it goes through Kafka. It could be something else tomorrow.

So absolutely true. I think focusing on abstractions is is the way to go there.

Are you struggling with broken pipelines,

stale dashboards,

missing data?

If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability platform.

Trusted by the teams at Fox, JetBlue, and PagerDuty,

Monte Carlo solves the costly problem of broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, airflow jobs, and business intelligence tools,

reducing time to detection and resolution from weeks to just minutes.

Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out

of the box. Start trusting your data with Monte Carlo today. Go to data engineering podcast.com/montecarlo

to learn more. In terms of your experience of building and managing these data systems, what are some of the most interesting or innovative or unexpected ways that you have seen them used and approaches that you have seen to actually implementing them?

I think some of the times I've been surprised is when we've rolled out a platform,

thinking about a specific use case in mind and it getting used for entirely different use case.

An example is Looker, for instance. Looker is something that we use at Robin for dashboarding. It's a very common

dashboarding

and, you know, BI layer for for data.

We started seeing that people were actually,

you know, using Looker to do a lot of intermediate analysis and running custom SQL.

And then people really started using Looker as a notebook.

And we didn't realize people would do that, and it really opened up this gap or this

capability that we weren't aware of that, hey, people want this. And they're trying to use whatever they have at their disposal to make it happen even if they are they have to, you know, jump through hoops to make it happen. You know, another example is back at my previous company at Yelp, when we had initially

rolled out Flink,

the goal was to use Flink to build out this kind of comprehensive connector ecosystem.

Essentially, take data in Kafka and write it to different data stores like Elasticsearch,

back to MySQL,

Cassandra.

And we wanted to use Flink as the engine that persists this data into data source. That was the primary use case that we were looking at. But we saw people were starting to write some heavy duty, you know, stream processing jobs either for ad hoc analysis or production applications. And we thought, oh, wait a second. It's not just for the connector ecosystem. People wanna use it in different ways as well. So I think that's 1 of the exciting things about data platforms and data systems is that you have some use case in mind and you roll it out. And when you see people, you know,

shoehorning different use cases into it, you know that people want something else as well, and you kinda learn from that. And in your experience of building these platforms

and working with the data producers and consumers and the engineering team that is supporting these platforms, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Yeah. Alignment is really hard. I think everybody is really opinionated

about about some aspects of the data platform. Data serialization format is a big 1. You know, some people are super opinionated about Avro. Some people are super opinionated about protobuf.

And same with technology choices. I think the biggest challenge I have faced as a people leader or an organizational leader here is to get people to align

and get strong voices and opinionated voices

to get on the same page.

Especially

with a system like kind of the data platform ecosystem,

it's easy enough that any team could build their own data platform.

But it's hard enough that it probably doesn't make sense for individual teams to build out their data platform

and to have a common central platform to get economies of scale

because it's easy enough to, you know, write data processing jobs or to transform data or to ETL data.

Oftentimes that I've seen that business engineering teams or product engineering teams, if they're not aligned, will go ahead and build their own duplicate versions of the same infrastructure.

It's a very unique challenge, I think, to data platforms that, you know, doesn't exist in, let's say, a layer down the stack. Nobody's gonna build out their own cloud infrastructure. If a company has a cloud infrastructure, everybody's just gonna use that. Nobody's gonna build out their microservice architecture. The company has a microservice services infrastructure, people are just gonna use that. But if a company has a data platform, that doesn't mean that every team will start using that because

it's easy enough for people to stand up their own stacks. So I think the biggest challenges that I've faced is organizational alignment and getting people to buy in on using a common platform, especially when people are opinionated about

formats and technology choices and so on. Definitely always interesting trying to find common ground among opinionated engineers.

Yes. For sure. For sure.

And

so given your experience of building and managing data lake architectures at Robinhood and at Yelp, what do you see as the cases where that paradigm is the wrong choice and you're better suited with a data warehouse or

maybe just some of the potential pitfalls that you would like to warn people against if they do decide to go down the data lake route? Not every time does it make sense to,

you know, make all the data available to the people who need it.

It sounds like a really mobile goal. Like, it sounds really amazing in theory to, you know, get data in a format that's easy to query. It's lightning fast. It's easy to use.

And the theory is that, you know, it will really unlock a lot of business value and potential.

However, the downside is if the data is not curated,

if the semantics or the meaning of the data isn't well understood,

then you can end up making really bad decisions with with incorrect data.

So in practice, I do think organizations

need to be very thoughtful

about what goals do they wanna solve

and what is the primary objective. Is the primary objective

making the data available?

Or is the primary objective

to make the data

right?

And it might not be an either or or, like, typically organizations have to do both. But it's a question of what do you wanna solve first.

So let's say all of your data is, let's say, trapped in a production database and your data scientists are querying and running these massive

OLAP queries against your production database, you probably wanna solve the availability problem first in the cheapest and the fastest way.

But, you know, if you have all of your data available and

the organization's product complexity grows, it probably makes sense to

bring in some controls into the picture to make sure data is used in the right way. So I don't think

1

same

time,

use that to explore and play around with data. But at the same time, you also need to make sure that there's curated data available

of on top of which decisions are being taken for the business and for the company.

So I think it needs to be a mix of both. The sequence of when and how you do it really depends on where your organization is, life cycle and maturity.

As you continue to iterate on the data platform at Robinhood and continue to investigate and explore the ever evolving ecosystem. What are some of the things you have planned for the near to medium term

or any particular areas that you're paying close attention to to see where they end up or what some of the potential

future developments might be in the space?

There are a few things. The data lake ingestion and storage format is definitely an interesting 1. Like I mentioned earlier, we doubled on Hudi mostly because Hudi, Iceberg, and the open source Delta Lake were similar when we did our evaluation, and we leaned in Hudi because the team had experience.

However, I think the space is super evolving and I'm being a close eye to see,

you know, what

emerges as the winner in this, you know, battle of the open beta formats.

Another area

where I don't know if there's a clear winner yet is what I was talking about earlier when it comes to aggregations and streaming analytics

and multiple solutions out there and no clear winner. And there are, I think, lots of new entrants to the market as well. Job orchestration is another interesting 1. You know, airflow has been really dominant in the space, but there are new entrants coming in which offer richer capabilities there too. So we're on the lookout for these areas to see what is gonna make more sense for us longer term. But in terms of the road map and the challenges, like I was alluding to earlier, a lot of our work is in terms of, you know, continuing to add more lines of availability, continuing to make our infrastructure more cost efficient,

and continuing to see where we lack capabilities.

And those capabilities are a little bit more advanced in terms of how do we get stream processing

to be generally usable across the entire company,

how do we get, you know, querying to work in a lightning fast way as our dataset size increases over time, How do we build in good governance and controls that are even more better and more solid than what we have today? So a lot of the story is on continuous improvement as we scale things out and making pragmatic choices in the lens of that continuous improvement based on what people need. I know I'm a little abstract, but I'm trying to kind of walk the tightrope on what I'm allowed to share. But I think at the end of the day, it kind of boils down into those 4

key pillars that I talked about earlier on reliability, security,

user experience, and cost efficiency.

Are there any other aspects of your experience of building and running data platforms or the specific architectures implemented or the surrounding space of building

data lake architectures that we didn't discuss yet that you'd like to cover before we close out the show? The advice I would have for people kinda considering and investing in the data stack is

think about how your data is going to be used over the next 2 years and use that in your decision making of your platforms. Like 1 of the mistakes I think I've kind of made in both companies

is underestimating

the need for

things like lineage and data quality and getting schemas right and ownership right. You know, it comes as an afterthought.

You know, we're focused on building highly available, you know, Lambda architectures or PAP architectures and getting them off the making them really highly available and all of that cool stuff. But I think

it's easy to forget

the quote unquote boring stuff. It's easy to forget that, you know, data systems

are, you know, extremely intertwined with humans and organizations,

and it's easy to forget that. And so my advice would be to think about ownership of data from the get go

before implementing

sophisticated data platform solutions.

It's a mistake I've made twice, and I think I'm gonna take this forward into my analytics tape whenever that happens.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think there's a cottage industry of companies and startups,

solving, I think, all parts of the data stack. I don't think there's anything that's left untouched.

But then there are places that I don't think we have a clear winner for. I think 1 of them is kind of the whole tooling around data quality, discovery, and lineage.

It's a massive problem. It's not solved well. There are a lot of new entrants, but no big player there. Another big area that is

kind of underinvested

is, you know, data operations

in in terms of anomaly detection, alerting, monitoring,

and, you know, that whole ecosystem around that, I think, is ripe for disruption.

I think there are a lot of solutions coming up, but ETL ing at scale

is something that is not you know, there's no clear commodity winner for that yet. I know there are there are tools like Fivetran and Stitch who do a great job, but, you know, at the petabyte scale in near real time, I don't think people have solved that problem. There are a few startups that are emerging. So I think those are the 3 things I would say. You know, low latency,

high scale ingestion systems,

tooling around data quality,

observability,

ownership,

lineage.

I guess I didn't mention this earlier, but the third thing around governance, security, privacy, and all of that stuff. So those are probably the areas that I think

are a little bit underinvested in the ecosystem today.

Alright. Well, thank you very much for taking the time today to join me and share your experiences

of building out these data platforms

and the lessons that you've learned in the process. It's definitely a very

perennial problem, and it's always great to hear the solutions that people have settled on for their particular context. So thank you again for taking the time, and I hope you enjoy the rest of your day. Thank you, Tobias. I enjoyed this conversation. Lots of great questions that made me reflect and think, and it was great to be on the show. So thank you for having me.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links