Summary
Building a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. Yotpo has been on a journey to evolve and scale their data platform to continue serving the needs of their organization as it increases the scale and sophistication of data usage. In this episode Doron Porat and Liran Yogev explain how they arrived at their current architecture, the capabilities that they are optimizing for, and the complex process of identifying and evaluating new components to integrate into their systems. This is an excellent exploration of the decisions and tradeoffs that need to be made while building such a complex system.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog
- Your host is Tobias Macey and today I’m interviewing Doron Porat and Liran Yogev about their experiences designing and implementing a self-serve data platform at Yotpo
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Yotpo is and the role that data plays in the organization?
- What are the core data types and sources that you are working with?
- What kinds of data assets are being produced and how do those get consumed and re-integrated into the business?
- What are the user personas that you are supporting and what are the interfaces that they are comfortable interacting with?
- What is the size of your team and how is it structured?
- You recently posted about the current architecture of your data platform. What was the starting point on your platform journey?
- What did the early stages of feature and platform evolution look like?
- What was the catalyst for making a concerted effort to integrate your systems into a cohesive platform?
- What was the scope and directive of the project for building a platform?
- What are the metrics and capabilities that you are optimizing for in the structure of your data platform?
- What are the organizational or regulatory constraints that you needed to account for?
- What are some of the early decisions that affected your available choices in later stages of the project?
- What does the current state of your architecture look like?
- How long did it take to get to where you are today?
- What were the factors that you considered in the various build vs. buy decisions?
- How did you manage cost modeling to understand the true savings on either side of that decision?
- If you were to start from scratch on a new data platform today what might you do differently?
- What are the decisions that proved helpful in the later stages of your platform development?
- What are the most interesting, innovative, or unexpected ways that you have seen your platform used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing your platform?
- What do you have planned for the future of your platform infrastructure?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Yotpo
- Greenplum
- Databricks
- Metorikku
- Apache Hive
- CDC == Change Data Capture
- Debezium
- Apache Hudi
- Upsolver
- Spark
- PrestoDB
- Snowflake
- Druid
- Rockset
- dbt
- Acryl
- Atlan
- OpenLineage
- Okera
- Shopify Data Warehouse Episode
- Redshift
- Delta Lake
- Iceberg
- Outbox Pattern
- Backstage
- Roadie
- Nomad
- Kubernetes
- Deequ
- Great Expectations
- LakeFS
- 2021 Recap Episode
- Monte Carlo
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer friendly data catalog for the modern data stack. Open source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga, and others. Acryl Data provides Data Hub as an easy to consume SaaS product, which has been adopted by several companies. Sign up for the SaaS today at data engineering podcast.com/acryl. That's acryl.
[00:01:27] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Doron Porat and Liran Yoguev about their experiences designing and implementing a self serve data platform at Yotpo. So, Doran, can you start by introducing yourself?
[00:01:38] Unknown:
Sure. Hi. So, yeah, my name is Doran. I work at Yotpo for about 5, 6 years now, and I manage Data Infra Group. Miran is my manager. And what we do in the Infra Group is basically build data infrastructure to self serve all the rest of data consumers and producers in Yappo around the platform.
[00:02:01] Unknown:
And Liron, how about yourself? I lead the domain of platform engineering at Yappo. So part of it is data infrastructure, another party's backend infrastructure and front end infrastructure. So, like, my goal in life is increasing the velocity of everyone that's worked working at Yotpo or, like, everything that's internal. Just making the company move faster. So I do it in these 3 domains.
[00:02:21] Unknown:
Going back to you, Doran, do you remember how you first got involved working in the area of data? Yes. I was very young.
[00:02:27] Unknown:
I think data is basically everything I did in my adulthood. I started off as a BI developer on a services company. And quite fast, I became a project manager, and I started building data warehouses for different organization, mostly startup companies. So I got to see, like, different MPP databases and all what was hot back then. I won't say how back then it was, but You could name name drop things from the past. It's okay. Yeah. Like XSL and Greenplum and things that no 1 uses anymore. And then actually, YAPO was the last, like, customer I arrived at. So I came to Yappo to build the 1st data warehouse, as I said, was like, that was probably 6 and something years ago.
In YAPO, I had this transformation, where something that always bothered me working in this field where we were, like, building data warehouses. And I felt like what we were building is something that can last for a year or 2 or 3 years. It really depends on the people that build it and how much knowledge they have and super couple to certain technology and these big black boxes and a lot of managed services and and not enough, like, know how inside the teams and a lot of locked up business logic and everything entangled together. And what we basically did in Yapo is create this platform that's built to scale and, like, a very robust architecture. Today, I'm I'm very proud of. And, like, 1 of the main principles that, like, lead us building this platform is having this architecture that everything is very how do we call it? Interchangeable. Right? You can you can, like, take stuff out and and and replace certain services and capabilities within a platform without having to break everything and and start all over again,
[00:04:06] Unknown:
which is something that I really hated to see from the side. Yeah. Like, I think we're so into not being vendor locked that we just we do everything in our power to reduce that. Too scared from vendor lock to something. Scared. Yeah. I think that would probably help if we'd been a bit more Yeah. Yep. And, Liron, do you remember how you first got started working in data? Part of my story is Doran's story. We've been working together for the last 6 years. But, like, I actually started from, like like, I developed, like, full is a full stack. Like, I built full stack products and security network security company. And then I just got offered a job of running a big data team, and I was so intrigued by doing something that I have no idea what it is. I just came here, and I was, like, fascinated by big data and and all different infrastructure, working with Spark and, like, doing, like, distributed calculations. It was so great and so interesting, and and I think, like, I got sucked into it and I'm, like, I'm I'm hooked right now. So in the last couple of years, I dabbled a bit in ML engineering and in in data infrastructure, and that's what I've been doing. And then their own stories come into play. So, yeah. Yeah. We we have the same. That's how I started. And so that brings us to the topic at hand for today, which is the work that you've been doing recently to
[00:05:15] Unknown:
either evolve or rebuild the architecture of your data platform. And before we get too far into that, I'm wondering if you can just start by describing a bit about what it is that Youtpo does and some of the roles that data plays across the organization.
[00:05:29] Unknown:
Youtpo is basically a marketing platform for e commerce businesses. So as part of the platform, we offer various products that give, like, all the solutions that the marketer needs. It start with content generation. It can be reviews or videos or pictures or whatever, communication channels, loyalty programs, referrals, and, like, basically, every solution you can think of to drive conversion and and keep customers coming in. And so the data that we have in Yapu, it varies from whether it's, like, b to c data coming from the website, say, where Yappo is implemented and serves, a b to b data where we have the product information coming from the users of Vapo, like the admins.
[00:06:11] Unknown:
And then we have all the, like, applicative data. Yeah. We have we have a lot of data, I think, coming from actual services. Like, I think for me, like, in general, like, the biggest challenge here was, like, this type of data because it's like an I think, like, we have, like, a data that's, like, more event streams. I think it's a bit easier to work with data platforms than in general. It's, like, a lot of technologies around events. But when you're talking about data that's being constantly updated, so, like, think about, like, relational databases. We have, hundreds of microservices with their own databases and everything is, like, interconnected somehow. Yeah. It says big part of the challenge. We also have all those, 3rd party. Jonathan Bocky: Yeah. We have like business data, like that data from Salesforce, like so, yeah.
[00:06:48] Unknown:
Once you have all of this data available, what are some of the ways that it's being used in the organization as far as driving value from that? Is it being used for machine learning use cases? Is it largely for business intelligence and informing some of the sort of road mapping for the organization? Is it doing a lot of analytics engineering or embedding analytics into customer experiences? Just 1 curious to talk through some of those use cases. Everything. Yeah. Everything. Yeah. Everything. We work with Yappo. Yeah.
[00:07:18] Unknown:
Yeah. I think, like, our data platform is being utilized by a lot of people. Hundreds of people. Yeah. I think, like, we just talked about it before. Like, we have a lot I think around 500 users to our data platform, which, like, they're working directly Yeah. Yeah. Directly off top of the data lake, mostly using Databricks or some other tools. And they're doing a lot of different things with it. So we we have, like, supporting engineers that are doing it they're using it to understand, like, their customer's data or, like, help solve bugs or whatever is coming for the customers. We have analytics there that are, like, trying to understand, how the organization can work better, just, like, give their managers a better view of of the business. We have engineers building tools to help customers Yeah. Help with the data platform. Yeah. We have we have actually r and d is using the same data platform to create, like, features which they find hard to do using, like, their own platforms.
So they can use offload to offload. Yeah. To offload stuff that are mostly pretty hard doing with, like, microservices or, like, event event driven architectures. Who else? Lot lots of BI. Yeah. And then, of course, data science, which is quite a heavy user of our data platform. So they they they use it as well. And b to b b to b also. Yeah. Yeah. We're doing better than analytics. Yes. Yeah. You mentioned that as well. Yeah. I think that's that's a very interesting use case for us. I think that was like because in the past, like, this was a a really interesting project because in the past, like, it was part of the product. So the product had to be in charge of, like, b to b analytics.
So they had to generate the data, create the API on top of it, create the dashboards. And what happened at some point was, like, it became such an expensive task to do that. We just stopped adding dashboards to our customers. I think that happens to a lot of organizations. Right? And time to market was also crazy. It took, like, 6, 8 months to deliver a dashboard. Yeah. Because they wanted, like, the perfect UI. And data engineering is yeah. And data engineering is is difficult when you have to do, like, on the application side. But when you have, like, big data systems, it becomes so easy because data is so accessible, and it's really easy to join and do the different aggregations on huge datasets. So then we got, like, this task of making this path this task faster, and we created a platform that uses the data lake, Snowflake, and Looker. And basically, then they just do embed Looker inside our product, and now they can create dashboards within days instead of, months. So that's, like, a really cool use case, I guess, for data platform today. And what is the size and structure of the team that you have who are building and supporting this platform, particularly in relation to the size of the end consumers you said you have on the order of 500 people who are actually taking advantage of that platform?
[00:09:38] Unknown:
Basically, up until last September, I think the biggest the team was 5 people. It's not I think. I know. It was 5 people at most. Yeah. And then we started a new team, very innovative team back in September. So that's 3 more people.
[00:09:54] Unknown:
That's us. But, like, in general, like, data group that Dorny is in charge of is doing infrastructure only. Like, they're not actually generating data assets. Like, I would say, like, that's how we want them to do. Like, they they're still in charge of a a few of them, but just like legacy ones. But so they are only about, like, the users or the consumers of data or the producers of data. So they treat with everything that's how do you ingest data into the system? How do you extract it? How do you create, like, transformations? How do you govern the data? So there's also outside, there's, like, a BI dev team that's not part of the group. They're also part of the ecosystem. And we are all about, like, the self-service part of the platform. Like, we want to engage with other users and let them create a data assets and let them use it. And so that's what we are investing a lot in. It's making it very easy to use. And you mentioned that 1 of the
[00:10:40] Unknown:
catalysts for embarking on this project of reimagining what your data platform architecture was going to look like was the fact that you had these requests for new dashboards that would take weeks or months to deliver.
[00:10:53] Unknown:
And I'm wondering if you can talk through what the state of the architecture looked like at that point before you went and reimagined it and rebuilt it. The biggest, largest catalyst, I think, for the data platform was at around, I think, couple of years ago when we we found out that, you know, every company starts with a monolith. Right? So all the data is there. It's very easily accessible. And at some point, you know, the company grows to a size where it needs to be have, like, microservices or serverless function and data started become, like, not centralized anymore. And that's where we came in, and the people started to wonder how do we work at, like, with the data? Like, the data is decentralized. It's everywhere. How do we join between it? How do we, like, support engineer? How do analysts work on top of it? So, you know, classically, there was, like, these data warehouses that they, you know, push data into it, but then we kind of, like, thought about, wait. Why don't we keep all that data somewhere centralized, which is a data lake, and utilize it and give all of our users access to this data to easily transform it or or just do whatever they want with it. In real time? Yeah. In real time. Yeah. We can talk about it later, I guess, like, how this part works. But, like, that that was, I think, like, a game changer for the company. I think that, like, 1 thing that we didn't get to mention is, I think,
[00:12:01] Unknown:
before this happened, we we started building this platform. We were very proud of it, but we didn't have, like, many use cases running on top of it. And we were basically building our own data pipelines. But we have this open source project called Metorico, and that's basically, like, an ATL framework that we wrote inside Yappo to enable non data developers to build their own data pipelines based on Spark. And then you can deploy it on whatever cluster orchestration that you have. So we had this in place, and we were playing with building our own stuff and doing our thing. And once we had a solution for what Liran described, it was really about being, like, in the right place at the right time.
Everything exploded. Like, everyone got to a point where they needed us, and and they wanted us everywhere. And, yeah, it was actually, like, a really a real turning point. I think things drastically change from there. Oh, and and the reason that I talked about is that we had this in place. So we had, like, the infrastructure to enable people to build pipelines on top of this data. And then we suddenly had all this data from all the services in real time, in a data lake. And then, woah, it's just like a playground. That's cool. We can build tons of stuff and everything is there. So
[00:13:10] Unknown:
In terms of the overall effort to build this new data platform and empower all of your end users to be self-service and unlock the potential value of the data and allow them to explore it. I'm wondering what the sort of early stages of that evolution looked like. What were some of the initial steps that you took down that path? Was it a matter of let's just throw a bunch of stuff at the wall and see what sticks and experiment, or was it a very sort of disciplined, let's plan everything out and end to end and figure out what are all the interconnection points, or or is it somewhere in between?
[00:13:41] Unknown:
Definitely not the second 1. No. Yeah. It's not us. No. It was a it was a complete chaos. Like and I think we went we went back and forth so many times with the Aritha architectures, and and we still do. Like, Don talked about Metorico. Metorico is being used at Yodpo. We have hundreds of data pipelines used again. We're we're about to kill it right now. Like, we're they don't really know it yet, but we have a new project called Yoda that's coming up, and that's gonna replace it. It's gonna be it's gonna be really, really great. It's based on top of dbt. We see, like, as the future of of data modeling.
It's better better than It's much better than the Toliko and and it's a better had a better approach to data modeling. But we can talk about it, like, the very beginning where we have, like, the event data, which was always basically from the day it was born, it was streamed to the data lake, then, you know I don't think it was the data lake. It's just s 3. Oh, s 3 c s 3 was files. No 1 touched it. Yeah. With parquet files and just, like, csv's. Yeah. And csv's? Okay. Later on, we had had to add Hive at some point because we needed, like, a catalog. I think most of comp yeah. What No. I think Hive is a really nice example because I remember when we started working, we didn't have a Metastore,
[00:14:41] Unknown:
and then we would load parquet files. And if you wanted to use this parquet file, you would need to, like, load it. So you you load the path as and you need to know where the data is. It's it's sounds only only lose you
[00:14:54] Unknown:
what the paths are. You have to ask someone from data engineering, where is this thing? But, you know, at that point, when we talk to people, like, from the industry about, like, Hive Hive metastore was still, like, not being utilized, like, from a lot of, like, different companies, like, very successful companies, like, woah, you have a catalog. Wow. That's it. It was, like, once we started using it, like, how did we work with that's so stupid, like, what we did before it. So I think, like, we started off with, like, loading data from our applicative databases by using, like, spark loaders, which were basically pulling all the data on a daily basis or an hourly basis. Select star. Yeah. They're doing select star from our operational databases, pushing it to Daylake, and then people could join and create that data pipelines.
So that was not really fresh. Right? So we had to And and and costly also. Yeah. Costly on on on and Yeah. Yeah. On the operational databases and, of course, on the loading process. And, of course, airflow, they're in the middle of doing a lot of it's airflow stuff, which, you know, not really great.
[00:15:47] Unknown:
He doesn't he doesn't like No. I I'm not a I'm not a fan of airflow. I personally have not had to go down the path of airflow. I was fortunate enough to come in at the 2nd generation stage, and so I've actually started my platform fresh with Dagster. So for what that's worth Oh, nice. Okay. Okay.
[00:16:03] Unknown:
Yes. Much, much better. I wanted to start with prefect, by the way, but You still do. Yeah. I still do. By the way, I'm Investing heavily in airflow. Keep investing in airflow. Yeah. But now actually, we moved from, know, managing your airflow and doing, like, a lot of stuff and people working on it and not enjoying it to having, like, an auto generated airflow. Like, that's where we are right now. So our airflow is becoming, like, this just like a scheduler. Like like, all the dogs are auto generated, so no 1 is actually aware of the airflow. Yep. So as long as it's working, I don't really care anymore. It's okay. Right? Yeah. So it started out with loaders and without a Metastore, then we added Metastore, and then we switched the loaders to CDC. We started working with CDC, which was like a huge revolution for us. Like, of course, an operational nightmare, by the way, still does. Like, right now, we we added, like, a lot of self-service around it. So, like, the Busium works well, and we have, like, processes to restart things and, like, the things that are not really built in the Bezium, which is, like there's, like, a lot of problems with that tool. So we worked around it for a very long time. Okay. So the museum just gets us to the Kafka. Right? So it just gets us from the applicative databases or the MySQLs and stuff into Kafka, but then we need to get to the data lake. So we started our own there's also a blog post about it. We used Metoico to write with hoodie format to the data lake doing upsurts.
Horrible times. Yes. Yes. These are Horrible times. Yes. It was a very, very tough time for us to Takes up. Very intensive. It was very costly. Mhmm. Did not work well. Did not scale well. Houdi is not a great format. It was not highly supported by, like, engines around it. We had a lot of trouble with it. And then at some point, we're like, we're just way too deep into this. We need to change, and we started using Upsolver to do our upserts. I think a year later, it became, like, a very successful project. And right now, like, we have thousands of tables that are being streamed at less than 1 minute freshness, to the lake with upserts and being queried by Spark. Or before that, we had also Presto. Like, everything works really, really well, and people are getting, like, really, really great query performance and and ingestion speed.
[00:18:00] Unknown:
I think that, like, basically, the way we work is we have these cycles where we where we, like, have this objective that we wanna push. And then we do some research, and then we decide what we want to go for. We implement the technology, then we build a self-service around it. We've had these cycles ever since we started the data platform. And like I said in the beginning, because it is so interchangeable. I don't know if it's interchangeable. It's even though it works. Right. It's a work. Because it's built the way it's built, so we just attack every time a different part of the platform. And, like, the thing is that we have to to make sure that everything plays well together. And because we live in a data lake and we because we heavily rely on Spark and and we use Hive metastore. So we have a few, like, anchors that we need to, like, build every solution that we bring in around these anchors.
[00:18:51] Unknown:
But all of those are pretty, like, standard. But, also, like, adding to that, like, I think 1 of the really good decisions that we made was, like, concentrating on a single query engine for a data lake at some Most time. Most yeah. Like, we had at some point, like, both Presto and Spark, and it just became so complicated handling both of those at the same time. Like, because, like, they do have nuances, and Presto for us was not really that great. Like, it was not as fault tolerant as Spark. It was not really dealing great with, like, memory, stuff. So, like, some point, like, okay. We are only using Spark. That's the only way to get to the data lake. If you wanna push it somewhere else, you have to use Spark to move it from the data lake to somewhere else. So we do use Snowflake, like, a lot for, like, these dashboards.
We plan to use it even further. We plan to use, like, Rockset or Druid for, like, these real time analytics parts and but it's all based on, like, a source of truth, which is the data lake in s 3 and then Spark on top of it. So that's how we roll.
[00:19:49] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder.
[00:20:23] Unknown:
As you have been going through these different iterations of development and discovery and optimization, what have been the sort of guiding forces or the key metrics that you're trying to optimize for in terms of the overall capabilities of the platform?
[00:20:38] Unknown:
I think those also changed over time. But I think that yeah. Firstly, just wanted to work, right, to get something working. Yeah. But today, I don't know if it's anything special. But today, we are super focused on self-service, like, how self-service this solution can be. Because we really don't want to silo stuff around our teams. It has to be something that's applicable for other teams to work with. Another thing would be cost and performance.
[00:21:01] Unknown:
So For self-service, I think we are always, like, you know, we have, like, this motto, like, an enterprise product, and we're, like, pretty very seriously like, we don't look at our customers, which are, like, developers or BI devs or, like, anyone that's using the platform. When we think about them, like like, we we're trying to sell them a product. Right? We need to make sure that they are successful with using the platform. And with that in mind, we always think, like, okay. So maybe we didn't really document it well. Maybe we didn't really maybe it's not really easy to use. Maybe there's not enough tooling around. Maybe it was, like, it's too complicated to like, they had to copy a lot of stuff. Like, they have to like, it it should be really easy. Like, we need to make big data or, like, data analytics simple for them. So that's, like, a really important part. And the other part is, like, I think, Don mentioned, so cost and and performance. Like, so we're really into, like and those, by the way, work together. Right? Performance and cost. Like, they do have, like, a lot correlation between them. But we always look at, like, query performance. We really monitor it quite closely to see that people are getting, like, normal times for their whatever it is, like, data pipelines. So just they're normally just querying the data ad hoc, like, with their, like, Databricks clusters. So really into, like, making sure that the things are, like, you know, proper performance. And cost is something that really heavily monitoring, but also we're working a lot in moving the the ownership to, like, the teams so they can monitor it by themselves. It's it's all about the self-service. Like, even the cost and the performance, like, we wanna move it to so they can use, but we need to digest it in a way that they can understand and they can actually do something about it. Like, tell them, oh, you have a too much memory in your spark cluster. Like, what will they do with it? Can you use auto scaling to help them? Can we do, like, auto optimizations for them? Can we maybe send them some automated message, like, saying, oh, your pipeline is very, you know, it's time consuming or it's wasting a lot of resources that you really need. Like, how can we automate, like, our communication with them? And also trying to reduce, like, that we have, like, a contact data channel where they can ask, like, questions. We wanna reduce, like, the number of questions that they're asking because it's very time consuming for us, and then we don't get to do, like, this crazy infrastructures. Yeah. Now that I think about it, there's, like, another thing. There's this notion of like a cognitive load, right, for people interacting with the platform. That's 1 side where we want to make everything self-service and and very, like, low hanging for them to use and to understand.
[00:23:05] Unknown:
And on the other hand, we have, like, the cognitive load for data infrastructure engineers too. Because working in an environment where we have so many services, so many responsibilities, so many context switches, Unlike any other team, I think, aside from, like, DevOps teams, probably, it's, like, the most similar. Another thing would be so is this technology mature enough for us to handle at this point in time where we have this fire and this fire going on and we have, like, so many stuff going. But so it's another thing, like, to keep the developers happy on both sides of the platform. Right?
[00:23:43] Unknown:
You mentioned cost a few as being 1 of the motivating factors in the design and the implementation and the tool selection, and that can often be something that's very difficult to model thoroughly because, you know, at at the 1 hand, you have a vendor where you can say very clearly, okay. This is the cost to acquire their platform and to use it for these use cases. I'm just gonna go with the open source route because it's free as in puppy. And so now it's gonna cost me, you know, several months' worth of my engineers' time and, you know, some untold 1, 000 of dollars because of the fact that we have to reverse engineer it, you know, integrate it with these other systems. And so I'm curious how you think about that cost modeling approach as you're debating this build versus buy process in terms of this tool selection, and then also weighing that against the maturity model of that tool or platform.
[00:24:31] Unknown:
So I think, like, we also matured. Like, we were at some point, like, on the We were younger. Yeah. We were younger, and we were just using just open source. Like, what? No. Don't use a managed solution at all. We had to talk a few years ago. I think it was when Leland wanted to we were deep into Hootie, and everything was so so bad that I think it was when COVID started, and we were working in tonight. We were zooming together. We were so miserable. And then Leanne said, forget about it. Let's just get something managed to do it right. Let's just do something else. And I was like, no. What are we, like, technicians? And I don't wanna do to manage services. It's boring. What what's left for us? Where's the
[00:25:07] Unknown:
rush in data? No. Managers. This is wonderful. No. We I think we we get our rush today from other things. I think, like, working with this non mature open source project or the open source projects that are not really fitting to what we need can be a real headache. And if it starts like this really cool way, you do everything by yourself, it just ends up to being something very time consuming and very causing, like, a lot of fatigue to our team. So we're concentrating now on bringing, like, impact and value if that means that we need to build our own tool because I actually I I give you, like, an example right now. So we started rethinking about, like, data modeling. We talked about before, which is heavily used for data pipelines. And we wanted to move away from data pipelines into the more the data modeling world because people were not, like, building data assets that last. Like, they're building only their own ETLs. Right? And not building, like, things that, like, people can reuse or they can build on top of them. They were not sharing knowledge, which I think is, like, what DBT is for. So we started looking at DBT, which is great. It's been heavily used by a lot of corporations around the world. Right? But then, like, we thought, like, woah. This is, like, there's like a lot of manual stuff to do here. A lot of stuff like that I don't really see people, like, enjoying. It has, like, a lot of things that you have to generate by yourself, like creating those metadata files or generating, like, sources or, like, everything was, like, seems, like, very complicated. So it's, like, okay. Maybe you can just, you know, utilize Middle Rico and just, you know, move it from the ETL world to data modeling world, which is, like, our own project. We have full control over it. But DBT comes, like, with a lot of different integrations, and, like, we have, like, like, the community. We can enjoy this. So this is, like, choosing, like, to build our own or using open source. And then we looked also with DBT management. We thought, like, well, it doesn't provide us with the right, like you know, it doesn't give what was what we want. So we we need to build our own. So we're actually now at this, like, stage. We're we're building, like, a lot. We have a team that's actually doing, like, a lot of stuff on top of dbt, but also, like, using dbt behind the scenes because it's, like, a standard way, and it's been used in the industry. So this is, like, where we chose to, like, to build our stuff on top of open source because there's not a really good managed solution. But, like, looking at, like, AppSolver, for example, which is doing such an incredible job streaming data to the data lakes of the world or, like, to data warehouses, like, where I don't think there's any other solution. And we tried, like, everything. We did it with, like, Snowflake. We did it with Databricks. We did it with, like, just our own spark. Like, they do it really, really well. And even, like, at the end, like, it was cost optimized. Like, they're, like, in a premium solution. Right?
So we are, like, always, like, on the lookout of the best solution we can find. If we can, you know, we have we can budget it, we'll probably buy it. If not, then we'll think about, like, creative ways or just not do it right now. Just wait for the right moment or the right tool to come to our rescue. Like, for example, like, data catalog is 1 of those things where we're still, like, not sure what to buy or to build because, like, there's a lot of coming up solution right now that are interesting, I guess.
[00:27:47] Unknown:
We wanted to know Acryl Data. We wanted to work with, you talked to Aslan. I don't know. Yeah. Aslan as well and Acryl data and We talked to every data catalog out there, I think. But also with dbt, like, gives you some in the world of dbt, it gives you a basic cool, nice generated catalog out of the box with lineage and really like it. And we also, like I said, in terms of, like, catalog and lineage, we we talked about integrating open lineage. And I think we're not really confined to a certain, approach of saying, okay. We're we're going all managed if possible and otherwise. So we have this mix. And, also, I think, like, for we always invest time in, like, trying out, like, new technologies and stuff out there and and getting to know, like, open source projects so we can, like, stay up to date and and know, like, what's out there. But, eventually, we have this amount of bandwidth to handle these open sources and stuff that we build ourselves. And around every managed tool that we have, like Lirhan said, we automate stuff and we build this infrastructure around it. So there's always, like,
[00:28:49] Unknown:
our code wrapping whatever service we bring in. It's not like managed solution will be like a free ride for us all. Like, we still have a lot of work to do around management. Don't fit our needs. So 100% of them, the POC fails, and we but we get it working with you. So actually have no other choice. Yeah. No. But, also, like, I think the ones we got working is we have, like, a good partnership with the people working there. So we had, like, a really we struggled with the POC with Ocara, which is doing, like, our data governance. But then we work really closely with them, and we were able to make it work. And in the end, like, we bought the solution, and it's a great solution for us. I don't think there's anything out there that resembles it, like, because of the work we did put together with them on our data platform. Yeah. And also, I think it turns out that a lot of the vendors that we chose to work with, we ended up being, like, true design partners building stuff together in a lot of the cases. Yeah. I think they like us. I hope.
No. Like, I think, like, because because our platform is not, like, people that have, like, a lot of, like, better platforms are just like Snowflake. Like, it's great. Right? It works really well, but ours is very complicated, and and we know our stuff. Like, we know how to deal with. We have a, like, a lot of data engineering knowledge we inside the team. So, like, we know what we want. We know what we need, and and when we have, like, a really good vendor to work with, then it works very well. But, again, it's not 0 work. We have a lot of work on those POCs and afterwards, like, implementation and the self-service part. So And we make mistakes and we go back and we try something else. Yeah. So it's not like a clear cut, like, where we'll invest more. So I think, like, if I have to, like, summarize it, like, we're just looking for the best solution out there. If it's open source and open source, if it's a managed 1 where you use managed, and and maybe it will just be a lower on, which is also something that we do from time to time. So Definitely interesting, sort of the specific flavor that you build up in an engineering team that everything has to kind of fit into that, or you have to make it fit. And you talk about DBT puts me in mind of what the folks at Shopify did with their adoption of DBT for building their cloud data warehouses where they built this wrapper around it to enable CICD workflows and make it fit their specific engineering patterns. I think everyone is doing it. Like, I think something is missing inside DBT. Like, everyone is not able to use it. They're just out of the box. I think, like, that's a problem, like, because, like, the amount of work that we're doing around it is, like, not like, I really wished it was part of the solution. By the way, we might open source it as well, like, because, like, we do, like, a lot of work, and so I hope that we can contribute. You. Yoda.
[00:31:01] Unknown:
Another thing that you mentioned is that you work to stay up to date with the different tools and systems that are out there. And I'm wondering how you manage to simplify the process of experimentation so that you can get up and running with these systems, test it out, figure out what are the sharp edges, does it make sense, what are the gaps, and then make an easy and sort of quick decision of, yes, this makes sense to invest further in, or, no, I'm not gonna waste any further time without getting to the point of I'm going to spend 2 months and become operationally excellent at this 1 tool just so I can see if it actually works. I think usually by the time that I get to hear about something, it's after Lionel
[00:31:37] Unknown:
tried it already, like, locally.
[00:31:39] Unknown:
So that's how I get my stream. That's not true. I think, like, we try to look for we read a lot before we start investigating because as I said, it's very costly to start the like, these POCs. So we try to better understand, like, if this solution works. Have a lot of talks with with the vendor themselves and understanding, like, what they do, talk to other companies that use them. Like, we try to get as much information as we can before even starting to try it. But in the end,
[00:32:02] Unknown:
we try to create, like, a POC around a very specific use case, which is, like, solvable. And, hopefully, like, based on documentation and what we heard, like, is something that can work. It's also something I think very, very important. Like, a lot of these solutions like, we are a data lake solution. We do data lake, and we don't like this. We don't like these passwords. We usually have a very specific need that we want to solve and think that it makes us very focused around the POC and what we're trying to get. And on the other hand, it doesn't make things more complicated or vague because,
[00:32:34] Unknown:
yeah, I want this for whatever this platform has to offer. Show me. I wanna test out all the features. No. I'm approaching this with a very clear goal. And I think a lot of the tools that we're using today, we're using in a very specific part of the platform where their offering goes way further, and we choose to do this. I'm thinking right now about a POC that we're not doing it. Like, the 1 is is is actually trying to avoid me of of doing it, which is rock which is RockSette. No. It's an option. Yeah. We really so Roxette is like this magical creature that exists. That falls in love. No. Yeah. I really I really they are selling It was Prefect before and now it's Roxette. No. Yeah. I'm sold with Prefect. But, like, their premise is is super interesting. Like, they're really promising to do something that no 1 else can, like, based on their documentation and videos and the talks that I had with them. So I'm super intrigued to, like, see that in action. And so we actually, like, created, like, this use case for our POC, which, like, Joel just needs to set some time for. And, hopefully, like, within a day or 2, we can, like, figure out if it actually does what it promised. So, like but that's, I think, what we're trying to achieve, like, find these, like, very specific use case. And we start small, like, just what we did with Snowflake. We start with something small. We try to assess if that works for us. We started in some kind of a production, very, small production use case, and then we just we grow with it. So the problem we're going to solve is real time analytics and how to enable real time analytics because we have analytics and we have real time. But,
[00:33:55] Unknown:
like, because the data that we're dealing with is data that is normalized and using basically like, mostly, we have CDC streams coming into the data lake. So we either have the CDC itself straight from Kafka or we have the materialized table in the data lake. And then we have to do these tons of joins to get something coherent that people can use for b to b analytics, for example, or other use cases. But mostly, it's some sort of b to b use case that needs this real time. So this is what we want to get. Right? So 1 approach can be, let's use Rockset. And then, like, let's put all the raw data in some database, and then we can query it live. So we can run our queries. It can also be Druid. It can be Snowflake. It can be any database that can handle, like, a large amount data and do aggregations in real time. Right? So but they promise to do something better. Right? Yeah. Because we want, like, impressive performance.
[00:34:46] Unknown:
That's 1 way to go around this. Why we like, like, this approach, and we're actually just right now contemplating about if we actually wanna go that way. But we really like that you don't have to decide anything in advance. So you just push it somewhere that you can later on just, you know, decide what you query on it, like, just like a normal database, then you can change your mind, which is great. Like, I love being able to change my mind about schema, like, being about the query structure, about the joins, about, like, the amount of data that I'm acquiring. Like, I love this type of solution. And then there's the other approach. Right? And this is because
[00:35:14] Unknown:
we know by now how difficult it is that we you missed something and you want to add this data point or the business logic is wrong, but then you have to build this whole stream again from beginning. And so we know how hard and expensive these things tend to be. So that's why this is, like, a valid option. And then we have to make sure that it it connects with all the rest of the pipeline and how it's supposed to reach eventually, whatever, if it's a dashboard or something, some other database that needs this data. So it's not perfect. Right? So this is 1 option. Then another way going around is going the ETL way. Right? And having these joints happening and modeling this data before it reaches destination and streaming this as a joined view into some storage layer.
And this would go into the world of something like Kafka based, just like k tables, k streams, or maybe Flank or Managed Flank or something from this part. Yeah. We got also, like, materialize there, like, as well. Right? They also do the same kind of Yeah. An absorber, by the way, as well. Like, they do, like, streaming details. Yeah. So this is the problem. This is 1 we wanna solve. This is, like, the technological spectrum that we can go around, different approaches, different vendors. Like, this is where we are now, and this happens every, like, this big of a thing. So this happens, like, every, I don't know, 6 months or something like that that we have this process. And, yeah, this is how it happens. We love it. It's fun. Like, this is what the work is about. Right?
[00:36:42] Unknown:
Yeah. In terms of your current architecture, I'm wondering if you can outline the key components and any of the early decisions that you made or some of the legacy systems that you had in place or that you needed to deal with that constrained your downstream options that led to the current state of the world?
[00:37:00] Unknown:
First of all, I think we started up with Redshift. That was the first component, and we still have it. And we can't we can't seem to get rid of it. I don't know why. I know why we Just shut it down. We just can't get rid of it. But I think that, like, the biggest decision that we made early on is to be data centric. And I remember when we started working with Snowflake for the speed to b solution that we talked about. And then, like, we asked ourselves, should we go, like, all in for Snowflake? It'd probably be much easier for everyone around us, including us.
Now that we're gonna keep the data lake approach. We're gonna keep working with Spark, running our thing there. So I think that probably working data lake and Spark based, it it was, like, the biggest influence we have in in the architecture. But by the way, we manage deprecations and migrations all the time. It's, I think, I don't like, 30, 40% of our time engineers are going about how are we going to migrate out of this into this? And it's not always, like, it can be a new technology. It can be a new methodology of doing things. It It can be just because, like, new features are out, and we can do stuff better and different, or we need to upgrade stuff, or
[00:38:11] Unknown:
deprecate things that are no longer relevant. So it's also it's something to consider. But, yeah, our redshift is still here 6 later. I think, like, the way that we, like, do the decision making all the time is we really wanna move forward all the time. So and I think, like, you asked about, like, our our points in time where, like, how do you make the decisions or, like, the key features that that we get into the day. Like and I think, like, it's always about, like, what's not working well for us? Why couldn't we improve? And how easy or difficult will be to deprecate, like, the old solution, and we always deal with this all the time. So I think, like, the key points was Data Lake and Spark. I think Databricks is, like, a huge choice. Love. We love Yeah. We love Databricks. Before, we had, like, tons of different solutions. We had EMR. We had, like, the Jupyter Notebooks. We had, like, everything. We did everything. We've been with Databricks for many years now, and it's just working well. And we actually we run our own spark clusters, like, from very, very, very early on. Like, we're we're running them on HashiCorp's nomad, running them on Kubernetes, running them, like, just by our own, like, in EMR. But right now, we're actually thinking about stopping all that and just using Databricks. Like, they did such a good job with Photon that we just, like, why not use it? Like, they really did, like, this amazing work making Spark faster than it is. Why not, like, using what they're offering? Manage. Don't manage. Yeah. So right now, we're starting to think how do we deprecate all those different spark cluster where we started a different, like, architecture we build around Kubernetes and Nomad and just do this without driving our users crazy. Yeah. Yeah. We do saying that We can do it maybe behind the scenes.
Yeah. So, like, Databricks, 1 of the CDC, huge thing that happened in in data platforms. Yeah. And also, AppSolver was, like, something that was very, like, a game changer for the entire data platform. And I think also, like, just if you go a bit back, like, is Hive as well. Like, think like Hive Metastore. I think a lot of people are using right now, like, this, like, AWS solution, what it's called, Glue Catalog or, like, Databricks solution or, like, we have our own high metastore. It's great that I think stupid and we love it. It's it's really stupid. Yeah. It's stupid and we love it. Yes. I think that's stupid. That should that should be its motto. But so right now, I think that's gonna be with DBT. I think that's gonna be, like, really gonna change the way we do data modeling. And also Delta Lake format, which is, like, after our, you know, experience with Houdi, we're, like, really like, we don't wanna touch formats for a while now. Let's just stop doing that. Like, it's not really well. We also by the way, we tested Delta at that time. It was not really mature, the open source version. And the the only 1 we didn't test is Iceberg, which I heard the good things about. But right now, we're getting back to Delta, which is really much more mature, and I think this is also gonna be, like, something that we're gonna utilize more and more. I love their z ordering feature that's also, again, working only in Databricks. But once we move to Databricks, we don't really care anymore. Like, it's working well for us right now within our new data pipeline. So I think that's going to be a few things that are are, like, important. And, also, I think, like, the governance feature, the owner. So they came from, like, necessary evil, right, where we were, like, forced to create governance feature because of, like, compliance. But I like that we're safe. People don't have access to data that they don't need to see. And I think in in the world of data lakes, it's really complicated.
So, yeah, I think that's also 1 thing important.
[00:41:13] Unknown:
And if you were to throw everything away and start from scratch today, you have greenfield, completely new system. What are some of the things that you would do differently, or what would your ideal architecture look like? Well, that's interesting. I would want,
[00:41:24] Unknown:
like, a fully stream architecture. Like, I would remove batch altogether. But I have to say that I still 2022, I still don't feel comfortable doing this. Even if I were to create, like, a new data platform from scratch for a new company, like, it still doesn't work well as it should. But I would really work hard to make it work and, like, make my users work harder and keep the platform happy, I guess. Because, like, I think batch is, like, it's a problem. Like, it's something that we need to remove. Like, it's just, like, a very old school way of of thinking, and it's driving, like, a lot of It causes a lot of problems Yes. Eventually.
[00:42:06] Unknown:
If we were to start everything all over again, we might, like, go through the same path, excluding the stuff that we threw away, but end up at in the same place. But just knowing what we know today, we would have done things differently. Yeah. Like, if we would have invested this much in self-service early on. You know what? I'm gonna say something that I think I mean. Okay? In terms of governance. Right? So you talked about access management. We have this whole part of data cataloging and stuff. I think that if something always gets pushed away, we never say, oh, it's important enough. Let's stop doing this production thing because, yeah, we need access. We need a catalog and we want lineage. So it always gets pushed aside.
But I think that it can create a very good relationship working with data consumers and producers in an environment where they have observability to the data assets, and they understand things much better, where the data comes from and where it goes to. So I wanna be brave and say that if I start all over again, I would, like, be, like, data catalog driven development.
[00:43:14] Unknown:
Yeah. No. No. No. That's I think that's, like, what DBT is all about. Right? It's, like Yeah. Yeah. It's, like, maintaining, like, documentation on your data. And it's implicit. That's the thing that we like the most. It's implicit. Yeah. So you don't have to work this hard for a lot of the things that they can offer. Yeah. I think as I said, like, I think if things are not perfect yet, like, there's a lot of things to do. Like, I would wanna see, by the way, DBT works in a streaming world, which will be amazing. Because I think SQL is, like, great, and I wanna keep it. Like, I think, like, the by the way, you're asking about, like, important choices. Like, we went with SQL all the way. It's it's You were the first. We were we were not the first. But, like, we, at some point, like, stopped writing Scala and Python, like, reducing, like, mostly, like, you know, data science, they love their Python, so go ahead. But, like, we tried to keep everyone working with SQL even if it was a bit hard. At some point, like, Spark added, like, a lot of different to work better with SQL, but right now, SQL is great, and it's winning. We interviewed someone from Meta on our podcast, and, like, in Facebook also, like, it won. Like, they're using, like, SQL quite a lot. Like, all data patterns are written in SQL. So, like, I would wanna see, like, the world of SQL gets merged into the world of, like, real time analytics and real time streaming, and I think that's where we're gonna take our platform, but it's still not there yet. So, like, I won't say, like, I have, like, a perfect solution right now. Yeah. You're talking about the future and not the past. He he liked the past as it was. It was a journey. We've Yeah. I like I like that. Need to do it. We need to have this journey. Yeah. But also, like, I think things are so much easier today in the data platform engineering world. Like, it's like when we started it, like, Yeah. 2016, 2015. It's a long time ago. Very long like, they were, like, missing, like, a lot of things that are people are, like, taking for granted today. And I think, like, we had to pass, like, a lot of different, like, architectures and solutions to get to where we are right now. And but that's not your that was not your question. But Okay. I'm reminiscing. So Yeah. I started this podcast in the beginning of 2017.
[00:45:01] Unknown:
And in that time span, everything has changed drastically. Like, I barely even recognized the things that I talked about at that point. I mean, some some of them are still around, but, you know, there was a lot of stuff that it was like, oh, let's dig into the Hadoop ecosystem and interview somebody from there. That's just not even a thing anymore.
[00:45:15] Unknown:
It is. No. But not really. Yeah. Yeah. You'll find someone in that data center, like, below. Yeah. Below the ground. They can talk about Hadoop with. So yeah.
[00:45:24] Unknown:
Now that you have a functional data platform that is obviously constantly evolving, but what are some of the most interesting or innovative or unexpected ways that you've seen it used by your downstream consumers?
[00:45:35] Unknown:
I think that 1 use case we can mention, it's actually like a trend that started with the data platform and it actually grew way beyond it. It's our use of Debezium. Because we use Debezium for CDC over the databases like we described before, and it actually caught people's eyes in Yappo, and they started using it for Outlook's patterns. So now it's, like, a big thing for, like, inter service communication, in Yappo. And they built amazing applications
[00:46:04] Unknown:
based on outbox or whether it's, like, scheduling still solutions, all kinds of very complex problems using this component. So that's something that we're really happy that we have integrated in. Terraform Kafka, like, the use of Terraform in general is something that we adopted quite early on in our Yeah. Infrastructure. Yeah. Everything is, like, infrastructure as a code, like and we are working really hard to, like by the way, that's, like, 1 of the things we do, like, when we select, like, a tool is, like, see that we can do that. And if not, we just build a solution around it to make sure that it's possible. Because doing, like, drag and drop in the UI, that's great for, like, a small start up, but, like, a company with, I don't know, a 1000 people right now is just getting way too much complicated, and things are just, you know, missing.
But I would say, like, I think, like, it's unique that our support engineers use the data lake. They can debug, like, features and debug, like, issues with our customers directly on top of the data lake, which is I don't think a lot of companies are doing that, and they are investigating the data there, and they're joining data, and they're able to create, like, their own, like, reports and features and everything. So I think that's a kind of an interesting use case. I think, like, r and d in general is using, like, the platform a lot to do things that they're finding complicated to do in their own application infrastructure. Not always, like, the best decision, but, like, yeah, at some point, they're like, okay. We want to create this feature where we need, like, data from a lot of different places, and then we need to push it into Elasticsearch.
Alright. So do it, like, you know, event based. You know, do it like you normally do, like software. Right? But they're like, well, we don't have the time. We don't have the resources. We don't, like, deal with, like, this like, there's a lot of synchronization issues when you're dealing with, like, event driven architecture. And they're like, let me just write these SQL joins. I'll run them every hour, every 10 minutes or whatever, and then push into it. And and they are using it for many years now, and they're building, like, a lot of these different features just I think more quickly than they would if they had to do it, like, on the application side. Again, it's an example. I'm not thinking, like, I'm a huge fan of this because I think it's kind of, like, abusing the platform.
But it's a use case, and it's, yeah, it's unique, I guess. I also think that, like,
[00:48:07] Unknown:
because we made big data accessible, which was not accessible at all before, and we have Matoico, which is really this just The enabler. Yeah. It's a great enabler. It's just a plug in where you have all sorts of supported inputs and all sorts of supported outputs, which you can expand and add more and more. So, like, starting to use this big data to build small cute features, little bit galleries, for example, where you listen to the click stream coming in and you see whenever some consumer and end user is arriving at some gallery at the product page or gallery page in some store, and then you can drive the data all through the admin side so that the store owner can see that someone is visiting the gallery at the moment. So it's all kind of nice features that people could just innovate and build for themselves.
[00:48:55] Unknown:
We made, like, Spark structured streaming something that anyone can use and create within minutes, like, their own data pipelines. And I think today, they are not doing it in anymore. Like, they stopped doing these kind of features. But, like, do new 1 did. Oh, really? Yeah. Okay. So but we had, like, Innovation Week, so we created, like, a lot of these things. So I think, like, at that point where, like, structured streaming was, like, the thing. But since then, like, we discovered that Spark structured streaming is not the greatest streaming architecture, like, data in general, but it was very easy to integrate.
[00:49:23] Unknown:
The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. Posthog is your all in 1 product analytics suite, including product analysis, user funnels, feature flags, experimentation, and it's open source so you can host it yourself or let them do it for you. You have full control over your data, and their plug in system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog.
[00:49:58] Unknown:
And so in terms of your own experiences of going through this journey of iteratively and exploratively building this new platform and working with your end consumers to shape the direction and the feature set, what are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:50:15] Unknown:
We talked about this before about, like, sole servers, but I think, like, 1 of the discoveries that, like, we've made is and I think that's like, a journey that we've made internally is treating our customers with respect, I would say. Like, if at the beginning, we were like this really technical team where people like, oh, yeah. Big data. I'm just doing all these really cool stuff. Like, oh, just use it. Like, it's so easy and so cool. Oh, you can't use it, or you're stupid. And I think that was, like, the approach. And I think it's, like, a a lot of infrastructure teams around the world. Like read the logs? Did you read logs? Why did you read the logs? Why aren't you? Logs. No 1 can read them. Yeah. But, like, we were like, well, it's so easy to oh, you got, like, an executor, like, the lost executor because of memory. Like, how did you miss it? Like, oh, you're so lame because you're just, like, this normal developer, not, like, amazing like us, which is, like I think, like, infrastructure teams in general have this disease of thinking they're above everybody else. But I think that what we, like, this thing that changed, and I think that's what we learned is, like, there's so many different types of developers and so many different types of people are in organization that that, you know, they can be really good at something, but not really good at, like, this specific thing. And the fact that we, like, really like infrastructure doesn't mean everyone likes it. It doesn't mean everyone can orient around in it. Like, they're frightened by it because it's, like, it's something that they don't know. And we've been living with this for the last, you know, 6 years, going for 10 years.
So, like, for us, it's, like, so obvious, like, how to use the system and how to utilize it and how to understand it. And now we're switching, like, our approach and thinking about them, like, first. And why is it not so easy? Why are those logs? Like, why do they even need to read the logs? Why can't we translate it into something that they can understand? And not because they're stupid, just because, like, they do something else. They have a different
[00:51:55] Unknown:
role than us. And they don't have the capacity to learn this. Yes. And they don't. So we want to learn this. They really don't want to, and it's fine. Like and I think that's really big lesson for us, like, You know, I think what you're saying is basically, as you said before, before, we're building products and we have users. And I think that 1 of the, like, things that are more challenging for us is that we're also, I don't know, the tech leads building our own roadmap, and we're our own product managers designing our product. And we are also the developers building this infrastructure.
And I think that being all this under 1 roof and small teams, it's something itself very, very challenging.
[00:52:36] Unknown:
And that does bring up the question of how you approach the sort of customer education as you're selecting these different tools and integrating them together and figuring out what does the workflow, what is the user experience, how you actually approach the education and onboarding of the different users in the organization to make sure that they are able to effectively take advantage of the tools and capabilities that you're providing.
[00:52:58] Unknown:
This is work in progress. People from Yahoo listening, we're
[00:53:01] Unknown:
do you wanna talk about Backstage, if that's what you wanna talk about? Also, I think, yeah, Backstage is a cool example. But, like, I think in general, like, first of all, it's about, like, culture, and it's, like, partnerships. Like, we try to get them on board as quickly as possible before we build these, like, spaceships. Right? And get them as part of the design partners and get them, like, in their conversation and discussions and and ask them what they need and not what we want to build and, like, what's cool. Like, we stop asking ourselves this question asking yeah. And I think it's an important part because then once we get to the education stage you just talked about, it's not that hard. Like, we already have an advocate on their side. Like, there's someone there that says, like, part of the project from the beginning, and he can do that or they can do that. Like, they can explain, like, the project and why it's really good. And so, yeah, in terms of, like, education, like, we do, like, these sessions. We have, like, recorded video sessions. We have, like, a lot of documentation.
And also, like, self-service, like, Jenkins pipelines or bots or whatever. Also tooling that that, like, helps them Yeah. Yeah. Automate stuff. Yeah. Whatever we can automate, we're trying to do right now. And not by education, but just by, you know, automating Simplifying. Yeah. Simplifying it and not by explaining it. Joanne just said that, like, backstage, I think is, like, I think we're we're gonna see, like, the next generation of education, like, internal education because Backstage is, like, this really cool enabler. So Backstage, you know, it's like a project by Spotify. It's a really cool project, and it's very extensible. It has a lot of different plugins. We started implementing it at Yotpo, like, a couple of months ago. We're really successful, and part of it is, like, the centralized documentation feature generators.
And, of course, just like this knowledge just there, you don't have to ask people questions, like, what the different services, different parts, APIs. Like, everything that, a service can expose is there. It's making portals cool again. Right? Yeah. No. It's a by the way, it looks really bad. Don't know if you've seen it lately. Again, we're using Roadie, manage big stay solution just because, like, we don't really care about like, it's good enough. Like, we don't want someone else to manage it. Like, again, we chose, like, a manage just because it's easier for us to implement.
[00:54:57] Unknown:
But I think that's gonna be, like, a really great way to push, like, education and allow them to, like, onboarding process make it easier, and this is gonna be I hope it's gonna be great. As you continue to build and iterate on your platform, what are some of the things you have planned for the near to medium term or any particular projects or POCs that you're excited to dig into?
[00:55:16] Unknown:
So I think we discussed a few where we talked about the real time. That's a big issue for us. We talk about governance, and the biggest project that we're working on currently is is Yoda dbt based project, which in our vision would basically mean, like, rewriting all of the data leak. So this is a big thing for us. And I think for the past a lot of months, we've been working around, like, migration from Nomad into Kubernetes. These are good examples of the diverse tasks that we have in our infra teams. It it varies a lot. I wanna add something about Yoda is that It's the output data application. Output data applications. Yes.
[00:55:55] Unknown:
So we are taking, like, a really cool approach there, and it kind of encapsulates, like, a lot of different things that we probably would have done outside. But, like, right now, we're doing it as part of this project. So we look at data modeling in the beginning, and we started talking about, like, how do you create a model and how do you write it and how do you document it? But also, like, how do you test it? And I think that's 1 of the things that are really missing right now and, like, a lot of different places in the world in data is, like, testing. So we're gonna do a lot of work around testing here. So I think everybody has, like, started doing, like, tests. Right? A few like, but we were probably the first to do it. Like, with has been we have, like, unit tests on data pipelines from the beginning.
We had, like, a DQ test, like, as part of the, like, data pipelines test, like, for a very long time. So we're doing the same there. But we're trying to, like, rethinking about how do you test a pipeline properly without affecting production. So using all these different, like, using, like, great expectations and with, like, FS, so you can do, like, append data pipelines without actually affecting production and making sure that everything works great or, you know, collectively update a bunch of data pipelines altogether with Lake Efex, which is also, like, a really cool you can test, like, a lot of different pipelines together, like, as part of a single process. You know what I'd like to assess this? Yes. I've actually had them on the podcast a little while ago. Oh, cool. Okay. Yeah. We love them. So, like, these are the things that we're running. But also, like, how do you test locally? I don't think it's been answered. Like, DB took the approach of, like, oh, you have, like, production data. You can do it, but sometimes the data is so big. Like, how can you run you can run it in dev? You're gonna rebuild the entire data, like, on your dev environment? Like, that's not possible. So, like, doing, like, these mocks and, like, different, like, auto generated sources and creating something small, like a smaller world so you can do, like, unit testing or integration testing.
So we're really into that, and that's also part of the dev process. And data CICD in general is a topic. It's a surreal topic for us. Yeah. I think we're doing, like, a lot there, but, also, like, taking, like, end to end approach with data applications. So it's not just about creating a pipeline. That's great. And now data is in the data lake, and people can use it. That's a really huge step. But how do we get that accessible to people that have no access to SQL? Like, they don't really know how to access data. We want them to have also access to data. Like, if someone modeled something, invested a lot of time in modeling, why not have that data already exposed to people around the organization that are, like, from finance or higher management. So we're working also with DBT to expose it in in Luca or in some other dashboards so they can drag and drop data. But because that data was heavily tested, it's documented. It was, like, built by someone that understand the domain.
That person can trust it, and they can drag and drop, like, whatever they want. They don't even need, like, an analyst with them to make sure that they understand, like, because it's all self explanatory. Like, data is self explanatory. So I think, like, that project, like, encapsulates, like, a lot of the future of how we see our data platform evolves.
[00:58:42] Unknown:
Yeah. So it's, like, true self-service analytics. It's also about, like, shipping it in the shortest cycles possible with, like, smoothing out all the friction that you have when you're developing, testing, and deploying.
[00:58:58] Unknown:
And consuming.
[00:58:59] Unknown:
Yeah. Because everything we've talked about until now, most of what we talked about, it's it's around people that know SQL. And can they self serve themselves around, like, yeah, you have Databricks, so it will work. But we talked about 500 people. There's at least 500 more. They would love to work with self-service analytics. And think of a world where there's a new feature. And as they're building this feature, there's a developer that's building the DBT model alongside the feature. They can release it right after data is starting to get generated. And in the CD,
[00:59:30] Unknown:
we automatically create this explorer that the product manager can access and see the metrics flowing. That's wild, I think. Yeah. It's gonna be wild. Well, for anybody who wants to get in touch with you and follow along with the work that you're each doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:59:52] Unknown:
I think we kinda mentioned it before, but, like, our problems today are around, like, good real time analytics solutions. And this whole thing around governance, we don't see it solved yet. Like, we have many different solutions, all, like, covering just a part of what data governance is about, and then you have all this overlap between the tools. And they all cost so much, and you don't want to buy those 200 k dollar solutions and have all this mashed together. So, yeah, that's what bothers me.
[01:00:25] Unknown:
I think so I agree. Definitely. I think for me is, like, I'm thinking, like, we're this team of right now around 9 people doing this work all the time. And, like, every company is doing the same with, like, less people or more people or sometimes 10 times more people. It just sounds crazy that everybody's, like, doing the same thing, and they all keeping people that are not really directly contributing to the business aspect of, like, their companies, like, like, doing infrastructure. We might as well be working in any other company right now. Like, we're not doing anything like this Yotpo specific. But we're doing it for Yotpo because, like, we know, like, what's good for Yotpo. We know what it needs. But so I think, like, it's just it became so complicated, like, this error. Like, before it was just there was nothing or, like, there's where there's data warehouses where we're locked within them, and then they couldn't do anything, like, outside. And right now, it's just like this wild west of tools, and and you need a lot of people to manage this.
So I would like like something to unify them all together or to make them work, like, better, like, so you don't have to keep, like, on payroll so many people that they're doing the same job with for different companies. Us fight you want us all fired. Yes. In general. About our job security. No. No. I think no. But I think, like, I think, like, that's what's missing. It's just too chaotic right now. Yeah. And everything else, the 1 said. I really like the the app the episode that you had to summarize. Last year, you had this panel, and I think we really related to the stuff that you guys were talking about, like, what's going on in the domain
[01:01:58] Unknown:
out there and how solutions look like. Yeah. I think it's a really interesting topic to see, like, how the ecosystem evolves over time, what solutions are getting born, and which sticks and stays. And, yeah, it's really interesting.
[01:02:10] Unknown:
Also, by the way, adding I'm adding to my answer that, like, I think, like, data trust is still, like, not solved. Like, I think, like, there's still a long way to go there, like, to have better automation around, like, understanding errors in in our data and, like, something that's more, like, hands off approach. And I think, like, Monte Carlo is on the right track, but, like, I think it's still, like, there's a long way to go there. But those companies are mostly sent like, focused on the masses.
[01:02:39] Unknown:
Again and the masses don't do what what we do. Yeah. So we always get, like, pushed aside, and it takes time until they get there Yeah. To what we're doing. And then we build something crappy by then, and yeah. Yeah.
[01:02:51] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've both been doing at Yotpo and building out this data platform architecture and taking the time to share your experiences around that. It's always a valuable view into all of the processes that go into actually building and managing and running these systems, and it's a process that I'm going through myself as well. So appreciate all of the insight you've been able to share, and I hope you each enjoy the rest of your day. Thank you, Tobias. Thank you, Tobias.
[01:03:24] Unknown:
Thank you for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Interview with Doron Porat and Liran Yoguev
Yotpo's Data Platform Architecture
Team Structure and Self-Service Data Platform
Evolution and Challenges in Data Platform Development
Key Metrics and Optimization Strategies
Experimentation and Tool Selection
Current Architecture and Legacy Systems
Interesting Use Cases and Lessons Learned
Future Plans and Projects
Biggest Gaps in Data Management Tooling
Closing Remarks