Summary
Building a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today!
- Your host is Tobias Macey and today I’m interviewing Inbar Yogev and Lior Winner about the data platform that the team at Riskified are building to power their fraud management service
Interview
- Introduction
- How did you get involved in the area of data management?
- What does Riskified do?
- Can you describe the role of data at Riskified?
- What are some of the core types and sources of information that you are dealing with?
- Who/what are the primary consumers of the data that you are responsible for?
- What are the team structures that you have tested for your data professionals?
- What is the composition of your data roles? (e.g. ML engineers, data engineers, data scientists, data product managers, etc.)
- What are the organizational constraints that have the biggest impact on the design and usage of your data systems?
- Can you describe the current architecture of your data platform?
- What are some of the most notable evolutions/redesigns that you have gone through?
- What is your process for establishing and evaluating selection criteria for any new technologies that you adopt?
- How do you facilitate knowledge sharing between data professionals?
- What have you found to be the most challenging technological and organizational complexities that you have had to address on the path to your current state?
- What are the methods that you use for staying up to date with the data ecosystem? (opportunity to discuss Haya Data conference)
- In your role as organizers of the Haya Data conference, what are some of the insights that you have gained into the present state and future trajectory of the data community?
- What are the most interesting, innovative, or unexpected ways that you have seen the Riskified data platform used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data platform for Riskified?
- What do you have planned for the future of your data platform?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Riskified
- ADABAS
- Aerospike
- Neo4J
- Kafka
- Delta Lake
- Databricks
- Snowflake
- Tableau
- Looker
- Redshift
- Event Sourcing
- Avro
- hayaData Conference
- Data Mesh
- Data Catalog
- Data Governance
- MLOps
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlin is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so so that the humans could focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlas metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy, and today I'm interviewing Inbar Yogev and Lior Winner about the data platform that they and their team at Riskified are building to power their fraud management service. So, Inbar, can you start by introducing yourself?
[00:01:40] Unknown:
Hi there. I'm Inbar. I'm a data architect and 1 of the,
[00:01:45] Unknown:
I'd say, founding members of the data platform in Riskified. And how about yourself, Lior? So I'm Lior Winer. I lead 1 of the data platform teams here in Riskified. I've been working here for the past 5 and a half years.
[00:01:58] Unknown:
Joined in Bar as the 3rd member in the data team. Going back to you, Bar, do you remember how you first got started working in data?
[00:02:05] Unknown:
Yeah. Well, actually, over 20 years in the data business, as you may call it, started as a DBA. We had Adabas and Oracle, like very young Oracle, like Oracle 6, which later grew to 9, 10, 11, 12, whatever. And I've been through several roles seeing multiple customers as a database consultant and being involved in developing data applications. I found myself in Riskified as like 1 of the developers, full stack developer with data expertise. Obvious that someone needs to focus on building the data platform and training the organization and then doing data enablement in the organization. And I took this role
[00:02:53] Unknown:
ever since. And I've been doing it for 8 years now. And, Lior, do you remember how you first got started working in data? I started working with data around 12 years ago. Back then, the role was called, BI developer. I was building legacy data warehouses and analytical tools using frameworks such Cognos and OlaCubes and Microsoft SSIS ETLs. And later, I studied computer science and kind of, like, mixed both areas together and started working as a data engineer at Riskify. As I said, it was 5 and a half years ago, and that's it. Ever since I worked in Riskify inside the data platform group, I was a data developer at the beginning, later started leading 1 of the data platform teams, and now I'm leading the data guild in Riskified.
[00:03:40] Unknown:
And so before we get too far into the specifics of what you're building at Risk ified, I'm wondering if you can just give an overview about what it is that the business does.
[00:03:49] Unknown:
So Riskified is currently 1 of the leading solutions for protecting merchants in the e commerce realm. We have an AI based platform, which is like the core of our products that protects them. We analyze their transactions in near real time and approve or decline them, which helps them generate growth.
[00:04:09] Unknown:
And so because of the fact that 1 of the core elements of the product that you're building is this AI engine, obviously, data is a very important capability in the business, but beyond the obvious use case of supporting that AI engine for the fraud analysis, what are some of the other ways that data plays a role at Riskified?
[00:04:31] Unknown:
Probably everybody says this, but Riskified is a data driven organization. But it's a really data driven organization mainly for the fact that data science is like core of the product. Right? So by that, you will see a lot of data professionals hanging around. And very many people like accessing the data on a regular basis. And you see it both at the upper management and at the technical level. Decisions are being made based on data. And what we do with the data platform team is facilitate all the data storage access and anything that relates to data processing.
[00:05:10] Unknown:
So in terms of the types of data that you're working with, I'm curious if you can enumerate the different sources that you're working with, the form that that data takes, the, you know, relative volumes and variety that's involved, and just some of the inherent complexity that comes about because of the types and sources of data that you're working with?
[00:05:31] Unknown:
We are doing fraud detection for orders. Right? So mainly, our entity is an order. And if we look at our data architecture, it's comprised of all my data stores, like of various types. You'll see elastic search for doing near real time BI or have search capabilities for near real time data. You have RDBMS, PostgreSQL for storing our order data for long term purposes. We also have Aerospike, Neo4j, and we keep expanding that as the need arises. For streaming, meaning data publishing and a synchronous microservice communication, we use Kafka. And this is what Lior's team is in charge of. For anything relates to big data analytics, we have a data lake based on S3 and Delta Lake. We have Databricks. We have Snowflake as the data warehouse.
And our analytic tools include SQL, Tableau as well. We have Looker on the foresight.
[00:06:33] Unknown:
I find it interesting that you're using both Snowflake and Databricks because of the fact that so many people have kind of posed them as direct competitors as they start to edge into each other's markets. And I'm wondering if you can talk to some of your experiences there. This takes us to the data history in Riskify. We we came out of Redshift, Amazon Redshift,
[00:06:51] Unknown:
which used to be everything we need to do in big data. You could query all the raw data is there. All the fine data is there. All the aggregations are there. You can join anything to anything. It was easy enough to do anything you want in data. But then it became too expensive, and it became too not very cost performant, you could say. So we said we need to store long term barely used data on the cloud storage. And then you need to think, how do you access this data? Right? You need an access layer. And since we initiated working with Spark for the entire organization, it only makes sense that we have a platform that allows non engineers to use Spark for data access. This is why Databricks is there. And we encourage everyone data from the data lake to use Databricks for any kind of data exploration, building data pipelines, or whatever.
So it's not always an obvious answer. Whenever someone asks us, where should I go to? How do I access this data? It's not obvious go to Snowflake or go to Databricks. It depends on the data that you need Because we keep the fine data in Snowflake and the raw data in the Data Lake. It depends on your use case. So normally, if someone comes to us with this question, we'll have to see what what he actually needs and and direct into the right direction.
[00:08:07] Unknown:
And as far as the organization of the data professionals and the ways that you're working with data and who the downstream consumers are, I'm curious if you can talk to some of the ways that that manifests in the organization and how you think about the structures as far as how to lay out those different teams and how the use of data informs the overall structure of how you define the different contracts and interfaces between those teams.
[00:08:35] Unknown:
If we're talking about the grand data organization and risk defined, we are probably talking about over than 300 people. On the analytics side, we have the data science department, which is doing everything from data exploration, performance research, feature engineering, and model training, and model training automation. We also have BI, which is a significant and important consumer that is transforming the raw data that we are producing to the data lake or to the raw data at Snowflake. And they are transforming this data, digesting it, and producing new layers of data inside the data warehouse, inside Snowflake, which are the single source of truth the single source of truth, sorry, for traditional consumers.
They are building KPIs. They're building dashboards. They're providing wide services and bringing a lot of value to the product themselves. So we have data analysts in the operations, marketing, and sales department. They are working closely with the data platform product in order to bring value to their product and the business. And last but not least, we have the Dev Organization, which are, on 1 hand, they are the data producers. They are creating the application that's constantly streaming data to our data platform and later to the data lake and Snowflake. But they also have their own use case where they are consumers of their own data. When they are building data pipelines, in a reverse ETL method, they consume their own data on the offline side, transform it, and later push it back to their own online systems for online use cases.
[00:10:12] Unknown:
Because of the fact that you do have so many people working with data, I'm curious what the evolution has been going from when you first started there, Inbar, as 1 of the first people working on data infrastructure to where you are now as far as how that sort of team topology has progressed, where I'm assuming it started as, you know, 1 or a small handful of people working with data to now having multiple different dedicated, very kind of specialized teams working with different areas of the data infrastructure and data analytics, etcetera.
[00:10:44] Unknown:
If we take a brief look of the history of the Grain Data organization, Riskified. So first, we had the first data team in Barra started, like, 8 years ago. And this small team was responsible in everything from end to end in data platform world. Like, all of the data products was in our ownership. Even the domain owners' data processes was under our ownership. And later, when the use cases arise and together with the growth of the company, we had to create additional expertise, a really wide range of technologies. And then we realized that 1 team cannot serve everything yet. So we had to split this small team into 2 teams at the beginning. And later, we created additional team. So today, we have, like, 3 teams in the data group data platform group. But I think that this is like it just if we are looking on the data platform, we are only looking like 1 side of the data organization at riskified. So on the other end, we have all of the data consumers that we talked about and how they are accessing their data platform products.
1 thing that we realized 2 years ago, we started to feel that some kind of a pain between the data users and how they are using the data platform. So there was some kind of a gap between the users and the platform. And about 1 year ago, we started thinking about a solution for this gap. And an idea we had in mind was forming a data guild. We wanted to create a community of our users.
[00:12:21] Unknown:
If I'd sum this up, you need a team that can build the platform. And for it to scale well, you need to be as self-service as you can on anything related to platform tools. And you need users who can actually use use the platform well. So they know how to model data. They know how to do efficient data access. And they can work with their own teams and train them and guide them and approach us like the data professionals
[00:12:47] Unknown:
to for help whenever they need. So this is like the kind of organization we're trying to build now. Yeah. I like the idea of the sort of guild system in this data capacity because especially since it's very difficult to be able to hire people in who are experts in all of the different technologies that any given organization might use or just even experts in the principles of kind of distributed systems and data management and all of the myriad things that you need to have a, you know, solid ground again to be able to work effectively in data. Exactly. And so the fact that you are making it in kind of a core organizational capability to be able to train people internally into those roles and facilitate that kind of learning and continuing education to be able to gain that capacity and remain effective is a very interesting and, I think, a very insightful way to approach the problem.
[00:13:45] Unknown:
And at certain points in an organization's life cycle, it's like the only way to go. I think we've grown We've gone too big, as we said, 300 people using SQL on a daily basis is challenging.
[00:13:56] Unknown:
Because of the fact that data is so core to all of the work that you're doing and you have so many different users consuming the data that you're working with and manipulating it, I'm curious what are the organizational constraints that have had the biggest impact on how you think about the design and usage of your overall data platform?
[00:14:14] Unknown:
So I think that 1 of the most interesting thing and the most challenging thing we have when we create a data platform is thinking about the users that are going to use it, who are the users, what kind of expertise they have, and how they are going to use the platform. So as we see it, a data person should create value for the company, whether his vision is enabling growth or improving the product. For this to happen, we need a couple of things inside our platform. The first thing is we need to provide data that is easy to access, reliable, and high quality data.
The user needs to know where he can find the raw data, more refined data in the data warehouse. And the next thing that our platform needs to provide in order for this to happen is the data catalog product, which is a tool for data discovery. We need to make our users' lives easier with data catalog solution that will help them find the relevant data, tag his own data, or even use digested data that someone else already solved for him. And then he just can reuse the datasets. And the next thing we need to make sure that happens is that the user would have the right expertise. The more data we collect, the more tools we have, We are creating really complex interfaces with our data platform.
So we need to make sure that the users are using the data platform in the correct way. In order to get good user experience. We need to make sure that we provide tools that gives them good performance with cost effectiveness in mind. We need to protect our data platform from high loads and stuff like that. And 1 of the most powerful things that we can do and provide is self-service tools, right? So we can think of common tasks that our users are doing and create automation for that and allow non tech savvy people to access data easily with all of the best practices in mind around reducing loads on the platform, reducing costs of their queries, or or anything else. Efficient data storage and data modeling? Yeah.
[00:16:22] Unknown:
Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workloads that connect your entire data stack end to end with a mix of your code and their open source low code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you're ingesting data from an API, transforming it with DBT, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they're good at, solving problems with data. Go to data engineering podcast .com/shipyard today to get started automating with their free developer plan.
As far as the ways that you have approached the solution to those constraints and being able to power all of the different data use cases at Riskified, I'm wondering if you can now talk through some of the architectural elements and design considerations that you've put together into your data platform and some of the journey to where you are now.
[00:17:32] Unknown:
We've kinda looked at it as like a tiered architecture, where where the leftmost part is like the producer side, where data is being created. You're talking about the online systems, stateful applications which have their own database. Therefore, you need a CDC solution where you're currently implementing Debezium for that purpose. You also have streaming applications producing data amongst themselves and also into the offline data realm for data analytics. Event sourcing has become a thing. We used to sync very stateful and everything has a state, like a final state. It took a lot of adoption around the organization to get used to that notion. But you can see how the entity changed over time. It's required a lot of education and we're still working on that. We use Avro for data publishing, which helps us avoid breaking changes in the schema and puts the actual ownership and responsibility on the producer itself. We talked about self-service tools. This is like the first element in the self-service tools. He builds his own schema.
He knows if it's broken or not. We don't have to do anything about it. And we can trust that it's not going to break over time. So the 2nd tier, we're talking about a streaming platform to allow both us asynchronous communication between microservices and as well as data publishing. So we have, as we said, Kafka in the center of our architecture. We have k streams. We have Kafka Connect. And for data publishing, we use our own built Spark stream. Again, self serve. Once you've built your schema, you can generate a new stream that will publish everything into the data lake, into a data format. Now, I'm jumping into the next tier, which is like the analytics storage team. We have S3 for raw data and Snowflake for fine data as we mentioned before.
And for accessing this data, you need some kind of compute layer. So use Snowflake for accessing Snowflake data. Use Databricks and Spark jobs for accessing the Data Lake data. Now you have the usage or the analytics layer. So we have Tableau. We just started migrating some of the Tableau workload into Looker. I think this is like the main parts of our data platform.
[00:19:49] Unknown:
Because of the fact that you have so many people relying on this platform that you're building, obviously, you need to ensure that it is stable and that you're very deliberate in the technologies that you choose and the ways that you approach introducing them. So I'm curious if you can talk to some of the evaluation criteria and some of the ways that you think about selecting those different technologies and how to integrate them into the platform and try to create this kind of unified interface for people to be able to understand what are the responsibilities of each of these different components and how do they all work together.
[00:20:24] Unknown:
So I think that the first step we are looking when we are talking about the new technology, connecting it to the business goal or the need inside the organization, We learned that from experience that we first need to get the business need in order to get later the prioritization from the domain owners that later are going to eventually are going to migrate to this new technology. We can look, for example, for Aerospike, which is a new in memory key value database that we adopted in Riskified. We look like on business goals like cost optimizations, performance improvements, and cloud agnostic, and Neo4j, which is additional and new joiner to our stack, which also was like a business requirement from our users.
So I think that we have some kind of checklist that we are going through the process of evaluating a new technology. First thing that we will check is the cost. We'll try, like, to create some kind of estimation of cost, whether it's fully managed solution that we are going to start working with a new vendor, understand its business model and how we are going to trying to understand eventually how much we are going to pay on this solution or whether it's, like, it's an open source solution that we are going to develop on ourself and adopt it. And, like, we have some kind of a learning curve to implement this technology in our technology stack. So cost, like, it's a really important factor in our process.
Next thing is performance or scale of the technology that we are evaluating. We need to make sure that it's like answering the business requirements we have today. And because we are a constantly growing company, we have to make sure that this product is scalable and can serve our requirements in a near and longer future. So production readiness is the next thing we check. We need to make sure that the product is mature, answers all of the security requirements and compliance that we have. And last thing is checking out the community around the technology. We want to make sure that a lot of people use this product. We want to learn from other companies' experience, what kind of companies are using it, and for what purposes.
And so we are like after we have gone through all of the checklists, we are starting a POC phase. POC phase will probably involve several technologies that are comparable. We will go through all of the parameters that we talked about, compare how every technology answer this parameter. We will do some kind of validation process. And the last thing is we'll be, like, getting to the decision making step, commercial terms, and stuff like that before we start plan the real implementation phase inside the company and how it is going to affect our users and what kind of migrations we have to do.
[00:23:19] Unknown:
1 more thing is you're not married to the technology at the end. Right? So you need to keep in mind that you might have chosen based on some criterias And what you normally see when embedding a new system, a new technology in our system is it not always works as in the POC. And you sometimes need to to think again. You can always work hard, bang your head against the wall until stuff works. And you can always look around, ask other companies what are they doing. And maybe maybe you've taken the wrong path. This is something you have to reckon and live with.
[00:23:58] Unknown:
And to that point of being able to tap into the community to understand what are the solutions that have worked well, what are the things that we should be keeping an eye out for, I'm curious. What are some of the resources that you've been able to lean on as you have grown this data platform and how you have worked to foster both internal to your company and within your kind of regional community that those connections that have helped you to validate and grow those capabilities for your technology platform and your organizational capacities and understanding both the the technical aspects of how do I make this work and the business aspects of how do I make this scale kind of logically and semantically?
[00:24:44] Unknown:
First of all, we are trying to get use our own experience from the past experience that we have, like in Riskify and our people that have more experience, like in other companies. But then we want to expand our knowledge and try to introduce ourselves to new technologies. So of course, we have, like, can read everything online. We can hear broadcast, I go to conferences, meetups, etcetera. Then we are trying to mingle with other companies, hear their thoughts about similar challenges that we are facing. Their POC is our POC, right?
[00:25:20] Unknown:
So we usually learn from their own learning curve and try to embed them into our own knowledge.
[00:25:26] Unknown:
Yeah. So we have similar challenges, create opportunities to do knowledge sharing. That's leading us to how we thought about Hidata, which is a Hebrew pun for the English term, did you know? So Idata was built after we recognized the vacuum in the Israeli data community. We wanted to create a new data conference for the data engineering and the data science community in Israel. Riskified is the main promoter. But from the beginning, we had in mind how we are going to add additional companies for the committee team. We wanted to create additional connections in the industry. It's a great opportunity for companies like us to mingle, to discuss their challenges, to hear about the success and the mistakes of other companies that face the same challenges that they have.
And we created this conference for the community. It's a non profit conference. Everything that we do is for the community. We created a wide range of people inside the conference committee in order to bring up to date and diverse agenda for our crowd. And of course, we are their people, so we can take a look on some of the numbers from the 2022 conference. We hosted more than 700 people. Together with 18 sponsors, we had 23 talks, 2 different tracks. Most of the content was around data engineering and data science. It was a great conference. We had, like, great feedbacks from the community.
We're really proud of the the end result.
[00:27:11] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. Between your work at Riskified and in the Hyatt Data Conference, I'm wondering what you have seen as some of the major themes that people are dealing with, whether it's still at the technology level of how do we manage this data, how do we scale things, or is it moving more to how do we improve the end user experience? I'm curious what are kind of some of the main topics that you're dealing with both internally at Riskified and that folks were addressing in the most recent conference.
[00:28:36] Unknown:
I'll start from the end. Everybody is in the same boat here. And when you talk to other companies, you always see that your challenges are very similar to theirs. And and if I'm like I'm recognizing like hot topics in today's era is is like data mesh and data lake implementations. I'm not sure if I know a single company that has like full data mesh. But everybody's talking about it. Everybody's aiming for it. Data cataloging is like a super hot topic here. And how do you govern the data at all? Data governance is very hot too, including data quality tools and maybe MLOps would be the last 1. Everybody's dealing with feature stores, model trading, model serving.
And the challenges for companies on our side is more or less the same. This is what Lior said before. The conference is a great place to meet and discuss over these topics. And we had a lot of engagement. We screened a lot of abstracts in the screening process. You see young companies, they have the opportunity to dive into new technologies. So you see old companies learning from younger companies, but you also see the the vice versa discussion where young companies are going to scale, better are approaching bigger companies. And they want to learn about how their practices and growing pains, etcetera. So a lot of collaboration here. Yeah.
[00:30:04] Unknown:
Given the kind of slate of issues that so many organizations are dealing with and that they're all going through it together, I'm wondering what you see as some of the potential future trajectory for the data community and some of the upcoming challenges that folks are likely to run into as they start to tackle the ones that you and they are currently addressing?
[00:30:26] Unknown:
1st and for us is cost, cost management, especially in the Kubernetes era where everything runs in Kubernetes. And it's apparently cheap because everything is open source. But you need people who knows who know the job to be able to tune Spark jobs on Kubernetes, for example. And I think we're going to see a lot of developments in Kubernetes automation in the near and mid future. Like the big question, do I go into a fully managed Cloud data warehouse? Do I manage my own data lake? How do I choose exactly which technologies are gonna be in my data stack? This is like a question that's gonna
[00:31:06] Unknown:
be here for a while, I think, until someone wins. Right? Yeah. Absolutely. It's definitely a challenging question to answer because of the fact that it's such a moving target where it says, oh, well, I need something that's easy to get started with, but I also need to be able to work with all of this raw data that I have. Well, Snowflake, you know, started in that easy to use path, but they're starting to expand into that data lake use case of, oh, we can offload querying to your raw data that doesn't necessarily live inside Snowflake, and then you've got, you know, Databricks going in the opposite direction. And then technologies like Trino and Presto that are trying to kind of play the middle ground between them.
[00:31:42] Unknown:
And if you utilize Spark, you can't use any of them. You need data in your cloud storage. Right? So, yeah, a question that's going to remain as long as the market is like this. Absolutely.
[00:31:53] Unknown:
And in terms of the data platform that you have built at Riskified and that you're continuing to support and maintain and evolve. What are some of the most interesting or innovative or unexpected ways that you've seen that platform used or some of the types of products or capabilities that have come out of it that you didn't anticipate?
[00:32:12] Unknown:
There's no magic here. But, I mean, normally when we roll out a new technology, we start looking at what's going on. And I can give some examples of what we had in the past. So when we rolled out Airflow, for example, for a broad use of the entire company, it was not so late after that we saw very huge DAGs being developed with the BI team and contribution into the Airflow platform, which was very nice to see. And we didn't expect it, like, at first. Maybe another example would be we have our own built anomaly detection system. As we said, we're analyzing orders. We're making decisions. And anomaly detection is like a different way to look. You're not looking at the order level. You're looking at the bringing of orders, trying to find correlation between them. Once we rolled out Spark for the data science organization, it was like, I think, 1 of the first projects.
And it opened eyes for everyone when we suddenly saw very limited anomaly detection capability being scaled and finishing, like, a nightly job finishing in 30 minutes, which which couldn't finish before in 8 hours. And I know it's not like amazing thing to hear, but for us being, facilitators and promoters of technology, it was very nice to see that it is utilized correctly and that there's actually value in it. In your own experiences
[00:33:33] Unknown:
of building and growing the Riskify data platform and working with your internal stakeholders and end users as well as the work that you're doing at the Hyatt Data Conference. What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:33:47] Unknown:
I hope nobody hears this, but data science is kind of a traditional organization, technology wise. I'm kidding, of course, but because this is the most advanced technology that we have. But we found out that rolling out new technologies, let's say, we move them from running computational stuff on their own laptops using RStudio into running stuff on Databricks. But talking about a distributed cloud based platform like Databricks, it was a hard thing to do. And it required a lot of education. And maybe taking them by their hand and showing them and working with them together and showing them, hey, this could work. I can go with you, like, take with you the first few steps and and show you that it works. So you need to be patient when pushing new technologies. It doesn't happen in 1 day when we replaced Redshift with Snowflake. It was seemingly moving from 1 cloud data warehouse to another. But the adoption was such a long process. We found out we have so much SQL in code that needs to be migrated eventually.
So you need to be patient and you need to keep pushing forward. That's my main take. As you continue
[00:34:59] Unknown:
to support and evolve the data platform and be able to adapt to new requirements and new products and capabilities at Riskified? What are some of the things you have planned for the near to medium term as far as the technological and organizational aspects of what you're building?
[00:35:16] Unknown:
So we said that we are creating a more robust organization using our data guild. It's like bridging the gap between the data platform and the data owner teams. And we will expect to take this even 1 step further. We want to create full ownership of the data on the domain owners, like in a data mesh like idea. We want to give domain owners the full ownership of their own data and their own data processes. We want them to wake up at night when something breaks. We want to create some kind of a data mesh like idea.
They will have the ownership for end to end. They will have the right tools to see how their data flows. They will have the correct data quality tools, the lineage. And then they will be able to run without us. And the data platform can go to even for the next step, which is creating independence inside the data platform. We want to keep creating the building blocks that later our domain owners can use to create their own product. We will have the data guild as the supporter for this process. Of course, it's a long term process. We'll have to create the right trainings and the right processes of shifting this paradigm between, like, a data platform owner that owns the data processes and the domain owners that are going to own their own data processes and maintain them by themselves.
We have a long journey here. But the end result, it's going to be a lot more robust and a lot more scalable in terms of our data platform. And of course, that's key part of this transition is data governance practices that are going to help us keep in control. Like, we are going to give a lot of freedom and independence to the main owners. But as as the data platform owners, we have to keep in control. So we have to control the data access layers. We have the right auditing tools. And I think that it is.
[00:37:17] Unknown:
Well, for anybody who wants to get in touch with each of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think what
[00:37:35] Unknown:
is missing for me, being able to control data access and audit data access in a data lake implementation from any kind of access there, whether it's Spark, Presto, Athena, or whatever. Being able to both control the access without having to implement it in a separate platform each time. And being able to audit and understand who accesses the data and which data is being accessed is kind of a challenge at this point in time. Maybe the second item would be that we have data lakes. Data is lying there in cheap storage. Sometimes, you need just random access into this data and be able to pinpoint a single row. And from what we know at this point, it's not really possible to get this capability within the current existing tool sets. So being able to have a data that allows you both doing mass data processing as well as drilling down into the single row level would be a very nice addition to current capabilities we have.
[00:38:37] Unknown:
Alright. Well, thank you both for taking the time today to join me and share the work that you've been doing on Riskified's data platform and being able to help evolve the technological and organizational capacity to take advantage of data and Hyatt data conference to help build and foster that community. So appreciate all of the time and energy that you're both putting into that and for taking the time to share your work there. So hope you have a good rest of your day. Thank you, the best. Thanks a lot. Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hosts at pythonpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
[00:39:49] Unknown:
Workers.
Introduction to Guests and Their Roles at Riskified
Overview of Riskified's Business and Data Use Cases
Data Sources and Architecture at Riskified
Organizational Structure and Data Team Topology
Challenges and Solutions in Data Platform Development
Architectural Elements and Design Considerations
Evaluation Criteria for New Technologies
Community Engagement and Knowledge Sharing
Current Trends and Future Trajectories in Data Management
Innovative Uses and Lessons Learned from Riskified's Data Platform
Future Plans for Riskified's Data Platform
Biggest Gaps in Current Data Management Tooling