Charting the Path of Riskified's Data Platform Journey

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlin is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlin's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so so that the humans could focus on delivering real value.

Go to data engineering podcast.com/atlan

today, that's a t l a n, to learn more about how Atlas metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy, and today I'm interviewing Inbar Yogev and Lior Winner about the data platform that they and their team at Riskified are building to power their fraud management service. So, Inbar, can you start by introducing yourself?

Hi there. I'm Inbar. I'm

a data architect and 1 of the,

I'd say, founding members of the data platform in Riskified. And how about yourself, Lior? So I'm Lior Winer. I lead 1 of the data platform teams here in Riskified.

I've been working here for the past 5 and a half years.

Joined in Bar as the 3rd member in the data team. Going back to you, Bar, do you remember how you first got started working in data?

Yeah. Well, actually,

over 20 years in the data business, as you may call it, started

as a DBA.

We had Adabas

and Oracle,

like very young Oracle,

like Oracle 6,

which later

grew to 9, 10, 11, 12, whatever.

And I've been through several roles

seeing multiple customers as a database consultant

and being involved in developing

data applications.

I found myself in Riskified

as

like 1 of the developers,

full stack developer

with data expertise.

Obvious

that someone needs to focus on building the data platform

and training the organization and then doing data enablement

in the organization. And I took this role

ever since. And I've been doing it for 8 years now. And, Lior, do you remember how you first got started working in data? I started working with data around 12 years ago. Back then, the role was called, BI developer.

I was building

legacy data warehouses and analytical tools

using frameworks

such Cognos and OlaCubes

and Microsoft

SSIS

ETLs.

And later, I studied computer science and kind of, like, mixed both areas together and started working as a data engineer at Riskify.

As I said, it was 5 and a half years ago, and that's it. Ever since I worked in Riskify inside the data platform group, I was a data developer

at the beginning, later started leading 1 of the data platform teams, and now

I'm leading the data guild in Riskified.

And so before we get too far into the specifics of what you're building at Risk ified, I'm wondering if you can just give an overview about what it is that the business does.

So Riskified is currently 1 of the leading solutions

for protecting merchants in the e commerce realm. We have an AI based platform,

which is like the core of our products that protects them. We analyze their transactions in near real time and approve or decline them, which helps them generate

growth.

And

so because of the fact

that 1 of the core elements of the product that you're building is this AI engine, obviously, data is a very important capability in the business, but beyond the obvious use case of

supporting that

AI engine for the fraud analysis, what are some of the other ways that data plays a role at Riskified?

Probably everybody says this, but Riskified is a data driven organization.

But it's a really data driven organization

mainly for the fact that data science is like core of the product.

Right? So by that, you will see a lot of data professionals

hanging around. And very many people like accessing the data

on a regular basis. And you see it both at the upper management and at the technical level.

Decisions are being made based on data. And what we do with the data platform team is facilitate

all the data storage access

and anything that relates to data processing.

So in terms of the types of data that you're working with, I'm curious if you can

enumerate

the different sources that you're working with, the form that that data takes, the, you know, relative volumes and variety that's involved, and just some of the

inherent complexity that comes about because of the types and sources of data that you're working with?

We are doing fraud detection for orders. Right? So mainly, our entity is an order.

And if we look at our data architecture, it's comprised of all my data stores,

like of various types. You'll see elastic search

for doing near real time BI or have search capabilities

for near real time data. You have RDBMS,

PostgreSQL

for storing our order data for long term purposes.

We also have Aerospike, Neo4j,

and we keep expanding that as the need arises.

For streaming, meaning data publishing

and a synchronous

microservice communication, we use Kafka.

And this is what Lior's team is in charge of. For anything relates to big data analytics, we have a data lake based on S3 and Delta Lake. We have Databricks. We have Snowflake as the data warehouse.

And our analytic tools include SQL, Tableau as well. We have Looker on the foresight.

I find it interesting that you're using both Snowflake and Databricks because of the fact that so many people have kind of posed them as direct competitors as they start to edge into each other's markets. And I'm wondering if you can talk to some of your experiences there. This takes us to the data history in Riskify. We we came out of Redshift, Amazon Redshift,

which used to be everything we need to do in big data. You could query all the raw data is there. All the fine data is there. All the aggregations are there. You can join anything to anything. It was easy enough to do anything you want in data. But then it became too expensive, and it became too not very cost performant, you could say. So

we said we need to store long term barely used data on the cloud storage. And then you need to think, how do you access this data? Right? You need an access layer.

And since we initiated working with Spark for the entire organization, it only makes sense that we have a platform that allows

non engineers to use Spark for data access. This is why Databricks is there. And we encourage everyone data from the data lake to use Databricks for any kind of data exploration,

building data pipelines, or whatever.

So it's not always an obvious

answer.

Whenever someone asks us, where should I go to? How do I access this data? It's not obvious go to Snowflake or go to Databricks.

It depends on the data that you need Because we keep the fine data in Snowflake and the raw data in the Data Lake. It depends on your use case. So normally,

if someone comes to us with this question, we'll have to see what what he actually needs and and direct into the right direction.

And as far as the

organization

of the

data professionals and the ways that you're working with data and who the downstream consumers are, I'm curious if you can talk to some of the ways that that manifests in the organization and how you think about the structures as far as how to lay out those different teams and

how the use of data informs the

overall structure of how you define the different contracts and interfaces between those teams.

If we're talking about the grand data organization and risk defined, we are probably talking about over than 300 people. On the analytics side, we have the data science department,

which is doing everything from data

exploration, performance research, feature engineering, and model training, and model training automation.

We also have BI, which is a significant and important consumer

that is transforming the raw data that we are producing to the data lake or to the raw data at Snowflake. And they are transforming this data, digesting it,

and producing

new layers of data inside the data warehouse, inside Snowflake, which are the single source of truth the single source of truth, sorry, for traditional consumers.

They are building KPIs. They're building dashboards.

They're providing wide services and bringing

a lot of value

to the product themselves.

So we have data analysts in the operations,

marketing, and sales department.

They are working closely with the data platform product

in order to bring value to their product and the business.

And

last but not least, we have the Dev Organization, which are, on 1 hand, they are the data producers. They are creating the application that's constantly

streaming data to our data platform and later to the data lake and Snowflake.

But they also have their own use case where they are consumers of their own data. When they are building data pipelines,

in a reverse ETL method, they consume their own data on the offline side, transform it, and later push it back to their own online systems for online use cases.

Because of the fact that you do have so many people working with data, I'm curious

what the evolution has

been going from when you first started there, Inbar, as 1 of the first people working on data infrastructure to where you are now as far as

how

that sort of team topology has progressed, where I'm assuming it started as, you know, 1 or a small handful of people working with data to now having multiple different dedicated,

very kind of specialized teams working with different areas of the data infrastructure and data analytics, etcetera.

If we take a brief look of the history of the Grain Data organization, Riskified. So first, we had the first data team in Barra started, like, 8 years ago.

And this small team was responsible

in everything from end to end in data platform world. Like, all of the data products

was in our ownership.

Even the domain owners' data processes was under our ownership.

And later, when the use cases

arise and together with the growth of the company, we had to create additional expertise,

a really wide range of technologies.

And then we realized that 1 team cannot serve everything yet. So we had to split this small team into 2 teams at the beginning. And later, we created additional team. So today, we have, like, 3 teams in the data group

data platform group. But I think that this is like it just if we are looking on the data platform, we are only looking like 1 side of the data organization at riskified. So on the other end, we have all of the data consumers that we talked about

and how they are accessing their data platform products.

1 thing that we realized 2 years ago, we started to feel that some kind of a pain between

the data

users

and how they are using the data platform.

So there was some kind of a gap between the users and the platform. And about 1 year ago, we started thinking about a solution for this gap. And an idea we had in mind was forming a data guild.

We wanted to create a community of our users.

If I'd sum this up, you need a team that can build the platform.

And for it to scale well, you need to be as self-service as you can

on anything related to platform tools. And you need users who can actually use use the platform well. So they know how to model data. They know how to do efficient data access.

And they can work with their own teams and train them and guide them and approach us like the data professionals

to for help whenever they need. So this is like the kind of organization we're trying to build now. Yeah. I like the idea of the sort of guild system in this data capacity because especially since it's very difficult to be able to hire people in who are experts in all of the different technologies that any given organization might use or

just even experts in the principles of kind of distributed systems and data management and all of the myriad things that you need to have a, you know, solid ground again to be able to work effectively in data. Exactly.

And so the fact that you are

making it

in kind of a core organizational capability to be able to train people internally into those roles and facilitate that kind of learning and continuing education

to be able

to gain that capacity and remain effective is

a very interesting and, I think, a very

insightful way to approach the problem.

And at certain points in an organization's

life cycle, it's like the only way to go. I think we've grown We've gone too big, as we said, 300 people

using SQL on a daily basis is challenging.

Because of the fact that data is so core to all of the work that you're doing and you have so many different users

consuming the data that you're working with and manipulating it, I'm curious what are the organizational constraints that have had the biggest impact on how you think about the design and usage of your overall data platform?

So I think that 1 of the most interesting thing and the most challenging thing we have when we create a data platform

is thinking about the users that are going to use it, who are the users, what kind of expertise they have, and how they are going to use the platform.

So as we see it, a data person should create value for the company,

whether his vision is enabling growth or improving the product. For this to happen, we need a couple of things inside our platform.

The first thing is we need to provide data that is easy to access,

reliable,

and high quality data.

The user needs to know where he can find the raw data,

more refined data in the data warehouse.

And the next thing that our platform needs to provide in order for this to happen is the data catalog product, which is a tool for data discovery. We need to make our users' lives easier with data catalog solution that will help them find the relevant data,

tag his own data, or even use digested data that someone else already solved for him. And then he just can reuse the datasets.

And the next thing we need to make sure that happens is that the user would have the right expertise.

The more data we collect, the more tools we have, We are creating really complex interfaces

with our data platform.

So we need to make sure that the users are using the data platform in the correct way. In order to get good user experience. We need to make sure that we provide tools that gives them good performance with cost effectiveness in mind. We need to protect our data platform from

high loads and stuff like that. And 1 of the most powerful things that we can do and provide is self-service tools, right? So we can think of common tasks that our users are doing and create

automation for that and allow non tech savvy people to access

data easily

with all of

the

best practices in mind around reducing loads on the platform, reducing costs of their queries, or or anything else. Efficient data storage and data modeling? Yeah.

Tired of deploying bad data?

Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes.

Build powerful workloads that connect your entire data stack end to end with a mix of your code and their open source low code templates.

Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does.

So whether you're ingesting data from an API, transforming it with DBT,

updating BI tools, or sending data alerts, Shipyard centralizes

these operations and handles the heavy lifting so your data team can finally focus on what they're good at, solving problems with data. Go to data engineering podcast .com/shipyard

today to get started automating with their free developer plan.

As far as the ways that you have approached the solution to those constraints

and being able to power all of the different data use cases at Riskified, I'm wondering if you can now talk through some of the

architectural elements and design considerations that you've put together into your data platform and some of the journey to where you are now.

We've kinda looked at it as like a tiered architecture,

where where the leftmost part is like the producer side, where data is being created. You're talking about the online systems, stateful applications which have their own database.

Therefore, you need a CDC solution where you're currently

implementing Debezium for that purpose. You also have streaming applications

producing data amongst themselves

and also into the offline data realm for data analytics.

Event sourcing has become a thing. We used to sync very stateful

and everything has a state, like a final state. It took a lot of adoption around the organization

to get used to that notion. But you can see how the entity changed over time. It's required a lot of education

and we're still working on that. We use Avro for data publishing, which helps us avoid breaking changes in the schema and

puts the actual ownership

and responsibility

on the producer itself. We talked about self-service tools. This is like the first element in the self-service tools. He builds his own schema.

He knows if it's broken or not. We don't have to do anything about it. And we can trust that it's not going to break over time. So the 2nd tier, we're talking about a streaming platform

to allow both us asynchronous communication between microservices

and as well as data publishing.

So we have,

as we said, Kafka in the center of our architecture.

We have k streams. We have Kafka Connect.

And for data publishing, we use our own built Spark stream.

Again, self serve.

Once you've built your schema, you can generate a new stream that will publish everything into the data lake, into a data format.

Now, I'm jumping into the next tier, which is like the analytics storage team. We have S3 for raw data and Snowflake for fine data as we mentioned before.

And for accessing this data, you need some kind of compute layer. So use Snowflake

for accessing Snowflake data. Use Databricks

and Spark jobs for accessing the Data Lake data.

Now you have the usage or the analytics layer. So we have Tableau. We just started migrating some of the Tableau workload into Looker. I think this is like the main parts of our data platform.

Because of the fact that you have so many people relying on this platform that you're building, obviously, you need to ensure that it is stable and that you're very deliberate in the technologies that you choose and the ways that you approach introducing them. So I'm curious if you can talk to some of

the evaluation criteria

and some of the ways that you think about selecting those different technologies and how to integrate them into the platform and try to create this

kind of unified interface for people to be able to understand what are the responsibilities

of each of these different components and how do they all work together.

So I think that the first step we are looking when we are talking about the new technology,

connecting it to the business goal or the need inside the organization,

We learned that from experience that we first need to get the business need in order to get later the prioritization

from the domain owners that later are going to eventually are going to migrate to this new technology.

We can look, for example, for Aerospike, which is

a new in memory key value database that we adopted in Riskified.

We look like on business goals like cost optimizations,

performance improvements, and cloud agnostic,

and Neo4j, which is additional and new joiner to our stack, which also was like a business requirement from our users.

So I think that we have some kind of checklist

that we are going through the process of evaluating a new technology.

First thing that we will check is the cost. We'll try, like, to create some kind of estimation of cost, whether it's fully managed solution that we are going to start working with a new vendor, understand its business model and how we are going to

trying to understand eventually how much we are going to pay on this solution or whether it's, like, it's an open source solution that we are going to develop on ourself and adopt it. And, like, we have some kind of a learning curve to implement this technology

in our technology stack. So cost, like, it's a really important

factor in our process.

Next thing is performance or scale of the technology that we are evaluating.

We need to make sure that it's like answering the business requirements we have today. And because we are a constantly growing company, we have to make sure that this product is scalable and can serve our requirements in a near and longer future.

So production readiness is the next thing we check. We need to make sure that the product is mature,

answers all of the security requirements and compliance that we have.

And last thing is checking out the community around the technology.

We want to make sure that

a lot of people use this product. We want to learn from other companies'

experience,

what kind of companies are using it, and for what purposes.

And so we are like after we have gone through all of the checklists, we are starting a POC phase. POC phase will probably

involve several technologies that are comparable.

We will go through all of the parameters that we talked about,

compare how every technology answer this parameter.

We will do some kind of validation

process.

And the last thing is we'll be, like, getting to the decision making step, commercial terms, and stuff like that before we start plan the real implementation phase inside

the company and how it is going to affect our users and what kind of migrations we have to do.

1 more thing is you're not married to the technology at the end. Right? So you need to keep in mind that you might have chosen based on some criterias

And what you normally see when embedding a new system, a new technology in our system is it not always works as in the POC.

And you sometimes

need to

to think again.

You can always

work hard,

bang your head against the wall until stuff works.

And you can always

look around, ask other companies what are they doing. And maybe

maybe you've taken the wrong path. This is something you have to reckon and live with.

And to that point of being able to tap into the community to understand

what are the solutions that have worked well, what are the things that we should be keeping an eye out for, I'm curious. What are some of the resources that you've been able to lean on as you have grown this data platform and how you have worked to foster both internal to your company and within your kind of regional community

that those connections that have helped you to

validate and grow those capabilities

for your technology platform and your

organizational capacities and understanding both the

the technical aspects of how do I make this work and the business aspects of how do I make this scale kind of logically

and semantically?

First of all, we are trying to get use our own experience

from the past experience that we have, like in Riskify and our people that have more experience, like in other companies. But then we want to expand

our knowledge and try to introduce ourselves to new technologies.

So of course, we have, like, can read everything online. We can hear broadcast, I go to conferences,

meetups, etcetera.

Then we are trying to mingle with other companies,

hear their thoughts about similar challenges that we are facing. Their POC is our POC, right?

So we usually learn from their own learning curve and try to embed them into our own knowledge.

Yeah. So we have similar challenges,

create

opportunities to

do knowledge sharing. That's leading us to

how we thought about Hidata,

which is a Hebrew pun for the English term, did you know?

So Idata

was built after we recognized the vacuum in the Israeli data community.

We wanted to create a new data conference

for the data engineering and the data science community in Israel.

Riskified is the main promoter. But from the beginning, we had in mind how we are going to add additional companies

for the

committee team.

We wanted to create additional connections in the industry. It's a great opportunity

for companies like us to mingle,

to discuss their challenges, to hear about

the success and the mistakes of other companies that face the same challenges

that they have.

And we created this conference

for the community. It's a non profit conference.

Everything that we do is

for the community.

We created a wide range of people inside

the conference committee in order to bring

up to date and diverse agenda

for our crowd.

And of course, we are their people, so we can take a look on some of the numbers from the 2022 conference.

We hosted more than 700 people. Together with 18 sponsors,

we had 23

talks, 2 different tracks.

Most of the content was around data engineering and data science.

It was a great conference. We had, like, great feedbacks from the community.

We're really proud of the the end result.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done.

DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests.

DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying. You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses

as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

Between your work at Riskified

and in the Hyatt Data Conference,

I'm wondering what you have seen as some of the major themes that people are dealing with, whether it's still at the technology level of

how do we manage this data, how do we scale things, or is it moving more to how do we improve the end user experience? I'm curious what are kind of some of the main topics that you're dealing with both internally at Riskified and that folks were addressing in the most recent conference.

I'll start from the end. Everybody is in the same boat here. And when you talk to other companies, you always see that your challenges

are very similar to theirs. And and if I'm like I'm recognizing like hot topics in today's era is is like data mesh and data lake implementations.

I'm not sure if I know a single company that has like full data mesh. But everybody's talking about it. Everybody's

aiming for it. Data cataloging is like a super hot topic here. And how do you govern the data at all? Data governance is very hot too,

including data quality tools

and maybe

MLOps would be the last 1. Everybody's

dealing with feature stores, model trading, model serving.

And the challenges for companies on our side

is more or less the same. This is what Lior said before. The conference is a great place to meet and discuss over these topics.

And we had a lot of engagement.

We screened

a lot of abstracts in the screening process.

You see young

companies,

they have the opportunity to dive into new technologies.

So you see old companies learning from younger companies, but you also see the

the vice versa

discussion where young companies

are going to scale, better are approaching bigger companies. And they want to learn about how their practices

and growing pains, etcetera. So

a lot of collaboration here. Yeah.

Given the

kind of slate of issues that so many organizations are dealing with and that they're all going through it together, I'm wondering what you see as some of the potential future trajectory for the data community

and some of the upcoming challenges that folks are likely to run into as they start to tackle the ones that

you and they are currently addressing?

1st and for us

is cost, cost management,

especially in the Kubernetes era where everything runs in Kubernetes. And it's apparently cheap because everything is open source. But you need people who knows who know the job to be able to tune Spark jobs on Kubernetes, for example. And I think we're going to see a lot of developments

in

Kubernetes automation in the near and mid future. Like the big question,

do I go into a fully managed Cloud data warehouse? Do I manage my own data lake? How do I choose

exactly which technologies are gonna be in my data stack?

This is like a question that's gonna

be here for a while, I think, until someone wins. Right? Yeah. Absolutely. It's definitely a challenging question to answer because of the fact that it's such a moving target where it says, oh, well, I need something that's easy to get started with, but I also need to be able to work with all of this raw data that I have. Well, Snowflake, you know, started in that easy to use path, but they're starting to expand into that data lake use case of, oh, we can offload querying to your raw data that doesn't necessarily live inside Snowflake, and then you've got, you know, Databricks going in the opposite direction. And then technologies like Trino and Presto that are trying to kind of play the middle ground between them.

And if you utilize Spark, you can't use any of them. You need data in your cloud storage. Right? So,

yeah, a question that's going to remain as long as the market is like this. Absolutely.

And in terms of the data platform that you have built at Riskified and that you're continuing to support and maintain and evolve. What are some of the most interesting or innovative or unexpected ways that you've seen that platform used or some of the types of products or capabilities that have come out of it that you didn't anticipate?

There's no magic here. But, I mean,

normally when we roll out a new technology,

we start looking at what's going on. And I can give some examples of what we had in the past. So when we rolled out Airflow, for example, for a broad use of the entire company,

it was not so late after that we saw very huge DAGs being developed with the BI team and

contribution into the Airflow platform, which was very nice to see. And we didn't expect it, like, at first. Maybe another example would be we have our own built

anomaly detection system. As we said, we're analyzing orders. We're making decisions. And anomaly detection is like a different

way to look. You're not looking at the order level. You're looking at the bringing of orders, trying to find correlation between

them. Once we rolled out Spark

for the data science organization, it was like, I think, 1 of the first projects.

And it opened eyes for everyone when we suddenly saw very limited anomaly detection capability

being scaled and finishing, like, a nightly job finishing in 30 minutes, which which couldn't finish before in 8 hours. And I know it's not like amazing thing to hear, but for us being,

facilitators and promoters of technology, it was very nice to see that it is utilized correctly and that there's actually value in it. In your own experiences

of building and growing the Riskify data platform and working with your internal stakeholders and end

users as well as the work that you're doing at the Hyatt Data Conference. What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I hope nobody hears this, but data science is kind of a traditional

organization,

technology wise.

I'm kidding, of course, but because this is the most advanced technology that we have. But we found out that rolling out new technologies,

let's say, we

move them from running

computational

stuff on their own laptops using RStudio

into running stuff on Databricks. But talking about a distributed

cloud based platform like Databricks,

it was a hard thing to do. And it required a lot of education.

And maybe taking them by their hand and showing them and working with them together and showing them, hey, this could work. I can go with you, like, take with you the first few steps and and show you that it works. So you need to be patient when pushing new technologies. It doesn't happen in 1 day when we replaced Redshift with Snowflake.

It was

seemingly moving from 1 cloud data warehouse to another. But the adoption was such a long process. We found out we have so much SQL

in code that needs to be migrated eventually.

So you need to be patient and you need to keep pushing forward. That's my main take. As you continue

to support and evolve the data platform

and be able to

adapt to new requirements and new products and capabilities at Riskified? What are some of the things you have planned for the near to medium term as far as the

technological and organizational aspects of what you're building?

So we said that we are creating a more robust organization

using our data guild. It's like bridging the gap between the data platform and the data owner teams. And we will expect to take this even 1 step further. We want

to create full ownership of the data on the domain owners, like in a data mesh like idea.

We want to give domain owners the full ownership of their own data and their own data processes.

We want them to wake up at night when something breaks. We want to create some kind of a data mesh

like idea.

They will have the ownership for end to end. They will have the right tools to see how their data flows. They will have the correct data quality

tools, the lineage.

And then they will be able to run without us. And the data platform

can go to even for the next step, which is creating

independence

inside the data platform.

We want to keep creating the building blocks

that later our domain owners can use to create their own product. We will have the data guild as the supporter for this process. Of course, it's a long term process.

We'll have to create the right trainings and the right processes of shifting this paradigm between, like, a data platform owner that owns the data processes and the domain owners that are going to own their own data processes

and maintain them by themselves.

We have a long journey here. But the end result, it's going to be a lot more robust and a lot more scalable in terms of our data platform.

And of course, that's key part of this transition

is data governance practices

that are going to help us keep in control. Like, we are going to give a lot of freedom and independence

to the main owners. But as as the data platform owners, we have to keep in control.

So we have to control the data access layers. We have the right auditing tools. And I think that it is.

Well, for anybody who wants to get in touch with each of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question,

I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think what

is missing for me,

being able to control

data access and audit data access

in a data lake implementation

from any kind of access there, whether it's Spark, Presto,

Athena, or whatever.

Being able to both control the access without having to implement it in a separate platform

each time. And being able to audit and understand who accesses the data and which data is being accessed is kind of a challenge at this point in time.

Maybe the second item would be that we have data lakes. Data is lying there in cheap storage.

Sometimes, you need just random access into this data and be able to pinpoint a single row. And

from

what we know at this point, it's not really possible to get this capability

within the current existing tool sets. So being able to have a data that allows you both doing mass

data processing as well as drilling down into the single row level would be a very nice addition to current capabilities we have.

Alright. Well, thank you both for taking the time today to join me and share the work that you've been doing on Riskified's data platform and being able to help

evolve the technological and organizational capacity to take advantage of data and

Hyatt data conference to help build and foster that community.

So

appreciate all of the time and energy that you're both putting into that and for taking the time to share your work there. So hope you have a good rest of your day. Thank you, the best. Thanks a lot.

Thank you for listening. Don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest on modern data management, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you learned something or tried out a project from the show, then tell us about it. Email hosts at pythonpodcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Workers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links