Clean Up Your Data Using Scalable Entity Resolution And Data Mastering With Zingg

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/

linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests.

Data fold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying.

You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

Your host is Tobias Macy. And today, I'm interviewing Sonal Goyal about Xyng, an open source entity resolution framework for data engineers. So, Sonal, can you start by introducing yourself?

Thanks for having me on the show, Tobias. Very excited to be talking to you today.

I'm from India, and I'm working on an open source product called Xyng.

I've spent a lot of time on software engineering, so roughly 24 years of working in various kinds of domains.

I was running a data consultancy before founding Xyng. So here I am.

And do you remember how you first got started working in data? So it's been a while, somewhere around 2, 006, 2007

while I was getting bored corporate life, and I wanted to delve into, you know, more and more challenging stuff.

So I started freelancing,

and 1 thing led to the other. I discovered Hadoop, learned about distributed systems,

and started,

working on a lot of open source stuff. So I was kind of early on the, I think, the Hadoop big data wave as they used to call it then,

and, saw a lot of problems in Stack, but it was fun time. A lot of learning.

And so in terms of the Xyng project, can you describe a bit about what it is that you're building and some of the story behind how it got started and why you wanted to spend your time and energy on this?

Yeah. So as I was working as a data consultant

and started also doing some open source work,

Started getting a lot more bigger projects,

so incorporated

consulting company and hired a few people. So we were having fun building, you know, data lakes and warehouses.

And some of our projects actually started feeling the need for entity resolution. At that part of time, actually, I did not understand what this problem is. We were building a data lake where we had to get customer data from multiple,

Oracle store systems

and build analytics out of it.

At that time, we kind of so for us, you know, back in 2010,

2011, setting up Hadoop clusters on EC 2, Those were the prime challenges.

Not really,

the problem of really, you know,

saying that these 3 records

from these sources are actually referring to the same customer. So when it hit us, it really hit us hard, and I saw firsthand

the problem and how important it was for analytics reporting, personalization,

and other things. So that kind of get got me started. I think it was failure first, getting hit by the problem multiple times in multiple projects,

which kind of led me where I am today.

So in terms of the Xyng project, who is the target audience and some of the ways that you thought about

what problems you're solving for who and how that influences

the design and implementation

of housing is incorporated into the overall workflow?

Seeing as an open source tool is, actually primarily for the data community. It's for the data engineers, the data scientists,

and the data analysts

to ensure

that

the required

relationships

across

their customers,

across their suppliers, across their products

from multiple sources are established.

Now when I talk about relationships, what that means is that you have the same customer

represented coming in through maybe an offline store and then coming in through an online channel

having slight variations in their data. And but you need to tie them all tie tie them down to a single physical real world entity

so that you can do your reporting, so that you can do your personalization.

In terms of what we have at Xyng is, I think, an ML based framework that you can

apply

as part of your pipeline

and use that. So the target audience that we have, which is the data people, are

these are smart people. This is an audience which knows how to use sophisticated tools. This is an audience which knows

how to transform and angle their data. So we have a variety of, like, data engineers, data scientists, and these are specialized skills that people have. So in terms of while we are, like, valuable building Xyng, we are very, very conscious about the fact that

the product has to be innovative enough. I mean, people cannot this

this audience can build anything that they choose to.

But is this the right use of the time? We're giving them enough functions, enough value so that they come to Xyng and build and, you know, use it use Xyng instead of building it themselves.

Does it easily tie into their workflows,

into the systems where the data is saved? So those are some of the key things in terms of the design that we've

really been very conscious about.

And as far as the

cases where entity resolution and record deduplication

is necessary. I'm wondering

what are some of the, I guess, challenges or potential errors that those duplicate records or, you know, multiple

entities that all converge to the same identity,

some some of the issues that those might cause and some of the ways that it might manifest in

analytical or data engineering workflows

and some of the ways that the entity resolution

approach can solve those problems.

So I think I would like to break this down into 2 things. 1 is duplication, which is, you know, you have something which you should not have had in the first place. So let's say, you know, erroneously, in some cases, you have 1 customer record

typed in multiple times. So let's say you are an event company and the same person has registered multiple times at your booth.

I would say that would be a duplication error, and that would cause a problem in terms of data quality, in terms of counting your

customers. But when we talk about entity and identity resolution for us, it's more about,

you know, those customer 360 60 use cases, the supplier 360 use cases, which are fundamental to the reason why you are building your warehouse or the data lake in the first place. So, you know, like, lifetime value, if I'm counting

5 records

as 5 different customers, though, actually, they represent 1 single customer who probably visited, you know, my store multiple times

or through came through multiple channels,

that is

a fundamental problem, you know, in my analytics.

So right from very simple

counts of new customers added per quarter

to lifetime value to higher order use cases like anti money laundering, g t p r? Where is my data saved? Can we delete,

have ready access to all the data of this customer

and purge it when they request it? Anti money laundering. So the cases actually vary. But to start with,

even the simplest

reporting use cases actually get hampered without entity resolution.

As far as the actual

entity resolution

workflow, for people who aren't familiar with it or who haven't either used it or dug deep into actually building it themselves.

What are some of the

requirements for being able to

implement it and some of the ways that it is incorporated into the data workflow. So the, I guess, the locations in the data life cycle where it actually gets applied and just some of the

technical and infrastructure

and, I guess, organizational capabilities that are necessary

for being able to implement it in the absence of something like Xyng?

So the problem remains the same, right, whether you use a tool like Xyng or whether you don't. It is generally 1 of the first transformation steps in any data life cycle

because that is where your core entities, your core nouns,

the customers, the suppliers, the products are established

over which the dimensional data is added the transactional data is added so that you can do your further analysis.

So or apply your machine learning. So it's generally the first step. In terms of particular skills, I would say

it's a combination

of

a lot of programming. I would say if you're building an ML based system, obviously, it requires

knowledge of machine learning.

1 particular challenge with the entity resolution is really how do you determine what to compare?

So when we talk about, like, you know, joins, I think that's fundamentally all data people understand, and it's something that drives us crazy optimizing joins. We we kind of, you know, always are working around the joins.

Joins are with exact keys, but entity resolution is joins without keys.

Now that absolutely turns the tables. It's like if you have 10, 000 records, you actually are comparing 10, 000 records against 10, 000 records. And then moment you, you know, you go to 1, 000, 000 records, the scale, the complexity,

a million cross million join. It's an echo card is a join. So that completely blows up. So somebody who understands these nuances, who understands what to compare, how to compare,

and is able to put them all together

is actually the skill. I I would say it's a mix of it's a mix of art and science.

As far as the Xyng project,

I'm wondering what was the motivation for creating it as open source and making it available for

any practitioner to be able to take it and use it as part of their tool belt as an off the shelf component versus having to

go through the work of building their own framework and building their own implementation to be able to

apply these deduplication

and entity resolution

techniques?

There are actually multiple reasons for Xyng being open source. 1 is that I have been a consumer of open source, so it's my way to kind of give back to the community. I've been using so many open source solutions. I built my consulting around that. So that is 1. Secondly, I feel open source Xyng is far more powerful than closed sourcing because

people who are interested in the topic kind of contribute back to a growing framework or a library.

And we have people who are actively, you know, helping us, supporting us. Databricks has come out with their own notebooks

with Xyng workflows,

and we know a few others who are actually working around this.

Another reason is that open sourcing

has power in the sense so entity resolution is a problem. I've not even talked about Xyng. Entity resolution is a problem that gets applied in, like, multiple scenarios.

We talk about, you know, anti money laundering and fraud use cases. We talk about GDPR.

We talk about product matching,

catalog matching.

We talk about item recommendations. We talk about review aggregation. We talk about customer 360, 60, supplier risk management.

We have varied, like and also as, like, the last year has taught me, there are so many more use cases than I could have personally

learned about and thought about or reached out to those people. Open source is a great way to, you know, to be able to service or to help a lot more use cases than a closed source solution could. So those are practically

my motivations for doing an open source product.

I've discussed the overall concept of entity resolution

a few different times with multiple different people.

And

from

the brief, definitely nonexhaustive survey that I've done, it seems like the majority of applications for entity resolution are either as

a feature of a broader product or as a kind of commercial capability

or more frequently, something that is implemented

kind of in house by the team that needs to apply that capability.

And I'm wondering why you think there has not yet been a compelling or widely adopted solution for entity resolution and record deduplication

up till now.

This is a, you know, a very interesting question and kind of takes us to the evolution of, I think, the data industry as a whole or the data tooling as we know it, the modern data stack as we know it now as a whole. So I think in the beginning, people kind of struggle

with, you know, getting their base layers ready, which is

collecting data, then transferring it to a central location

over which they can actually, you know, start

analyzing that data.

Entity resolution happens the moment you have, you know, this data coming in already saved.

And you start analyzing it, and you start realizing that this is, you know, a problem with with your data. So it's not, you know, the first thing that you do as part of your while building out your data stack,

but it's probably the first thing that you do as part of your data transformation. So the base tag the base layers for entity resolution have to be ready on top of where it can kind of be applied on. Because if the data is not in 1 place, you will not

have the need or the urge to resolve your entities.

They are in separate systems. You are happy with your separate

departmental silo, and you're working through your system.

So only when you are, you know, holistically looking at your data, that's the time when entity resolution

kind of strikes you.

2nd,

the reason why it has so 1 is that, which is, like, you know, just the evolution of data maturity and data collection. We now have very strong tools around all those capabilities.

I think second fundamental thing is that

as a problem, it's a fairly tough problem to solve.

Especially the way I think we are solving it in Xyng, which is very domain agnostic and let people apply it to a domain and entity of their choice,

is

a fairly

intricate

technical implementation.

It is

not something that a lot of people immediately would, you know, jump into solving.

People have custom solutions. They've built it for their own set of data, and a lot of smart people have already done it. There are some toolkits also available.

So I think it's more about, you know,

being able to solve it in a way that

commercially would appeal

to the kind of use cases that you would see.

Thirdly, I fundamentally believe that it is, you know, 1 of the core

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links