Summary
Despite the best efforts of data engineers, data is as messy as the real world. Entity resolution and fuzzy matching are powerful utilities for cleaning up data from disconnected sources, but it has typically required custom development and training machine learning models. Sonal Goyal created and open-sourced Zingg as a generalized tool for data mastering and entity resolution to reduce the effort involved in adopting those practices. In this episode she shares the story behind the project, the details of how it is implemented, and how you can use it for your own data projects.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Sonal Goyal about Zingg, an open source entity resolution framework for data engineers
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Zingg is and the story behind it?
- Who is the target audience for Zingg?
- How has that informed your efforts in the development and release of the project?
- What are the use cases where entity resolution is helpful or necessary in a data engineering context?
- What are the range of options that are available for teams to implement entity/identity resolution in their data?
- What was your motivation for creating an open source solution for this use case?
- Why do you think there has not been a compelling open source and generalized solution previously?
- Can you describe how Zingg is implemented?
- How have the design and goals shifted since you started working on the project?
- What does the installation and integration process look like for Zingg?
- Once you have Zingg configured, what is the workflow for a data engineer or analyst?
- What are the extension/customization options for someone using Zingg in their environment?
- What are the most interesting, innovative, or unexpected ways that you have seen Zingg used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Zingg?
- When is Zingg the wrong choice?
- What do you have planned for the future of Zingg?
Contact Info
- @sonalgoyal on Twitter
- sonalgoyal on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Zingg
- Entity Resolution
- MDM == Master Data Management
- Snowflake
- Snowpark
- Spark
- Milvus
- Pinecone
- DuckDB
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/ linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Data fold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying.
You can now know exactly what will change in your database. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. Your host is Tobias Macy. And today, I'm interviewing Sonal Goyal about Xyng, an open source entity resolution framework for data engineers. So, Sonal, can you start by introducing yourself?
[00:01:53] Unknown:
Thanks for having me on the show, Tobias. Very excited to be talking to you today. I'm from India, and I'm working on an open source product called Xyng. I've spent a lot of time on software engineering, so roughly 24 years of working in various kinds of domains. I was running a data consultancy before founding Xyng. So here I am.
[00:02:13] Unknown:
And do you remember how you first got started working in data? So it's been a while, somewhere around 2, 006, 2007
[00:02:19] Unknown:
while I was getting bored corporate life, and I wanted to delve into, you know, more and more challenging stuff. So I started freelancing, and 1 thing led to the other. I discovered Hadoop, learned about distributed systems, and started, working on a lot of open source stuff. So I was kind of early on the, I think, the Hadoop big data wave as they used to call it then, and, saw a lot of problems in Stack, but it was fun time. A lot of learning.
[00:02:47] Unknown:
And so in terms of the Xyng project, can you describe a bit about what it is that you're building and some of the story behind how it got started and why you wanted to spend your time and energy on this?
[00:02:57] Unknown:
Yeah. So as I was working as a data consultant and started also doing some open source work, Started getting a lot more bigger projects, so incorporated consulting company and hired a few people. So we were having fun building, you know, data lakes and warehouses. And some of our projects actually started feeling the need for entity resolution. At that part of time, actually, I did not understand what this problem is. We were building a data lake where we had to get customer data from multiple, Oracle store systems and build analytics out of it.
At that time, we kind of so for us, you know, back in 2010, 2011, setting up Hadoop clusters on EC 2, Those were the prime challenges. Not really, the problem of really, you know, saying that these 3 records from these sources are actually referring to the same customer. So when it hit us, it really hit us hard, and I saw firsthand the problem and how important it was for analytics reporting, personalization, and other things. So that kind of get got me started. I think it was failure first, getting hit by the problem multiple times in multiple projects, which kind of led me where I am today.
[00:04:09] Unknown:
So in terms of the Xyng project, who is the target audience and some of the ways that you thought about what problems you're solving for who and how that influences the design and implementation of housing is incorporated into the overall workflow?
[00:04:26] Unknown:
Seeing as an open source tool is, actually primarily for the data community. It's for the data engineers, the data scientists, and the data analysts to ensure that the required relationships across their customers, across their suppliers, across their products from multiple sources are established. Now when I talk about relationships, what that means is that you have the same customer represented coming in through maybe an offline store and then coming in through an online channel having slight variations in their data. And but you need to tie them all tie tie them down to a single physical real world entity so that you can do your reporting, so that you can do your personalization.
In terms of what we have at Xyng is, I think, an ML based framework that you can apply as part of your pipeline and use that. So the target audience that we have, which is the data people, are these are smart people. This is an audience which knows how to use sophisticated tools. This is an audience which knows how to transform and angle their data. So we have a variety of, like, data engineers, data scientists, and these are specialized skills that people have. So in terms of while we are, like, valuable building Xyng, we are very, very conscious about the fact that the product has to be innovative enough. I mean, people cannot this this audience can build anything that they choose to.
But is this the right use of the time? We're giving them enough functions, enough value so that they come to Xyng and build and, you know, use it use Xyng instead of building it themselves. Does it easily tie into their workflows, into the systems where the data is saved? So those are some of the key things in terms of the design that we've really been very conscious about.
[00:06:12] Unknown:
And as far as the cases where entity resolution and record deduplication is necessary. I'm wondering what are some of the, I guess, challenges or potential errors that those duplicate records or, you know, multiple entities that all converge to the same identity, some some of the issues that those might cause and some of the ways that it might manifest in analytical or data engineering workflows and some of the ways that the entity resolution approach can solve those problems.
[00:06:46] Unknown:
So I think I would like to break this down into 2 things. 1 is duplication, which is, you know, you have something which you should not have had in the first place. So let's say, you know, erroneously, in some cases, you have 1 customer record typed in multiple times. So let's say you are an event company and the same person has registered multiple times at your booth. I would say that would be a duplication error, and that would cause a problem in terms of data quality, in terms of counting your customers. But when we talk about entity and identity resolution for us, it's more about, you know, those customer 360 60 use cases, the supplier 360 use cases, which are fundamental to the reason why you are building your warehouse or the data lake in the first place. So, you know, like, lifetime value, if I'm counting 5 records as 5 different customers, though, actually, they represent 1 single customer who probably visited, you know, my store multiple times or through came through multiple channels, that is a fundamental problem, you know, in my analytics.
So right from very simple counts of new customers added per quarter to lifetime value to higher order use cases like anti money laundering, g t p r? Where is my data saved? Can we delete, have ready access to all the data of this customer and purge it when they request it? Anti money laundering. So the cases actually vary. But to start with, even the simplest reporting use cases actually get hampered without entity resolution.
[00:08:20] Unknown:
As far as the actual entity resolution workflow, for people who aren't familiar with it or who haven't either used it or dug deep into actually building it themselves. What are some of the requirements for being able to implement it and some of the ways that it is incorporated into the data workflow. So the, I guess, the locations in the data life cycle where it actually gets applied and just some of the technical and infrastructure and, I guess, organizational capabilities that are necessary for being able to implement it in the absence of something like Xyng?
[00:08:58] Unknown:
So the problem remains the same, right, whether you use a tool like Xyng or whether you don't. It is generally 1 of the first transformation steps in any data life cycle because that is where your core entities, your core nouns, the customers, the suppliers, the products are established over which the dimensional data is added the transactional data is added so that you can do your further analysis. So or apply your machine learning. So it's generally the first step. In terms of particular skills, I would say it's a combination of a lot of programming. I would say if you're building an ML based system, obviously, it requires knowledge of machine learning.
1 particular challenge with the entity resolution is really how do you determine what to compare? So when we talk about, like, you know, joins, I think that's fundamentally all data people understand, and it's something that drives us crazy optimizing joins. We we kind of, you know, always are working around the joins. Joins are with exact keys, but entity resolution is joins without keys. Now that absolutely turns the tables. It's like if you have 10, 000 records, you actually are comparing 10, 000 records against 10, 000 records. And then moment you, you know, you go to 1, 000, 000 records, the scale, the complexity, a million cross million join. It's an echo card is a join. So that completely blows up. So somebody who understands these nuances, who understands what to compare, how to compare, and is able to put them all together is actually the skill. I I would say it's a mix of it's a mix of art and science.
[00:10:37] Unknown:
As far as the Xyng project, I'm wondering what was the motivation for creating it as open source and making it available for any practitioner to be able to take it and use it as part of their tool belt as an off the shelf component versus having to go through the work of building their own framework and building their own implementation to be able to apply these deduplication and entity resolution techniques?
[00:11:03] Unknown:
There are actually multiple reasons for Xyng being open source. 1 is that I have been a consumer of open source, so it's my way to kind of give back to the community. I've been using so many open source solutions. I built my consulting around that. So that is 1. Secondly, I feel open source Xyng is far more powerful than closed sourcing because people who are interested in the topic kind of contribute back to a growing framework or a library. And we have people who are actively, you know, helping us, supporting us. Databricks has come out with their own notebooks with Xyng workflows, and we know a few others who are actually working around this.
Another reason is that open sourcing has power in the sense so entity resolution is a problem. I've not even talked about Xyng. Entity resolution is a problem that gets applied in, like, multiple scenarios. We talk about, you know, anti money laundering and fraud use cases. We talk about GDPR. We talk about product matching, catalog matching. We talk about item recommendations. We talk about review aggregation. We talk about customer 360, 60, supplier risk management. We have varied, like and also as, like, the last year has taught me, there are so many more use cases than I could have personally learned about and thought about or reached out to those people. Open source is a great way to, you know, to be able to service or to help a lot more use cases than a closed source solution could. So those are practically my motivations for doing an open source product.
[00:12:36] Unknown:
I've discussed the overall concept of entity resolution a few different times with multiple different people. And from the brief, definitely nonexhaustive survey that I've done, it seems like the majority of applications for entity resolution are either as a feature of a broader product or as a kind of commercial capability or more frequently, something that is implemented kind of in house by the team that needs to apply that capability. And I'm wondering why you think there has not yet been a compelling or widely adopted solution for entity resolution and record deduplication up till now.
[00:13:16] Unknown:
This is a, you know, a very interesting question and kind of takes us to the evolution of, I think, the data industry as a whole or the data tooling as we know it, the modern data stack as we know it now as a whole. So I think in the beginning, people kind of struggle with, you know, getting their base layers ready, which is collecting data, then transferring it to a central location over which they can actually, you know, start analyzing that data. Entity resolution happens the moment you have, you know, this data coming in already saved. And you start analyzing it, and you start realizing that this is, you know, a problem with with your data. So it's not, you know, the first thing that you do as part of your while building out your data stack, but it's probably the first thing that you do as part of your data transformation. So the base tag the base layers for entity resolution have to be ready on top of where it can kind of be applied on. Because if the data is not in 1 place, you will not have the need or the urge to resolve your entities.
They are in separate systems. You are happy with your separate departmental silo, and you're working through your system. So only when you are, you know, holistically looking at your data, that's the time when entity resolution kind of strikes you. 2nd, the reason why it has so 1 is that, which is, like, you know, just the evolution of data maturity and data collection. We now have very strong tools around all those capabilities. I think second fundamental thing is that as a problem, it's a fairly tough problem to solve. Especially the way I think we are solving it in Xyng, which is very domain agnostic and let people apply it to a domain and entity of their choice, is a fairly intricate technical implementation.
It is not something that a lot of people immediately would, you know, jump into solving. People have custom solutions. They've built it for their own set of data, and a lot of smart people have already done it. There are some toolkits also available. So I think it's more about, you know, being able to solve it in a way that commercially would appeal to the kind of use cases that you would see. Thirdly, I fundamentally believe that it is, you know, 1 of the core categories for anybody putting in their data stack. The time for entity resolution is now. It's there. We have seen entity resolution. So, yeah, as you're mentioning, that entity resolution is part of a tool, like maybe a CDP or an MDM or even a data quality tool. But CDPs handle part of the problem, which is more on the marketing side and, you know, digital channels.
And MDL, with all the history of the baggage and all the technology, has not really tied into the data stack, the kind of stacks that we are now building. So there is definitely a strong need of a solution to fit into that space where your warehouse or your lake is the central repository of your data, and entities are resolved directly and natively wherever your data resides.
[00:16:37] Unknown:
Digging into Xyng itself, can you talk through some of the implementation and design of the project and some of the ways that you have had to engineer around the requirement for it being generalizable and broadly applicable regardless of the underlying dataset?
[00:16:53] Unknown:
Yeah. I'd love to really have a long, young discussion, but let me keep it brief. So at the heart of it, Xyng is an ML based system. It learns from your data. We show the users some pairs through which they can create training data, and we learn from that. We learn various kinds of models and some techniques that we use are AutoML. We use graph processing. We have our own internal clustering algorithms to break down the auto join, multi join problem that I described to you to get it to scale. And we use Spark as the distribution engine, or we also leverage the warehouse for running those complex loads.
Internally, at the heart of it, I think the premise behind Zynga is that we learn from your data. We don't come in with any big truths or any big, you know, notions of similarity. The user says that these are the attributes of which I want to match. And if this is a variation, I'm okay to mark them as matches. Zinc kind of, you know, picks up that and learns that patterns from the training data itself. So that's how it is actually able to generalize to different kinds of entities. You can throw addresses to it. You can throw events to it. You can throw people or suppliers or other kinds of entities.
And the core building block here is that the learning happens directly on the user data running within your system without any data getting transferred out of Xyng. We are very, very frugal about learning because I think the 1 of the big fundamental problems also while applying machine learning is that those training dataset creation can become a tedious exercise. So we are very, very frugal about that, and 40 to 50 pairs is all you need to have a nice model which can predict similarity over billions of records.
[00:18:55] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. As far as that machine learning element and some of the ways that it fits into the data engineering life cycle. This is another topic that has come up a few times is data engineering, as a practice, is very focused on being able to make things repeatable and reusable and predictable, and machine learning is inherently probabilistic and not necessarily deterministic. And I'm wondering how you think about that balance in a tool like Xyng where it is targeting this use case of, I want to be able to do something that is repeatable and is going to increase the reliability and reusability of my data while also relying on this probabilistic approach to matching and just some of the ways that you need to work on some of the kind of user education around what they can actually reasonably expect from it, as well as being able to balance in the tool, maybe having some sort of a dial where you can say, I want to bias towards extreme predictability versus I really care more about having as much of a catchall for entity resolution as I can get. That's a wonderful question, actually, and it's something that we've really, really thought about very deeply.
[00:21:13] Unknown:
So when I talk about AutoML in Xyng and learning from the data, what we also learn is really where do we cut off and where do we establish the balance between precision, which is saying that, you know, these 2 records are the same, and recall, ensuring that we catch every single match. And that's always a tricky exercise, but we actually learn that on the fly from the training data that the user provides to us. I think just entity resolution is a problem. Right? We said, it is a probabilistic approach, and people I think, you know, all of of us used to zeros and ones. Obviously, this fuzzy matching is somewhere in between, which is going to which takes some amount of thinking through. What we do in Xyng is that we provide some confidence scores to our end results, which say that this is the probability that these records are a match.
And this is the best it this record has matched to any other record in this cluster. And what we see is that people are comfortable beyond what Xyng gives. What they generally do is they pick out another threshold, which is a threshold that they are comfortable with saying that, you know, maybe above 80% of a probability, I'm happy judging the algorithm. And some of the lower thresholds, probably, we want to kind of get a data steward to have a look at. So I would say it's the nature of the problem because if data is missing or if it's messy, you can't say 100% with conviction.
But the probability has to be in tandem with the likelihood that those records would seem to be a match to a human. And I think that's where we kind of leave it.
[00:23:11] Unknown:
As you have gone down the path of implementing this project and going from the initial idea and prototyping to where you are today, what are some of the ways that the design and goals have shifted over that time?
[00:23:27] Unknown:
So I'll tell you a very interesting story around this. If you go to the, say, GitHub, repo, you would see, you know, some dynamic GIF, which shows the labeler, which shows some records to the user to label. And that was an afterthought. When I talk about Zinc right now, everywhere I use that GIF, which shows, you know, the labeler where the user can actually put in yes, no. This is a match. This is not a match and train Xyng. But we implemented the algorithms first, and we hand, hand did, like, a training set by hand and fed them to the algorithm and said, yes. These algorithms work. But the moment we started working with some companies, we realized that they don't have a training set. So we had to build that training, you know, that, the ability to create training data within Xyng.
And I think, that lesson has been very valuable in terms of Xyng being 1 single piece of software that lets you train, that lets you manage your model, that lets you get your results. And I feel that it's 1 of the most powerful aspects of what we've built, and we've learned that through just a few months before open sourcing it.
[00:24:40] Unknown:
So for people who want to incorporate Xyng into their data platform and their data work flows. Can you start by describing what the installation and integration process looks like?
[00:24:51] Unknown:
So the installation is fairly straightforward in terms of we we are a Java based application. So there is a JAR, which which is a Spark job. It can be run on an Elastic MapReduce or Databricks, any hosted spark environment, or you can have your run it on your own spark cluster. If it's a few million records, sing is powerful enough to be able to easily run that on a single machine. So you probably even don't need to bother about the spark cluster. So you need a running JVM, the Zing jar, and a local spark installation, which is just a tar runs it. And so, basically, 2 tarunzips, Xing and Spark and a JVM, and you're pretty set in terms of creating a JSON to describe your data and running the ZYN CLI to start labeling and building your training set.
The ZYN models are persisted to a file location. And once trained, you can use them again and again as part of an airflow, DAG, or, any workflow of your choice. So that is how it kind of works. You train it once using the labeler that Xyng comes with, and you run it multiple times. If you want to run it on the Spark cluster, you will use the same model and configure your Spark job on Databricks or some other environment. We also have a Python API, which came out recently. So you can instead of creating a configuration in JSON, you can write your Python programs, and Xyng will run them for you.
[00:26:24] Unknown:
As far as the workflow, once somebody has Xyng installed and it's part of their data platform, what are some of the ways that a data engineer or a data analyst might interact with Xyng and some of the stages of the workflow where it's likely to be applied, where I'm thinking in terms maybe of, like, the DBT workflow where you have your raw data and then you stage it and then go through incremental and marts? In terms of the raw data, I mean, you train Xyng on your raw data once,
[00:26:54] Unknown:
and then you use Xyng to make the predictions. So that is your cluster data. Next time you have your incremental data, you actually run Xing in an incremental mode, which links your incremental new data against the existing clusters and assigns them to those clusters. So that is how the Xyng workflow is. Train it once, deploy the model, run it in a match mode, which means finding out all the clusters within your data, And then incrementally, whenever the new data comes in, run Xyng jobs at a frequency at which you need the clustering to happen and make predictions against existing data.
[00:27:38] Unknown:
And so for teams who are using Xyng, what are some of the interfaces or extension points that are available for being able to customize its functionality or customize the models that are being used for managing that entity recognition piece.
[00:27:55] Unknown:
Xyng is actually completely customizable in the sense that, you know, it's training on your data. It's as I mentioned, we don't come with a pretrained model. You train it on your data, so it is completely learning the rules on your own data and at your own system. So the customization happens as part of the training process of Xyng itself. In terms of interfaces, we have a Python interface through which you can program Xyng and you can configure where your data resides, where you want the output to go, where do you want the model, etcetera, to be saved, some other performance kind of parameters.
And, similarly, we also have a JSON through which you can actually define some of these parameters. Some of the extensions that people may like to do and some of the extra advanced user configurations that we have is the ability to add stop words. Let's say, you know, you were looking at company names, and there are commonly occurring things like LLP, LLC, PBT Limited, which people just want to ignore. So Xyng has the notion that you can actually customize and add your own stop words. In fact, it can even suggest which stop words you should probably consider while looking at column. So those are some ways in which you can customize your own matching. Beyond that, I think that's pretty much what it is.
[00:29:16] Unknown:
Looking at the documentation, it seems like the predominant workflow is reliant on Spark as the execution context. And I'm wondering if there has been any thought or effort put into being able to make it run as a stand alone process or being able to integrate it with other run times or execution frameworks.
[00:29:38] Unknown:
So that's something I think you caught us because that's something we are very, very actively working on. It's still under the wraps because it's a lot of discovery, I think, at this point of time, but we are very strongly getting there. So the reason why we've chosen Spark as the execution framework is that Spark gives the ability to distribute the load. I mean, the problem of entity resolution inherently is, like, twofold. 1 is what to compare. 2nd is how to compare. Now what to compare, the joint thing, that has to be broken. Just throwing spark alone doesn't obviously solve it because a Cartesian joint is a Cartesian joint. But we learn how to distribute it through the training data, and that's where the beauty of Xyng actually comes in. And users are able to scale their workloads to multimillion records easily.
Spark has been a fundamental part of our stack, but what we realize also is that many of our users are, you know, using, like, Snowflake. We are trying to see if we can leverage Snowpark API in a similar fashion and, you know, kind of run it natively within their environment. So that is 1 fundamental thing that we're actually very strongly looking at.
[00:30:54] Unknown:
In terms of the engineering effort that has gone into this project, what are some of the places that you've been able to lean on prior art and existing best practices, and what are some of the complex engineering challenges that you've had to explore and solve on your own to be able to provide this as a generalizable framework?
[00:31:15] Unknown:
I would say Xyng, in that sense, has been a mixed bag that has I think just due to the nature of the problem, there has been a lot of exploration and discovery. In terms of leaning on existing work, we use standard libraries for string comparison. We use something called second string, which helps us compute, string similarity and differences. We heavily use Spark. We use part of their ML package. We use the graph package. We learned a lot from their Python packaging because Python was an afterthought for Xing, which came in when we saw users working with the config files and some of them wanting to integrate Xing into their Python workflows. And when we added Python, ours being a Java native app, we had to really, you know, struggle to not overdo the Python stuff, but still be able to give a good API.
And we actually learned a lot from the way PySpark is written, which is using Py 4 j internally and transferring all the work eventually to the Scala Spark layer. So all the packaging, all the API stuff, we actually learned a lot from that. There have been a lot of research papers we've read and implemented bits and pieces, like the active learning part that we have in Xyng is learned through multiple publications and multiple research papers. We definitely had to tweak, and we've had to tweak to apply them to the problem at hand. But many of these are techniques that are used across various kinds of products.
Yeah. That's pretty much, I think, a lot of learning from various places, but then applying them to the problem at hand, I think, has been the tricky part.
[00:33:03] Unknown:
And as far as the project itself, what are some of the ways that you're thinking about the governance and sustainability and any opportunities for commercial applications of it?
[00:33:15] Unknown:
In terms of, I think, where we are right now, we have a lot of users. We have, like, 2 30 people on Slack. We have a lot of downloads. We have various kinds of use cases coming out of people using Xyng. We've not had a lot of direct active developer contribution coming to it, except I think the biggest 1 has been Databricks, who've done their notebooks. I think those problems of, you know, governing the flow, etcetera, have not really come to us that much. Xyng is backed by the corporate, Xyng dotai, which is a US based startup.
So in that sense, there is a commercial angle definitely to it. We are working on some enterprise features, which we think would be valuable to the end user beyond what the open source is. So the open source is definitely fairly powerful. There are some things in terms of the entire workflow, in terms of the ability to do more of the data stewardship after Xyng has given the results. So those are some areas in which we are actually working right now.
[00:34:27] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enables you to automatically send data to hundreds of downstream tools. Sign up for free today at dataengineeringpodcast.com/rudder. In your experience of building Xyng and working with the community members and growing awareness around it, what are some of the most interesting or innovative or ways that you've seen it applied?
[00:35:10] Unknown:
It's been fascinating, honestly. When I kind of was working on Zynga and my prior experience was working with commercial entities and, you know, I was always thinking about insurance and life sciences and patients and policyholders and customer 360. 1 thing that has made me really proud is Xyng being applied for the North Carolina campaign open data project, which is an open initiative where all the donor data and the recipient data in the campaign in North Carolina is reported. There were a lot of mismatches in the donors as well as in the recipients, and Xyng has been applied there to identify exactly how much funding is flowing between a dollar and a recipient.
And it's really a fascinating reader. I think it's something I'm very, very proud of beyond and much very, very unexpected to what I thought Zayn would ever be applied to. We talk about spend optimization, but this is definitely something which has really made me very proud.
[00:36:20] Unknown:
As far as your experience of building this project and founding the business around it and interacting with the community and the broader ecosystem, what are some of the most interesting or unexpected or challenging lessons you've learned in the process?
[00:36:34] Unknown:
My experience over the last year has been very, very positive. I would count myself fortunate to have got a lot of spotlight. I'm not sure if it's just good luck or if it's the problem that I'm solving or if it's combination of everything together. But Xyng has got a lot of good coverage from data leaders, influencers. Even academics have written to me and told me that they've liked the project. So it's been a very good journey. I was honestly not expecting this level of support and this level of notice, but I feel that this has really helped me on the open source journey to be able to go ahead with conviction and to kind of, you know, continue on this path to solve this extremely, extremely tough and important problem. I would say I'm really thankful to all the people who've really, really supported me in many ways being I think even in your case, like, you know, spreading this good word about Zayn readers. Your listeners would be actually hearing this.
We know the data talks club and done some demos. So it's been a very, very good journey so far.
[00:37:45] Unknown:
For people who are running into this challenge of duplicate records or trying to do entity resolution or master data management, what are the cases where Xyng is the wrong choice?
[00:37:57] Unknown:
Xyng is the wrong choice if your data is simple, if you have clearly defined identifiers, like email IDs that you can trust or customer identifiers in your product database that you are absolutely sure you are mapping all over your systems. So if you are starting out on your data journey, if it's a small data team and you don't have multiple sources of data, or your data is really small, like, maybe, you know, 10, 000 records or 50, 000 records, I think something like Xyng would be an overkill. Beyond that, yeah, do try Xyng.
[00:38:36] Unknown:
As you continue to build and iterate on the Xyng project and work with the community and expand the capabilities? What are some of the things you have planned for the near to medium term or any features or projects that you're excited to dig into?
[00:38:49] Unknown:
So I'm very excited to think about Xyng as the tool for data matching and data transformation in some senses. I kind of think that, you know, we have matching in various forms. We probably don't realize it, like images. So it's not just text. It's also unstructured data. It's also a lot of product definitions. It's also reviews. It's also images. They all need matching in some way or the other. So the notion of similarity to me is very, very appealing. And broadly, that's where I think that in the long term, that's my vision for Zayn, which is really, really resolving, you know, any kind of entity. And that entity may not really be a physical entity. It may also be, like, a digital entity.
In the near term, we are working on making Xyng very easy to use. So we started with based on user feedback, we gave a docker image. We introduced the Python library. We are working on the Snowflake integration. We already have a Snowflake integration, but we are doing even more native work with Snowflake, deeper integration with Databricks. So ability to support the workflow where the workflow is actually happening is the core behind Xyng. And then we want to really add a lot more things on the usability side, like, you know, suggesting that maybe if you split this column into 2 columns, your matching results will improve.
Or maybe if you apply this transformation to your raw data, you can have far better results. And these could have, you know, implications even otherwise. So that's where And those are things I would just love to kind of build.
[00:40:30] Unknown:
Yeah. And keying off of your mention of the similarity search aspect of it, I'm curious if you have done any work of looking into things like Milvus or Pinecone as far as these vector databases or working in the vector embedding space for being able to do that similarity search in the vector space as well.
[00:40:49] Unknown:
Yeah. I'm definitely learning a lot about the vector space, and I feel some mix of Xyng with some of these technologies would be awesome because then you can do it's like mixed media. Right? It's the beauty of mixed media. It would be wonderful. Yes. Yeah.
[00:41:05] Unknown:
And talking about this, I recognize this is totally out in left field, but from the kind of vector database perspective, I'm also kind of highly speculative, but I think it would be interesting to see if there is viability of a kind of DuckDb style approach for vector databases where you can have a lightweight in process approach to generating these vector embeddings and doing the similarity search kind of in process without having to have another piece of infrastructure to manage.
[00:41:35] Unknown:
That's a great point, and that is to an extent my thought as well. So the way I look at Xyng is, you know, the common framework over which you can bring in your hash functions, your neighborhood search, and then you can bring in your notion of similarity. And then the workflow is what Zing manages, which is essentially how Zing is built right now. The vector database is definitely a different piece of infrastructure as you mentioned. So maybe that indexing doesn't happen, like, in a separate system, but we're already indexing. Right? We're already breaking the problem of matching everything with everything with smart indexing internally. So, yeah, maybe that's done in memory and the user doesn't even know or doesn't even care about it. Just transparent, and they just consume the results.
You know? Focus on better things that they need to focus on. Absolutely.
[00:42:30] Unknown:
Are there any other aspects of the Xyng project or the problem space of entity resolution and record matching and master data management that we didn't discuss yet that you'd like to cover before we close out the show?
[00:42:42] Unknown:
I think it would be very interesting to even build out, you know, some ways. We talk about golden records in the MDM space a lot, which is the final view of all your records amalgamated into 1 single record, which you can absolutely trust. And that, again, is a very rule based approach so far. I would love to see, you know, how we could tackle that in Zing in a very user defined, user configured way where the user says that this is how it should look and kind of Xyng just figures out what are the internal rules that need to be applied. I think that would be, again, interesting to build.
[00:43:23] Unknown:
Are there any particular areas of contribution or feedback that you're looking for from the broader community?
[00:43:30] Unknown:
I would love for people to kind of, you know, try it on different kinds of datasets. I know people have scaled it. So when we released it, we called it scalable. Most enterprise data generally is, you know, 4, 5, 000, 000 records maximum. So we thought it it is going to go well, and it will scale because we didn't see a stress on 15, 000, 000. But we've had users do it at 75, 000, 000 without any stress. So that's very encouraging. If people can, you know, let us know about what are the problems you are kind of facing, how are you thinking about incorporating in your work flow? Is there anything which is stopping you from using Xyng?
I mean, good or bad. I think every feedback is welcome right now, and, would love to build it together, make it more usable.
[00:44:16] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:44:31] Unknown:
So in terms of the tooling, we definitely have a lot of tooling. I think those pieces are strong. The ETL is strong. But I would again say that it's a lot of effort still. There's still a lot more I think we can do, especially coming from an ML perspective. My thought and my vision of the world is where while we are building ML systems for, you know, other people, but we don't have any ML systems which are helping us in our day to day jobs. Zinc is definitely an attempt to do that, but I would love to see, you know, like, index suggestions or transformation suggestions, recommendations for data people on how to wrangle their data, you know, what kind of views to build by looking at some raw data.
I think those are even, like, you know, possible SQL query generation. Those are things I feel would be the next frontier, which which should happen in some time in the near
[00:45:32] Unknown:
future. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Xyng. It's definitely a very interesting project and great to see an open source option for entity resolution and data matching, which is definitely a very real and ubiquitous problem. So I appreciate all of the time and energy that you and your collaborators are putting into making that available. So thank you again for taking the time today, and I hope you enjoy the rest of your day. Thank you, Tobias. It was great talking to you today, and thank you for having me today.
[00:46:06] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and
[00:46:45] Unknown:
workers.
Challenges in Modern Data Pipelines
Introduction to Sonal Goyal and Xyng
The Genesis of Xyng
Target Audience and Use Cases for Xyng
Challenges and Errors in Entity Resolution
Entity Resolution Workflow and Requirements
Open Source Motivation and Community Contributions
Why Entity Resolution is a Tough Problem
Technical Implementation of Xyng
Balancing Probabilistic and Deterministic Approaches
Evolution of Xyng's Design and Goals
Installation and Integration of Xyng
Xyng's Workflow in Data Platforms
Customization and Extension of Xyng
Exploring Other Execution Frameworks
Engineering Challenges and Solutions
Governance and Sustainability of Xyng
Interesting Applications of Xyng
Lessons Learned from Building Xyng
When Xyng is the Wrong Choice
Future Plans and Exciting Features
Similarity Search and Vector Databases
Golden Records and Future Directions
Community Contributions and Feedback
Biggest Gaps in Data Management Tooling