Summary
Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing Tal Galfsky about how Cherre is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines.
Interview
- Introduction
- How did you get involved in the area of data management? Started as physicist and evolved into Data Science
- Can you start by giving a brief recap of what Cherre is and the types of data that you deal with? Cherre is a company that connects data We’re not a data vendor, in that we don’t sell data, primarily We help companies connect and make sense of their data The real estate market is historically closed, gut let, behind on tech
- What are the biggest challenges that you deal with in your role when working with real estate data? Lack of a standard domain model in real estate. Ontology. What is a property? Each data source, thinks about properties in a very different way. Therefore, yielding similar, but completely different data. QUALITY (Even if the dataset are talking about the same thing, there are different levels of accuracy, freshness). HIREARCHY. When is one source better than another
- What are the teams and systems that rely on address information? Any company that needs to clean or organize (make sense) their data, need to identify, people, companies, and properties. Our clients use Address resolution in multiple ways. Via the UI or via an API. Our service is both external and internal so what I build has to be good enough for the demanding needs of our data science team, robust enough for our engineers, and simple enough that non-expert clients can use it.
- Can you give an example for the problems involved in entity resolution
Known entity example.
Empire state buidling.
To resolve addresses in a way that makes sense for the client you need to capture the real world entities. Lots, buildings, units.
- Identify the type of the object (lot, building, unit)
- Tag the object with all the relevant addresses
- Relations to other objects (lot, building, unit)
- What are some examples of the kinds of edge cases or messiness that you encounter in addresses? First class is string problems. Second class component problems. third class is geocoding.
- I understand that you have developed a service for normalizing addresses and performing entity resolution to provide canonical references for downstream analyses. Can you give an overview of what is involved?
What is the need for the service. The main requirement here is connecting an address to lot, building, unit with latitude and longitude coordinates
- How were you satisfying this requirement previously? Before we built our model and dedicated service we had a basic prototype for pipeline only to handle NYC addresses.
- What were the motivations for designing and implementing this as a service? Need to expand nationwide and to deal with client queries in real time.
- What are some of the other data sources that you rely on to be able to perform this normalization and resolution? Lot data, building data, unit data, Footprints and address points datasets.
- What challenges do you face in managing these other sources of information? Accuracy, hirearchy, standardization, unified solution, persistant ids and primary keys
- Digging into the specifics of your solution, can you talk through the full lifecycle of a request to resolve an address and the various manipulations that are performed on it? String cleaning, Parse and tokenize, standardize, Match
- What are some of the other pieces of information in your system that you would like to see addressed in a similar fashion? Our named entity solution with connection to knowledge graph and owner unmasking.
- What are some of the most interesting, unexpected, or challenging lessons that you learned while building this address resolution system? Scaling nyc geocode example. The NYC model was exploding a subset of the options for messing up an address. Flexibility. Dependencies. Client exposure.
- Now that you have this system running in production, if you were to start over today what would you do differently? a lot but at this point the module boundaries and client interface are defined in such way that we are able to make changes or completely replace any given part of it without breaking anything client facing
- What are some of the other projects that you are excited to work on going forward? Named entity resolution and Knowledge Graph
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today? BigQuery is huge asset and in particular UDFs but they don’t support API calls or python script
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Cherre
- Photonics
- Knowledge Graph
- Entity Resolution
- BigQuery
- NLP == Natural Language Processing
- dbt
- Airflow
- Datadog
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse, enabling identity stitching and advanced use cases like lead scoring and in app personalization. Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder. Your host is Tobias Macey. And today, I'm interviewing Tal Galsky about how Cherry is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines. So, Tal, can you start by introducing yourself? Hi, Tobias.
[00:01:34] Unknown:
Tal. I'm a data scientist at Cherry. Do you remember how you first got involved in the area of data management? So I took a bit of an interesting route in my career. 1st and foremost, I'm a physicist, and my academic track originally was aimed for working in optics and photonics. So when I finished my PhD, I started working as a photonics designer for an optical communications company. And then 1 day, 1 of our former postdocs invited me to check out Cherry's office. He started working there. He says, cool people. Come check it out. And it was really kind of an immediate hit. So Cherry really met all the criteria that I have when it comes to looking at perspective projects.
Specifically, they were working on a really challenging problem, redefining the main model for real estate. They were working on a high impact issue, and it was work with smart people, Really smart people. Right? So type of problems Terry needs to deal with is very challenging. And then what we're doing, really, if you think about the real estate industry state as it is right now, so the parallel would be kind of what speed trading or AIs did for the stock markets. Like, really changing the way the real estate market is looking at tech. So all that looked good. I joined Cherry, and I learned about the domain of real estate from Ben Chizak, who's 1 of the founders.
And I learned about functional and object oriented programming and test driven design and microservices from Madison Sterling, who's a senior engineer. I learned about NLP and knowledge graphs from the awesome John Maiden, our head of machine learning engineering, who you've interviewed before, and as well as from Ron Beckerman, who's our CTO and a data science professor. So it's been really great working with all these people. It's been an awesome learning experience.
[00:03:36] Unknown:
It's been a fun transition. As you mentioned, you're working at Cherry, and, you know, I had John on before to talk a bit about some of the specifics of building the knowledge graph that you use. But can you give a bit of a recap of what it is that you're doing at Cherry and some of the types of data that you're dealing with and how you're kind of addressing the problem of messy data within the real estate market? Cherry itself is a prop tech company.
[00:04:01] Unknown:
And what we do is we give our clients the tools to to take a data driven approach so they can make a better investment, management, underwriting decisions. As I said, the real estate market currently is not very tech savvy. It's a little bit behind the technology. There's still a very strong human component when it comes to decision making. Can kind of think about it as a human pipeline. You wanna make a decision about a property, you need to aggregate data from all these different sources within your organization. Primarily, Charlie doesn't sell data.
What we do sell is connectivity and insights. First of all, we allow automation connectivity within your organization. So you can get the data you need quickly in an organized manner and, like, bring it up when you need it. And then our clients, they do bring us their data, but they also bring the questions that they're trying to solve. And we basically tell them, okay, so let's take your questions. Let's see if we can design a domain model to help answer these questions. So we help our clients make sense of their data and and better connect it so they can make better business decisions.
[00:05:09] Unknown:
In terms of the specific topic that we're discussing here, I know that 1 of the sort of core attributes of any piece of property is the physical location, which is generally denoted by the address, at least within the United States. There might be different systems in different countries. And I'm wondering what are some of the biggest challenges that you're dealing with in your role when working with real estate data and how that leads into some of the work that you're that we're gonna be discussing today about address normalization and entity resolution? Well, I guess the biggest challenge in my role is very tied up with the biggest challenge that we have as a company.
[00:05:46] Unknown:
And that there is basically there is no standard model for data in real estate. And I'll give you an example for what I mean by that. So let me ask you a question. What's your favorite programming language?
[00:05:58] Unknown:
I primarily use Python.
[00:05:59] Unknown:
Great. So how do you define a variable?
[00:06:03] Unknown:
Python, you just set a arbitrary name that doesn't start with a number and has an equal sign, and then assigns a value to that variable to act as a placeholder. It can be a none. It can be the concrete value that you want. It can be of any variety of types.
[00:06:18] Unknown:
Yep. That's right. So we can say it's a container to store data value. Right. Right. We sign x equals 5 or x equals a piece of like a string. Yep. And you have variables of different types. So this is actually a very well defined data model. We know the entities that are involved when we work in Python. Right? We have methods. We have classes. We have variables. Now let's ask you a different question. How would you define a real estate property?
[00:06:45] Unknown:
Yeah. That's definitely a much broader question. You know, it could be it's a, you know, building that sits at this address. It could be there's an empty lot that sits at this sort of geographical boundary that has some bounding box using a combination of lat and long coordinates. It can be, you know, some multiple locations that are all owned by a single entity that are being purchased as a unit.
[00:07:08] Unknown:
Exactly. Right. So it turns out that you will get a different answer to this question depending on who you ask. So if you ask a tax assessor, they would say, yes. I have a tax lot. I don't care what's on it. It's a property. If you ask someone who does vacation rentals like Airbnb, They can say, oh, anything can be a property. It can have an Igloo in Alaska. That's a property that I'm renting or a tent in someone's backyard. So it's really anyone who's dealing with real estate has a different definition of the concept. On top of that, real estate is something that is by default, tied with geography. So it's geographically spread.
So every single place you would go to is now gonna have different rules and regulations about real estate. So, yeah, this is a type of question that we wanna answer. And the kind of companies we work with, like a typical real estate company, we can have $5, 000, 000, 000 in asset under management. Right? And if you ask them the question of how many properties you own today, they might not be able to answer this question with confidence. Right? And it's not just because it's hard for them to define what is a property. It turns out that this is a fairly complex question to answer without a very good entity resolution model.
I guess the biggest challenge is the ontology. We had to come up with an entity resolution model for not exactly what is an address, but what is a property. So we usually talk about lots. We talk about buildings. So you have a lot. The building sits on a lot. Within a building, you can talk about different units. So we try to tie everything into lots, buildings, and units. I guess the second challenge that we have now is quality. Right? Because we have all these different datasets and different data sources. And when I get a dataset to that I need to handle, I start asking the questions, alright, how is this data collected? When was this data collected? How much can I trust this source to be accurate? And then how it connects to all the other sources that I have, which brings me to the 3rd challenge, which is making decisions that are related to hierarchy of sources.
Right? Not all data is created equal. Some datasets are better than others when it comes to certain aspects. And then when we take into account quality of the source and the exact type of entities they refer to, we start making decisions about hierarchy. Yep. It's a bit of a long winded way of answering the question. That's good. It's definitely helpful too because
[00:09:35] Unknown:
every domain has its own aspects of how you look at the data, you know, its own way that the messiness kind of manifests. And real estate, I think, as you so well demonstrated, has its more than its fair share of messiness. Exactly. And so because of the fact that there's so many different ways of thinking about what is a property and, you know, how do you represent it, What were some of the most common and biggest pain points that you were running up against, and what led you down the path of deciding that address resolution was worth investing, you know, a full sort of engineering effort on solving, I'm not gonna say once and for all, but at least to some level of satisfaction.
[00:10:16] Unknown:
I guess the 1 common thread that goes through all of these datasets at the end of the day is addresses. People talk about the properties in terms of the address. It it's a very common thing. You hope to have let long in some of these, like, the coordinates, but eventually, you gotta solve the problem of addresses. If you're going to connect all these different datasets that's supposed to be talking about the same thing, you gotta be able to take all these different addresses. You need to standardize them to a single form, and you need to connect them to the individual entities or to the real world entities that they're talking about.
[00:10:53] Unknown:
And so within Cherry Systems and the pipelines that you're building, can you give some examples of the types of teams that are relying on the address information and how they're hoping to be able to consume it and some of the systems that are responsible for being able to source that information and deliver it to the teams, the other sort of programmatic systems that are relying on that information for being able to produce derived data products or be able to create decisions for your customers?
[00:11:23] Unknown:
Yeah. We can start, I guess, internally and work towards clients. So as I said, the nature of the data is such that addresses exist in almost every dataset that we take in. So the address resolution, the address service needs to be integrated directly into the pipeline. Every time we take in a dataset, the engineering team needs to be able to send the address information through the service and get a standardized address at the end. On top of that, we also have the data science team that needs to use the address resolution engine to power the knowledge graph so we can connect all the different entities within the knowledge graph.
So it needs to be good enough that they can do that. And then, of course, we also need to serve addresses to clients, standardize addresses to clients. And we need to be able to take in addresses from clients in real time so the front end team also relies on this service. And we also wanted to give you the same result whether you have this address from a dataset and you processed it in bulk and send it through the pipeline, or it came in through API or the UI.
[00:12:29] Unknown:
In terms of being able to actually use the addresses programmatically, what were some of the ways that you were trying to parse the information and be able to translate that within the projects that you were working on? And what were some of the biggest difficulties that were posed by the fact that there weren't consistent ways of representing the data or the same address might show up in, you know, different structures or different forms based on the source that you were pulling it from? That's a good question. It's probably gonna answer them in several parts.
[00:13:03] Unknown:
We can start with, like, a relative example. When you talk to John about knowledge graphs, I think you probably talked about entity resolution a little bit for property owners' names. So let's say well, let's take a known entity as an example. Right? Let's say Michael Jordan. And let's say that I have 3 unrelated datasets from different sources. 2 of them talking about commercial property owners, and the third 1 has people's contact info. 3rd dataset has owner name Michael Jordan. 2nd dataset, owner name is Michael j Jordan. And the 3rd datasets, the 1 for contact info, has the contact person as MJ.
Just MJ. Right? So how do we know that all of these 3 data points refer to the same individual? Right? How would you approach this problem?
[00:13:56] Unknown:
1 way is just kind of making a best guess of splitting on the first and last name and then trying to match up 2 initials, but that can obviously be, you know, a very lossy resolution because multiple people can have the same initials and very different names. And then also, if you add in the middle name, then that also confounds the basic logic. And I know that there's a whole area of research about how to actually properly do entity resolution based on different variables. So
[00:14:25] Unknown:
Exactly. There's a lot of ambiguity. Specifically, for Michael Jordan, I looked it up in the tax assessor data and Michael Jordan shows up 760 times. And it's not because sure he owns a lot of stuff, but I'm also sure there's a lot of people called Michael Jordan. Right? So if you just see the name, you you can't solve this problem from just the name itself. You need to be able to solve it from context. So for example, if the first 2 datasets have property information, you can say with confidence, okay. These 2 properties are actually the same property. Then it's very likely that Michael Jordan and Michael j Jordan are the same person since he's listed as the owner of that property. And the 3rd dataset, for example, if you have a mailing address and the mailing address is the same as the 1 you see for the owner of the properties, you can say, oh, okay. So in that case, Michael Jordan, Michael j Jordan, and MJ are all names for the same entity. They're alternative names.
Pretty much the same exact problem applies for addresses, except names for addresses or the alternative addresses are not even going to be that similar to each other. So you can have 2 datasets. 1 of them, for example, would have 20 West 34 Street, New York City, and the other 1 is going to have Empire State Building as an address. And then you need to ask yourself, okay. So can I connect these 2 together? Should I connect these 2 together? And I think as I mentioned before, the the Cherry domain model, we talk about 3 types of entities that have addresses. So we talk about tax lots, 1st and foremost.
So first of all, there's the land that the building is. If you take the Empire State Building, there's the land that the Empire State Building is on, which is the tax lot. And then when it comes to buying and selling properties, this is really what is being sold. You can't buy a building, but you can buy a land with a building on it. The second entity that we talk about is the building itself, which you can think about as a property or a feature of the tax law. So the tax law is a container for the building. Then the 3rd entity are units within the building.
16th floor unit 14, for example. Right? Which is, in this case, the building is a container for the unit. Just like a variable is a container for data values in Python. So going back to the question of the 2 datasets and should I connect them? First of all, I need to look at the type of objects that these datasets are talking about. Are both of them talking about lots or buildings just so I know how to connect them? Is it going to be a lot to lot connection, or is it going to be a lot to building connection, kinda like parent to child sort of thing? So once we did that, we wanna tag the object in the dataset with all the relevant addresses So we know that the Empire State Building resides on 20 West 34th Street.
And then we wanna find the relations to all the other objects. So if the 2 datasets talk about buildings, this is the type of connections that we're gonna have. If 1 of them talks about lot and the other 1 talks about build building, we're gonna make it that kind of connection. So parent to child kind of thing. In terms
[00:17:37] Unknown:
of how you're working with the address data, I'm imagining that you probably do this kind of at the point where you're trying to resolve the entity within some given project and that maybe you have some shared libraries for being able to do this in a fairly standardized way. So what was the biggest pain point that still existed that put you down the road of deciding this needs to be part of the data pipeline that I can then just consume as a call to an API to be able to understand what is the canonical representation of this address that I was just fed based on all of the other information that I've been able to build and to sort of turning a full kind of data science project into a component of your data engineering pipeline?
[00:18:24] Unknown:
Several big points. I think the first one's to actually internally be able to come up with an entity resolution data set or a canonical source of truth that makes all these connections, and then we can start talking about these objects in terms of object IDs instead of you know, we have 3 addresses and then go find out if they're related to each other without being able to connect them to some sort of object. Like, just connecting the addresses to coordinates is a good start that would resolve at least the problem of alternative addresses in most cases. But since you're talking about different entities, you also wanna be able to answer that question.
So first off, we needed to build out our canonical source of truth. And then the next thing we needed to deal with is, okay, so now we have this very good data set for address points, like addresses that connect to specific geographical locations and, type of entities that they connect to. And now we need to be able to take any address string and be able to standardize it and connect it to this source of truth. So that was the second big challenge. This is where we had to start using NLP tools. We actually have a hierarchy of strategies
[00:19:42] Unknown:
that we use as we take in and add a string, parse it, standardize it, and then eventually match to our source of truth. So can you talk a bit more about the overall project that you did end up building to integrate into your data pipeline and just some of the particular challenges that were posed because of the fact that you're trying to do it in sort of an automated fashion where you can consume this as part of the data in flight without having to have a human intervene to be able to sort of answer all the different little edge cases? Yeah. I guess to answer that,
[00:20:15] Unknown:
we kind of want to understand what are the types of problems that we see when we need to handle addresses. If you think about what an address is, it's a tag. And it's a text string tag. Right? So it's a text string. It's about it's usually longer than 20 characters. So as you can imagine, it's gonna have all the classic problems that you can find with text. Right? So you could have typos. You can have missing spaces. You can have random bits of text that actually don't belong to the address. For example, I could have a name like l is something in 20 West 34th Street, New York City. So this is the first class of problems, string problems. The second class is actually component problems. So an address is something that's actually were very well defined. If you take a mailing address, you can say, okay, this is the house number, This is the street name, street suffix, prefix, city, state, ZIP. You can take the address and then take the individual components.
And the thing is there's no standard way that people write these components, so people tend to abbreviate. So you need to deal with abbreviations. They also need to deal with alternative place names. So for example, if you take New York City, it often gets abbreviated to NYC. It's also often an alternative place name for Manhattan. The same thing happens with abbreviations on street that becomes s t. So that's something that is, you know, relatively easy to deal with the standard NLP techniques. But then you also need to think about errors such as missing components or incorrect components. Like, for example, someone can type 23 Park Circle instead of 23 Park Court.
So this is the sort of problem that you end up solving in the matching phase. You you give it a confidence score saying, okay. We could not find 23 Park Circle, but but we can find 23 Park Court in the city that you specified in the ZIP code that was associated with it. So this is the type of problems we have to deal with and that the service deals with. And then we have to build in in such way that it does the same thing whether you use it in bulk or whether you plug into it on the fly. So now we had to take into account. Okay. We wanna build common libraries to do this. So we can use those either within BigQuery, for example, which is what we use to store our data in the back end.
And we also want to be able to use it as a sort of a quick function for the UI. So, yes, we had to do shared libraries and shared libraries that could be used within completely different systems. So that was definitely challenging and interesting. We didn't start with that, evolved over time.
[00:23:05] Unknown:
Modern data teams are dealing with a lot complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool waterflask. As you decided to go down this path of saying, I'm going to build an automated entity resolution and address normalization component to our data pipeline, what were some of the design considerations? Did you have a specific SLA of it needs to be accurate this percentage of the time? It needs to have this, you know, amount of availability? Or was it just a matter of let's just build something out, see what we're able to do, and kind of take it from there?
[00:24:30] Unknown:
So you definitely you definitely want a certain level of accuracy. Usually, I think most people are very used to Google as sort of like an address processing solution, and Google is very good. Right? They have everything built in. They have a very strong NLP model that analyzes the address, removes it from the extra components of it. And at the end of the day or at the end of the query, you get mapped to a specific Lat Long with a pinpoint on a map that says this is your address. This is your location. You need to be at least as good as Google.
But we also wanted to give you the extra perspective of because Google deals with address points. It's not going to tell you that these 2 different addresses are in the same building, or it's not going to tell you that this is the tax law that the building is sitting on. Right? So we needed to also take this component into account when we built the service. So I guess our SLA was be at least as good as Google, be able to solve for a typical dataset. We wanna be upwards of 80%. It really depends on how messy the dataset is. Right? We've seen datasets where some of the addresses you just can't use within the service.
You have an address like lot 45, no location provided. Or, yeah, or things like n a n a, New Jersey. Right? So, obviously, you're not gonna be able to do anything about that unless there's some sort of lat long that you could use. But in most cases, we actually take pride of the level of accuracy that we can reach. With the decision to build this in home, it wasn't something that we started with off the bat. We actually evaluated a bunch of external service providers that give you the same sort of solution. And we ended up realizing that, okay, so this service provider is better on this side of things, but they're really not that good in New York. And then this service provider has really good parsing, but they're not very good at entity resolution. So you can't really find a single source that would give us everything that we're looking for. So that's why we decided, okay, we're just gonna bite the bullet, put the bullet, build it ourselves, and see how far we can take it. For being able to
[00:26:50] Unknown:
standardize these address records, what are sort of the broad pieces of the system that you're working with and sort of the technologies that you've chosen to be able to build this system and just a sort of high level view of how an address traverses these different stages of pipeline and transformation and then being able to be used in other downstream pipelines?
[00:27:15] Unknown:
We use Airflow as a scheduler for our pipelines. And then within Airflow, every task that we do, for example, if you do an extract task, load task, transform task, or address service task, it all runs within its own Kubernetes container. So when it comes to the address service, like, it doesn't have to be something that happens in BigQuery. We can build it as a completely different thing. We can build it as a Python script that runs within a pod. Or we can build build it as a JavaScript, or we can build it as anything else, like Spark for NLP. And we can actually do a cascade of these things as long as we have the boundaries very well defined.
Boundaries as long as we know, okay, this is the expected input, whether it comes in for a BigQuery table or whether it comes in for something else. We know that we can take this in. We activate the entity resolution engine on it. And and at the end, we expect a BigQuery table to be written out. So this is what happens within the pipeline. The API can actually can connect directly to a pod that's running the standardization service constantly. And then it just sends, like, a request to that pod, and the response is a standardized address string together with left log and object ID.
[00:28:33] Unknown:
Because of the relative complexity of this sort of system and the stage of your pipelines, I'm curious how that has impacted your overall design and structure of the broader system of DAGs that you're building and sort of the dependencies for this stage of the pipeline, sort of how you manage triggering or alerting on any failures in being able to source the data and then downstream being able to monitor for and manage the expectancies of the downstream consumers of that stage of the pipeline for being able to then use that clean cleaned address data in other systems and then how you manage records that aren't able to be properly normalized because they're just missing too much information?
[00:29:24] Unknown:
I think the key point here I think I would say it's about having a clear understanding of the data flow and having very, very well defined service boundaries and module boundaries. Defining exactly what sort of input and what sort of output comes out at every single stage was crucial to being able to set it up in such a way that you that we can use it everywhere. Basically, the system is built in such a way that it's a microservices approach. Every single part can be taken out, replaced, changed without any change to the overall behavior of the system. So, obviously, you know, if you have a successful match, what you will end up with is a standardized address plus the latitude and the longitude with an accuracy score and an object ID that goes with your, address.
Now if you have an address that is not well formatted or that for some reason we're unable to match our canonical source of truth or unable to geocode to specific let longs, what we will give you is basically the input back. Like, we will do our best to standardize it if we can actually parse it through the individual components. And we can for example, if you put an and input an address like 34 w 34 c, we might standardize it to 34 West 34th Street. But if we can't geocode it, it will come back with a geo accuracy code that says not available, basically, or the best guess that could be, you know okay. So we're not able to find the exact building, but here's the street level coordinates. So if someone is using it in our UI, which has a map on it, at least the map is gonna zoom in to the street level of where you would expect to find the property.
Of course, if the address is something like, you know, n a n a, New Jersey, there's not much we can do about that. We'll send you to New Jersey, and good luck. But, yeah, it's a part of it. The idea is we try not to give you a completely empty response. Like, at least you will get what you put into the system. If we really can't find anything for you, at least you would get your input string. And if there's anything else in the system that we can connect on that,
[00:31:48] Unknown:
it will connect to it. Now that you have rolled out the system as part of your data pipelines, what are some of the other sources of information and other types of problematic data that you work with that you would like to see given a similar treatment and turned into a service to be able to automatically clean it up for you? Yeah. I think the natural candidate for that is going to be
[00:32:13] Unknown:
named entities resolution, specifically the Michael Jordans that we have in our system. So we wanna take an entity approach there as well. For example, we want to be able to resolve companies and the subsidiaries of the company and the subsidiary of the subsidiary and so on. So we wanna be able to standardize company and people name, which is a completely different ballgame, really, but you can still take the same service oriented approach. And this is, again, something that we wanna be able to do in bulk in the pipeline, and we also want to be able to do for clients on the fly. Because clients do wanna go into the UI, be able to type in, for example, a company name and see if we can resolve the portfolio for this company, just as an example.
[00:33:00] Unknown:
Going back to 1 of the things that you said very early on when you introduced yourself as a data scientist, I think it's interesting that you're working on this particular class of problem because in some ways, it's a data engineering problem, but it also has a very data science heavy element to being able to solve this component of the data engineering pipeline. And I'm wondering if you can just talk to sort of the team structures and team dynamics that you have at Cherry and how you view the dividing line between data science and data engineering, and sort of how that plays out within your team, and how you've seen it play out in other organizations.
[00:33:38] Unknown:
There's this kind of a wall between data engineering and data science. And often, you know, the ball gets tossed back and forth in the in this wall. Data science comes with a solution for something, which, you know, will be done in a Jupyter notebook. And then data engineering, you know, gets that notebook and says, okay. Now how do I put this into a pipeline? Right? Do I wanna do this? Is it going to scale? Probably not gonna scale like that. We need to find a different solution. So that's really a common thread that I've seen in many places.
In Cherry, we try to take a cross functional approach to the teams. So for example, my team, at this point, we have 2 data scientists, 2 data engineer. We have a product manager, and we have a senior data engineer. It's been a very powerful approach to the solution. It really, really helps streamlines all these combined projects with data science and data engineering needs to be integrated, especially when it comes to services. So my team, I guess you can call it focus on the services for the company. We have the address service that we're working on, which is combined with a GeoCorder service, which is what we use to serve the lat long. We have the name service, which is the next project. And then we have the data scientists that are working on the knowledge graph that is going to use these 2 services in order to get built.
[00:35:04] Unknown:
Because of the fact that you have these cross functional teams that has allowed you to produce some interesting kind of types of systems that I haven't seen a lot of other teams build towards where because of the fact that you have data scientists and data engineers working on the same project, you're able to bring this kind of machine learning style component to the pipeline and build it as a service into the overall data flow, whereas most of the time, the pipeline is generally treated as kind of a not really a sort of dumb system, but, you know, very much reliant on just being very mechanical, and then the machine learning is something that happens at the end. And I think it's interesting to see a bit more of an evolution of machine learning being brought earlier into the life cycle of working with the data sources and being able to use that as a means of providing more high value data assets to, you know, analysts and end users and downstream data science teams.
[00:36:04] Unknown:
Exactly. Yeah. I mean, when Cherry started, it was focused on just New York City data. And the initial solution was it for addresses was exactly that. Well, something that existed only within the pipeline. It solved a very, very specific problem, did not take into account a lot of the current capabilities that we have. And then once we took a more machine learning engineering approach to it, Right? This sort of thing tends to happen at the end of the pipeline, but you can't do that because addresses they coming at the source, like standardizing them at the end. And we have tried standardizing them later in the process, and it just doesn't work. It just leads to complications.
You really wanna be able to handle them as a part of your standard transform before you bring in the business logic.
[00:36:53] Unknown:
I think too it's interesting that you implemented at least parts of this as user defined functions within BigQuery. And I know that there's been a bit of a trend of moving more machine learning into the database itself rather than having it be something that has to sit on top of it and pull data out and push data back in. And I'm wondering what your thoughts are on some of that or any of the other sort of interesting trends that you're seeing in how teams and technologies are able to facilitate more advanced data workflows?
[00:37:22] Unknown:
Yeah. The cool thing of being able to run machine learning models within BigQuery is scalability. But when it comes to dealing with big data, BigQuery is really the natural choice for that, at least for us. Able to define more powerful user defined functions is key for that. You know, think about trying to write a machine learning module a model with SQL. It's it's horribly. Nobody wants to do that.
[00:37:51] Unknown:
Right. Right? So
[00:37:53] Unknown:
being able to use the defined functions to use more natural tools to do this is really key. 2 things that I really wish BigQuery supported within their user defined functions are, a, the ability to make API calls to external services. That could be very powerful, you know, if you think about it. And then the second thing is, currently, the user defined functions don't actually support Python scripts,
[00:38:20] Unknown:
which is what most data scientists are gonna use, you know, right off the bat. Digging a bit more into the fact that you are using user defined functions as a portion of this service, I'm curious how you manage things like testing and versioning and updating the code that lives in the database and just sort of what your deployment pipeline looks like for being able to iterate on this service and be able to grow and evolve it? Man, there's a battery of tests.
[00:38:48] Unknown:
We have unit tests in place for the service. We also have dbt test. We we use dbt when we write the SQL models. So we have DBT test on the data. Basically, every task that we have in the pipeline, there's either a test task afterwards or the tests are built in. So we're talking about unit tests, integration tests, data tests, end to end tests. We also use Datadog to do monitoring on GraphQL, just 1 of the interfaces that the clients are using. So monitor GraphQL and the website.
[00:39:27] Unknown:
As you have built out the address resolution capabilities and built out the services for being able to manage it, what are some of the most interesting or unexpected or challenging lessons that you learned in that process, and what are some of the most interesting or appalling edge cases that you've run into as far as how the data is formatted?
[00:39:47] Unknown:
Oh, boy. Yeah. I guess you can't build a system without learning how to better build it the next time you approach the problem. There's been a lot of that. Of course, there's a lot of edge cases with this sort of thing. We need to deal with a lot of geospatial data. So we deal with footprints datasets. We deal with address points datasets. I think 1 example was when we decided to bring in open data for the top 20 markets in the US. So a lot of cities have really good open data. For example, New York City has the best open data that I've seen from building permits to transportation, to courts.
Anything you want for in for New York City is is really available. And Nashville Nashville was 1 of these cities that we wanted to bring in. So we took in the address points data for Nashville, where we have or the building dataset for Nashville, where we have the addresses, we have the coordinates for a building, and we took it into our adverse point dataset. We extracted it, loaded it, transformed it. And then when we connected it to our other datasets, we found out that the entire city of Nashville was resolved to a single tax law in Tennessee, which seems very counterintuitive.
So we plotted the data on a map, and we found out that the entire city was actually mapped to an area which was about 1 square foot in size. So, yeah, 1 foot by 1 foot. It's like a nano shvel kind of thing. Right. And the reason is because they were using this, like, unique coordinate system, and there was some weird numerical error when we transitioned it to our coordinate system. It mapped to the right place, but just like several orders of magnitude smaller.
[00:41:50] Unknown:
That's funny.
[00:41:51] Unknown:
Yeah. So this was 1 of the edge cases.
[00:41:54] Unknown:
And now that you have put the system into production and you've sort of learned the lessons of going through the trial and error and making it production ready, as you look back, what are some of the changes that you would make if you were to start over today or sort of improvements that you would make over the existing system if you given the time and resources?
[00:42:15] Unknown:
Yeah. So like I said, you can't build a system without learning how to do it better next time. There's a lot of improvements that could be done. And I think the 1 aspect that I'm really happy about is at least the way it is built right now. So the clearly defined service and microservices boundaries, the module integration, and the client interface. These clear definitions actually enable us to make significant changes without really breaking anything or, hopefully, without breaking anything, which is in production or client fixing. So we can do all of our testing separate from what is exposed to the clients.
We can completely swap out components. We can change the language that we use for the service. We can replace the entire service. But as far as the client is concerned, they would only notice incremental improvements as we work on the system. So I think that's the most important thing when you start working on a service. Try to have a very clear definition of how you interact with it and how the client is interacting with it. So if you have this clear input and output expectations, anything in the middle can be a black box. Which means you can do anything you want to it, and you would still get the same output out of it. So are there any other aspects of the work that you're doing at Cherry,
[00:43:44] Unknown:
either in terms of the address resolution and the system that you've built to be able to turn that into for your data pipelines or the overall challenges of messiness in real estate data or some of the overall trends in being able to move machine learning earlier in the life cycle of the data systems that we didn't discuss yet that you'd like to cover before we close out the show? Having the ability to do,
[00:44:10] Unknown:
machine learning processing, non addresses in this case, but also building the same thing for names. Earlier during the pipeline proved to be crucial, at least for us. And then, I think we talked a little bit about the knowledge graphs. So once These are some very, very cool processing tools, but the real insight comes from once you actually can run AI and deep learning algorithms on the knowledge graph itself.
[00:44:39] Unknown:
This is where you start gaining insights and seeing structures that you haven't seen in the data before. Yeah. Knowledge graphs in general are sort of an interesting construct and 1 that I've touched on in a few different episodes in a few different areas. And I definitely look forward to seeing them be used in more contexts because I think that that's going to be able to provide a lot more power to people who are just doing very basic analytics right now. And as knowledge graphs become more approachable and more manageable, I'm interested to see sort of what kinds of insights that helps us surface.
[00:45:12] Unknown:
Exactly. And in our case, the knowledge graph, I think John probably mentioned it, has several 1000000000 nodes in it. Yep. So whatever outcomes we we want to use and that we need to make sure that they're of the order of the number of nodes. Right. At at worst.
[00:45:35] Unknown:
Well, Farar, anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:45:50] Unknown:
Yep. So I think I mentioned it before. We use BigQuery in the back end, and we found it to be an extremely powerful tool. And then we also use a lot of user defined functions. And then the 2 things that I would really love to see support for coming up from the Google Cloud Platform for BigQuery. If we use the user defined functions currently, they don't really support sending external calls to APIs, which I think can be extremely powerful if we could do that. And then the other thing is they also don't support scripting in Python, which means, you know, Python is really the natural language for a lot of data scientists.
So just being able to take in, like, import a Python module and just run with it in there could be very, very, very powerful and very natural.
[00:46:42] Unknown:
Yeah. Definitely, it would be interesting evolution to the database market. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing at Sherry for being able to push machine learning earlier in the pipeline life cycle and some of the capabilities that that's allowed you to unlock. It's definitely very interesting project, and I appreciate you taking the time to share it with us and sort of explain the benefits that it's been able to provide. So I appreciate your time and effort, and I hope you enjoy the rest of your day. Yeah. Alright. Thanks for hosting. This was great. Listening.
Don't forget to check out our other show, podcast.init@pythonpodcastdot com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcastdot com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Tal Galsky: Introduction and Background
Cherry's Approach to Real Estate Data Management
Challenges in Real Estate Data: Address Normalization and Entity Resolution
Importance of Address Resolution in Real Estate Data
Building an Address Resolution Service
Technologies and Pipeline Integration
Future Projects: Name Entity Resolution
Cross-functional Team Dynamics at Cherry
Machine Learning in Data Pipelines
Lessons Learned and Edge Cases
Improvements and Future Directions
Final Thoughts and Closing Remarks