Moving Machine Learning Into The Data Pipeline at Cherre

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

RudderStack is the smart customer data pipeline.

Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse,

enabling identity stitching and advanced use cases like lead scoring and in app personalization.

Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder.

Your host is Tobias Macey. And today, I'm interviewing Tal Galsky about how Cherry is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines. So, Tal, can you start by introducing yourself? Hi, Tobias.

Tal. I'm a data scientist at Cherry. Do you remember how you first got involved in the area of data management? So I took a bit of an interesting route in my career.

1st

and foremost, I'm a physicist,

and my academic track

originally was aimed for working in optics and photonics.

So when I finished my PhD, I started working as a photonics designer for an optical communications

company.

And then 1 day, 1 of our former postdocs

invited me to check out Cherry's office. He started working there. He says, cool people. Come check it out. And it was really kind of an immediate hit. So

Cherry really met all the criteria that I have when it comes to looking at perspective projects.

Specifically,

they were working on a really challenging problem,

redefining

the main model for real estate.

They were working on a high impact issue,

and it was work with smart people,

Really smart

people. Right? So

type of problems Terry needs to deal with is very challenging.

And then

what we're doing, really, if you think about the real estate industry

state as it is right now, so the parallel would be

kind of what speed trading or AIs did for the stock markets.

Like, really

changing the way the real estate market is looking at tech.

So all that looked good. I joined Cherry, and I learned about the domain of real estate from Ben Chizak,

who's 1 of the founders.

And I learned about

functional and object oriented programming and test driven design and microservices

from Madison Sterling, who's a senior engineer.

I learned about NLP and knowledge graphs from the awesome

John Maiden, our head of machine learning engineering,

who you've interviewed before,

and

as well as from Ron Beckerman,

who's our CTO and a data science professor.

So it's been really great working with all these people.

It's been an awesome learning experience.

It's been a fun transition. As you mentioned, you're working at Cherry, and, you know, I had John on before to talk a bit about some of the specifics of building the knowledge graph that you use. But can you give a bit of a recap of what it is that you're doing at Cherry and some of the types of data that you're dealing with and how you're kind of addressing the problem of messy data within the real estate market? Cherry itself is a prop tech company.

And what we do is we give our clients the tools to to take a data driven approach

so they can make a better investment,

management, underwriting decisions.

As I said, the real estate market currently is not very

tech savvy. It's a little bit behind the technology.

There's still a very strong

human component when it comes to decision making.

Can kind of think about it as a human pipeline. You wanna make a decision about a property, you need to aggregate data from all these different sources within your organization.

Primarily, Charlie doesn't sell data.

What we do sell is connectivity and insights. First of all, we allow automation connectivity within your organization.

So you can get the data you need

quickly in an organized manner

and, like, bring it up when you need it. And then

our clients, they do bring us their data, but they also bring the questions that they're trying to solve.

And we basically tell them,

okay, so let's take your questions. Let's see if we can design a domain model to help answer these questions.

So

we help our clients make sense of their data and and better connect it so they can make better business decisions.

In terms of

the specific topic that we're discussing here, I know that

1 of the sort of core attributes of any piece of property is the physical location, which is generally denoted by the address,

at least within the United States. There might be different systems in different countries.

And I'm wondering what are some of the biggest challenges that you're dealing with in your role when working with real estate data and how that leads into some of the work that you're that we're gonna be discussing today about address normalization and entity resolution? Well, I guess the biggest challenge in my role is very tied up with the biggest challenge that we have as a company.

And that there is basically

there is no standard model for data in real estate. And I'll give you an example for what I mean by that. So let me ask you a question. What's your favorite programming language?

I primarily use Python.

Great. So

how do you define a variable?

Python, you just set a arbitrary name that doesn't start with a number and has an equal sign, and then assigns a value to that variable to act as a placeholder. It can be a none. It can be the concrete value that you want. It can be of any variety of types.

Yep. That's right. So we can say it's a

container to store data value. Right. Right. We sign x equals 5 or x equals a piece of like a string. Yep. And you have variables of different types. So

this is actually a very well defined data model. We know the entities that are involved when we work in Python.

Right? We have methods. We have classes. We have variables.

Now let's ask you a different question. How would you define a real estate property?

Yeah. That's definitely a much broader question. You know, it could be it's a, you know, building that sits at this address. It could be there's an empty lot that sits at this

sort of geographical boundary that has some bounding box using a combination of lat and long coordinates.

It can be,

you know, some multiple locations that are all owned by a single entity that are being purchased as a unit.

Exactly.

Right. So it turns out that you will get a different answer to this question depending on who you ask.

So if you ask a tax assessor, they would say, yes. I have a tax lot. I don't care what's on it. It's a property.

If you ask someone who does vacation rentals like Airbnb,

They can say, oh, anything can be a property. It can have an Igloo in Alaska. That's a property that I'm renting or a tent in someone's backyard.

So it's really anyone who's dealing with real estate has a different definition of the concept. On top of that, real estate is something that is

by default, tied with geography. So it's geographically spread.

So every single place you would go to is now gonna have different rules and regulations

about real estate.

So, yeah, this is a type of question that we wanna answer. And

the kind of companies we work with, like a typical real estate company,

we can have $5, 000, 000, 000

in asset under management. Right? And if you ask them the question of how many properties you own today,

they might not be able to answer this question with confidence.

Right? And it's not just

because it's hard for them to define what is a property.

It turns out that this is a fairly complex question to answer without a very good entity resolution model.

I guess the biggest challenge is the ontology. We had to come up with

an entity resolution model for

not exactly what is an address, but what is a property. So we usually talk about lots. We talk about buildings. So you have a lot. The building sits on a lot. Within a building, you can talk about different units.

So we try to tie everything into lots, buildings, and units.

I guess the second challenge that we have now

is quality.

Right? Because we have all these different datasets and different data sources. And when I get a dataset to that I need to handle, I start asking the questions, alright, how is this data collected? When was this data collected? How much can I trust this source to be accurate?

And then how it connects to all the other sources that I have, which brings me to the 3rd challenge, which is making decisions that are related to hierarchy of sources.

Right? Not all data is created equal. Some datasets are better than others when it comes to certain aspects.

And then when we take into account quality of the source

and the exact type of entities they refer to, we start making decisions

about hierarchy.

Yep. It's a bit of a long winded way of answering the question. That's good. It's definitely helpful too because

every domain has its own aspects of how you look at the data, you know, its own way that the messiness kind of manifests. And real estate, I think, as you so well demonstrated,

has its more than its fair share of messiness.

Exactly.

And so

because of the fact that there's so many different ways of thinking about what is a property and, you know, how do you represent it,

What were some of the most common and biggest pain points that you were running up against, and what led you down the path of deciding that address resolution

was worth investing, you know, a full sort of engineering effort on

solving, I'm not gonna say once and for all, but at least to some level of satisfaction.

I guess the 1 common thread that goes through all of these datasets

at the end of the day is addresses. People

talk about the properties in terms of the address. It it's a very common thing. You hope to have let long in some of these, like, the coordinates,

but eventually, you gotta solve the problem of addresses. If you're going to connect all these different datasets

that's

supposed to be talking about the same thing,

you gotta be able to take all these different addresses. You need to standardize them to a single form, and you need to connect them

to the individual entities

or to the real world entities that they're talking about.

And so within Cherry Systems

and the pipelines that you're building, can you give some examples of the types of teams that are relying on the address information and how they're hoping to be able to consume it and some of the systems that are responsible for being able to source that information and deliver it to the teams, the other sort of programmatic systems that are relying on that information for being able to

produce derived data products or be able to create decisions for your customers?

Yeah. We can start, I guess, internally and work towards clients.

So as I said, the nature of the data is such that addresses exist in almost every dataset that we take in.

So the address resolution, the address service needs to be integrated directly into the pipeline.

Every time we take in a dataset, the engineering team needs to be able

to send the address information through the service and get a standardized address at the end. On top of that, we also have the data science team that needs to use the address resolution engine

to power the knowledge graph so we can connect all the different entities within the knowledge graph.

So it needs to be good enough that they can do that.

And then, of course, we also need to serve addresses to clients,

standardize addresses to clients. And we need to be able to take in addresses from clients in real time so the front end team also relies on this service.

And we

also wanted

to give you the same result whether you

have this address from a dataset and you processed it in bulk and send it through the pipeline, or it came in through API

or the UI.

In terms of being able to actually

use the addresses programmatically,

what were some of the ways that you were trying to

parse the information

and be able to translate that within the projects that you were working on?

And what were some of the biggest difficulties

that were posed by the fact that there weren't consistent ways of representing the data or the same address might show up in, you know, different structures or different forms based on the source that you were pulling it from? That's a good question. It's probably gonna answer them in several parts.

We can start with, like,

a relative example.

When you talk to John about knowledge graphs, I think you probably talked about entity resolution a little bit for property owners' names.

So let's

say well, let's take a known entity as an example. Right? Let's say Michael Jordan.

And let's say that

I have 3

unrelated datasets from different sources. 2 of them talking about commercial property owners, and the third 1 has people's contact info. 3rd dataset has owner name

Michael Jordan. 2nd dataset, owner name is Michael j Jordan.

And the 3rd datasets, the 1 for

contact info,

has the contact person as MJ.

Just MJ.

Right? So

how do we know that all of these 3 data points refer to the same individual?

Right? How would you approach this problem?

1 way is just kind of making a best guess of splitting on the first and last name and then trying to match up 2 initials, but that can obviously be, you know, a very lossy resolution because multiple people can have the same initials

and very different names. And then also, if you add in the middle name, then that also confounds the basic logic. And I know that there's a whole area of research about how to actually properly do entity resolution based on different variables. So

Exactly. There's a lot of ambiguity.

Specifically,

for Michael Jordan, I looked it up in the tax assessor data and Michael Jordan shows up 760 times.

And it's not because

sure he owns a lot of stuff,

but I'm also sure there's a lot of people called Michael Jordan.

Right? So if you just see the name,

you you can't solve this problem from just the name itself. You need to be able to solve it from context.

So for example, if the first 2 datasets

have property information, you can say with confidence,

okay. These 2 properties are actually the same property. Then it's very likely that Michael Jordan and Michael j Jordan are the same person since he's listed as the owner of that property.

And the 3rd dataset, for example, if you have a mailing address

and the mailing address is the same as the 1 you see for the owner of the properties, you can say, oh, okay. So in that case, Michael Jordan, Michael j Jordan, and MJ

are all names for the same entity. They're alternative names.

Pretty much the same exact problem applies for addresses, except

names for addresses or the alternative addresses are not even going to be that similar to each other. So you can have 2 datasets.

1 of them, for example, would have

20 West 34 Street, New York City,

and the other 1 is going to have Empire State Building

as an address.

And then you need to ask yourself, okay. So can I

connect these 2 together?

Should I connect these 2 together?

And I think as I mentioned before, the the Cherry domain model, we talk about 3 types of entities that have addresses.

So we talk about

tax lots,

1st and foremost.

So first of all, there's the land that the building is. If you take the Empire State Building, there's the land that the Empire State Building is on,

which is the tax lot. And then when it comes to buying and selling properties, this is really what is being sold. You can't buy a building, but you can buy a land with a building on it.

The second entity that we talk about is the building itself,

which you can think about as a property or a feature of the tax law. So the tax law is a container

for the building.

Then the 3rd entity are units within the building.

16th floor unit 14, for example.

Right? Which is, in this case, the building is a container for the unit.

Just like a variable is a container for data values in Python.

So going back to the question of the 2 datasets and should I connect them? First of all, I need to look at the type of objects that these datasets are talking about. Are both of them talking about lots or buildings just so I know how to connect them? Is it going to

be a lot to lot connection, or is it going to be a lot to building connection, kinda like parent to child sort of thing?

So once we did that, we wanna tag the object

in the dataset with all the relevant addresses

So we know that the Empire State Building resides on 20 West 34th Street.

And then we wanna find the relations to all the other objects. So if the 2 datasets talk about buildings, this is the type of connections that we're gonna have. If 1 of them talks about lot and the other 1 talks about build building, we're gonna make it that kind of connection. So parent to child kind of thing. In terms

of how you're working with the address data,

I'm imagining that you probably

do this kind of at the point where you're trying to resolve the entity within some given project

and that maybe you have some shared libraries for being able to do this in a fairly standardized way.

So

what was the biggest pain point that still existed that put you down the road of deciding this needs to be part of the data pipeline that I can then just consume as a call to an API to be able to

understand what is the canonical representation of this address that I was just fed based on all of the other information that I've been able to build and to sort of turning

a full kind of data science project into a component of your data engineering pipeline?

Several big points. I think the first one's to actually internally be able to come up with an entity resolution data set or a

canonical source of truth

that makes all these connections, and then we can start talking about these objects in terms of object IDs

instead of

you know, we have 3 addresses and then go find out if they're related to each other without being able to connect them to some sort of object. Like, just connecting

the addresses to coordinates is a good start

that would

resolve

at least the problem of alternative addresses in most cases. But since you're talking about different entities, you also wanna be able to

answer that question.

So first off, we needed to build out our canonical source of truth.

And then the next thing we needed to deal with is, okay, so now we have this very good data set for address points, like addresses that connect to specific

geographical locations and,

type of entities that they connect to. And now we need to be able to take

any address string

and be able to standardize it and connect it to this source of truth.

So that was the second big challenge. This is where we had to start using

NLP tools.

We actually have a hierarchy of strategies

that we use as we take in and add a string, parse it, standardize it, and then eventually match to our source of truth. So can you talk a bit more about the overall project that you did end up building to integrate into your data pipeline and just some of the particular challenges

that were

posed because of the fact that you're trying to

do it in sort of an automated fashion where you can consume this as part of the data in flight without having to have a human intervene to be able to sort of answer all the different little edge cases? Yeah. I guess to answer that,

we kind of want to understand

what are the types of problems

that we see when we need to handle addresses.

If you think about

what an address is, it's a tag. And it's a text string tag. Right? So it's a text string. It's about it's usually longer than 20 characters.

So

as you can imagine, it's gonna have all the

classic problems that you can find with text. Right? So you could have typos. You can have missing spaces.

You can have random bits of text that actually don't belong to the address.

For example,

I could have a name like l is something in 20 West 34th Street,

New York City. So this is the first class of problems, string problems. The second class

is actually component problems. So an address

is something that's actually were very well defined. If you take a mailing address, you can say, okay, this is the house number, This is the street name, street suffix, prefix,

city, state, ZIP. You can take the address and then

take the individual components.

And the thing is there's no standard way that people write these components, so people tend to abbreviate.

So you need to deal with abbreviations. They also need to deal with alternative place names. So for example, if you take New York City,

it often gets abbreviated to NYC.

It's also often an alternative place name for Manhattan.

The same thing happens with abbreviations on street that becomes s t. So that's something that is,

you know, relatively easy to deal with the standard NLP techniques. But then you also need to think about

errors

such as missing components or incorrect components. Like, for example, someone can type

23 Park Circle instead of 23 Park Court.

So this is the sort of problem that you

end up solving in the matching phase.

You you give it a confidence score saying, okay. We could not find 23 Park Circle, but but we can find 23 Park Court in the city that you specified in the ZIP code that was associated with it.

So this is the type of problems we have to deal with and that the service deals with. And then we have to build in in such way

that it does the same thing whether you use it in bulk or

whether you plug into it on the fly. So now we had to take into account.

Okay. We wanna build common

libraries

to do this. So we can use those either within BigQuery, for example, which is what we use

to store our data in the back end.

And we also want to be able to use it as a sort of a quick function

for the UI.

So, yes, we

had to do shared libraries and shared libraries that could be used within completely different systems.

So that was definitely challenging and interesting.

We didn't start with that,

evolved over time.

Modern data teams are dealing with a lot complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to data engineering podcast.com/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool waterflask.

As you decided to go down this path of saying, I'm going to build an automated

entity resolution and address normalization

component to our data pipeline,

what were some of the design considerations?

Did you have a specific SLA of it needs to be accurate this percentage of the time? It needs to have this, you know, amount of availability? Or was it just a matter of let's just build something out, see what we're able to do, and kind of take it from there?

So you definitely you definitely want a certain level of accuracy.

Usually,

I think most people are very used to

Google as sort of like an address processing solution, and Google is very good. Right? They have everything built in. They have a very strong NLP model that analyzes the address, removes it from the extra components of it. And at the end of the day or at the end of the query,

you get mapped to a specific Lat Long with a pinpoint on a map that says this is your address. This is your location.

You need to be at least as good as Google.

But we also wanted to give you the

extra perspective of because Google deals with address points.

It's not going to tell you that these 2 different addresses are in the same building, or it's not going to tell you that this is the tax law that the building is sitting on.

Right? So we needed to also take this component into account

when we built the service.

So I guess our SLA was be at least as good as Google, be able to solve

for a typical dataset. We wanna be

upwards of 80%. It really depends on how messy the dataset is.

Right? We've seen datasets where

some of the addresses you just can't use within the service.

You have an address like lot 45, no location provided.

Or,

yeah, or things like n a n a,

New Jersey.

Right? So, obviously, you're not gonna be able

to do anything about that unless there's some sort of lat long that you could use.

But in most cases, we actually take pride of the level of accuracy that we can reach.

With

the decision to build this in home, it wasn't something that we started with off the bat. We actually evaluated a bunch of external service providers

that give you the same sort of solution.

And we ended up

realizing that, okay, so this service provider is better on this side of things, but they're really not that good in New York. And then this service provider

has really good parsing, but they're not very good at entity resolution. So you can't really find a single source that would give us everything that we're looking for. So that's why we decided, okay, we're just gonna bite the bullet, put the bullet, build it ourselves,

and see how far we can take it. For being able to

standardize

these address records,

what are sort of the broad pieces of the system that you're working with and sort of the technologies that you've chosen to be able to

build this system

and just a sort of high level view of how an address traverses these different stages of pipeline and transformation and then being able to be

used in other downstream pipelines?

We use Airflow as a scheduler for our pipelines.

And then within Airflow, every task that we do, for example, if you do an extract task, load task, transform task, or

address service task, it all runs within its own

Kubernetes container.

So

when it comes to the address service, like, it doesn't have to be something that happens in BigQuery. We can build it as a completely different thing. We can build it as a

Python script that runs within a pod. Or we can build build it

as

a JavaScript, or we can build it as anything else, like Spark

for NLP. And we can actually do a cascade of these things

as long as we have the boundaries very well defined.

Boundaries as long as we know, okay, this is the expected input,

whether it comes in for a BigQuery table or whether it comes in for something else.

We know that we can take this in. We

activate the entity resolution engine on it. And and at the end, we expect a BigQuery table to be written out. So this is what happens within the pipeline. The API

can actually can connect directly to a pod that's running the standardization

service

constantly.

And then it just sends, like, a request to that pod, and the response is a standardized address string together with left log and object ID.

Because of the relative complexity

of this sort of system and the stage of your pipelines, I'm curious

how that has impacted

your

overall design and structure

of the broader system of DAGs that you're building and sort of the

dependencies

for this stage of the pipeline, sort of how you manage

triggering or alerting on any failures

in being able to source the data and then downstream

being able

to monitor for and manage the expectancies

of the downstream consumers of that stage of the pipeline

for being able to then use that clean cleaned address data

in other systems and then how you manage records that aren't able to be properly normalized because they're just missing too much information?

I think the key point here

I think I would say it's

about having a clear understanding of the data flow and having very, very well defined

service boundaries and module boundaries.

Defining exactly what sort of input and what sort of output

comes out at every single stage was

crucial to being able to set it up in such a way that you that we can use it everywhere.

Basically,

the system is built in such a way that it's a microservices approach.

Every single part can be

taken out, replaced,

changed without any change to the overall behavior of the system.

So, obviously,

you know, if you have a successful match,

what you will end up with is a standardized address plus the latitude and the longitude with an accuracy score and an object ID that goes with your, address.

Now if you have

an address that is not well formatted

or that for some reason we're unable to match our canonical source of truth or unable to

geocode to specific let longs,

what we will give you

is

basically

the input back. Like, we will do our best to standardize it if we can actually parse it through the individual components.

And we can

for example, if you put an and input an address like

34

w 34 c, we might standardize it to 34 West 34th Street.

But if we can't geocode it, it will come back with a geo accuracy code that says

not available,

basically, or the best guess that could be,

you know okay. So we're not able to find the exact building, but here's the street level

coordinates. So if someone is using it in our UI, which has a map on it, at least the map is gonna zoom in to the street level of where you would expect to find the property.

Of course, if the address is something like,

you know, n a n a, New Jersey, there's not much we can do about that.

We'll send you to New Jersey, and good luck.

But, yeah, it's a part of it. The idea is we try not to give you a completely empty response.

Like, at least you will get what you put into the system. If we really can't find anything for you, at least you would get your input string. And if there's anything else in the system that we can connect on that,

it will connect to it. Now that you have rolled out the system

as part of your data pipelines,

what are some of the other sources of

information and other types of problematic data that you work with that you would like to see given a similar treatment and turned into a service to be able to automatically clean it up for you? Yeah. I think the natural candidate for that is going to be

named entities resolution,

specifically the Michael Jordans

that we have in our system.

So we wanna take an entity approach there as well.

For example, we want to be able to resolve

companies

and the subsidiaries of the company and the subsidiary of the subsidiary

and so on. So we wanna be able to standardize

company and people name, which is a completely

different ballgame, really, but you can still take the same

service oriented approach. And this is, again, something that we wanna be able to do

in bulk in the pipeline, and we also want to be able to do for clients on the fly. Because clients do wanna go into the UI, be able to type in,

for example, a company name

and see if we can resolve the portfolio for this company,

just as an example.

Going back to 1 of the things that you said very early on when you introduced yourself as a data scientist, I think it's interesting that you're working

on this particular class of problem because

in some ways, it's a data engineering problem, but it also has a very data science heavy element to being able to solve this component of the data engineering pipeline. And I'm wondering if you can just talk to sort of the team structures and team dynamics that you have at Cherry and

how you view the dividing line between

data science and data engineering, and sort of how that plays out within your team, and how you've seen it play out in other organizations.

There's this kind of a wall

between data engineering and data science.

And

often,

you know, the ball gets tossed back and forth in the in this wall. Data science comes with a solution for something,

which, you know, will be done in a

Jupyter notebook. And then data

engineering, you know,

gets that notebook and says, okay. Now how do I put this into a pipeline?

Right? Do I wanna do this? Is it going to scale? Probably not gonna scale like that. We need to find a different solution. So that's

really a common

thread that I've seen in many places.

In Cherry, we try to take a cross functional approach to the teams.

So for example, my team,

at this point, we have 2 data scientists,

2 data engineer. We have a product manager, and we have a senior data engineer.

It's been a very powerful approach to the solution.

It really, really helps streamlines

all these combined projects with data science and data engineering needs to be integrated, especially when it comes to services.

So my team,

I guess you can call it focus on the services

for the company.

We have the address service that we're working on, which is combined with a GeoCorder service, which is what we use to serve the lat long. We have the name service, which is the next project. And then

we have the data scientists

that are working on the knowledge graph that is going to use these 2

services in order to get built.

Because of the fact that you have these cross functional teams that has allowed you to produce some interesting

kind of types of systems that I haven't seen a lot of other teams

build towards where

because of the fact that you have data scientists and data engineers working on the same project, you're able to bring this kind of machine learning style

component to the pipeline and build it as a service into the overall data flow, whereas

most of the time, the pipeline is generally treated as kind of a not really a sort of dumb system, but, you know, very much reliant on just being very mechanical,

and then the machine learning is something that happens at the end. And I think it's interesting to see a bit more of an evolution of machine learning being brought earlier into the life cycle of working with the data sources and being able to

use that as a means of providing more high value data assets to, you know, analysts and end users and downstream data science teams.

Exactly.

Yeah. I mean,

when Cherry started, it was focused on just New York City data. And the initial solution was it for addresses was exactly that. Well, something that existed only within the pipeline. It solved a very, very specific problem, did not take into account

a lot of the current capabilities that we have.

And then

once we took a more

machine learning engineering approach to it, Right? This sort of thing tends to happen at the end of the pipeline, but you can't do that because addresses

they coming at the source, like standardizing them at the end. And we have

tried standardizing them later in the process,

and it just doesn't work. It just leads to complications.

You really wanna be able to handle them as a part of your standard transform

before you bring in the business logic.

I think too it's interesting that you implemented at least parts of this as user defined functions within BigQuery. And I know that there's been a bit of a trend of moving more machine learning into the database itself rather than having it be something that has to sit on top of it and pull data out and push data back in. And I'm wondering what your thoughts are on some of that or any of the other sort of interesting trends that you're seeing in how teams and technologies are able to facilitate more advanced data workflows?

Yeah. The cool thing of being able to run machine learning models

within BigQuery

is scalability.

But when it comes to dealing with big data,

BigQuery is really the natural choice for that,

at least for us.

Able to define more powerful

user defined functions is key for that.

You know, think about

trying to write a machine learning module

a model with SQL.

It's

it's horribly. Nobody wants to do that.

Right. Right? So

being able to use the defined functions to use more natural tools to do this

is really key.

2 things that I really wish BigQuery supported within their user defined functions are,

a, the ability to make API calls

to external services.

That could be very powerful,

you know, if you think about it. And then the second thing is,

currently, the user defined functions

don't actually support Python scripts,

which is what most data scientists are gonna use, you know, right off the bat. Digging a bit more into the fact that you are using user defined functions as a portion of this service, I'm curious how you manage things like testing and versioning and updating the code that lives in the database and just sort of what your deployment pipeline looks like for

being able to iterate on this service and be able to grow and evolve it? Man, there's a battery of tests.

We have unit tests in place for the service. We also have

dbt test. We we use dbt

when we write the SQL models.

So we have DBT test on the

data. Basically,

every task that we have in the pipeline,

there's either a test task afterwards or the tests are built in.

So we're talking about unit tests, integration tests,

data tests, end to end tests.

We also use

Datadog

to do monitoring on GraphQL,

just 1 of the interfaces that the clients are using. So monitor GraphQL and the website.

As you have

built out the

address resolution capabilities and built out the services for being able to manage it, what are some of the most interesting or unexpected or challenging lessons that you learned in that process, and what are some of the most

interesting or appalling edge cases that you've run into as far as how the data is formatted?

Oh, boy.

Yeah. I guess you can't build a system without learning how to

better build it

the next time you approach the problem.

There's been

a lot of that. Of course, there's a lot of edge cases with this sort of thing.

We need to deal with a lot of

geospatial data.

So we deal with footprints datasets. We deal with address points datasets. I think 1 example was when we

decided to bring in

open data for the top 20 markets

in the US.

So a lot of cities

have really good open data. For example, New York City has the best open data that I've seen from building permits

to

transportation,

to courts.

Anything you want for in for New York City is is really available. And Nashville

Nashville was 1 of these cities that we

wanted to bring in.

So we took in the address points data for Nashville, where we have or the building

dataset for Nashville, where we have the addresses, we have the coordinates for a building,

and we took it into our adverse point dataset.

We

extracted it, loaded it, transformed it. And then when

we connected

it to our other datasets, we found out that the entire

city of Nashville

was resolved to a single tax law in Tennessee,

which

seems very counterintuitive.

So we plotted the data

on a map,

and we found out that the entire city was actually mapped to an area which was about 1 square foot in size. So, yeah, 1 foot by 1 foot. It's like

a nano shvel kind of thing.

Right. And the reason is because they were using this, like, unique coordinate system,

and there was some weird

numerical error when we

transitioned it to our coordinate system.

It mapped to the right place, but just like

several orders of magnitude smaller.

That's funny.

Yeah. So this was 1 of the edge cases.

And now that you have put the system into production and you've sort of learned the lessons

of going through the trial and error and

making it production ready,

as you look back, what are some of the changes that you would make if you were to start over today or sort of improvements that you would make over the existing system if you given the time and resources?

Yeah. So like I said,

you can't build a system without

learning how to do it better next time. There's a lot of improvements

that could be done. And

I think

the 1 aspect that I'm

really happy about is

at least the way it is built right now.

So

the clearly defined

service and microservices

boundaries,

the module integration, and the client interface. These clear definitions

actually

enable us to make significant changes

without really breaking anything or, hopefully, without breaking anything,

which is in production or client fixing. So we can do all of our testing

separate from what is exposed to the clients.

We can

completely swap out components. We can change the language that we use for the service. We can replace the entire service.

But as far as the client is concerned, they would only

notice incremental improvements as we work on the system.

So

I think that's the most important

thing when you start working on a service. Try to have a very clear definition

of

how

you interact with it and how the client is interacting with it.

So if you have this clear input and output expectations, anything in the middle can be a black box. Which means you can do anything you want to it, and you would still get the same output out of it. So are there any other aspects of the work that you're doing at Cherry,

either in terms of the address resolution

and the system that you've built to be able to turn that into for your data pipelines

or

the overall challenges of messiness in real estate data or some of the overall trends in being able to

move machine learning earlier in the life cycle of the data systems that we didn't discuss yet that you'd like to cover before we close out the show? Having the ability to do,

machine learning processing,

non addresses in this case, but also building the same thing for names.

Earlier during the pipeline

proved to be crucial, at least for us.

And then,

I think we talked a little bit about the knowledge graphs. So once

These are some very, very

cool processing tools, but the real insight comes from

once you actually can run

AI and deep learning algorithms on the knowledge graph itself.

This is where you start gaining insights and seeing structures that you haven't seen in the data before. Yeah. Knowledge graphs in general are sort of an interesting construct and 1 that I've touched on in a few different episodes in a few different areas. And I definitely look forward to seeing them

be used in more contexts because I think that that's going to

be able to provide a lot more power to people who are just doing very basic analytics right now. And as knowledge graphs become more approachable and more manageable, I'm interested to see sort of what kinds of insights that helps us surface.

Exactly.

And

in our case, the knowledge graph, I think John probably mentioned it, has

several 1000000000 nodes in it. Yep. So whatever outcomes we

we want to use and that we need to make sure that they're

of the order of the number of nodes.

Right.

At at worst.

Well, Farar, anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yep. So I think I mentioned it before. We use BigQuery in the back end, and we found it to be an extremely powerful tool.

And then we also use a lot of user defined functions.

And then the

2 things that I would really love to see support for

coming up from the Google Cloud Platform

for BigQuery. If we use the user defined functions currently, they don't really support sending external calls to APIs,

which I think can be extremely powerful if we could do that. And then the

other thing is they also don't support scripting in Python,

which means,

you know, Python is really the natural language for a lot of data scientists.

So

just being able to take in, like, import a Python module and just run with it in there could be very, very, very powerful

and very natural.

Yeah. Definitely, it would be interesting evolution to the database market. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing at Sherry for being able to push machine learning earlier in the pipeline life cycle and some of the capabilities that that's allowed you to unlock. It's definitely

very interesting project, and I appreciate you taking the time to share it with us and sort of explain the benefits that it's been able to provide. So I appreciate your time and effort, and I hope you enjoy the rest of your day. Yeah. Alright. Thanks for hosting. This was great.

Listening.

Don't forget to check out our other show, podcast.init@pythonpodcastdot

com to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcastdot

com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links