Summary
Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources of information. In this episode John Maiden talks about how Cherre builds knowledge graphs that provide powerful insights for their customers and the engineering challenges of building a scalable graph. If you’re wondering how to extract additional business value from existing data, this episode will provide a way to expand your data resources.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. We have partnered with organizations such as ODSC, and Data Council. Upcoming events include ODSC East which has gone virtual starting April 16th. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing John Maiden about how Cherre is building and using a knowledge graph of commercial real estate information
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Cherre is and the role that data plays in the business?
- What are the benefits of a knowledge graph for making real estate investment decisions?
- What are the main ways that you and your customers are using the knowledge graph?
- What are some of the challenges that you face in providing a usable interface for end-users to query the graph?
- What technology are you using for storing and processing the graph?
- What challenges do you face in scaling the complexity and analysis of the graph?
- What are the main sources of data for the knowledge graph?
- What are some of the ways that messiness manifests in the data that you are using to populate the graph?
- How are you managing cleaning of the data and how do you identify and process records that can’t be coerced into the desired structure?
- How do you handle missing attributes or extra attributes in a given record?
- How did you approach the process of determining an effective taxonomy for records in the graph?
- What is involved in performing entity extraction on your data?
- What are some of the most interesting or unexpected questions that you have been able to ask and answer with the graph?
- What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with this data?
- What are some of the near and medium term improvements that you have planned for your knowledge graph?
- What advice do you have for anyone who is interested in building a knowledge graph of their own?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Cherre
- Commercial Real Estate
- Knowledge Graph
- RDF Triple
- DGraph
- Neo4J
- TigerGraph
- Google BigQuery
- Apache Spark
- Entity Extraction/Named Entity Recognition
- NetworkX
- Spark Graph Frames
- Graph Embeddings
- Airflow
- DBT
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you get everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances.
Go to data engineering podcast.com/lunote today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing John Maden about how Cherry is building and using a knowledge graph of commercial real estate information. So, John, can you start by introducing yourself? Hi. I'm John Baden. I'm a senior data scientist at Cherry. I work on adding insights and features that we can extract from the data we process for our customers. And do you remember how you first got involved in the area of data management?
[00:01:11] Unknown:
So my background as a data scientist, I mean, I think most people say, oh, you know, you focus on the the science component of the job. But, But, you know, data is actually a very important component of what we do. You know, you're especially, you have to think about the types of data you have, the type information it provides. That's always a big 1. You know, you can have lots of data, but if it doesn't really say much, then it's kinda useless. And then, when you get into the field of, people use big data a lot, but, you know, big data, when you're working with terabytes of data, you have to really think hard about what information is in the data, how is it contained, where does it go, is it gonna be something that you care about and be useful, How does it all connect together? I mean, especially if you think about, you know, joining data when you have terabytes of data, it becomes very hard. So data management is a big part of what data scientists need to think about and do. And so
[00:01:55] Unknown:
you are currently working at Cherry. Can you give a bit of a background on what it is that Cherry is building and the role that data plays within the business?
[00:02:03] Unknown:
So Cherry's is a it provides data for our customers in the commercial real estate space. So what we're doing is making sure that they have all the data they need to be able to execute and, you know, make the best decisions for their their business. A lot of data tends to be very siloed. Real estate's 1 part of it, but a lot of industries have this problem too where you have the data you have and you don't really think about, you know, you know it's got value, but you don't really how know how much value you've got until you realize that there's a whole universe of data out there that can also be connected to your data. So what CHERI does is helps make those connections. So taking a lot of amazing public data, so, you know, my I'm a big fan of New York City open data. They provide a lot of great information that we can use. There's other cities that are also buying their own initiatives to provide great open source data that we can use to drive business, connecting that data, as well as third party paid and data partnership data that we pull in. So, there's a lot of other companies that provide a wide range of data that is useful for the commercial real estate space. So, besides the usual tax and transaction information, you've also got demographic information, you got transportation, you know, there's just lots of different aspects and our customers have many different ways of thinking about the data. Our customers can be brokers, they can be property developers, they can be insurers, they can be financials. And, each 1 of them has different cares and interests in the data. So it's getting all of their interests and all their needs consolidated, putting that data together, as well as being able to combine their data into the pipeline.
So getting out of that silo. So not just saying, oh, you've got that data. It's, you know, the data you have is useful, but then taking that data and then combining it across the wide variety of data available really makes a powerful
[00:03:37] Unknown:
offering to customers to realize that they can have much more insights with the connected data we can get. And so for being able to connect all the data and perform some of the useful analysis for your customers, you have built it into a knowledge graph using some of those open data sets that you mentioned. And I know that you also purchase data sets from different brokers or, real estate owners. And I'm curious if you can just discuss a bit about the benefits that a knowledge graph provides in terms of being able to make informed investment decisions in the real estate space and some of the challenges that alternate representations of that data pose?
[00:04:13] Unknown:
Yeah. So, you know, as you said, you know, we get data from public sources. Some of it we pay for, but some of it's from a wide variety of data partners. Knowledge graphs are it's a it's a useful tool. There's a lot of different aspects of data that, you know, the traditional database of property, you know, of let's think of it from the real estate space. So from you have property, number of bedrooms, number of bathrooms, square footage, you know, there's all the traditional aspects of data that you collect. That's very useful. That helps people get insights and information to the data they have. Knowledge graphs are a different way of organizing data. I mean, there's a lot of If you look online, there's a lot of descriptions of a knowledge graph. You can look at it from a scientific perspective, from a, you know, a computational perspective, but at the end of the day, it's about relationships with your data. You know, so you're saying that a is connected to b and this is how they are connected. Now, to do that, you have to think about how your data is structured. It's about organizing the data in certain ways, it's also about what value you want to get from that data because there are many different ways that data can be organized and connected.
In our case, you know, we're very interested about how properties are connected to people or corporations, And so, extracting and going through those data sets and organizing the data in that way is very useful, but then that allows us to build a lot of great products on top of that connected data where the emphasis is on the connections. Right? That's why we build a knowledge graph is to show how data is connected. And then in terms of the end users of the knowledge graph and the analysis being performed, I'm wondering if you can give a bit of a flavor of the interactions and the types of questions that are being asked of that knowledge graph and some of the challenges that you face in terms of being able to expose that underlying graph in a way that's intuitive and easy to use? Yeah. So to use the knowledge graph that we have, I would say for us, we see it as a internal resource. It's something that we we're the internal users for it right now. So a lot of what we're doing is we're acquiring it through databases. We've got some graph tools as well, a lot of it is analysis, so we've got a great, data science machine learning team. We're spending a lot of time just analyzing the data and looking at what we wanna get out of it. It's a hopping off point for other products that we can build. And so, I would say, you know, the visualization tools right now are It's complicated. So, originally, we built a smaller knowledge graph just based on New York City data. When we did that, there was tens of millions of edges and just visualizing that in a graph database was kinda hard because we would say, okay, I wanna look at a specific property, and it would have hundreds of connections based on how we collected the data. So, you know, you can filter it down. You can try to emphasize the way that everything's related, but still a lot of big data dump that makes it hard to you know, you can see all these connections, but in the end of the day, what's what is this relevant? So for us, a lot of it is especially when you get to the national data. National data has got billions of rows so or billions of edges. So, with billions of edges, it's even harder to I mean, I would I'd love to say visualization is important. It's something that helps us understand the data but, you know, traditionally, what we've been doing so far is just organizing it as a graph with edges. We're doing a lot of analytics and then we're also doing machine learning on top of that to try to extract insights. So, the knowledge graph per itself at the moment, it's useful. It's it allows us to see the data but, you know, visualization, it gets very hard when you have billions, you know, you have 100 of millions of nodes and billions of edges. And it's more about aggregate statistics so you have to think about it in terms of big data.
So if, you know, we're trying to you're trying to use this for something, the knowledge graph should be a driver for your products. So for whatever you're building, what's the coverage you're expecting? What's the accuracy? So you have to think about it more quantitative terms. If you have a specific use case, like, if you're looking at a specific property then, you know, visualizing it's great. But generally, we're working at we're working with the data in aggregate and trying to figure out across nationally how does this information help us. And I know that
[00:07:58] Unknown:
in terms of graph information, there have been a few different approaches to being able to actually store and analyze it starting with some of the RDF triples and using a triple store or using a graph model that is stored within a relational database. And then there have also been engines that are specifically built for storing and processing graphs and graph algorithms, such as Neo 4j being the most notable, but also things like dgraph or tiger graph. And I'm curious what you're using
[00:08:27] Unknown:
as the actual storage engine for being able to house this information and query it, and any challenges that you're facing in terms of being able to scale it to the volumes of data and the numbers of queries that you're trying to process. Yeah. So the example I gave would be then when we did the New York City knowledge graph, which was the the ramp up to doing the national 1. So the New York City 1, that was something that, you know, we could process on a single, computer, and it had tens of millions of edges, but you could still do enough analysis. I mean, a lot of it was extracting data from sources. So this is the, you know, the complexity of, first, before you build a good knowledge graph, you just need the data in place to be the in the first case. And so, a lot of our engineering efforts, and we've got a very large dedicated engineering team, is focusing on building out resilient, repeatable, scalable pipelines that pull in everybody's data in a consistent way, in a safe way. And, once you get that data there, then it's very easy for, you know, data scientists, machine learning engineers to go ahead and say, okay. These are the sources I need. I need to pull them out. I need to extract all the pairs. So, you know, we didn't go through the traditional relationship extraction that you normally would do. I mean, for us, at least for New York City, it was very straightforward to just pull the data of, you know, we're looking for properties to people, properties to, addresses.
We had always built on all public data for the New York City 1 because New York City provides a great set of data. So, a bit of Python, a bit of, where Google Cloud shops or BigQuery, that got the data into the right format to start with. In terms of visualization, Neo4j was very important because this was our initial, play. We wanted to figure out, does it make sense? It's, you know, it's hard to imagine, can I do this in with New York City since that's where we're from? That at least was something we could focus on and know clearly, like, you know, we know New York City very well. Did we get this right or not? And you can see the data right in front of you. So, you know, Python, Neo 4 j were great for the initial analysis. Scaling becomes very hard, I agree, and our emphasis is more on the analytics. For us, the knowledge graph is a driver for new products that go in front of our live customers. So, being able to process a graph at scale, as well as being able to do analytics on it is more important than, you know, being able to visualize it all. So our focus has been on, so building out a resilient pipeline, ingesting the data to then, you know, putting everything in BigQuery because obviously, we've got billions of edges to work with. I'm a big fan of Spark and so that's what we've been using to drive it has been, Spark, especially graph frames because that can ingest the data. It can quickly find the patterns we need. It can do a lot of, analysis quickly. And so, that's where I'm a big fan of is I'm more on the the learning analytics side and I think that's a very good general tool for performing bulk operations. I mean, it only takes, you know, the process a couple of 1, 000, 000, 000 edges takes us a couple of hours to run on a decent sized cluster and that makes me very happy. And another issue of being able to build this type of data store and do the entity extraction
[00:11:18] Unknown:
and resolution and being able to establish the edges between the different nodes within the graph, There are a lot of challenges because of the fact that with pulling from multiple different data sources, I'm sure they don't all have the same representations and a sort of common schema or format.
[00:11:42] Unknown:
Yeah. So the Yeah. So the data is extremely noisy in many different ways. I mean, I've worked with many large noisy datasets. This 1, real estate in particular because a lot of real estate data is collected at the county level. It's just how things are done in the US, which means that you have varying levels of information, formatting of, I mean, zoning codes vary. Everyone's different. So it's not like if I say I've got a national database of real estate data, You know that there's gonna be tons of inconsistencies. So some of this stuff just involves having great analysts who really know their data. So we've got a good portion of the team has strong real estate knowledge and can quickly look at the data and say, this makes sense and or this doesn't, you know, this is relevant for what our customers care about or not. So having subject matter experts is always a critical part, you know. If you're gonna build a knowledge graph, the graph part is cool but you also need the knowledge. And so, going through the data determining what is relevant and what's not, with our data yet again because we're coming in from multiple sources, sometimes you just have to think about, you you know, you have to apply some very strong business logic to the data.
Finding, you know, missing data components are very hard. So, trying to find ways to either fill them in or at least provide enough information that you can complete the graph is useful. Now, I mean, most of the time we have a very complete comprehensive data set, but, you know, not everything is gonna be perfect, especially when we're trying to combine the data. On the messiness side, you know, not just, you know, just because we have providers or we ourselves have put the data together into a national database, it doesn't really mean it's always connected. The biggest ones that drive us crazy are addresses and names. And so, to build a powerful knowledge graph, I mean, if anyone's worked with real estate data, you probably know that, like, addresses are very complicated.
You can have multiple addresses that all mean the same thing but are written differently. So, we used to be on 6th avenue, you can write 6th avenue as the number 6th s I x t h, Avenue of the Americas. Those are all valid addresses. But then, if you've got multiple datasets that each use the different spellings, those are gonna lead you into different points. So you can't connect the data that way, especially if you have, you know, you put in the wrong state, maybe you transpose the zip code. So, address standardization is a big effort. That's, you know, to get the knowledge graph, putting the data together is not as important as being able to clean it up well, and our big effort's been on address standardization. So making sure that all the addresses we get from all the different sources all match together, that's a big lift in itself, and that's something a service that we provide. We use internally to clean our data as well as provide to our customers to allow allow them to connect their data. Entity resolution is another big 1. Buildings can have multiple addresses. So, what most people see in the building is probably the mailing address, but, you know, with a range of addresses for different buildings, you also have to put in time and effort both from data science and an engineering perspective to resolve all the different datasets. I mean, our current office is, you know, has a street address and an avenue address which means that depending on which dataset you're using, you would have 1 data set pointing to our street address and another into our avenue address. You gotta resolve both of those to actually meet in the middle. And the last 1 that's tricky is names. So, everyone, you know, you think that, oh, names should be easy to do. Every different dataset has different ways of formatting names especially when we're trying to connect them across different disparate data sets. So, you know, 1 data set might have maiden comma John, another 1 might have John maiden, a third set will have John w maiden. Are these all the same Johns? Is there a different John out there? And if if there is a typo and someone had John q Maiden versus John w Maiden, how do you guarantee that you know that these are the same people, that the middle initial is not important, or you know enough that they are the same or they are distinct people and you have to keep them separate? So, name resolution is very important.
The other problem is also, and this ties back to putting everything into a graph, is that certain names are very common. You know, you There are definitely many John Does in New York City. I mean, not John Does per se, but there's a lot of people with very common names. You know, if you look at the data, you could say, you know, this 1 guy owns 20% of New York City real estate. That's not really the case because if you then put everything into a graph and you start looking at all of the different networks, you can say, well, obviously, all of these different connections go to this 1 person, this 1 name John Doe owns all these properties. But if I look at it from a graph perspective, I really see disparate networks.
So, there's a back and forth and building a knowledge graph is very much an iterative process. So, there's the cleaning the data as well as you can. There's putting everything into a knowledge graph format. Then, there's looking at how the data actually connects together, and then doing further cleaning and iteration because you're never gonna get anything perfect on the first try. But, you know, it's You need clean data to build the graph, but once you build the graph, you can still see how noisy your data is and you can do further rounds of iteration. And so, be able to do that is very important. So it's not just, you know, the data in 1 place, you know, a knowledge graph is not just about getting the data in 1 place. It's also recognizing all the effort and time that goes into getting it there in the 1st place and making those connections. And for us, you know, address and name standardization is very important to be able to make it as powerful and as useful as we can. Another element of the entity resolution is things like businesses that have multiple
[00:16:45] Unknown:
doing business as names. And I'm curious if you try to collapse those into a single entity as well, or do you keep them as distinct entities and just represent big use cases for our knowledge graphs at the moment is that we're gonna
[00:17:01] Unknown:
So 1 of the big features 1 of the big use cases for our knowledge graphs at the moment is trying to find owner unmasking. So a lot of commercial real estate properties for multiple reasons that tend to be owned by separate LLCs and especially some of the big players you'll see, like, they'll have let's you know, if you have 123 Main Street and 124 Main Street and they're both owned by the same company, the owners on the tax rules will be 123 Main Street LLC and 124 Main Street LLC. It becomes very hard because, you know, I mean, if you know personally these are owned by the same bill the same owner, there's no clear way to connect them back. And for our customers, they care about who the true owner is. So that's 1 of the use cases for our knowledge graph is getting all the data together so that we can do the connection. So going from property to all these other dots and going all the way back to a true owner. So trying to actually get behind and trying to get away from the LLCs and all the inter intermediaries.
Another part of, name resolution I think you kinda touched on is that it's not always just a 1 to 1. So, a lot of the data sources we're working from don't just say John, you know, maiden comma John w. It'll say something like, you know, you know, Bank of America on behalf of John Maiden Trust as represented by John w Maiden. And so, it's not just about cleaning up the names, it's also about doing multiple entity extraction and also prioritization. Right? Like, in that case, the entity that you care about would be something like the John Maiden Trust or specifically John Maiden. You wouldn't care about the bank. So, there's a lot of logic. It's not just about purely just making sure that, you know, John Madden shows up as in the correct format. It's also being able to recognize what are the true entities. So, is this a person? Is this a corporation? Is this a trust? And, you know, for the the LLCs particularly, like, is this really is this really the endpoint of the graph or is there really a further connection to go where, you know, this actually resolves up to big player number 1? And another element of
[00:18:50] Unknown:
challenges in terms of the consistency of the data is, as you mentioned, you're pulling data from multiple different counties and municipalities and states, and they're all gonna have different systems that they're storing that information in, each which have their own restrictions on field inputs. And 1 of the common challenges in basically any input system is, as you said, name resolution and the name formats that are allowed. So in 1 case, it might be first name and last name only, in which case, you know, maybe the middle name or if there's a hyphenated last name, it'll get all stuck into the same field. Or if you have different cultures where they have different mechanisms or different standards in terms of what the family name is versus the personal name. I'm also sure that there are a lot of cases that you run into of unicode naming being munged into some sort of ASCII representation. And I'm curious if there are any particular horror stories that come to mind in terms of just some of the challenges of being able to clean up some of the specifics of name or entities, and I'm sure also with some of the business of property information as well. Yeah. Generally, I think the the name resolution can be very tricky,
[00:19:55] Unknown:
especially with there's no 1 consistent way to look at names. I think there are some locations that are better than others in terms of tracking information. So sometimes you have to go with some ambiguity. You know, it's it's better to be slightly more ambiguous than trying to join stuff together. Like, going back to the example of if I had a John q maiden and a John w Maiden and I really couldn't tell if they were separate people or not, maybe it's a little bit better to have both names. I mean, especially if for our customers, I mean, for the owner, I'm asking particularly brokers care about this. So brokers say, I found a property. I really wanna I've a love this property. I think my customers really wanna buy it. I don't know who to contact to make a offer. If As long as you have, like, contact information, they don't particularly care about if it's John W. Maiden, John Q. Maiden. Now, that doesn't mean that we're done iterating and it helps the more information we pull in. So, the more different data sources you can add in, it helps as well. Triangulation's important. So, you know, the New York City graph, everything was built with public data, but we have such a rich set of data across, you know, tax information, transaction information, building information. There's just so many different rich data sources that that helps narrow down it just from the data itself. It's a balance. Right? So, if you've got great data sources, then you don't have to worry as much about using data science or some of the really cutting edge NLP techniques. If you are But that also assumes that you've got New York City where New York City is very consistent and homogeneous in the way it's presenting its data. As you've mentioned, once you start getting outside of the city and you have to go across the country where you have, you know, thousands of different regimes when it comes to data collection, that's where you wanna get as much good data as possible, but then you have to start thinking about, you know, really using some, you know, getting as many smart people as you can into a room to tackle the difficult problems of getting the data there, getting it on time, making sure it's in reliable pipelines, and also, you know, rolling out a lot of a lot of cutting edge NLP to try to analyze and resolve as much as you can with the assumption that, you know, you just you're never gonna be a 100% perfect. And as long as you get the customer to where they need to be, if you give them actionable data, I think that's the that's the most important thing is, can you give them data that's actionable? And if that's the case, then they that's what they care about. Another aspect of messiness in data and the different regimes of data that you're getting the information from are things like missing attributes or extra attributes
[00:22:11] Unknown:
where 1 record might have name and property name and geographic location. Another 1 might just have the name and property name, name, and you have to do your own figuring out of where it is in terms of latitude and longitude. And I'm wondering what your strategy is for being able to handle that inconsistency and the availability of of specific attributes
[00:22:30] Unknown:
and the approach that you've taken to determining an effective taxonomy for representing all of these different attributes of the data within the graph? Well, so I would say that depending on the use case, sometimes you don't need that much that very complicated knowledge graph. So the knowledge graph itself is primarily about relationships. So if you care about a connecting to b or if a property connects to this person or this in this company or something like that, like, sometimes that's just enough. And I would say knowledge graphs are important for the connection perspective. I would say that, you know, some of the other use cases that we care about, so we do actually care about a lot of the building features and a lot of information that we can gain from adding in other datasets.
You know, sometimes, you're just gonna have to Yeah. You have to think about what data is critical. So you'd like to have as much data as possible. Sometimes, you're gonna have to sacrifice, you know, you want more coverage is more important than accuracy in certain cases. And then you have to pivot to what is what's the story that you wanna tell. You've got the data. The knowledge graph is there. You're kinda limited by the quality of your data. You can either say there's a couple different routes. 1 might be, I really wanna tell story a and I really need to get this other dataset that will help me get there. And if it's something that's obtainable, then that's an effort we have to put in to make sure that we can actually communicate what we need to communicate with the data. If it turns out it's just not there, I mean, sometimes there are things that can be implied or extracted from other data sources that might be tangential or it might be be that, you know, this use case, it's a great idea. It's something that we'd love to tackle, but it's just not something that's practical with the current state of knowledge. And but it's also iterative. I mean, I think once you start building knowledge graphs, that there are certain things that you can learn from them that you can then feed back into the knowledge graph itself. You know, whether it's, you know once you've done owner and masking, you can learn about portfolio ownership across know, multiple locations, and that becomes an attribute or information that you can use to drive further insights. But it's also that if you have enough data, you know, if you've got if you have enough data, if there's a small percentage where you're missing coverage, you can potentially impute it or work around the fact that you don't have perfect coverage around the country. I mean,
[00:24:40] Unknown:
that's understand how recently was this data acquired. And from that, you also need to be able to go back to the source and say, at what point was this data set generated at the source to be able to ensure that you have up to date information? Because particularly with real estate, properties might change hands somewhat rapidly, and you wanna make sure that you're contacting the appropriate owners. And I'm wondering what your approach is there in terms of being able to ensure that freshness of data and the data lineage tracking to know how accurate and how up to date that information is. Well, it's a combination of business and technology. So on the business side, identifying the partners that can either either we're buying it or obtaining it or we have a data partnership. So So finding the companies that give us the most fresh data on a repeatable basis that also give us the best coverage. And, you know, having done a lot, you know, having been in the data space for a long time,
[00:25:28] Unknown:
you might not have 1 partner that gives you all the exact coverage you need. Maybe they're really good in 1 vertical but not another. And so you might have to do some piecemeal work to create the best data set from as many different components as possible. And if you can find a partner that gives you everything you need, you know, pursue them, make sure that they're part of your your platform. On the technology side, keeping it fresh is investing in engineering and technology, so making sure that you've got a strong team that's empowered to build really good quality, scalable data. I mean, the data we have, you know, some of it's small, but a lot of it is, you know, terabytes of data that gets processed. Making sure that it's ingested automatically without bugs on a repeatable process. That's I mean, data freshness is also about quality control. So, it's, you know, finding the data you need, making sure that it can be delivered to customers, but also putting in place monitoring and other control processes to know, you know, has the data, you know, what's going on with the data? Are we seeing the same data being refreshed again and again? How is it delivering cleanly?
Are we getting the coverage that we expect? Is this, you know, the data we expected to see on a regular basis? Are we getting the coverage? And so there's a lot of moving components to guarantee freshness.
[00:26:36] Unknown:
From the perspective of leveraging the knowledge graph for doing analysis, what have you found to be some of the most interesting or unexpected questions that you've been able to answer or particularly notable insights that you've been able to gain by querying the graph and being able to do some exploration of it?
[00:26:54] Unknown:
So the data in itself, I mean, so the knowledge graph, I think, is always cool. I like collecting data in the first place and I think, you know, you know, sometimes you say, oh, John, you're a data scientist, you know, you gotta do something with it. But, you know, let's let's recognize the amount of time and effort that's put in to, you know, there's a lot of technology effort put in to just getting a large knowledge graph that anyone can see useful. It's very easy to collect data together, put it in 1 place and say, oh, we've got, you know, our database. We've got tons of data. We can do amazing stuff with it. So the the amount of time and effort just to clean and to collect it as well as clean and process it. So a lot of engineering work, a lot of data science machine learning efforts to, you know, standardize and clean the data. I mean, in itself, just getting a knowledge graph that works, that's always something that makes me excited. But, yeah, once we start collecting the data, I mean, I'm always excited especially when it comes to big data, like, how do you how do you actually verify big graphs? How do you verify that you did something right? When we're doing owner unmasking, we've got 100 of millions of properties to check. You can't check everyone individually and you can't guarantee that you got a 100% right. So looking at it, looking at the top results saying, yep, I definitely got this going. I mean, I know that's kind of a large scale type of answer, but generally it's like, you know, can you build something at scale and get it to work? And then, when you see the the results coming out of it, it's like, yep, that makes it's exactly what I expect to see. That generally makes me happy but it's also, you know, if I can put in specific properties, I mean, we're based out of New York, so, you know, start querying all the big the famous properties and, like, yep, we got this, you know, the building when, we got that right, the building next door, we got that right. You know, I think it's sometimes the small things, even if those aren't necessarily important to our customers. It's being able to get the big picture as well as some of the smaller resolution and seeing, like, yep, we actually, you know, the data pulled this out. Like, when you see the connections that you wouldn't have seen, I think that's the big 1, especially when, you know, given that we have a graph, it's not, you know I think, when you think about traditional data sources, you say, okay, well, I'm just gonna put all this data into a database, I'll do a join of table a to table b and then I'll see that, oh, you know, obviously, this property is owned by that person because that's how the data connects. It's not as straightforward and especially when you're doing a graph, it really can be multiple hops or it can be some type of aggregate information on the data. And so when you're seeing results that come from 2 or 3 hops out, it's pretty accurate. Like like, I would you know, if I get a result quickly from owner and masking and I then have to sit down and say, oh, okay. Well, I'm gonna look through this and see where the data came from. You know, I have to look through a couple of different sources to find the connections, but totally was solid. That's a really powerful graph and that's what makes me happy. It seems too that because of the fact that there is such a dearth of information available to people in the real estate space
[00:29:27] Unknown:
that there is probably going to be a bit of an increased tolerance for some level of ambiguity or uncertainty in the data just by virtue of the fact that they're already getting access to more information than they would have had otherwise. And I'm curious what your thoughts are on that level of tolerance within the real estate industry, and some of the challenges that people in other verticals might be facing if they're trying to leverage a knowledge graph and be able to account for some of these sources of uncertainty. Right. So I think,
[00:29:58] Unknown:
going back to my earlier comment about the data being siloed, like, having the data that most companies have, they usually have really great insightful data, but it tends to cover 1 very small part of the market. If you got a broker or a property developer, maybe they only work in 1 neighborhood. So, they care about this neighborhood. They they care about this type of vertical. That's their strength. That's their power. That's great for them. But the question is, like, if you think about expanding that information, if you just get a little bit more information, what can you do with that? Could you potentially go to a neighborhood you haven't heard about before, you know about but you don't really know that much about, and be able to find great properties that also get you what you need to do? I think it's it gives you the strength to be able to expand your horizons and look further beyond what you currently know because there's just so much more data out there. I mean, it's amazing the types of data we can collect normally, but it's also validating a lot of the insights that they have. So because they only have 1 small piece of the pie, of the puzzle, it's being able to see that, you know, the data has there are multiple aspects. It's not just about, you know, sales per square foot or lease per square foot. It's about demographics. It's about how the neighborhood is changing. So, looking at the data over time, that's something that, you know, is not always available to a lot of these players. So having time information, having being able to look at a potential future prospect. So neighborhoods or properties that aren't there yet, but probably would be worth your time in 6 months to a year. So I think that's a very important thing is that once you open your mind to a lot of the other options, if you expose yourself to different sets of data, then you can make yourself and make your your business a lot stronger. We recently had a hackathon and our teams, you know, we had great cross functional teams that were doing a lot of great projects. A lot of what we did was trying to investigate some of the common thoughts in commercial real estate. You know, what are millennials trying to do?
How do we determine what an up and coming neighborhood is? And so, you know, beyond the traditional gut feeling of what people do and I know there's a lot of data driven players in the space but, you know, given the data that we have and the insights we can generate, I think, you know, within the company, we had just tons of great ideas. And it's, you know, from us, you know, from within the data that we can see, I'm sure that our customers will see even more than we can if they have their data and they put it in 1 place and are able to expose themselves to a lot of different ways of viewing the data and seeing it over time. And as far as your experience and your team's experience of working with all of this data
[00:32:19] Unknown:
and the disparities in it and being able to build a useful knowledge graph out of it and build products based on that, what have you found to be some of the most interesting or unexpected or challenging aspects of that work? I mean, from a data perspective,
[00:32:33] Unknown:
so generally building a knowledge graph, it's trying to determine what's the relevant data. So I think, you mentioned as earlier about taxonomy. Right? So, originally, when we started to build the knowledge graph, our taxonomy was just focused on we have properties, we have addresses, we have people and we have companies. And that's kind of a, you know, very stark, let's do this as a as a, original version, build it out and see what happens. As we started building and getting feedback both internally and externally, we realized that, well, you know, identifying someone as a company is important, but our customers do care a little bit more about fine grained information.
So, adding in, you know, they care about if this owner is an educational institution. They care about if it's a government owner. I mean, there's a lot of government owned properties, so those aren't necessarily gonna be on the market. So, adding in some richness is very important and getting feedback was a big thing. So, you know, starting simple and adding feedback is important. Along the way, the challenges were getting it to scale. So having, you know, billions of edges, you know, Google Cloud can handle a lot of this data very cleanly. Getting it into Spark and getting it to process and making sure it doesn't blow up, I mean, if anyone's used Spark, that's a very common problem is, you know, just because the code looks clean and you know exactly what it should be doing doesn't mean you're gonna get, like, what lots of weirdness with memory errors and the like. So scalability is always important. If we've got a couple 1, 000, 000, 000 edges, having a process now runs with no problem. But, eventually, we wanna add in additional datasets, We're gonna have many more edges. And so on the technology side, scalability is always important, making sure that we can add in additional features, add in more. And, definitely, you know, I put in a lot of late nights and the team has also worked a lot on this to make sure that we can make something that cover you know, when we say national coverage, national is big. And scalability, repeatability,
[00:34:15] Unknown:
making sure that it fits in a production system, that's a big challenge for us. Looking to the near and medium term, what are some of the improvements or enhancements that you have planned to the actual content of the knowledge graph itself or the pipeline and tooling that you have to be able to build and power the graph?
[00:34:33] Unknown:
So on the data side is to add in additional datasets. I mean, I think we've got a good mix of data so far focusing on a lot of, you know, real estate data. So, it's kind of general. But, basically, you know, the traditional tax and transaction information to be able to do entity resolution, you definitely need a good source of, information about corporations. So those are always good feedback. But then just getting more data that's relevant to the space. So there's additional datasets we've identified that we think are critical to connecting the dots. I mean, as I said, the more data sets you can use that triangulate or supplement each other, that's important. So, especially when you're doing a graph, it's more about the relationships. It's not just about the relationship 1 thing has to another, but it has interconnectedness.
So, if a property is connected to an address connected to a person, if that person and the property are also similarly connected, that makes it much more stronger connection. You can say, you know, I've got this relationship that I know is solid as opposed to sometimes your data sources might have some random noise. So, I guess, it also ties back into your issue about noise is the more datasets that you can use that can empower the existing data you have, that's very important especially on a graph structure because you wanna have lots of overlapping data sources. So anything that we can identify that makes those connections stronger or highlights specific connections as opposed to others, that's what we're really low looking for on the data side. On the technology side, productionization, productionization, productionization.
As a data scientist, my big focus has always been on making processes that can run scalable, repeatable, not manually. So anything that we can do to make sure that everything runs smoothly, we can ingest all the data sources we need to. We can extract all the pairs. We can clean up the addresses, the names. We can build the graph. We can run the jobs and then have this process on a regular basis given all the data that updates. That's my big focus. Like, I think that's a big accomplishment is, you know, the data science is cool. I love doing data science and I love the the implications it has for customers, but making sure that it runs consistently and repeatedly,
[00:36:31] Unknown:
that's what makes me happy. And for anybody who is interested in building a knowledge graph of their own or in the early phases of that process, what are some of the pieces of advice that you have or any useful references that you can point them to? So building a knowledge graph, there's a couple of good resources.
[00:36:48] Unknown:
Generally, if you look online, you're gonna find lots of great resources pointing you to Google. So I think before you go into building a knowledge graph, you should have you have to ask yourself, what do I need to do? Like, sometimes traditional structured databases work very well for the types of problems you wanna solve. You have to ask yourself, I'm building a knowledge graph because I care about how data is connected and I care about the relationships between data, and I'm looking at more than 1 hop away. So, you know, if it's simply just a is connected to b, that's something you can get from, you know, working with a traditional database structure. It's more about how nodes connect to each other, how they connect to their neighborhoods and the structure of the neighborhood. So you have to think about, you know, it's it's about multiple connections and how each of these connections interplay with each other. So that's the first question you should ask is, what's my business use case? What am I trying to accomplish? If I think that emphasizing data relationships is very important, then that's something that I want to I need a knowledge graph to be able to build and and use. From there, I think the next step is to think about is the knowledge graph in itself an end goal or do I need to build stuff from it? So sometimes, having the data just in 1 place is all you need to accomplish your goals. You've got data. You can see it. You can visualize it. Perfect. You're good. So, yeah, I would say think about what the the end product is. For us, the knowledge graph is a resource that we use to build off of, so we have to think about When we think about how the knowledge graph is built in process, we think about it in terms of our end products. How do we well support owner masking? How do we support a lot of the other goals that we have for this data set? So, first is business use case. Then, once you have a clear business use case in mind, it's to think about where is this going? How do I construct this data? What data do I need? Right? So starting from the business use case, you just take a step back and say, to get there, is there specific datasets I need to build off of or ingest? Are those things that can easily be joined together or do I have to do additional processing in time and effort? Is this something I can build or buy? And so this is, Before we get into the modeling, I'm sure if you're a data scientist listening in, you're like, but what about the model? What about, like Think about the process and what you're trying to accomplish. So getting there is very important, but starting from the end goal is critical.
Otherwise, you have a giant a giant collection of data and everyone's like, oh, that's really nice, but, you know, what does it do? So business use case first. From there, take a step back and think about what data you need to support it. And then, you know, then you start to get into the, okay, now we definitely need this. Let's look at the technical aspects. In that case, you have to look at, well, okay, so how am I gonna support this? Is this gonna be small amount of days? Is this gonna be big? Is this something that I can fit into if I start building this, is this something that I can easily put into something like Neo 4 j? So understanding that from a technical perspective or some of the other great competitors out there. Like, is this something that I need to visualize or is this something I need to process? You know, potentially, there's some great Python packages out there, so I'm a big fan of network x. So if it's small enough, is this something that I could run-in network x to get the data I need from it? If I need to do this at scale, then how do I process it? So, I'm a big Spark user so I kinda default to Spark when it comes to processing large graphs. I know that there are other great packages out there. So, understanding Spark and graph frames are also great technical resources.
Depending on what your use case is, you know, you might wanna think about, you know, getting into more of the machine learning aspects. So this is these are graphs. Are you gonna be doing graph analytics on this? So are you looking for connectivity? So maybe you wanna find a good book on the basics of network structure, about how a graph gets analyzed? If you're trying to get more complicated, are you then starting to think about, you know, deep learning? Do you wanna do crazy things like, you know, am I gonna do graph embeddings? Do I need to do graph deep learning models? Like, those are those are the things that you have to think about, like, how down how far down the rabbit hole do I need to get my business accomplishment done. And so, in that case, you know, it's iterative of start with the business, start thinking about how to get it and analyze it. And then you think about like, can I do this with basic analytics? Do I have to start thinking about much more complicated, packages? Do I have to start going through some of the graph literature? I mean, if you're interested, there are good papers out there. I saw a good paper and I'll probably get you a link later. There was a good paper that kind of did a very comprehensive this is everything to know about knowledge graphs approach. It is a, published paper, so it's gonna, you know, give you a very deep perspective from a computer science perspective on what knowledge graphs can do. But, you know, generally, it's where does this need to go? How do I accomplish it? And then from there, the the engineering and machine learning challenges that will get me to my goal. Are there any other aspects of knowledge graphs in particular or the work that you're doing at Cherry or the challenges that you're facing that we didn't discuss that you'd like to cover before we close out the show? I would just say that building a knowledge graph is a big team effort. I mean, there's a lot of different moving parts. And it it's a mix of so there's the challenges of on the business side identifying what data is relevant and useful for our customers.
There's the challenge of having a strong engineering team to build out the pipelines and getting it running in a repeatable scalable process. And, you know, also having a good data science team that is willing to tackle big problems particularly at scale. At the end of the day, it's, you know, gathering the data is important. It is very important to also understand the data. I'm a big fan of understanding. There are some people who say, you know, data is data. I'll throw it into a big model and the model will figure it all out. You know, understanding where the data is coming from, understanding how useful it is, understanding which features or which sources are more relevant than others. You know, having good business knowledge is always very important. And so, when you're trying to build this and you're trying to do knowledge graphs or you're trying to work with data in general, it's a you wanna have some a company or work with the if you're doing this yourself, you wanna make sure that you have a good mix of technology as well as business expertise to really make it a powerful product.
[00:42:23] Unknown:
For anybody who wants to follow along with you or get in touch about the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. So I see a lot of tools where I think everyone's focused on how the data is viewed. So, you know, there's really good tools about tracking data. So as a company, you know, we have
[00:42:48] Unknown:
we use airflow for our pipelines, we use DBT to manage a lot of our SQL scripts. I think a lot of it's a great a lot of great tooling on visualizing how the data is flowing. So data flows are very well understood. The day how the data is connected through a pipeline is very well understood. I would say, as a whole through the industry, there's tend to be a gap about what the data is. Now, over the years, I've worked with a lot of, large engineering teams where I've said, okay. We need to find data that can do x y z because this is a business need. And they kinda say, well, this is the data diagram and like, okay, so which table gives us what we need? Like, I'm not really sure. I know it's somewhere in this big mess of, you know, somewhere here. So, I would say that, you know, understanding where the data What the What was the business value of the data? Because collecting the data is important, but understanding where it came from, how it was used, what its limitations were, you know, going back to data freshness, how often is this updated, who are the final users, which also kinda ties back into what was the use case. So being able to attach business meaning and understanding to datasets,
[00:43:47] Unknown:
I think it's something that is a whole In terms of data management, most of the time it's focused on data as an object where data is more of living breathing thing. And I think that should be there should be better tooling and focus on providing systems that allow you to recognize that as well. Yeah. It's definitely something that I can agree with is that in engineering, it's all too easy to lose sight of the actual business value and business purpose of what it is that you're doing. So I appreciate your insight on that being a, continued problem with the actual data assets that we're working with. So, thank you very much for your time today. I appreciate all the expertise that you're bringing and definitely wish you the best of luck on your work at Cherry. And I hope you enjoy the rest of your day. Well, thank you very much for your time as well. Enjoyed talking to you. Listening. Don't forget to check out our other show, podcast dot in it atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Guest Introduction
Overview of Cherry and Its Data Management
Building a Knowledge Graph for Real Estate
Challenges in Data Visualization and Analysis
Technical Aspects of Storing and Querying Graph Data
Handling Noisy and Inconsistent Data
Owner Unmasking and Entity Resolution
Name Resolution and Address Standardization
Leveraging the Knowledge Graph for Analysis
Challenges and Insights from Building the Knowledge Graph
Future Enhancements and Improvements
Advice for Building a Knowledge Graph
Team Effort and Business Relevance
Biggest Gap in Data Management Tooling