Summary
In order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration that need to be solved. The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. In this episode Mars Lan and Pardhu Gunnam explain how they designed the platform, how it integrates into their data platforms, and how it is being used to power data discovery and analytics at LinkedIn.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about DataHub, LinkedIn’s metadata management and data catalog platform
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what DataHub is and some of its back story?
- What were you using at LinkedIn for metadata management prior to the introduction of DataHub?
- What was lacking in the previous solutions that motivated you to create a new platform?
- There are a large number of other systems available for building data catalogs and tracking metadata, both open source and proprietary. What are the features of DataHub that would lead someone to use it in place of the other options?
- Who is the target audience for DataHub?
- How do the needs of those end users influence or constrain your approach to the design and interfaces provided by DataHub?
- Can you describe how DataHub is architected?
- How has it evolved since you first began working on it?
- What was your motivation for releasing DataHub as an open source project?
- What have been the benefits of that decision?
- What are the challenges that you face in maintaining changes between the public repository and your internally deployed instance?
- What is the workflow for populating metadata into DataHub?
- What are the challenges that you see in managing the format of metadata and establishing consistent models for the information being stored?
- How do you handle discovery of data assets for users of DataHub?
- What are the integration and extension points of the platform?
- What is involved in deploying and maintaining and instance of the DataHub platform?
- What are some of the most interesting or unexpected ways that you have seen DataHub used inside or outside of LinkedIn?
- What are some of the most interesting, unexpected, or challenging lessons that you learned while building and working with DataHub?
- When is DataHub the wrong choice?
- What do you have planned for the future of the project?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- DataHub
- Map/Reduce
- Apache Flume
- LinkedIn Blog Post introducing DataHub
- WhereHows
- Hive Metastore
- Kafka
- CDC == Change Data Capture
- PDL LinkedIn language
- GraphQL
- Elasticsearch
- Neo4J
- Apache Pinot
- Apache Gobblin
- Apache Samza
- Open Sourcing DataHub Blog Post
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering podcast.com/97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. If you've been exploring scalable, cost effective, and secure ways to collect and route data across your organization, RudderStack is the only solution that helps turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open source foundation, fixed pricing, and unlimited volume, they are enterprise ready but accessible to everyone.
Go to data engineering podcast.com/rudder
[00:01:45] Unknown:
today to request a demo and get 1 free month of access to the hosted platform along with a free t shirt. Your host is Tobias Macy. And today, I'm interviewing Pardu Ghanam and Mars Lahn about DataHub, LinkedIn's metadata management and data catalog platform. So, Mars, can you start by introducing yourself? Hello. I'm Mars. I'm particularly for the metadata team at LinkedIn. I joined LinkedIn about 3 and a half years ago to lead us, fledgling metadata team back then. This was kinda actually my first dabble in the data world. Prior to that, I was mostly working in, you know, cloud infra related project. But I I do believe that coming from a different background, if you sort of bring new ideas to,
[00:02:22] Unknown:
this area. And, I was able to, so I believe that, coming from a different background helps to bring new perspective into this area. And I certainly hope that DataHub will be the proof of that. And, Pardu, how about you? Pardu Ghanam from DataHub team. I am the engineering manager on this team for the last couple of years. So, in my first 4 years at LinkedIn, I worked on, various data analytics platforms and projects. On various occasions, I really felt, a necessity for a good connected metadata story to solve a lot of the data problems holistically across systems. Then I started working on, like, work on tracking systems, operational tracking system, and eventually felt, saw the whole big picture around the metadata and moved to the metadata team at LinkedIn. Having the customer perspective of the requirement of such a metadata system definitely helps me to understand the key pain points, which GitHub needs to solve for its users.
[00:03:14] Unknown:
And, Mars, you mentioned that you hadn't had much experience in the data space before working on Data Hub, but do you remember how you first got introduced to data management there and a little bit of the experience of getting started with it? Sure. I mean, it's not like I have 0 experience with data, but data at Google is very different from data outside of Google.
[00:03:34] Unknown:
So so, I mean, I will I did work on some, you know, simple sort of MapReduce or Flume as, you know, the public framework adopted at Google for data processing and whatnot. But but certainly not in the analytical world, certainly not doing all these query fancy queries and that sort of stuff, dashboarding and all that. But I do feel like coming in here, it's fairly easy to relate because, you know, even though I work in the infra area, it is still technically big data we're working with, except it's mostly sort of online data, with databases and whatnot. So it's not like a completely foreign territory. But I think definitely, you know, joining LinkedIn and joining the data team, you know, open up a new world to me where I see how, you know, actually, these data can be used in, like, a, you know, all sorts of different ways offline and then being able to produce magical results there. And, Pardue, do you remember how you first got involved in, data management? Yeah. Like like I said, like, I was working on 1 of the projects, to trace, like,
[00:04:31] Unknown:
operational issues for a lot of metrics which are computed on on a platform called Unified Metrics Platform at LinkedIn. And lot of these issues are actually related to, related to the connectedness between across systems, and it's never the problem within 1 system. So that that problem that single project got me hooked into a lot of, metadata related projects and, like, interested in this particular team, which is solely focused on solving the data management using metadata.
[00:04:59] Unknown:
And so you mentioned that you're working on metadata management. You're using Data Hub. But can you give a bit more of an overview about what the Data Hub project is and some of its backstory?
[00:05:10] Unknown:
Sure. So, so we published this engineering blog post on LinkedIn engineering blog blog site, about a year ago. So the title of the blog is called DataHub, generalized metadata search and discovery tool. And it actually has a pretty, decent sections on the backstory, and then it's actually quite an interesting backstory. I would definitely encourage your listener to go and read the full block. But, you know, just in short, DataHub is kind of a v 3 version of our attempt to do metadata solutions at LinkedIn. And, you know, it's kind of incorporating all the learning that we have from the previous attempts and hoping that we do believe that it is a much better product compared to its predecessor.
[00:05:49] Unknown:
So so, yeah, definitely, I'll encourage the the listener to find out more details in that blog post. And you've mentioned that this is the 3rd iteration of metadata management at LinkedIn. I'm curious what were some of the challenges that you were facing in terms of data discovery and metadata management prior to building the current incarnation of DataHub and what was lacking in some of the previous solutions that motivated you to rethink and redesign the overall platform?
[00:06:17] Unknown:
Sure. So so the sort of whole metadata management started from LinkedIn, was a project called warehouse. This is probably about I also probably 5 years ago when the the project was first started and actually also got open source. And prior to that, there's actually no such thing exists inside of LinkedIn. That was actually the first of its kind. I honestly don't know. This is prior to my tenure at LinkedIn. I honestly have no idea how people get the job done before then when it turns out doing search and discovery because there's no tools to support that at all at all. So so so we have it was definitely 1 of the first attempt that we try at at LinkedIn, And it was kind of a kind of a natural way that you will approach this problem. And in fact, we do actually see a lot of project out there that is still following this task. Right? We just, you know, focus, a lot on, you know, you know, crawling, different sources, getting metadata into 1 place, and then, you know, have a sort of a, you know, search back end that allows you to do a Google style search around these metadata.
And that's that's definitely the correct thing to do, in my opinion, if you don't have anything at all. That would definitely be a b 1. But, you know, we have evolved so much since then. Once again, you know, the blog post will talk about the exact reason why we, went through different phases of evolution. But, with Data Hub, what we have right now is kind of what we would view as a complete solution. It's not a solution that's just trying to address the search and discovery problem or just trying to address, you know, the operational metadata problem or whatnot. It's it's kind of a full solution that capable addressing all these problem. And, you know, inside of LinkedIn, we're just sort of only scratching the surface right now of the capability of these tools these these tools. In terms of the overall problem space of data discovery and tracking metadata, what have you found to be some of the most
[00:08:05] Unknown:
difficult aspects of it and some of the elements of DataHub that have been most helpful for the end users?
[00:08:12] Unknown:
So there's 2 challenges. Right? 1 is on the consumer side, which is like, you know, data scientists, data engineers, you know, AI engineers. They wanna consume these metadata in a sort of a as as efficient as possible way. Right? Quickly find the stuff that they want, quickly, figure out if this is the thing that they need, and quickly figure out how things are related, so on and so forth. So so a lot of that sort of boils down to, you know, more or less UI challenges, assuming that you have all the metadata. And it's about how do you present those information back to the user in a useful way. So, so and and then that's a constant challenge from our perspective as well where, you know, we are working with closely with UX Designer and and our customer to make sure that, you know, we are presenting information in the way that's most helpful to the users. But I think there's also on the flip side, there's also the sort of gathering, or producing, if you will, these metadata in the first place. And that in and of itself can be viewed as an infra problem because you wanna have a proper infrastructure that's able to ingest all sorts of different metadata from all sorts of different places in a very scalable way. And then I think that is a hard problem because if you don't have that solid foundation, it's very hard to build useful things on top. And I think that's definitely 1 of the data hubs differentiator in our opinion, where we do believe that we have a very strong infrastructure, and we create something that's general enough that can be used to present sort of pretty much any sort of metadata you wanna throw at it without having to rearchitect the whole thing every time. Metadata management and data catalogs are a concept that have been around for a while in different forms and
[00:09:47] Unknown:
for different generations of data platforms, and they also range between open source and proprietary. I'm curious what was missing in some of those existing solutions that put you down the path of building something internally. I don't know if it was something that's specific to the data platforms that are being used at LinkedIn or the ways that you're processing the information, or if it's just a generational thing where the previous Lee available
[00:10:14] Unknown:
data catalogs and metadata platforms just didn't suit the needs of what the current set of tooling and the overall ecosystem requires. I I think I I can definitely talk about why we choose an open source solution or, you know, homegrown solution over a proprietary 1. But we might need an entire podcast. But it's it's sufficient to say that, you know, it's not very uncommon for companies who wanna capture highly customized metadata or to integrate with their in house solutions. Right? So you're having access to the source code is almost a must in those cases. And also, what we heard fairly commonly from our open source customers is that they a lot of time, they don't wanna lock themselves into some sort of proprietary solution. Right? It doesn't matter how great a solution is. I think, you know, it's kind of a general training industry where people are more wary about, you know, locking themselves into a proprietary solution that they can't can't get off. So so I would say, you know, open source, in my opinion, is definitely the way to go, especially for something that's so so gonna be widely integrated within the organization.
Not having open source is almost, in my opinion, not a not a acceptable solution. In that case, I wouldn't dig into what are the valuable proprietary solutions out there because, you know, there are quite a few of them out there, but I don't think it's an apple to apple comparison in that case. So so on the open source front, to be honest, there's actually only a handful of project in this area, if you search. And then, you know, we'd like to believe that we are the leading project in this area. And, you know, there's some of the stuff that we hear from our customers that why they choose Data Hub, you know, including things like, you know, the architecture is super flexible and extensible. And then also, you know, the the fact that this system has been battle tested at LinkedIn scale. And and, also, you know, this is the only system that is stream first when it comes to metadata ingestion. In fact, the the whole stream or Kafka, to be more precise, you know, integration is throughout the system. And then it also builds on top of, you know, kind of a modern more modern big data architecture where pretty much everything can scale, up and down the the the stack. And, you know, I think also, most importantly, a lot of people told us that we have an awesome team and very responsive team behind the project, and that's the reason why they choose us. Because they believe that we will keep driving the project in a direction that will, you know, improve in terms of features and whatnot and then fit their need if it doesn't fit their need today. So that's kind of it. Yep. And as far as the end users of Data Hub, who's the target audience, and how has that particular focus helped to direct the features and the design of the DataHub platform?
[00:12:54] Unknown:
We have identified, the we have categorized the audience, for DataHub into 4 categories, into 4 major personas. The first 1 is, data producers and data consumers who interact with the data on a regular basis. The second 1 is the data operators who need to understand the data flow and, and health of the pipelines that process or move this data for these data consumers and producers. The third 1 is an an is an interesting persona, from the organizational leaders all the way from CISOs or CDOs to engineering managers who are interested in getting the bigger picture of the data ecosystem with respect to privacy or, usage or how the data flows within their org or team. The 4th 1 is Data Hub is goal is not is not is beyond being just a data catalog. It truly accesses, centralized metadata store for other data systems as well to operate their echo data ecosystem better. It it acts as a source of metadata for all the systems, to interact, with other systems. These are these are the primary target audience for Data Hub in 2 days. Well, talking to about your second, thing of, how do how does this end users influence? If you if you really look at it, the true goal of Data Hub is to democratize the data management. It is important for us to build the right metadata foundations, which are very generate and extendable and integrate well with any of these systems. Our actual end end users and these other personas which I talked about, and the use cases what they come up with definitely influence our road map heavily. So if you'll really look at it, we are bringing a very new approach of doing metadata around existing data ecosystems. So, it is very important for us to be guided by the impact of the solutions we built to really solve these, day to day data user problems. And you mentioned that Data Hub is used as a metadata catalog for other components in the ecosystem.
[00:14:51] Unknown:
I know that there are things like the Hive metastore that act as a source of information for table structures for things like data lakes using s 3 or HDFS. I'm wondering if DataHub is also able to provide that same level of capability.
[00:15:06] Unknown:
So, yes, I think, we are, we pride ourselves as being a generalized data metadata store or services. So that is certainly possible. But I think Hive Metastore is a special case, in a sense that unless you have a drop in replacement for it, it sort of it has a deep integration with a lot of big data tools out there. So even if you have a comparable service or even better service for that matter, it's very hard to replace Hive Metastore in an instance. So I don't think that's our goal to replace Hive Metastore. I really think Hive Metastore is definitely a a super important player in the ecosystem.
But to gather those metadata into, our services so we can link that up to other things that Hive Metastore wouldn't have access to. I think that's the role 1 of the critical role Data Hub and its system plays because every system will obviously excels in what they do. Every meta a metadata service in different ecosystem, they would definitely be truly good at what they're doing. But very few of them will have any interest in linking up their stuff with somebody else's stuff, because that's just not part of their charter. And I think this is where Data Hub will come in and say, yes. We will be that player where we will gather these metadata from other places. And for the cases where there isn't a metadata service, Data Hub has only been used as that. In fact, inside of LinkedIn, there's been multiple cases where you're creating a brand new metadata service, And then Data Hub was the de facto choice in that case. Right? They would just create it on top of the same architecture that we, we build on top and automatically get those integrations.
[00:16:40] Unknown:
And can you dig a bit more into the architecture of Data Hub and how it's designed and some of the ways that you're able to actually integrate the metadata into the catalog? Because I know that it's a push based versus a pull based. I don't know if you wanted to dig into that a little bit as well. Most certainly.
[00:16:58] Unknown:
So instead of going into, like, the weeds about how little things was implemented, I'll say a little bit high level here and talk about the principle of our architecture because the details could change over time. We know that, but the principle hopefully won't. So there's 3 principle that we build, Data Hub on top of. So the first 1, like you mentioned, stream based. So everything, you know, all the ingestion, all that happens, should happen through a push. So for a system that doesn't have capability to push, you know, you can still write a crawler to crawl the system, but then push out those things through Kafka. So so the actual injection, the main ingest way of ingesting is through Kafka Stream. So the benefit of doing that is, you know, hopefully, all your listener would be able to relate almost immediately because Kafka is some not something new here. But first of all, it allows incremental, ingestion. Right? So you can you know, whenever something change, only at that instant do you actually push out the change. Of course, that's a lot more efficient than trying to do a snapshot every day or every hours for that matter. And, of course, also there's the near real time benefits as well. Right? When something changed, some metadata changed, you get notified immediately and take action if you want to. So not just the ingestion part of it. Actually, we take this to the extreme. Right? In fact, any metadata that get gathered or, you know, directly uttered on Data Hub will also have a change stream coming out of it. Right? So there is actually a c you can imagine a a standardized metadata CDC stream coming out of Data Hub where you can then build application that's trigger based on top.
In fact, we do that extensively inside of LinkedIn as well where we have system basically just listen to these, you know, metadata, changes and then take action based off that. So and then the the because of this, you know, stream based approach, the architecture become very, very scalable because now it becomes a problem. How do you scale Kafka? And then that problem, you know, more or less is considered solved, as a student scale. Right? So so that's the first principle, which is string based. The second principle is extensibility. Right? So we want a system to be very, very extendable in all sorts of different aspects. And that that's why we started up with a generalized data model. Right? So we say, hey. Look. We know metadata is gonna be sure there will be common things across, you know, everybody will would need sort of depends on, but there is also very specific things that people wanna do just for their organizations. So the metadata model has to be, completely generalized and extensible.
So because of that, we allow user to easily unpol new type of metadata regardless of their complexity. And we often joke about that. You know, when do we know that we are successful is when some when people spend 90% of their time debating how they model metadata correctly and then spend 10% of the time doing the actual coding. Then we know we are successful. So, in fact, because of this, you know, extensible and generalized, model, we are actually able to onboard more than 500 different kinds of metadata inside of LinkedIn, right, for over, you know, 50 different kind of entity within a matter of 18 months. That's completely unheard of inside of LinkedIn. Right? Any sort of platform, you know, the the sort of normal order of normal speed of onboarding things is you're talking about order months, not order of days. And we were able to really, really reduce that because we have a very generalized and extendable architecture.
So that's the second point, which is extensibility. And then the third point is decentralization. So we realized that, you know, most tech company nowadays, they're running they run microservices. Right? Or even server serverless where they're more even more advanced companies. It seems pretty backward to say, hey. Because metadata, we have to run it in a monolithic service, operate by a single team. That just seems very backward. But in fact, that's how most system are built, metadata system are built. Because they they automatically assume because we need to gather data in 1 place, So, of course, it has to be 1 single centralized giant service. But Data Hub is designed from the ground up to support decentralized metadata service. And in fact, that's how we run this inside of LinkedIn. Right? We actually have, you know, 20 plus different, metadata service that's part of the data hub eco ecosystem that, you know, all sort of in charge of the different kind of metadata in in that sense.
And but for simplicity, in the open source version, we we just have a single service because, you know, that's how people most familiar with it, and and we don't wanna overcomplicate the story. But, we just wanna emphasize that the architecture is designed to be decentralized. It's not designed to be centralized. And so does that mean that you can have multiple different instances
[00:21:33] Unknown:
of Data Hub deployed across different environments, and then you're able to link them together to be able to do federated searches for data discovery?
[00:21:42] Unknown:
So the detail is, once again, you know, if you go to our GitHub repository or read a blog post, the detail will surface there. But I'll just summarize it here saying that, we we take a very micro, service sort of way of doing things. So each service will have their own metadata. They will become the source of truth for those metadata, with a sort of a CRUD a typical CRUD interface. But because we have these change stream system built into every services, so we are able to gather the change stream in a central place and then do, use that to build up a search index and graph indexes.
Right? Because of that, yes, the search index and graph index is still, quote, unquote, considered centralized rather than federated, but that was more of a constraint of the system themselves because, you know, trying to do complete distributor grab and complete distributor search is is a little bit harder. And then that's why we chose to have this approach. And, you know, the current system is is very scalable. So so that's, considered acceptable in that sense. And another element that you pointed out is
[00:22:43] Unknown:
the modeling of the metadata itself and being able to represent it effectively and discover it. So I'm curious what you have found to be some of the most useful ways of representing the metadata and any enforcement that you have in terms of defining a basic set of structures that are necessary for the records in the catalog to make them useful for the downstream consumers?
[00:23:09] Unknown:
Yeah. You're touching on a very, very good point here. So there's multiple way of attack attacking this problem. I think they generally fall into a couple of buckets. So 1 bucket is the sort of the lowest common denominator bucket where we basically say, okay. Because there's so many things that are sort of heterogeneous, that's just, you know, take the most common thing across everybody and just say, hey. Here's the model. It should work for all of you guys because, you know, it's the lowest common denominator, and then just move on with that. Right? But now, of course, that approach the problem with that approach is that you're gonna lose all these richness in metadata, because now you extract everything. So everything looks like a, you know, a a a square hole, right, and you're trying to fit the round round pack into it, so to speak.
So that's 1 1 1 approach. The other approach is the other extreme where you say, okay, you know, we gotta treat everything that has key value here, right? Now now then we can represent anything we want. So everything you know, just a key and a value, and a value can be arbitrary complex if you want to, or, you know, it can be mapped within the map, within the map. Right? That approach, of course, is super generalized, and, you know, you can stick any kind of metadata you want in it. But then the consumer will hate you for that. Right? Because not everything looks like a map. Right? So all the logic of interpreting the map or even try to validate that schema will fall onto the consumer side. And then nobody would like that sort of interface because there's no guarantee on, you know, whether this API will even maintain backward compatibility going forward. So that's the second bucket.
So, we're trying to strike a balance between the 2 extreme and then try to stay in the middle where we say, okay. We recognize that metadata itself, it's very hard to give a meaning to a specific structure to it, so to speak. It's like when someone asks you what's a typical structure for data, then you'll be scratching your hands and say, well, I don't know. I mean, it can be anything, it can be any shape. So the same thing happens with metadata. We're not trying to pretend to know that, hey, metadata must have certain structure. But then we also wanna put in some ground rules so that at least we make this thing useful for people. So first of all, we say, okay. We're gonna create we're gonna leverage a language, that we do on LinkedIn called PDL. So at least we have strong schema associated with the metadata you wanna capture. And then we also wanna say, hey, we wanna put in some ground rules that how you associate these metadata with, you know, 1 another or even with the entity that is supposed to be associated with it. How do you do the hierarchy? How do you build graph on top? How do you do indexes and so on and so forth? So put in some sort of ground rule there so at least things are able to be linked together. But when it comes to the thing in terms of what goes into the metadata, we actually say you have complete freedom. You can put whatever you want in there, and that's actually the most powerful part of this. But we do we do recognize that there will be a subset of metadata that's, you know, once again, sort of common that everybody will need and want. So we so the way we do it is we put out these common model, which we say, hey, look, we know these things are always gonna be kind of true ish for everybody.
So, you know, we don't want you to all come out with, you know, 10 different ways of saying the same thing. Here are the common model. If you like it, use it. If you don't like it, you know, develop your own or extend it to fit your need.
[00:26:20] Unknown:
And in terms of actually populating the metadata, you mentioned that everything comes in through a Kafka stream. But I'm wondering if you can just talk through a bit more of the workflow of somebody who is publishing metadata about the datasets that they're collecting and curating, and then some of the ways that that metadata is used and represented at the other side for people to be able to discover useful datasets?
[00:26:46] Unknown:
So a small correction, I didn't say that every injection come to Kafka. I say majority of them comes to Kafka because, for things that are sort of programmatic in nature, Kafka is the right way to go. But, actually, the service does provide sort of the REST API where you can have the strong guarantee, you know, read after right consistency, when you update something and you get result back immediately. Right? So so that actually exists as well. And then that those are generally for sort of the more curated metadata, if you will. Right? When someone type in a message description, for a dataset, you know, or someone, you know, set themselves as the owner of the dataset when so so almost like always when a human is involved, you kinda expect that, you know, read after write consistency and then through a REST API. You don't expect to send to Kafka and then, you know, wait for a few seconds and see that appear. That's just not natural for especially when it comes to web app. Right? And it's kind of the expectation. You know? But if I change something, I expect to see it change immediately.
So so we have both avenue, and then they are sort of, you know, dedicated for different purposes. Yeah. So so that's kinda on the integration side. Sorry. And what's the other question? Sorry. I think you were asking 2 questions.
[00:27:56] Unknown:
And then the other question was on the other end where somebody is consuming the metadata to try and discover datasets or determine the health of a particular set of records or the relative popularity or or utility of some data, how you're representing that on the user facing side of it.
[00:28:15] Unknown:
So so on the consumer side, essentially, how do you consume the metadata? Once again, they they are actually at least 3 different ways you can do it. There is, similar to the write path. There is the API path where you can call the REST API to get things. And we're actually working on a GraphQL based API to make this even easier for everyone. So that is definitely there. And then I think that will be the majority of the cases that WebAd will be calling those API. And the nice thing about it is that your metadata was structured, established. So your API was also structured. They actually look exactly the same as your model. Everything is, generated, so to speak, from the same model. So so that way, you know, you kinda exactly expect what to re to get in return from the API. So there's there's that. And then the second part is, I think I mentioned before, the Kafka string, the change string. These are for more for near real time sort of, you know, action based triggers and not necessarily API web, but you would think that's you know, web can leverage that as well if they want to. So there's that. And then once again, whatever go on to that stream is exactly the same model that you define in the first place, whether they are. So the consumer will actually have no there's no surprise from the consumer point of view. They don't have to transform the data further in order to map it into the expectation, because the expectation is what going there in the first place. And then the third 1, which is, you know, probably less obvious, but because of the change stream that we have, we we we actually do stream that to offline and then create essentially a real, you know, a delayed version, but, you know, all close to real time version of the offline dataset for the metadata. And this is geared obviously towards, like, you know, you know, analytical use cases where people wanna say, hey. I wanna look at all the dataset or all the dashboard that we have that happened to be, you know, not used by anyone for the past 12 months and blah blah blah, etcetera. So those are kind of the 3 way to consume the metadata, if you will. Yeah. Just to add add to that, the part of the question you asked also is about such relevance and, the relevant data, how do you provide it to the consumers? There are certain aspects in our web app consume consumption where
[00:30:18] Unknown:
such relevance is kind of, the provisions, where the domain owners are the end the data producers can tune in their this relevant relevance, the relevance for the particular data type. And, there are also some work around some personalization based on what you have done or how you interact as a consumer with other data ecosystem. Those signals have taken in, like what team you belong and things like that to provide, like, more personalized data discovery for you. And then another element of metadata management
[00:30:50] Unknown:
is versioning of that information so that you can track when schema changes occur and then being able to ensure that the downstream consumers are using an appropriate version of that metadata for the point in time at which the source data was produced. So I'm wondering how you managed to store and represent the versioning of that metadata so that you can retain that information for the
[00:31:16] Unknown:
useful duration of its life cycle? I'm glad you asked that, because that's once again what I believe that is 1 of the main differentiator of Data Hub because everything is versioned by default. In fact, we get complain from people saying, now you guys version too many things. Can I tune it down? And the answer is yes. You can choose how you wanna how many versions you wanna retain, or even do a time based retention if you want to. 1 of the designs early design choice that we made was to make all the metadata immutable. So every change will results in a new version being captured.
And and that was this was kind of critical. You know, I think the schema example you gave was a good example, but it is also, when it comes to, you know, privacy and compliance related stuff that, you know, human generated, those things, you know, you do wanna keep, like, a detailed track record of how the metadata was evolved, especially if the metadata is actually being used programmatically to to make things happen. And if something breaks, you wanna know why it breaks. And when someone, you know, provided the wrong metadata, you wanna know why and when did that metadata went wrong. So we yeah. So all of that is out of the box. You get it out of the box. In fact, if you don't want it, you have to disable it. Anyway but, so that being said, that's, what we normally call internally automatic versioning. Right? You don't need to explicitly tell people that, hey, I'm on version 1 or version 2. By virtue of me changing the metadata, there's a new version that's generated automatically.
There is also explicit versioning where you say, look, I know my version, but I'm I'm on version x y z for, you know, the schema or whatever. That is also supported through a different mechanism. But that requires a bit more thinking in terms of how you model it and all that. So I wouldn't get into the details of it. But I, I think the point is you get automatic versioning out of the box,
[00:33:06] Unknown:
without doing anything at all. And for somebody who's interested in deploying and maintaining an extension points that they should be aware of for a about and the integration and extension points that they should be aware of for ensuring that it fits well into their deployed ecosystem.
[00:33:27] Unknown:
Sure. So so to simplify the deployment of Data Hub, you know, we have created Docker images in the open source where people would be able to just use it, you know, whether they're using in sort of the raw Docker environment or they can use it in a Kubernetes environment if they want to. Actually, in fact, the community contributed Kubernetes and Helm Chart support because we actually don't use that internally inside of LinkedIn. So so we're not in the best position to to provide that. And, you know, we are also we are also planning to create examples on how you can deploy Data Hub to, you know, popular cloud provider, you know, Azure, AWS, GCP, and maybe even have some sort of 1 click deployment option available where people can just click and do it. So I think on the deployment side, I think it's pretty standard, and people should be able to just use it as they wish. In terms of integration, so there's actually quite a few components used inside of Data Hub. You know, of course, there is, you know, the sort of the app side, you know, website or back end to whatnot. But there's also storage. Right? So, so in fact, we we support, you know, also different storages, sort of SQL based storages if you want. So MySQL is kind of the default, but you can choose, Maria or Postgres if you want. For storing the metadata, we're also working on the no SQL equivalent, so you can store it on MongoDB if you want. So that's the storing of the raw metadata.
But we also use a search as well as graph. And then, this is yet once again the strength of Data Hub is when we design it, we make sure that we never marry ourself to 1 specific technology. So we wanna be able to say, hey, you can pick and choose whatever search engine you want, or you can pick and choose whatever Graph DB you want, you know, through high level of abstraction. Even though now we have, like, you we use Elasticsearch as the default search engine, we use Neo 4j as the default Graph DB. We are working on making this even more generalized so you can pretty much apply to any, existing Graph DB or existing search index that you have that is not the the default. So so I think in in terms of integrating that part, it will only get better if it's not already fitting, you know, 90% of the use cases out there. And another interesting aspect of the work you're doing with Data Hub is the way that you're
[00:35:34] Unknown:
managing the relationship between the internally deployed and internally used code that you're building at LinkedIn and the open source project and some of the conflicts and challenges of trying to keep them both in sync. So I'm wondering if you can talk a bit more about the benefits and motivations for releasing Data Hub as open source and some of the ways that you're approaching the ongoing maintenance
[00:35:57] Unknown:
of the internal and the public code. Let me first talk about the motivations, like why, DataHub was even chosen to be open source project. Right? Like, the answer to it lies really in history of all some of the good projects which happened at LinkedIn. Definitely, you you LinkedIn has been always been a champion of open source culture. And, some of the popular projects like Kafka, Pinot, Goblin, and Samsa, something are set has set good examples for us and, to drive the motivation around open source. So being a data driven company for LinkedIn, we definitely have learned and solve a lot of metadata related problems at scale. We we enjoy, sharing our learnings and software with other companies. So as an industry, we don't, get into this, like, reinvention of the same solutions around the common un unmet needs around, data or metadata.
We we were actually pleasantly surprised with lot of good contributions already happening to Data Hub within the short span of open sourcing, which includes, like, k 8 support and, not to discount, but the great design conversations and modeling proper prototypes and proposals. The community is already kind of driving the some key aspects of the road map and product, and helping both LinkedIn as well as everyone out there. About, the how we make chain like, how do we deal with the con conflicts
[00:37:25] Unknown:
or how we are about the mergers and all, like, maybe March, you can give give a better technical answer to it. And I'm glad you bring this point, Tobias. It's once again, I will probably need an entire podcast to talk about this in details, but but it is by no means no jokes. Right? I mean, it is not a simple task. And I know it's probably for, like, a new startup or a cloud native company. It seems like a very trivial thing. Right? I mean, open sourcing stuff is like, let me change my private GitHub repository to public and then go to open source. But for more I sort of the more established company, you know, big tech company, they tend to have, like, very, very, deep integration with their in house solutions. And, you know, even when it comes to source code management and, you know, CICD and all that sort of stuff, they're not using, like, what everybody else is using out there in terms of what's available. So so trying to maintain that is definitely no joke. I think, we actually have a blog post as well on that specifically. It's called open sourcing data hub, LinkedIn's metadata search and discovery platform, and also on LinkedIn's engineering block block. So you can find those find the detail there. But in short, we actually have to develop tooling for us in order to, be able to keep the 2 repository in sync. And so it is definitely a lot of work upfront, but, it actually gives us a huge benefit in terms of being able to open source at all. Because a lot of time, you know, if you don't develop a project as open source first and, you know, trying to open source it at a later stage can be a very, very painful task because you would have probably taken on a lot of internal dependencies, and then it's very hard to unless you are very, very conscious about every single design that you put out, it wouldn't put everything behind a nice interface. But if you say, okay, we can only open source when everything is nice and clean, guess what? That would never happen because nobody will say, that's passed all the development for, you know, quarter, 2 quarters just to do open source. Right? That is not possible in reality. So so the tooling that we developed sort of once again try to strike a balance, making sure that we can move very fast internally without, you know, being slowed down by open source, so to speak, because we have to do all these, you know, 9 things and whatnot. But at the same time, we can also open source our code as quickly as we can. But also on top of that, pay for pathway where we can reduce all of these, you know, non interfaced things into NICE interface, refactor them into NICE interfaces over time. So the angle is to say whatever runs in, you know, open source should be 90% similar to whatever we run internally, and everything that's different can be easily loaded as a plug in sort of style. That is definitely the holy grail that we're moving towards. And then developing these tooling will allow us to move towards that rather than saying that we have to move there first before we can even open open source the first line of source code. And as far as the ways that you're seeing Data Hub being used inside and outside of LinkedIn, what are some of the most interesting or unexpected or or innovative ways that you've seen it deployed and implemented?
[00:40:20] Unknown:
Definitely, a lot of interesting, lessons, throughout this journey. First of all, I would like to call, like, building Data Hub definitely come from the need for such a scalable metadata system. And, also, to the point around, like, how much synergy we can unlock across these various data systems, with the connected metadata data story. At the at the rate how the data space is really growing, in terms of scale and complexity, I think we all need a well connected metadata graph and metadata insights, now more than ever before. But the challenge is the new paradigm shift in how many companies are companies thought about metadata. This is going to take some time and effort for all of these 2 things to converge into common common path, but we have really happy to see, like, lot of data teams and companies, already, feeling the connection to disconnected metadata. It's really what we are we are what we are building towards. And in terms of internal implementation,
[00:41:16] Unknown:
like adoption on Data Hub, like I mentioned, we were pleasantly surprised on how pea how receptive people are in saying, hey. We do wanna build something, from day 1 to to sort of fit this connected story. Even though we don't know for sure that we're gonna use it, but knowing that, you know, we build it the right way. So if there's a use case we wanna unlock, we can easily do that without having to react to the whole thing. That, you know, appeals to a lot of people. So we have, like, you know, internally, we have a lot of AI infrastructure build out, sort of just, you know, around the based on the Data Hub architecture. And all those rich metadata is sort of available at our finger
[00:41:58] Unknown:
fingertip if we if if we wanna get access to it. We have too much on our plate right now in terms of, use cases and whatnot, but I'm pretty sure later on, someone will start asking questions. Like, hey. Can we leverage those metadata to do other things? And and we'll be able we'll be very glad to help them. Yep. We have no already linked up. We just need to find the right query to query it, and then, and then right way to present the data back to me. Just to add, I I think doing this this type of metadata is more like a discipline or like like a like a culture which we should drive to to reap the benefits in the long term rather than, like, very short, use cases which we are seeing right now. So like like like I said, like, it's definitely gonna take a lot more time, but it's gonna def benefit us all in the long term, definitely. What are the cases when Data Hub is the wrong choice and somebody would be better served just relying on maybe the Hive metastore or some of the built in metadata management for their database or data warehouse or something along the lines of the Marquez project? So I think on on a technical trend, right, like, the couple of ways. Right? Like, 1 is many times people confuse between data or source, like, metadata. Just to clarify that, hey. This is purely a metadata store, and you still do need, like, data stores whom you can contact, and store your data and, like, operate your data. Right? This this works in connection to those data stores. Some of these data stores are also built for very specific purposes based on the, nature of the data, like, like an all app store or, key value based store or a relational DB. Like, those things cannot be replaced. Right? This this metadata has to work in tandem to that. And on our organizational front, like, this this kind of a scaled metadata system or, data hub kind of a solution really doesn't give a lot of, written off investment if it if you don't have, like, more than 10 people in your organization who is dealing with data. So we have found that it it definitely yields a lot more benefits beyond that scale, but, we have also seen a lot of startups who are who are planning to grow their data, data ecosystem coming and doing an earlier option with Data Hub with with an intention of having an early engagement or, like, building that, like, metadata discipline early into their data culture.
[00:44:06] Unknown:
In other words, if you have a guy in a company or girl that you can just go to and say, hey. Tell me everything about all the data assets that you have in the comp in the company and how they're related.
[00:44:17] Unknown:
If the person can answer that easily, then you probably don't need a catalog or the metadata system at that point. As you continue to build out Data Hub and support its use within LinkedIn and engage with the open source community, what do you have planned for the future of the future of the project or particular areas of contribution
[00:44:36] Unknown:
or feedback that you're looking for? Yeah. So, like, as I mentioned earlier, like, the it's only been 6 months we we open source data hub, but we are really excited to see lot of interest and earlier options happening through our open source community. The community is definitely playing a good role in driving the road map on that option. Right? We we conduct, like, monthly town halls now, and and there is, good participation and, like, very interested questions being asked and, like, engagement happening or Slack channels or individual session, like, meeting sessions on things. With with great support from LinkedIn's leadership and, like, collaboration with our open source partners, we are definitely working towards, true democratization of data management. So we we have shared our, like, road maps and, stuff on the open source project as well, which speaks to the same things what I've mentioned here. Are there any other aspects of the work that you're doing on Data Hub or the overall space of metadata management and data catalogs that we didn't discuss yet that you'd like to cover before we close out the show? So so I think I wanna
[00:45:33] Unknown:
iterate that Data Hub is something that we work on for, you know, the good part of the last 3 and a half years. It's kind of the the combination of all the learnings that we had at LinkedIn. And I think the system is it is the way it is today for a reason. A lot of design choices that we made, we didn't make it lightly. We we made it because, you know, we need to make those choices. So so on the surface, when someone look at this, they might say, hey. It looks awfully complicated. Right? I mean, why do I need something like this when I can just quickly write a couple of Python script and then, you know, create a a Flask app and then just run run with it? Right? Why do I have to go through all these trouble of, you know, being disciplined about getting the metadata in and then try to, you know, create a graph out of it and all that sort of thing? And I think the answer is very simple. We've been down there, we've been through that, and we got burned pretty bad as a result of it. And then metadata is 1 of those strange things that it it is cross discipline and it's cross org, it's cross team by nature. Because without those integration, without those ingestion, the metadata itself doesn't have any value.
It's when you are able to start linking them together, start keep growing it in all sorts of different dimension, both in scales, complexities, richness, and all that. Is when you are able to really truly unlock the value of metadata. And we do believe that Data Hub has that capability. And, you know, we maybe a year down the track, you you ask us to come back and then present our v 4. Who knows? But, hopefully, for now, we're pretty sure that the architecture is is battle proven, and it's, it's, it's definitely good for what, the foreseeable future. So that's just kind of the, sort of the closing,
[00:47:16] Unknown:
thoughts I have on DataHub. Alright. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing or contribute to the work that you're doing on DataHub, I'll you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? Yeah. Thanks for, asking this. Like, I I think the answer to this lies in the true nature of metadata, which is really,
[00:47:43] Unknown:
highly connected, highly connected with with each other. Right? So most of the tools and technologies out there, what we have really focuses, more on very specific use cases of metadata, like, either cataloging or compliance or, just lineage or access control, etcetera. However, we believe that, like, these are all different verticals built on top of a connector metadata graph, and, they cannot work in silos. Our focus is really to build world class metadata infrastructure that surfaces the metadata metadata graph to support any of these use cases. You can you can view the data hub, the web app, what we have on the surface. It's just, the search and discovery aspect use case on top of this graph graph, and you can see a lot more applications like this, which can be built upon the fundamental metadata graph, which we are building underneath. That's how I see is a major differentiator or the biggest gap with respect to,
[00:48:40] Unknown:
what we have outside and what we are working towards. I can also add a little bit more after this. So there's also 1 thing that, we did observe in the industry is that you have sort of the 2 groups of metadata, quote, unquote, services or application. Right? 1 is very much geared towards operational. So, like, hide Metasorb be the the typical example where it's all about, you know, operational metadata, and it's gonna be used extensively in sort of in a programmatic way in in production. On the other side, there's also groups of product that focus purely on search and discovery. Right? Saying that, hey. The treatment are mostly at read only at that point, and then the consumption is all human based. Our point here is that there shouldn't be 2 different systems. You should have 1 system that's able to support both scenario, which means the system has to be super scalable because it has and reliable because it has to support operational, things. But then it should also be super, flexible that you can build all sorts of fancy apps on top. Right? So so we do believe that, you know, Data Hub is 1 of the special
[00:49:46] Unknown:
project out there where we've tried to target both, not just, you know, 1 or the other. Well, thank you both very much for taking the time today to join me and discuss the work that you've been doing with Data Hub. It's definitely a very interesting project and 1 that I've been keeping an eye on since you first released it. And I look forward to seeing where you take it in the future, and I definitely plan on taking a closer look for my own purposes. So thank you both for all the time and effort you put into that, and I hope you enjoy the rest of your day. Thanks, Tobias. Thanks, Elias. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and co
[00:50:52] Unknown:
coworkers.
Introduction and Career Advice for Data Engineers
Interview with Pardu Ghanam and Mars Lahn
Getting Started with Data Management
Overview and Backstory of DataHub
Challenges in Data Discovery and Metadata Management
Target Audience and Use Cases for DataHub
Architecture and Integration of DataHub
Modeling and Versioning Metadata
Deployment and Maintenance of DataHub
Open Source Contributions and Community Engagement
Future Plans and Roadmap for DataHub