Summary
A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. After experiencing the impacts of fragmented metadata and previous attempts at building a solution Suresh Srinivas and Sriharsha Chintalapani created the OpenMetadata project. In this episode they share the lessons that they have learned through their previous attempts and the positive impact that a unified metadata layer had during their time at Uber. They also explain how the OpenMetadat project is aiming to be a common standard for defining and storing metadata for every use case in data platforms and the ways that they are architecting the reference implementation to simplify its adoption. This is an ambitious and exciting project, so listen and try it out today.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Your host is Tobias Macey and today I’m interviewing Sriharsha Chintalapani and Suresh Srinivas about OpenMetadata, an open standard for metadata and a reference implementation for a central metadata store
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what the OpenMetadata project is and the story behind it?
- What are the goals of the project?
- What are the common challenges faced by engineers and data practitioners in organizing the metadata for their systems?
- What are the capabilities that a centralized and holistic view of a platform’s metadata can enable?
- How would you characterize the current state and progress on the open source initiative around OpenMetadata?
- How does OpenMetadata compare to the OpenLineage project and other similar systems?
- What opportunities do you see for collaborating with or learning from their efforts?
- What are the schema elements that you have identified as critical to a holistic view of an organization’s metadata?
- For an organization with an existing data platform, what is the role that OpenMetadata plays, and what are the points of integration across the different components?
- Can you describe the implementation of the OpenMetadata architecture?
- What are the user experience and operational characteristics that you are trying to optimize for as you iterate on the project?
- What are the challenges that you face in balancing the generality and specificity of the core schemas for metadata objects?
- There are a large and growing number of businesses that create systems on top of an organizations metadata in the form of catalogs, observability, governance, data quality, etc. What do you see as the role of the OpenMetadata project across that ecosystem of products?
- How has your perspective on the domain of metadata management and the associated challenges changed or evolved as you have been working on this project?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenMetadata?
- When is OpenMetadata the wrong choice?
- What do you have planned for the future of OpenMetadata?
Contact Info
- Suresh
- @suresh_m_s on Twitter
- sureshms on GitHub
- Sriharsha
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- OpenMetadata
- Apache Storm
- Apache Kafka
- Hortonworks
- Apache Atlas
- OpenMetadata Sandbox
- OpenLineage
- Egeria
- JSON Schema
- Amundsen
- DataHub
- JanusGraph
- Titan Graph Database
- HBase
- Jetty
- DropWizard
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Sriharsha Chintalapani and Suresh Srinivas about open metadata, an open standard for metadata and a reference implementation for a central metadata store. So, Sriharsha, can you start by introducing yourself? Hi, Tobias. Thanks for inviting for data engineering podcast.
[00:01:15] Unknown:
And I'm Shiharsha. I've been in the data for the last past 10 years, mainly on the data infrastructure side. You know, I'm a streaming engineer, start up at Mozilla in Hortonworks, and worked on Uber as well, getting bunch of data teams there. I'm also an open source committer in Apache Storm and Apache Kafka and bunch of other projects and so. So that's how my data career started. And, Suresh, how about yourself? Hi, Tobias. Thank you for
[00:01:41] Unknown:
inviting us to the podcast, and hello, listeners. My name is Suresh Srinivas. I've been in the data space for a long time. Started out, you know, in data space at Yahoo, where I was in the initial team that built, Hadoop. And then Huddl started getting traction beyond web companies. I started, you know, as data, you know, became big data and then a lot of enterprises started looking at data as a significant tool. Huddl started getting traction. You know, it had all the way way of, big data. On that way, we started Hortonworks. I was 1 of the cofounders of Hortonworks. We started in 2011. And then in 2017 or so, having built a lot of complex platform, I was actually bothered by the fact that people were even though they had all these scalable systems, they were finding it really hard to realize value from their data. So it was just amazing to see how hard it is to get data. Right?
So I joined Uber, the other side of data, right, having built platforms. I switched over to 1 of the biggest platform consumers, big data platform consumers, Uber, to look at why it is hard to get the data right and, you know, what are the things that are required to get the data right. So since 2017 or so, my focus has been data experience, you know, improving data tools, making it easy to consume, easy to get the data right. You know, now open data is a continuation of that team.
[00:03:11] Unknown:
That brings us to what you're both working on now, which is the open metadata project. I'm wondering if you can just give a bit of context about what it is that you're building there, because I know it's both an open standard that you're looking to build and promote as well as a reference implementation of that standard. So I'm wondering if you can just talk through some of the sort of story behind the project and the main goals that you have for it. In 2019
[00:03:33] Unknown:
or so, Atuber, you know, Uber is a data driven company. Everything at Uber is driven by data, including, you know, matching riders and drivers and demand and supply. All of it is driven by data. We were finding some basic problems with data. Some of the key metrics. Right? Number of trips completed, things like that. Depending on who you ask, you would get different answers. And we had tried to solve this problem through piecemeal approaches. Right? So try to get the data quality higher in data warehouses. Maybe, you know, try to improve how we are reporting things by adding more tests to reporting.
All of this had taken a toll on data productivity. So we were using a lot of data scientists to make sense of data, and lot of their time was being spent on unproductive tasks. Right? Running behind data quality issues, you know, missing data issues, things like that. So we decided that we needed a different holistic approach to solve the data problems. And that included starting from the source of the data, that is where, you know, mobile phones and apps are producing events, online services that are storing the online transactional data, and then how the data is flowing from those systems into the offline systems, how it is getting curated, modeled, you know, and then finally, how are people consuming this data to build different data assets, like reports, dashboards, metrics, mission learning features.
So as we were looking into this, we just you know, it was very clear for us. Right? If the data is very poor at the source and it is polluted, in this stream of data, you cannot actually clean it up downstream. You cannot create quality downstream. It has to start at the source. So we took an end to end approach. We started a project that included all the way from mobile developers and online services developers to people who are in the big data world ingesting their data, creating offline models, all the way to data scientists. Now through that, there were a couple of things that became very clear. Right? So you cannot actually address the data issues by doing 1 time efforts. Right? So, culturally, data culture of a company needs to change and treat data as important.
So the band aid approaches were not working. The second thing is there were too many tools. Right? You've seen all these tools, you know, that keep coming up every day. There were too many tools in the organization, but there was no way to get a holistic picture of data. So you would go from for discovery, you go to a catalog kind of a thing, and then you jump to a data quality and then a query and then a metric system. All of the systems were different. Right? So people had to jump from 1 tool to the other. So tool disconnect was a problem. Then data is a team game. Right? Somebody is producing it. Somebody is curating it. Somebody is modeling it. Somebody is consuming it. The team needs to come together and work together to have successful data outcomes.
Without that, you know, people make assumptions and get data wrong or may not end up using the data. So the people disconnect was another problem that we had where people didn't know each other. Right? There was no names to their datasets. As we started looking at it, there were a lot of things that we were doing manually that was taking, you know, really people who have lot of expertise in data science or, you know, business intelligence and using their time to do some manual cleanup and a lot of file. So as we started looking at it, at the core of it, it emerged that discovery is a problem. Right? And discovery not as in the traditional sense of data catalog where it is focused on, say, databases and tables.
Data is used for creating a lot of data assets. Right? And all these data assets should be discoverable. An example is there is a metric that is a number of trips taken, right, along certain dimension. There were many such metrics, and each metrics had their own definition. Right? There are different dash boards trying to answer the same thing because they are using different datasets. So discovering all your data assets in a single place was 1 problem that emerged. The second problem was people disconnect. Right? When you're using a data asset, you don't know who produced it, why they're producing it in certain way. You have questions. You don't know, you know, whom to ask. And so consumers of this data did not know who is producing it, why are they producing it, what guarantees come with it. There was no ownership at all for data assets. And then producers of the data did not know that their data was being used for some super business, you know, important critical purposes. And so they would randomly make certain change and then, you know, for couple of weeks, you know, tickets will go from 1 team to the other. And then they finally realized that their data is being used for some super important, you know, business purpose.
And so they did not have visibility of who is using it, why are they using it, what critical business purposes. So that was a, you know, clear disconnect. And then finally, data must have context. Right? Without context, data is just bits and bytes. And lack of this context made many people not use the data that we had. Many people, you know, were still. Right? Many people used it incorrectly because they were making wrong assumptions. And then finally, right, you hear from data mesh, data as a product. Right? Lot of this data was produced without considering how it is being consumed. And as a result, data was hard to use.
Schema was not, you know, well designed. And because of which, some of our retail pipelines had 5, 000 lines of code writing state machine building the picture of reality instead of just capturing it right at the source. So these were the problems that came out and then, you know, tooling. Right? People need better tools and automation so that they can focus on more important stuff. So as we started looking at it, you know, cutting long story short, you know, the story is already long. It emerged that we were not managing metadata well. And managing the metadata where you centralize all the metadata about all the data assets, about user activity, about user inputs feedback, all of that became super important to solve this problem. So we built a metadata system that centralized all the metadata within Uber.
And then as we were looking at it, it started becoming the center of data universe at Uber. And we started building a lot of automation, lot of tooling, improved consuming the metadata. Lot of tools became lot simpler to build because there was a centralized metadata store. And I and Harsha were thinking, you know, this is a problem that every company has, and perhaps we should build a centralized metadata store where metadata is shareable through very well designed metadata APIs, and that can actually transform the data landscape. Right? So that is the reason why we decided that we will come and build it outside as an open source because we have a lot of open source background.
[00:10:50] Unknown:
Open source is the right solution here. The open metadata project is definitely very ambitious and that it is trying to be this kind of universal approach to metadata. It's trying to be this sort of central store to integrate metadata across all of the different components of the data ecosystem, which is, you know, to some estimation, an intractable problem because every different system has some concept of metadata and the ownership of that metadata. And so being able to expose that and share it and unlock it and sort of unify the schema of that is a very complex and multifaceted problem. And I'm wondering if you can talk through some of the common challenges that engineers face in trying to collect and organize all of the metadata across their system and some of the ways that the existing state of the tooling in the industry and the various generational shifts of those tools contributes to some of those challenges and blockades and being able to actually unlock the capabilities of that metadata?
[00:11:52] Unknown:
Solving a hard problem. You know, that is what is fun. Right? But solving this problem can transform the data world. So let me give you some thoughts on where we are today. It's not like the metadata is not collected in a centralized manner. To certain extent, it's being done by catalogs. The problem is the metadata models are a lot of times there are no metadata models. Right? It's key value pairs and, you know, there's no clear definition of what shape of metadata you can get out of the system. Second thing is lack of APIs. So if you look at the picture today, data catalogs have most of the metadata in an organization today. But it is not shareable. It is not modeled well, And there are no APIs, and there is no extensibility with most of the data catalogs.
Because of which, what ends up happening is a tool that is building, let's say, data quality. It has to build a separate metadata subsystem, right, within its tool because the metadata that is already available in our system system is not shareable. So they end up integrating with the other tools, collect all the metadata, and store their own copy of metadata just so that they can add some special metadata that, tool is focusing on, which is, let's say, test and then test results. Right? And then maybe possibly summarize it to the quality score. So imagine if the centralized metadata store was shareable and accessible through great APIs, a quality tool could have just focused on providing an interface for building tests. Those tests can be shared in a central metadata store instead of having your own system for storing it. Quality tool can run those tests, produce the test results, and then write it back to the central metadata store. With that, the quality tool can now focus on what special sauce it is bringing instead of trying to build a metadata system. Right? So this is true with data observability, data management, data governance.
Because the central metadata system is not shareable, doesn't have great APIs, everybody has to build their own copy of metadata system. Now the outcome of this is there are multiple copies of metadata within an organization. Right? So that metadata is fragmented. Some metadata is available in 2 legs, some available in 2 y. There is no central holistic picture of data in an organization. So there is fragmentation. Right? There is duplication. With duplication comes inconsistency. Right? You added a tag here. The The tag is missing there. You have a description here. The description is missing here. Because of which, you had to jump from tool to tool, and that causes lot of user frustration.
Right? So I believe that many tools have integrations built, which can be avoided if there is a central tool, central metadata repository that builds, focus on building integration, you configure that tool with these systems, centralize your metadata, make it shareable, make it extensible. Many tools can make use of the centralized metadata. And then finally, right, if you look at what is going on in data landscape, especially because of great IPOs in the space and a lot of investment in the space, every day, many different siloed narrow tools are cropping up. Right? And each of these tools have very small functionality, maybe observability or quality or something like that.
And they end up being simple workflows. If they could use the centralized metadata store, they become a full fledged subsystem. And imagine, you know, installing a system like this, integrating with rest of your data ecosystem, operationalizing it. Right? And then not to mention the cost, right, of, you know, adding this new service into your stack. All of that can be eliminated, and many of these tools can become simple workflows. And so I think it's not a challenging problem to centralize the metadata. What is gonna be challenging is standardizing, agreeing upon a standard.
And this takes time, and this takes collaboration with many other communities. And then it takes most importantly project to be successful and adopted. Right? As you get more adoption and people realize the value of metadata, schemas, metadata APIs, I believe we will start shifting towards adopting these schemas as metadata standards.
[00:16:15] Unknown:
Yeah. To add to what Sudesh said, we've seen this problem multiple times at Uber when we entered in the data interface. You know, there are silo 20 tools kind of doing the overlapping work and kind of storing the overlapping metadata for each of those use cases. When we started DataNG, that is 1 of the project where we said, like, hey, centralize the metadata, build APIs, and let users as well as tools come together. That is the important part where when you actually only consume users, you get enriched metadata, but again, the tools goes away and and everyone has to copy the metadata again and again. So we made a conscious decision there, kind of build this simplest metadata, build data quality as 1 of the applications, lineage as another application, and collaboration aspects where user can ask questions. User can raise questions or issues on the metadata. So that's proven to be a huge success at Uber.
And it also, proved to be a stepping stone to the next level onto the platforms itself. For example, 1 of the things we did at Uber was to identify the important net assets. Right? So that will give on the user side, hey, what are my important data assets out of, let's say, you know, 100 of thousands of data assets at Uber? But also on the platform, what it enabled us to do is as, you know, allocate more resources to important data assets. Make sure those are the ones that are running fine. So before that, you know, everything get the same priority. If you are a user running a query on a random table, we'll be getting the same priority as a, you know, trips table, which is the highest priority in the Uber, getting the same resources.
By doing this centralized metadata, by categorizing it well, we disperse this into the platforms as well as as users. So it's going to be really successful at Uber.
[00:17:58] Unknown:
Another interesting element of this is, as you said, if you have this global metadata store that is shared across the entire infrastructure and all the tools, a lot of the seemingly complex efforts of building a data quality tool or building a lineage tool becomes much simpler because you don't have to rebuild that metadata store over and over again. And I'm wondering what you see as the kind of vision of and the opportunity for open metadata to be this kind of out of the box metadata store for other tools to be able to build on top of and just add their own unique layer so that you can maybe say for a standalone tool of I'm going to build the data quality platform. I take open metadata off the shelf. I add in my specific interpretation of the metadata I'm collecting as it pertains to data quality.
But then if I want to deploy this data quality tool into a broader data platform, I swap out my internally vendored version of open metadata, and I instead connect up to the globally deployed open metadata system. Everything else operates out of the box. And so rather than each company having to build that metadata layer over and over again, they can use this reference implementation, which can also be swapped out for a company that already has 1 deployed. But what are the sort of limitations of that as the capability of saying, this is the metadata store for x? Because, obviously, you're not going to use open metadata as the way to describe the table schemas in your Postgres database. And I'm wondering if you can maybe break down what you see as kind of the boundary conditions for metadata as a native component to a piece of the infrastructure versus metadata as this shared resource across the platform?
[00:19:43] Unknown:
So what we are focused on is building innovation around metadata. And when we started looking at how it is currently solved, and this is a 3rd generation of metadata system that I'm building, having, you know, worked with Apache Atlas, second system that we built at Uber called YouMetadata. This is the 3rd system that we are building. What I feel is metadata can enable lot of innovations. Right? Innovations in how people experience the data. What kind of tooling and automation can be built around data? It can actually create next generation of data tools.
And when we were thinking about that innovation, we saw that many metadata systems are being built in the open source today. Many vendor solutions exist. And there are many in house metadata systems that are being built. And what we feel is metadata should be a solved problem by now. And there should be well modeled, the API first, schema first approach to metadata that becomes available as an open source project that anybody can bundle and take and build their own innovation around. Right? If we do that, many systems now can focus on their innovation instead of building a metadata subsystem. So that's the main goal, and that's the reason why we are open sourcing it. And we would like to innovate around open metadata ourselves, build some delightful experiences, automation, collaboration, workflows.
But anybody can take open metadata and then use it. So, you know, that's the reason why it is open source.
[00:21:19] Unknown:
As far as that open source project, I know that it's a fairly recent undertaking that you've started building out. I think I first heard about it a few months ago. And I'm wondering if you can just talk through and characterize the current state of that open source project and the community that you're starting to grow around it and some of the progress that you've been able to make over the few months that you've been working on it. Since we open sourced open metadata, you know, our community has been grown incredibly well.
[00:21:45] Unknown:
So some of the, like, high level members are, like, you know, we have 100 of users join our Slack community, trying out open metadata, giving us feedback, you know, participating in discussion in UI tooling as well. And we set up a sandbox so that, you know, anyone can actually visit the sandbox and easily play around and understand the APIs and the UI that we are building around it as well. So since we started the sandbox, we have thousands of users coming in and kind of playing around with the sandbox itself. And more importantly, our contributors numbers are growing up quite a bit, and we have around 30 plus outside contributors coming in and, you know, sending patches, sending features, all that stuff, and GitHub itself.
So 1 of the goals we set out with open source is to focus on delivering value to the users and ship features as quickly as possible. With that in mind, we said, like, you know, at the beginning itself, hey. We're gonna do monthly release, and we're gonna ship, you know, substantial features. And the value of the community is that, you know, what it might have taken us, like, months to build because of our external contributors coming in and shooting these features, we're able to kind of ship substantial features in each release. So far, we are 3 release releases in open source. We are coming up to the next 1. Again, we believe, like, you know, the base of the community has been incredible, and, we are really thankful for the community participation here.
And just to quote a number, like, you know, we are around 2 50 commits per month. So there's quite a bit, you know, additions and features that's coming up. So yeah. Yeah. 1 thing that I wanna call out is
[00:23:14] Unknown:
a lot of people equate the time for which a project is open source as the maturity of the project. Right? So maturity doesn't come come by in a time itself. Right? It also depends on what experiences that you are bringing to the table. Right? In building a system, what learnings are you implying? Past decade has been the decade of big data, and that space has transformed tremendously. Right? And there's a lot of learning if you have looked at it, and then there's a lot of learning. Those learnings can be implied in building a system. And so this is a third iteration of metadata system that we have built, which means to say we have made our share of mistakes. And hopefully, we are going to new set of mistakes instead of, you know, the same old, right, from the past 2 iteration.
And so I would say the maturity of the project is also dependent on what you bring to the table, what learnings, right, how have you employed those learnings, What are the different shapers?
[00:24:17] Unknown:
An interesting element of what you're building with open metadata and sort of the current state of the data ecosystem and community is that there has been a very broad sort of willingness to explore these open APIs and open integration points. And 1 of the other manifestations of that is the open lineage project, which is focused on metadata as it pertains to lineage of the various data pipelines that we're building across our infrastructures. I'm wondering if you can give your perspective on some of the ways that the open metadata project compares to the goals of open lineage and maybe some ways that the 2 can sort of collaborate with each other or learn from each other and any of the other efforts that you've seen that are similar to Open Lineage or the work you're doing with Open Metadata? I know that Agiria is another effort on that line. With lineage. Right? Lineage is
[00:25:12] Unknown:
a small but important part of metadata. But the metadata universe itself is a much bigger thing compared to lineage. Specifically to open lineage, I think the project has the right goals, Similar to what we are saying about open metadata, why keep building new metadata systems? They want to actually solve the problem of lineage integration as it pertains to getting details of runs and jobs from various workflow systems. So that is what they are focused on. However, when you look at a metadata system, right, it needs to anyway integrate with lot of data sources, not just workflow systems.
Second thing is it needs to capture as much metadata as possible, not just run events and jobs and stuff like that. So from that perspective, because we already have a lot of integrations with a lot of these tools already to capture not just lineage, other information as well. Using open lineage for just that purpose is not that significant. Right? However, we've been talking to open lineage community at least over Twitter on possibly standardizing schemas related to at least Goran's jobs and events. And then I think currently, open lineage does not have a lineage graph definition, which we have. Maybe we can collaborate on adopting lineage graph definition as well. Now coming to their other efforts as well. Right? Not just lineage for metadata. Right?
The challenge that I've seen is and this is sort of like a moment where, you know, I felt I was not thinking through clearly. We are data people. Right? And in order to use data really well, schemas are required. Not only that, if you want to use it efficiently, well designed, well modeled schemas are required. Right? Otherwise, you won't be able to use it efficiently. You might even make wrong assumptions and get it wrong. Now as data people, when, you know, we have built in the past metadata systems, we ourselves did not consider schemas as important. Right? So we have just put any shape to the data. Right? There's a property 1, there's an object, and then key value pairs and things like that.
What makes it hard then is people don't know the schema. The schema is not modeled correctly. There is no strong typing. So you will have to build if this field name is this, you do this, you might get any value. And so realization that schema is super important. We knew schema is super important for data. But metadata is data about data. Right? And schemas for metadata is also super important. And so strongly typed schemas are super important. Right? In order to make metadata shareable, reusable across tools, It cannot be key value pairs, and it has to have, you know, strong type and, you know, proper shape and all of that. And so schema first approach. Right? Which is the reason why, you know, we ended up doing open metadata.
It paid, it gave us lot of benefits at Uber. Right? Just to give you a small story behind why schema first and metadata vocabularies are important. When we are looking at data problems at Uber, Uber has lots of microservices defining its own events, schemas, and things like that. We saw close to hundred definition of what a location is for a company like Uber. Location is the central vocabulary word. Right? Location, point, currency, the exchange rate, things like that. Right? Core concepts had their own schemas. Sometimes somebody called a schema as location when they meant something else.
So you had confusion because same definitions were not used. It was inconsistent. In some cases, it was confusing because it was a totally different concept. So we did some work at Uber where we took the core vocabulary and then model them in a single place once for all the schemas to be consumed, which we call as data standardization effort. Through that, we realized that even metadata requires same standardization. Right? When you call something as an interval, it should mean interval across all different metadata entities. Ownership must mean the same thing.
Right? You know, when you say tag, tags must mean the same thing. If you don't have that from system to system definitions change, then without the right vocabulary and if people are saying the same word meaning different things, there is no collaboration communication possible. And all it results is in confusion
[00:29:46] Unknown:
and lot of trial. To that point of the schema being very important and properly typed and very explicit about what is meant semantically and tactically by those different elements within the schema, I'm wondering what you have seen as some of the useful patterns for naming conventions. Because I know that particularly in sort of the early days of computing, space was limited, so we used very short variable names and very short, you know, names of binaries because, you know, typing was we we didn't have the ability to easily correct our typing in, you know, the early days of UNIX. And I'm wondering what you see as maybe the opportunity for increasing verbosity of the naming that we're using in these schemas so that they're much clearer and there is less opportunity for confusion.
And in the metadata space specifically, what do you see as the sort of core elements that are required for being able to build a well designed and usable schema and API for collecting and organizing metadata?
[00:30:49] Unknown:
I don't know if verbosity means clarity. So, you know, there is a balance. Right? You know, a lot of the schemas get converted into code and then readability becomes a problem. So descriptive, long names, but just long enough. Right? Not too long. You want to capture only certain concerns. So from that perspective, right, if you look at what a name is, every name captures maybe a paragraph of concept. Right? So the name is a short form of a lot of knowledge you accumulated around. You are capturing it succinctly, right, with the name. And so the name must have a clear definition description, things like that. Right? Sort of like you know, I'm really surprised whenever I look at some very familiar words in the dictionary, and I look at the precise definition.
I'm just always, you know, blown away. Right? How clear definitions are. Similar things are required, right, for, schema names and type names and things like that. So there must be a succinct type name, not something that is not readable and, you know, has lot of acronyms and, you know, user specific and understandable, you know, short forms and tell abbreviations and all of that. But at the same time, it requires clear description. Right? Without the description being there, people won't be able to understand it. Right? So let's look at few metadata systems that have been designed in the past. So let's take an example of Apache Atlas.
What Apache Atlas ended up doing is it did a great job for its time. Right? It built extensible metadata schemas, and it provided 2 kinds of APIs. 1 is metadata modeling APIs and then metadata consumption APIs. And the system came out with very few types. Right? Maybe a hive table or something like that, a table and, you know, a few types were there. Most of these types were basic types. Now you want to define, let's say, presto table. Right? I'm just taking it as an example. You have to define a new entity called Presto table. You can copy some of the building blocks from Hive table, but you copy a new Presto table. So what ended up happening is every organization for most of their metadata needs had to use the metadata modeling APIs to model the entities.
When you model the entities, few things happen. Right? 1 is different people bring different expertise in modeling. And then the second thing is user x is modeling certain entity, and he understands what the field names mean and all of those things. But then user y cannot understand it because it is not, that definition is not captured anywhere. Right? So finally, the biggest problem was every organization will model different entities differently, And then 1 organization's metadata looks different from another organization's metadata. So a tool working with 1 organization cannot work with another organization. That is the reason why what we ended up doing is the system must have strongly typed all entities that are required for a metadata system modeled, right, off the shelf. Right? It should be modeled by people who have modeling expertise, and all the entities should be available.
And the second thing is all these entities must have core components or core attributes and relationships defined in the system already. Right? Now if it is required, there must be extension points in these entities where organization can extend them. But the core of the attributes and the entities must already be defined in a system. That's the approach that we have taken. And then finally, right, in terms of schema modeling, there are different schema modeling languages. At, Uber, when we were modeling, we had used YAML from which we are generating, you know, other schemas or, let's say, protobuf 3 of them, things like that. So that you model it once and you generate from the same modeling effort, schemas required in other language bindings.
So open metadata takes schema first approach. Right? So in lot of systems, you build the implementation and you throw whatever implementation details are as schema, right, or as API. We start with schemas, and what I was talking previously was we had schema neutral language that we had used at Uber to generate schemas for, you know, other schema languages. In the same way, we ended up choosing JSON schema, huge color to JSON schema community. JSON schemas are a powerful way to model all your schemas. And you can not only model the schemas, you can reuse types, you can build relationship, you can reuse other schemas, JSON schemas that you have built.
Finally, JSON schema has super good tooling support. Right? So the way we do things is we model entities and types in JSON schema, and they are reusable across all different entities. And then from this JSON schema, we generate Java code, Python code. Even our UI code is generated from JSON schema. And then our documentation, how we store the data, everything is driven by JSON schema. And the power of this is a lot of boilerplate coding just goes away. Not only that, if you make a schema change, every subsystem within open metadata automatically gets updated.
So JSON schema has been an amazing
[00:36:15] Unknown:
building block for us, and that choice has worked out greatly for us. Yeah. This, also works into our, you know, vision of, like, building the standards. Right? So when you build based on JSON schema, it's easy for us to build language bindings and others. So we can build Python, we can build code, Java, whatnot. So what that drives is, like, the integration play. Now you have centralized metastore. You can actually build services just by embedding a client there. Either you're reading a data quality or you're reading some tags around the metadata itself. You can easily embed all of this data in other services.
So that's a great investment we actually made into OpenMetadata there. Talking about the integration points, I'm wondering if you can discuss
[00:36:58] Unknown:
the process for an organization that already has an existing data platform. They have a number of different point to point integrations across their various systems. Each of those has some level of control over their own metadata. Maybe they're using a data catalog or a system like Amundsen or Data Hub to be able to try to build out this metadata graph. What do you see as the unique benefits that open metadata provides in that environment and the process to be able to actually start connecting the entirety of their platform into the open metadata system to be able to start realizing the benefits of this unified view of all of the context of their data across those different boundaries, both technical and organizational.
[00:37:38] Unknown:
Yeah. Open metadata can centralize all the metadata in a single place. Right? Now our vision for open metadata is not to be a data catalog. Right? It is more than that. It has to be a collaboration point for all the users. Right? Once you collect all the data context, then users can come around it and within the metadata system can collaborate with each other. That way, lot of user generated, you know, discussions, knowledge, and all of those things can be also captured as metadata. So we want to go beyond data catalog. Data catalog is a simple application of open metadata. But then people collaboration is what we are focusing on. That can reduce the friction of multiple tools that are there today within an organization.
The second thing is the data context that we have provided is not just for people. You can also use it for build building tools around it. Right? Now specifically to your question, I think Amun Sen and data hub guys are centralizing the metadata quite well. But I think the differences that I called out are metadata APIs, right, metadata schemas. We believe that that will make it easy for you to build things around metadata using open metadata. And then finally, right, they can also coexist. Different tools can bring different functionality. So these tools can coexist. We can integrate with each other and, you know, maybe capture the metadata in another catalog and then centralize it as well because there's a lot of metadata that is already generated in some of the systems, Right? That needs to be brought in as well. So we'll build those integration.
[00:39:18] Unknown:
Yeah. So a couple of points I would like to make here is, like, what we're building is the foundation layer of the platform itself. So discovery, lineage, quality becomes applications on top of it. So if you look at, like, what I have moments in doing after cataloging and discovery experience that could be built on top of the metadata. The benefits of the platform itself is again, as we said earlier, like, you know, it doesn't need to be isolated. Right? You build this foundation layer. You build Discord and other things. Other experience quality, you know, other experience can come on top of that. So that actually is a foundational play that for the organizations to use it.
[00:39:57] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook ads? Hi touch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hi touch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. In terms of the technical foundation of this platform of open metadata and the reference implementation that you're building, I'm wondering if you can discuss how you're actually approaching that construction. I know you mentioned that you very heavily invested in JSON schema as the core sort of building block for it. I'm wondering if you can talk about the overall architecture and maybe some of the ways that you are working to delineate the schema specification from the reference implementation and sort of the overall goals of that reference implementation.
If it is intended to be kind of the canonical point of if you want to use open metadata, this is the tool that you use, or this is an example, but these are the core APIs and this is the more important piece of it. And just maybe talk about the technical and architectural design and the sort of schema and specification design and how those 2 are playing off of each other.
[00:41:23] Unknown:
Let's start with architecture of OpenMetadata itself, and then we'll think about the clients. When we built OpenMetadata, you know, we did consider open sourcing what we had built at Uber and then starting with that. There are a couple of choices that we made. 1 is when systems are built in large companies that has large number of developers and some people to support and all of those things, Certain choices you make in terms of architecture are not gonna work for small companies. For example, when we build Apache Atlas, we ended up using Janus, Apache Titan at that point in time, Titan, and then HBase, and then HDFS for storage of metadata.
Just operationalizing this just for metadata is just is a nightmare. Right? Second thing is we did not add Kafka as a requirement or a graph database as a requirement because not many people want to operationalize this just to have a metadata system. However, our system is extensible to include Kafka where it is required for in real time publishing metadata changes. But it is not a must have, right, for everybody. Right? And so having less dependencies, less moving parts, and when we have made choice of what systems that we'll build on, we have made the choice of something that is well understood, well known. People know how to operationalize them. So from that perspective, things that we depend on are MySQL and Elasticsearch.
Nothing else. If you are a big company, you can operationalize it. You can add Kafka and things like that, but it is not required for most of the companies. Thus, making the operational complexity and the number of moving parts less. Second thing is looking at our own open meta data implementation. As you said, we start with schemas. And using schemas, we generate all the code. And so providing rest APIs, we use Jetty, DropWizard, JDBI, or writing to MySQL. So this is a fairly standard well known stack. So that's what we use and that provides the rest APIs. And the all the rest APIs are provided. The API parameters and this request responses, everything is coming off of JSON schema. Right? And then the second thing is we built an ingestion framework using Python.
And it follows the usual, you know, connect to a source and then get the data and then process it, put it into a different shape, and then sync, which is write it back to metadata system using our APIs. So I wanna call out an important distinction here. A lot of systems end up doing ingestion directly. They write it into database, which means that and they've also exposed their database, let's say graph database APIs as APIs for accessing metadata. That ties you into the implementation detail. You want to access everything through public interfaces, not directly go to implementation details. And many of the projects are finding that some of the graph databases that they have used are becoming issues for some of the, you know, users that are using their system. And then when you try to change the implementation detail, you're gonna affect a lot of other things. So we use ingestion through our open APIs.
And then we are using TypeScript. Yeah. All the, you know, regular, newer JavaScript framework to build our UI. And then, again, UI uses JSON schema to generate TypeScript. So that's the architecture now. And then 1 thing I want to call out is our injection framework is flexible that is written in Python. So we have put in a lot of effort to write it in a extensible manner. That has helped us great deal. Right? We can actually report a lot of integrations under a day. Right? So and we already have now 20 integrations for a young project starting from data warehouses, all the popular data warehouses, popular transactional databases, dashboard systems, and then the, you know, workflow systems. So we have and then Kafka included.
We have a lot of integrations. Now let's think about from the client side how they can use open metadata. The power of JSON schema is instead of writing code where if key name is equal to blah, I expect certain value and then you cast the value to some, you know, thing that you're expecting. And then anything changes, your debugging becomes a run time debugging. Instead, the power of JSON schema is there are enough tools and, as Harsha was saying previously, you can generate language bindings to any language of your choice. And, you know, you can generate your code and use that code and a lot of things that you are using. If you regenerate the code, you will find compile time errors instead of run time errors. Right? So that is for the clients.
So clients can use our JSON schemas, embed our JSON schemas in their JSON schema because of, you know, JSON schema has a great importing mechanism through references. So they can use our JSON schema in their JSON schemas, generate code using our JSON schemas. So they can take the our schema models and make it their own. And any project can make this in their own. They can take the vocabulary if they like. They can reuse the vocabulary, their types and entities and all of that. Very well defined, well documented. People can use any of the things that we have. You can use it from the web. You point it to an HTTP URL. You know, you can use JSON keywords.
[00:46:56] Unknown:
Yeah. So 1 of the important things when buzz are getting built at big companies and getting open sources, there is a blind side to it when they are designing it. Like, example, at Uber, there's Kafka team as an infrastructure, there's Hire, JFS teams as infrastructure just to keep the services up and running. So, you know, anyone who's building an application inside Uber thinks Kafka is there as a protocol. I can just publish it. It's free because they're not investing. They're not worrying about keeping up and running. So when we are actually designing open metadata, we wanted to kind of pay attention to the rest of the industry to bring it up and running who doesn't want to kind of, hey, you know, just to get the metadata up and running, we need to get Kafka and Crafty and everything else.
And we want to make especially color that, you know, without sacrificing any scalability aspects to it. A lot of this thoughts about designing goes, never and beyond, try to include complex projects, complex services inside just in the name of the scalability. But we actually did, you know, take an extra steps towards that and built a sandbox with, 1, 800, 000 entities. These are, like, 1, 000, 000 table entities, 500 dashboard, 100 topics so and so. And along with 4, 500, 000 relations within a single instance, open metadata. And this is we are not outwardly trying to tune it. This is the default configuration that we ship, and we are able to demonstrate that, you know, and publish the benchmarking with hundreds of users simultaneously accessing the sandbox.
So it scales well and, you know, especially attention given to simplicity and ability to keep up and running between more efforts and production.
[00:48:32] Unknown:
And the point of scalability is interesting because there are a number of different ways to think about that problem where 1 is the volume of data, the number of users as you were just discussing, but another is the scaling of organizational and cognitive complexity of how do you structure any, you know, siloing of metadata if that's, you know, necessary within your organization to be able to represent the different sort of maybe geographic boundaries of business domains and maybe being able to federate installations of open metadata to be able to do discovery of data assets across those organizational boundaries where maybe 1 business unit says, this is our open metadata installation that has everything to do with the data contained within our platform for our particular business unit. But then, you know, there's a different unit within the enterprise that says, we have our own installation of open metadata, but we need to be able to do discovery across those 2 because there are some interchanges of, you know, we're handing off, you know, FTP with the CSV file. We need to make sure that the structures of the way that we're exporting the data match with the way that you're importing the data and being able to handle some of those, you know, complexities of scale. That's a great question. I think scalability itself, I think we must be able to
[00:49:48] Unknown:
handle the scalability of most companies on the planet barring 1 or 2. So scalability is not a challenge. However, organizational boundaries. Right? Maybe, you know, the servers and systems that some part of the organization is using is in certain VPC, and they don't give access from another VPC, things like that. That is an issue for very large organizations with lot of lines of businesses and sub bugs. What we think we could do there is, you know, this scales. Right? You can have different open metadata installation, but then still you can centralize the metadata that you choose to centralize into a single metadata store. That way entire organizations, right, the entire company's data, how you are doing with the data is visible. Right? So there's concept of, you know, domains and, you know, all kinds of things that are coming right as an organizational unit. You should be able to have multiple open metadata installation and then, you know, you can ship that metadata into a centralized place.
That way it's the metadata integration into a central place than having to integrate the central open metadata to all your systems and exposing those systems. Right? So you could do that. That certainly is an option where these are restrictions within a company.
[00:51:04] Unknown:
In terms of the organizational complexities of being able to model those different business domains as well as being able to model all of the schematic elements of the data that you're dealing with. I'm wondering what have been some of the challenges that you're facing in being able to balance the generality of the system and its flexibility while being able to be specific and appropriately constraining in the types of metadata that you're going to consume so that you can enforce a meaningful and useful structure without it just being you can write any key value you want.
[00:51:40] Unknown:
I think we talked about, let's say, Apache Atlas as an example. Right? That is the reason why a metadata system must model all the common entities. And when new entities come, my thought process there is work with the community to add a new entity because if you have a need, most likely that other companies have that need as well. Right? So get the entities that are required in the metadata system standardized. Make it available. Otherwise, if you keep modeling your own metadata entities, you know, they model the same entity concept, but the model looks different and that runs against the open metadata standardization. Right? And then tools have to deal with varieties of metadata. The other thing that I would say is we have off the shelf some tags and categories and things like that.
We would like to work with different companies that are in different industry segments to work with us on defining their tag vocabulary. Right? That way, if they collaborate to write these, tag categories and the business glossaries, it becomes available for somebody else, and then they can add to it. So there is a way to even standardize some of the business vocabulary as an open source. But so far, I think our approach has been strongly typed entities, make entities available to capture all your use cases. If there are any extensibility required, like I was telling you earlier, the core aspects of, metadata should be common. Right? The core attributes of an entity should be common.
There might be extension points that might mean specific to your organization,
[00:53:14] Unknown:
your group, but that won't be commonly used. Right? That extension points will make them available. Going back to what we were discussing earlier about the fact that there are so many different layers to the data ecosystem and businesses that are building these special purpose tools that are all at their core built around different mutations and manipulations and interpretations of metadata with data quality being 1 category, discovery, data governance. They're all at their core using metadata about your system to be able to build their specific capabilities. And I'm wondering what you see as the role of open metadata across that ecosystem of projects and some of the potential future for being able to simplify the integration across those systems. Maybe if there are particular areas of those industry verticals that become obviated by the fact that open metadata becomes ubiquitous and maybe some of the potentials for new efforts and business domains to be able to grow on top of this shared specification?
[00:54:18] Unknown:
Yeah. What I feel is instead of tools becoming updated, what I feel is tools become a lot simpler. If the metadata already exists, lot of these tools can be built quickly, and it will increase the competition in this space. Right? And the ones that are building the best tools, best experiences, best solution to the problems, they will win. Right? Versus today, what has happened is a lot of the metadata is trapped in proprietary formats. And just, you know, entering into that kind of deployment is lot harder because in order to just become a quality tool, you have to first crawl and integrate with everything. Right? What if that metadata was shareable and then, you know, there can be a lot of quality tools and lot of those tools can be a lot simpler. And then, you know, barrier to building a solution around metadata, you know, decreases significantly, and we believe that this will actually foster innovation. Right?
This helps in making our data tool space better. And so the whole idea of open metadata is don't let your metadata, which is a significant asset you have, Probably more important than your data itself. Right? Set it free. Right? Make it easily shareable, available so that innovation can happen instead of letting it be, you know, letting it, you know, sit in, you know, vendor proprietary formats, traps in certain systems. I believe this can foster innovation.
[00:55:49] Unknown:
As you have been building the open metadata project and working to grow the community and exploring this space Anew. I'm wondering how that has influenced or shifted your perspective on the potential that is available in this space and the challenges that are inherent to metadata management and maybe some of the ways that your initial ideas and assumptions of what the goals and capabilities of this project might be have been challenged as you have started to build out the this implementation.
[00:56:21] Unknown:
Yeah. So if you look at the metadata space today, metadata is only used for discovery and then to certain extent, governance. Right? But there are lot more applications of metadata as we discussed. Right? So there is an overwhelming response to our open metadata announcement where people have, you know, come back to us with, hey, metadata is not a solved problem. There are a lot of applications of metadata that are possible. As you are saying, we completely concur. And metadata, a lot of people also have come back saying, you know, many metadata systems are just called catalog. There's a realization that catalog is just 1 application of metadata. So these are things that that we have discovered as we worked through the project, not only as our own thoughts, but thoughts reflected back to us, many people,
[00:57:17] Unknown:
in the community. Yeah. And, we have seen quite a few community members coming in adding new entities and kind of liking taking the liking towards suggestion scheme or how easy it is to add a new entity and get the ingestion run up and running. So in a way, it's kind of more of, you know, plus 1 vote. Hey. This is great. This is what we wanted. And the APIs and schemers are looking great. And that's, you know, kind of stumbling blocks before to have any integration to work with the metadata that we already have. And this is a project, you know, that has the right goals, right direction, and going the right way. So so far, it has been more of, you know, this script, you know, you want to start using it, start contributing it. So it is great. So for people who are looking to be able to simplify the management of their metadata across their data platforms and data systems, what are the cases where open metadata is the wrong choice and they might be better served with a data catalog like DataHub or, you know, these different
[00:58:15] Unknown:
solutions for metadata management in a limited domain?
[00:58:19] Unknown:
You know the answer to this. Open metadata is, you know, never a bad choice. Right? Where I think things could be different is, 1 is, today, our relations are not managing their metadata at all. Right? They're, you know, metadata is at the center of, you know, data mesh, data culture, data observability, whole bunch of things that people are grappling to sort of express as a clear problem statement. But metadata is at the center of it. Where I think things can be different is if you have a centralized metadata system, if somebody else builds a tool that is delightful, that solves your problem better, different tools can be used. Hopefully, they are thin tools. Right? Not building their own separate metadata system.
So we have certain experiences, user experience, collaboration experience that comes with open metadata. But somebody built something much better than that, something that is very focused on a specific persona, those tools, you know, should be adopted.
[00:59:23] Unknown:
Yeah. So when our guys are going down this road, they kind of go in a civil fashion. Right? So they realize I have a day 1 problem of discovery. So let me actually bring up catalog. Now I have a catalog. I have a quality problem now, so I'm gonna get another tool. Now I have a governance, so and so. Right? So what we are saying is, hey. We have seen these problems many times, and, eventually, you have siloed tools and are solving these problems repeatedly. Invest into open metadata. You're getting the metadata platforms, schemers, and APIs, and all this experience becomes part of it rather than ASI load fashion. So if you invest open metadata,
[00:59:59] Unknown:
your vision realizing the entire data culture will become a reality as we improve this product and continue to iterate on it. And as you continue to iterate on the problem and build out the open metadata specification and implementation, what are some of the things you have planned for the near to medium term or any upcoming projects that you're particularly excited about?
[01:00:19] Unknown:
So lineage is something that we are working on. There's a lot of work that is going on in our 0.5 release. We started off with lineage from different workflow systems as a part of the system, and then we are building versioning and eventing. What we believe is today, metadata versioning is not tracked at all. Right? That means as the metadata changes, it is reflecting how your particular dataset is changing. Right? The owner, you know, the columns, the description, and things like that. So we are building this feature that we are super excited about. You will have version metadata.
You will be able to go through what has happened in the life cycle of a table starting from day 1 it got created, how it changed over a period of time, you know, what tags got added, what tags got deleted, what was the previous description, how description has improved through versioning of every entity. And then from that versioning of every entity, we will generate change events, which will be used for building bots and applications where you can actually subscribe to a certain type of change. The table got created, the table got PII added, table ownership change, and then you can start building your own internal workflows to react to how the metadata is changing.
So those are features. Right? And then we'll continue to build a lot of integrations to a lot of tools available in the data ecosystem. But where we are right now in terms of our user experiences, we have discovery, right, which is the starting point. You must discover your data as a first step. Right? And then we have built an experience where you can have great descriptions and things like that so that people can understand what they have discovered. We'll start building next kind of experiences where people can collaborate, where they can understand it better through questions, comments, feedback, where to ask, for features, additional columns, things like that.
That is for people. Right? Collaboration experience. Then we'll also start doing some, reliability kind of stuff. You know, there are tools that provide, you know, the scheme as backward and compatibly chain. Now that we are worsening, we'll be able to let people know that if you are dependent on this data, it has backward in incompatibility change. Right? Some of the observability aspects. We'll be able to say your data distribution is this today, which feature that we already support, but then we'll start using that information to say, you know, there is something missing here. Maybe the number of rows that are coming into the table dropped significantly. Maybe there is missing data. So some kind of alerting like that. And then we'll also start adding support for test metadata so that, you know, we can have testing quality tools that, you know, can integrate with open metadata. So in a nutshell, data catalogs have gotten stuck in discovery. Right? Go beyond discovery to collaboration and then automation.
And then the second thing is 1 of the automations that we also want to do is a lot of data catalog, you know, description tagging is something that is driven by somebody with a scale, and then they say, you need to complete this by this time. And people, you know, tag the data, then, you know, after that, they they forget about, you know, continue to keep it up. And then your descriptions are stale, your metadata is stale, your tags are not correct. What we did at Uber was we built this tool on a weekly basis. What it will do is for every data owner, it will say, your description coverage is this, your quality coverage is this, Your SLA you're doing well, you know, poorly against SLA you're defined. That was a way of constantly nudging people, right, without having to have this big mandates and 2 weeks of, you know, doing things with fanfare instead of continuous improvement of data, right, as a culture. So you'll build some automations where people will get both positive and negative feedback. Right? Your descriptions are missing here compared to, hey, you did great. You know, your description coverage is top blah percentage within an organization.
Data is a tankless job for most of the people. We wanna bring some joy to the people who are doing data. Right? Through these kind of experiences where continuously, right, organization's data improves through quality feedback.
[01:04:35] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and get involved with the project, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. What I feel is more than gap, there are too many of them. Right? That don't talk to each other.
[01:04:57] Unknown:
And I think there is just like platforms. Right? 1, 000, 000, 000 of dollars went into platforms where now we nurse our emerge emerging and consolidation is happening. I think in tooling space, there are too many tools, and winners will emerge and consolidation will happen so that confusing noises die down and the clarity
[01:05:17] Unknown:
comes through it. I think the gap is the open metadata. I think, you know, you'll be thrilled to have your listeners, your audience come join us to build this. And this is 1 of the foundational things that we're building, and the reason to keep open source is to bring, you know, others into the community. So we'll be really thrilled to have your audience come over and work with us. Metadata should be a solved problem.
[01:05:40] Unknown:
And let's not keep rebuilding the same thing. Let's move to the next level of innovation. Absolutely.
[01:05:46] Unknown:
Well, thank you both very much for taking the time today to join me and share the work that you're doing on open metadata. It's definitely a very interesting project, and I definitely look forward to seeing it succeed and grow and be adopted. So I appreciate all of the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Thank you, Tobias. Listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Sriharsha Chintalapani's Background
Suresh Srinivas's Background
Introduction to Open Metadata
Challenges in Metadata Collection and Organization
Open Metadata's Approach and Goals
Benefits and Integration of Open Metadata
Technical Foundation and Architecture
Scalability and Organizational Complexity
When Open Metadata Might Not Be the Right Choice
Future Plans and Exciting Projects
Biggest Gaps in Data Management Tooling
Closing Remarks