Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Sriharsha Chintalapani and Suresh Srinivas about open metadata, an open standard for metadata and a reference implementation for a central metadata store. So, Sriharsha, can you start by introducing yourself? Hi, Tobias. Thanks for inviting for data engineering podcast.

And I'm Shiharsha.

I've been in the data for the last past 10 years, mainly on the data infrastructure side. You know, I'm a streaming engineer,

start up at Mozilla in Hortonworks, and worked on Uber as well, getting

bunch of data teams there. I'm also an open source committer in Apache Storm and Apache Kafka

and bunch of other projects and so. So that's how my data career started. And, Suresh, how about yourself? Hi, Tobias. Thank you for

inviting us to the podcast, and hello, listeners. My name is Suresh Srinivas.

I've been in the data space for a long time. Started out, you know, in data space at Yahoo,

where I was in the initial team that built, Hadoop.

And then Huddl started getting traction beyond web companies.

I started, you know, as data, you know, became big data and then a lot of enterprises started looking at data

as a significant tool. Huddl started getting traction.

You know, it had all the way way of,

big data.

On that way, we started Hortonworks. I was 1 of the cofounders of Hortonworks. We started in 2011.

And then in 2017

or so,

having built a lot of complex platform,

I was actually

bothered by the fact that people were even though they had all these scalable systems, they were finding it really hard to realize

value from their data. So it was just amazing

to see how hard it is to get data. Right?

So I joined Uber,

the other side of data, right, having built platforms.

I switched over to 1 of the biggest platform consumers, big data platform consumers, Uber,

to look at why it is hard to get the data right and, you know, what are the things that are required to get the data right. So

since 2017 or so, my focus has been

data experience,

you know, improving

data tools, making it easy to consume,

easy to get the data right. You know, now open data is a continuation of that team.

That brings us to what you're both working on now, which is the open metadata project. I'm wondering if you can just give a bit of context about what it is that you're building there, because I know it's both an open standard that you're looking to build and promote as well as a reference implementation

of that standard. So I'm wondering if you can just talk through some of the sort of story behind the project and the main goals that you have for it. In 2019

or so, Atuber,

you know, Uber is a data driven company. Everything at Uber is driven by data, including, you know,

matching riders and drivers and demand and supply. All of it is driven by data.

We were finding

some basic problems with data.

Some of the key metrics. Right? Number of trips completed,

things like that. Depending on who you ask, you would get different answers.

And

we had tried to solve this problem

through piecemeal approaches. Right? So try to get the data quality higher in data warehouses.

Maybe, you know, try to

improve how we are reporting things by adding more tests to reporting.

All of this had taken a toll on data productivity. So we were using a lot of data scientists to

make sense of data, and

lot of their time was being spent on unproductive tasks. Right? Running behind

data quality issues, you know, missing data issues, things like that.

So we decided that we needed

a different holistic approach to solve the data problems.

And that included starting from the source of the data, that is where, you know, mobile phones and apps

are producing events, online services that are

storing the online transactional data,

and then how the data is flowing from those systems into the offline systems,

how it is getting curated, modeled,

you know, and then finally, how are people consuming this data to build different data assets, like reports, dashboards, metrics, mission learning features.

So as we were looking into this, we just you know, it was very clear for us. Right? If the data is very poor at the source and it is polluted,

in this stream of data, you cannot actually

clean it up downstream. You cannot create quality downstream. It has to start at the source. So we took an end to end approach. We started a project that included all the way from mobile developers and online services

developers to people who are in the big data world

ingesting their data, creating offline models,

all the way to data scientists.

Now through that, there were a couple of things that became very clear. Right? So

you cannot actually

address the data issues

by doing 1 time efforts. Right? So,

culturally, data culture of a company needs to change

and treat data as important.

So the band aid approaches were not working.

The second thing is

there were too many tools. Right? You've seen all these tools, you know, that keep coming up every day.

There were too many tools in the organization,

but there was no way to get a holistic picture of data. So you would go from for discovery, you go to a catalog kind of a thing, and then you jump to a data quality and then a query

and then a metric system. All of the systems were

different. Right? So people had to jump from 1 tool to the other. So tool disconnect was a problem.

Then data is a team game. Right? Somebody is producing it. Somebody is curating it. Somebody is modeling it. Somebody is consuming it. The team needs to come together and work together to have successful data outcomes.

Without that, you know, people make assumptions and get data wrong or may not end up using the data.

So the people disconnect was another problem that we had where

people didn't know each other. Right? There was no names to their datasets.

As we started looking at it, there were a lot of things that we were

doing manually that was

taking, you know, really people who have lot of expertise in data science or, you know, business intelligence and using their time to do some manual

cleanup and a lot of file.

So as we started looking at it, at the core of it, it emerged that discovery is a problem. Right? And discovery not as in the traditional sense of data catalog where it is focused on, say, databases and tables.

Data is used for creating a lot of data assets. Right? And all these data assets should be discoverable. An example is there is a metric that is a number of trips taken, right, along certain dimension.

There were many such metrics, and each metrics had their own definition. Right? There are different dash boards

trying to answer the same thing because they are using different datasets.

So discovering all your data assets in a single place was 1 problem that emerged.

The second problem was people disconnect. Right? When you're using a data

asset,

you don't know who produced it, why they're producing it in certain way. You have questions. You don't know, you know, whom to ask. And so

consumers of this data did not know who is producing it, why are they producing it, what guarantees come with it. There was no ownership at all for data assets. And then

producers of the data did not know that their data was being used for some super business, you know, important critical purposes. And so they would randomly make certain change and then, you know, for couple of weeks, you know, tickets will go from 1 team to the other. And then they finally realized that their data is being used for some super important, you know, business purpose.

And so they did not have visibility of who is using it, why are they using it, what critical business purposes.

So that was a, you know, clear disconnect. And then finally,

data must have context. Right? Without context, data is just bits and bytes. And lack of this context made

many people not use the data that we had. Many people, you know, were still. Right? Many people used it incorrectly because they were making wrong assumptions.

And then finally, right, you hear from data mesh,

data as a product. Right? Lot of this data was produced without considering

how it is being consumed. And as a result, data was hard to use.

Schema was not, you know, well designed.

And because of which, some of our retail pipelines had 5, 000 lines of code writing state machine building the picture of reality instead of just capturing it right at the source. So these were the problems that came out and then, you know, tooling. Right? People need better tools and automation so that they can focus on more important stuff.

So as we started looking at it, you know,

cutting long story short, you know, the story is already long. It emerged that we were not managing metadata well. And managing the metadata where you centralize all the metadata about all the data assets,

about user activity, about

user inputs feedback, all of that became super important to solve this problem. So we built a metadata system that centralized all the metadata

within Uber.

And then as we were looking at it, it started becoming the center of data universe at Uber. And we started building a lot of automation,

lot of tooling,

improved

consuming the metadata.

Lot of tools became lot simpler to build because there was a centralized metadata store. And I and Harsha were thinking, you know, this is a problem that every company has, and perhaps we should build a centralized metadata store where metadata is shareable

through very well designed metadata APIs, and that can actually transform the data landscape. Right? So that is the reason why we decided that we will come and build it outside as an open source because we have a lot of open source background.

Open source is the right solution here. The open metadata project is definitely very ambitious and that it is trying to be this kind of universal approach to metadata. It's trying to be this sort of central store to

integrate metadata across all of the different components of the data ecosystem, which is, you know, to some estimation, an intractable problem

because every different system

has some concept of metadata and the ownership of that metadata. And so being able to

expose that and share it and unlock it and sort of unify the schema of that is a very complex and multifaceted problem.

And I'm wondering if you can talk through some of the

common challenges that engineers face in trying to collect and organize all of the metadata across their system and some of the ways that the

existing state of the tooling in the industry and the various generational shifts of those tools contributes

to some of those challenges

and blockades and being able to actually unlock the capabilities of that metadata?

Solving a hard problem. You know, that is what is fun. Right? But solving this problem can transform

the data world. So let me give you some thoughts on where we are today. It's not like the metadata is not collected in a centralized manner. To certain extent, it's being done by catalogs.

The problem is the metadata models are a lot of times there are no metadata models. Right? It's key value pairs and, you know, there's no

clear definition of what shape of metadata you can get out of the system.

Second thing is lack of APIs. So if you look at the picture today,

data catalogs have most of the metadata

in an organization today. But it is not shareable. It is not modeled well, And there are no APIs, and there is no extensibility with most of the

data catalogs.

Because of which, what ends up happening is a tool that is building, let's say, data quality. It has to build a separate metadata subsystem, right, within its tool because the metadata that is already available in our system system is not shareable. So

they end up integrating with

the other tools, collect all the metadata, and store their own copy of metadata

just so that they can add some special metadata that, tool is focusing on, which is, let's say, test and then test results. Right? And then maybe possibly

summarize it to the quality score. So imagine

if the centralized metadata store was shareable

and accessible through great APIs,

a quality tool could have just focused on

providing an interface for building tests. Those tests can be shared in a central metadata store instead of having your own system for storing it. Quality tool can run those tests,

produce the test results, and then write it back to the central metadata store. With that, the quality tool can now focus on what special sauce it is bringing instead of trying to build a metadata system. Right? So this is true with data observability, data management, data governance.

Because the central metadata system is not shareable, doesn't have great APIs,

everybody has to build their own copy of metadata system. Now the outcome of this is there are multiple copies of metadata within an organization.

Right? So that metadata is fragmented. Some metadata is available in 2 legs, some available in 2 y.

There is no central holistic picture of data in an organization.

So there is fragmentation. Right? There is duplication.

With duplication comes inconsistency.

Right? You added a tag here. The The tag is missing there. You have a description here. The description is missing here.

Because of which, you had to jump from tool to tool, and that causes lot of user frustration.

Right? So

I believe that many tools have integrations built, which can be avoided if there is a central tool, central metadata repository that builds, focus on building integration,

you configure that tool with these systems,

centralize your metadata, make it shareable, make it extensible.

Many tools can make use of the centralized metadata.

And then finally, right,

if you look at what is going on in data landscape, especially because of great IPOs in the space

and a lot of investment in the space,

every day, many different

siloed narrow tools are cropping up. Right? And each of these tools

have very small functionality, maybe observability or quality or something like that.

And they end up being simple workflows. If they could use the centralized metadata store, they become a full fledged

subsystem.

And imagine, you know, installing a system like this,

integrating with rest of your data ecosystem,

operationalizing it. Right? And then

not to mention the cost, right, of,

you know, adding this new service into your stack.

All of that can be eliminated, and many of these tools can become simple workflows.

And so I think it's

not a challenging problem to centralize the metadata.

What is gonna be challenging is standardizing,

agreeing upon a standard.

And this takes time, and this takes collaboration with many other communities.

And then it takes most importantly

project to be successful and adopted. Right? As you get more adoption and people realize the value of metadata, schemas, metadata APIs,

I believe we will start shifting towards adopting these schemas as metadata standards.

Yeah. To add to what Sudesh said, we've seen this problem multiple times at Uber when we entered in the data interface. You know, there are silo 20 tools

kind of doing the overlapping work and kind of storing the overlapping metadata

for each of those use cases. When we started DataNG, that is 1 of the project where we said, like, hey, centralize the metadata, build APIs,

and let users as well as tools come together.

That is the important part where when you actually only consume users,

you get enriched metadata, but again,

the tools

goes away and and everyone has to copy the metadata again and again. So we made a conscious decision there, kind of build this simplest metadata,

build data quality as 1 of the applications, lineage as another application,

and collaboration aspects where user can ask questions. User can raise questions or issues

on the metadata. So that's proven to be a huge success at Uber.

And it also,

proved to be a stepping stone to the next level onto the platforms itself. For example, 1 of the things we did at Uber was to identify the important net assets.

Right? So that will give on the user side, hey, what are my important data assets out of, let's say, you know, 100 of thousands of data assets at Uber?

But also on the platform, what it enabled us to do is as, you know, allocate more resources to important data assets. Make sure those are the ones that are running fine.

So before that, you know, everything get the same priority. If you are a user running a query on a random table, we'll be getting the same priority as

a, you know, trips table, which is the highest priority in the Uber, getting the same resources.

By doing this centralized metadata, by categorizing

it well, we disperse this into the platforms as well as as users. So it's going to be really successful at Uber.

Another interesting element of this is, as you said, if you have this global

metadata store that is shared across the entire infrastructure and all the tools,

a lot of the

seemingly complex efforts of building a data quality tool or building a lineage tool

becomes much simpler because you don't have to rebuild that metadata store over and over again.

And I'm wondering what you see as the

kind of vision of

and the opportunity for open metadata

to

be this kind of out of the box metadata store for other tools to be able to build on top of and just add their own unique layer so that you can maybe say

for a standalone tool of I'm going to build the data quality platform. I take open metadata off the shelf. I add in my specific

interpretation of the metadata I'm collecting as it pertains to data quality.

But then if I want to deploy this data quality tool into a broader data platform, I swap out my internally

vendored version of open metadata, and I instead connect up to the globally deployed open metadata system. Everything else operates out of the box. And so

rather than each company having to build that metadata layer over and over again, they can use this reference implementation, which can also be swapped out for a company that already has 1 deployed.

But what are the

sort of limitations of that as the capability of saying, this is the metadata store for x?

Because, obviously, you're not going to use open metadata as the way to describe the table schemas in your Postgres database. And I'm wondering if you can maybe break down what you see as kind of the boundary conditions for metadata

as a native component to a piece of the infrastructure versus metadata as this shared resource across the platform?

So what we are focused on is building innovation around metadata.

And when we started looking at how it is currently solved, and this is a 3rd generation of metadata system that I'm building,

having, you know, worked with Apache Atlas,

second system that we built at Uber called YouMetadata. This is the 3rd system that we are building. What I feel is

metadata can enable lot of innovations.

Right? Innovations in how people experience the data.

What kind of tooling and automation can be built around data?

It can actually create

next generation of data tools.

And when we were thinking about that innovation,

we

saw that many

metadata systems are being built in the open source today. Many vendor solutions exist.

And

there are many in house metadata systems that are being built.

And what we feel is

metadata should be a solved problem by now.

And there should be

well modeled,

the API first, schema first approach to metadata

that becomes available as an open source project

that anybody can bundle and take and build their own innovation around. Right?

If we do that, many systems now can focus on their innovation instead of building a metadata subsystem. So that's the

main goal, and that's the reason why we are open sourcing it. And we would like to innovate around open metadata ourselves, build some delightful experiences,

automation, collaboration, workflows.

But anybody can take open metadata

and then use it. So, you know, that's the reason why it is open source.

As far as that open source project, I know that it's a fairly recent undertaking that you've started building out. I think I first heard about it a few months ago. And I'm wondering if you can just talk through and characterize

the current state of that open source project and the community that you're starting to grow around it and some of the progress that you've been able to make over the few months that you've been working on it. Since we open sourced open metadata, you know, our community has been grown incredibly well.

So some of the, like, high level members are, like, you know, we have 100 of users join our Slack community, trying out open metadata, giving us feedback, you know, participating in discussion in UI tooling as well.

And we set up a sandbox so that, you know, anyone can actually visit the sandbox and easily play around and understand the APIs and the UI that we are building around it as well. So since we started the sandbox, we have thousands of users coming in and kind of playing around with the sandbox itself. And more importantly,

our contributors numbers are growing up quite a bit, and we have around 30 plus outside contributors coming in and, you know, sending patches, sending features,

all that stuff, and GitHub itself.

So 1 of the goals we set out with open source is to focus on delivering value to the users and ship features as quickly as possible.

With that in mind, we said, like, you know, at the beginning itself, hey. We're gonna do monthly release, and we're gonna ship, you know, substantial features.

And the value of the community is that, you know, what it might have taken us, like, months to build because of our external contributors coming in and shooting these features, we're able to kind of ship substantial features

in each release. So far, we

are 3 release releases in open source. We are coming up to the next 1. Again, we believe, like, you know, the base of the community has been incredible,

and, we are really thankful for the community participation here.

And just to quote a number, like, you know, we are around 2 50 commits per month. So there's quite a bit, you know, additions and features that's coming up. So yeah. Yeah. 1 thing that I wanna call out is

a lot of people equate

the time for which a project is open source as the maturity of the project. Right?

So maturity doesn't come come by

in a time itself. Right? It also depends on what experiences that you are bringing to the table.

Right? In building a system, what learnings are you implying?

Past decade has been the decade of big data,

and that space has transformed tremendously.

Right?

And there's a lot of learning if you have looked at it, and then there's a lot of learning. Those learnings can be implied in building a system.

And so this is a third iteration of metadata system that we have built, which means to say we have made our share of mistakes. And hopefully, we are going to new set of mistakes instead of, you know, the same old, right, from the past 2 iteration.

And so I would say

the maturity of the project is also dependent on what you bring to the table, what learnings, right, how have you employed those learnings, What are the different shapers?

An interesting element of what you're building with open metadata and sort of the current state of the data ecosystem and community is that there has been

a very broad

sort of willingness to explore

these open APIs and open integration points. And 1 of the other manifestations of that is the open lineage project, which is focused on metadata as it pertains to lineage

of the various data pipelines that we're building across our infrastructures.

I'm wondering if you can give your perspective on some of the ways that the open metadata project compares to the goals of open lineage and maybe some ways that the 2 can sort of collaborate with each other or learn from each other and any of the other efforts that you've seen that are similar to Open Lineage or the work you're doing with Open Metadata? I know that Agiria is another effort on that line. With lineage. Right? Lineage is

a small but important part of metadata. But the metadata universe itself is a much bigger

thing compared to lineage.

Specifically to open lineage, I think the project has the right goals, Similar to what we are saying about open metadata, why keep building new metadata systems?

They want to actually solve the problem of lineage integration

as it pertains

to getting

details of runs and jobs from various workflow systems. So that is what they are focused on. However,

when you look at a metadata system, right,

it needs to anyway integrate with lot of data sources, not just workflow systems.

Second thing is it needs to capture as much metadata as possible, not just run events

and jobs and stuff like that. So from that perspective,

because we already have a lot of integrations with a lot of these tools already to capture not just lineage, other information as well.

Using open lineage for just that purpose

is not that significant. Right? However,

we've been talking to open lineage community at least over Twitter

on possibly

standardizing

schemas related to at least Goran's jobs and events.

And then I think currently, open lineage does not have a lineage graph definition,

which we have. Maybe we can collaborate on adopting lineage graph definition as well. Now coming to their other efforts as well. Right? Not just lineage for metadata. Right?

The challenge that I've seen is and this is sort of like a moment

where, you know, I felt

I was not thinking through clearly.

We are data people. Right? And

in order to use data really well,

schemas are required.

Not only that, if you want to use it efficiently,

well designed, well modeled schemas are required. Right? Otherwise, you won't be able to use it efficiently. You might even make wrong assumptions and get it wrong. Now as data people, when, you know, we have built in the past metadata systems,

we ourselves did not consider schemas as important. Right? So we have just put any shape to the

data. Right? There's a property 1, there's an object, and then key value pairs and things like that.

What makes it hard then is

people don't

know the schema. The schema is not modeled correctly. There is no strong typing.

So you will have to build if this field name is this, you do this, you might get any value.

And

so realization that schema is super important.

We knew schema is super important for data.

But metadata is data about data. Right? And schemas for metadata is also super important. And so

strongly typed schemas are super important. Right? In order to make metadata shareable,

reusable across tools, It cannot be key value pairs, and it has to have, you know, strong type and, you know, proper shape and all of that.

And so schema first approach. Right? Which is the reason why, you know, we ended up doing open metadata.

It paid,

it gave us lot of benefits at Uber. Right? Just to give you a small story behind why

schema first and metadata vocabularies are important.

When we are looking at data problems at Uber,

Uber has lots of microservices defining its own events, schemas, and things like that. We saw

close to hundred definition of what a location is for a company like Uber.

Location

is the central

vocabulary word. Right? Location, point,

currency,

the exchange rate, things like that. Right? Core concepts had their own schemas.

Sometimes somebody called a schema as location when they meant something else.

So you had confusion because

same definitions were not used. It was inconsistent.

In some cases, it was confusing because it was a totally different concept. So we did some work at Uber where we took the core vocabulary and then model them in a single place once for all the schemas to be

consumed,

which we call as data standardization effort.

Through that, we realized that even metadata requires same standardization. Right? When you call something as

an interval, it should mean interval across all different metadata entities.

Ownership must mean the same thing.

Right? You know, when you say tag, tags must mean the same thing. If you don't have that from system to system definitions change,

then without the right vocabulary and if people are saying the same word meaning different things, there is no collaboration communication

possible.

And all it results is in confusion

and lot of trial. To that point of the schema being very important and properly typed and very explicit

about what is meant semantically and

tactically by those different

elements within the schema, I'm wondering what you have seen as some of the

useful patterns for naming conventions.

Because I know that particularly in sort of the early days of computing, space was limited, so we used very short variable names and very short, you know, names of binaries because,

you know, typing was

we we didn't have the ability to easily correct our typing in, you know, the early days of UNIX. And I'm wondering what you see as maybe the opportunity for

increasing verbosity of the naming that we're using in these schemas so that they're much clearer

and there is less opportunity for confusion.

And

in the metadata

space specifically, what do you see as the sort of core elements that are required for being able to build a

well designed

and usable

schema and API for collecting and organizing metadata?

I don't know if verbosity

means

clarity.

So, you know, there is a balance. Right? You know, a lot of the schemas get converted into code and then readability becomes a problem.

So

descriptive, long names, but just long enough. Right? Not too long. You want to capture only certain concerns.

So from that perspective, right, if you look at what a name is, every name captures

maybe a paragraph of concept. Right?

So

the name is a short form of a lot of knowledge you accumulated

around. You are capturing it succinctly, right, with the name.

And so the name must have a clear definition description,

things like that. Right? Sort of like you know, I'm really surprised whenever I look at some very familiar words in the dictionary,

and I look at the precise definition.

I'm just always, you know, blown away. Right? How clear definitions are. Similar things are required, right, for,

schema names and type names and things like that. So there must be a succinct type name,

not something that is not readable and, you

know, has lot of acronyms and, you know,

user specific

and understandable,

you know, short forms and tell abbreviations and all of that. But at the same time, it requires clear description. Right? Without the description

being there,

people won't be able to understand it. Right? So let's look at few

metadata systems that have been designed in the past. So let's take an example of Apache Atlas.

What Apache Atlas ended up doing is it did a great job for its time. Right? It built extensible metadata schemas,

and it provided 2 kinds of APIs. 1

is metadata modeling APIs

and then metadata consumption APIs.

And the system came out with very few types.

Right? Maybe a hive table or something like that, a table and, you know, a few types were there.

Most of these types were basic types. Now you want to define, let's say,

presto table. Right? I'm just taking it as an example. You have to define a new

entity called Presto table. You can copy some of the building blocks from Hive table, but you copy a new Presto table. So what ended up happening is every organization

for most of their metadata needs

had to use the metadata modeling APIs to model

the entities.

When you model the entities, few things happen. Right? 1 is different people bring different expertise in modeling.

And then the second thing is

user x is modeling certain entity,

and he understands what the field names mean and all of those things. But then user y cannot understand it because it is not, that definition is not captured anywhere. Right?

So finally, the biggest problem was every organization will model different entities differently,

And then

1 organization's metadata looks different from another organization's metadata. So a tool working with 1 organization cannot work with another organization.

That is the reason why what we ended up doing is

the system must have strongly typed

all entities that are required for a metadata system

modeled,

right, off the shelf. Right? It should be modeled

by people who have modeling expertise,

and all the entities should be available.

And the second thing is all these entities must have core components or core attributes and relationships defined in the system already.

Right? Now if it is required,

there must be extension points

in these entities where organization can extend them.

But the core of the attributes and the entities must already be defined in a system. That's the approach that we have taken. And then finally, right, in terms of schema modeling, there are different schema modeling

languages. At, Uber, when we were modeling, we had used YAML from which we are generating,

you know, other schemas or, let's say, protobuf 3 of them, things like that. So that you model it once and you generate from the same modeling

effort,

schemas required in other language bindings.

So open metadata takes schema first approach. Right? So in lot of systems,

you build the implementation

and you throw whatever implementation

details are

as schema, right, or as API.

We start with schemas, and

what I was talking previously was we had

schema neutral language that we had used at Uber to generate schemas for,

you know, other schema languages.

In the same way, we ended up choosing JSON schema, huge color to JSON schema community.

JSON schemas are a powerful way to model

all your schemas. And

you can not only model the schemas,

you can reuse types, you can build relationship,

you can reuse other schemas, JSON schemas that you have built.

Finally,

JSON schema has

super good

tooling support. Right? So the way we do things is we model entities and types in JSON schema, and they are reusable across all different entities.

And then from this JSON schema, we generate Java code, Python code.

Even our UI code is generated from JSON schema.

And then our documentation, how we store the data, everything is driven by JSON schema. And the power of this is a lot of boilerplate coding just goes away. Not only that, if you make a schema change,

every subsystem within open metadata

automatically gets updated.

So

JSON schema has been an amazing

building block for us, and that choice has worked out greatly for us. Yeah. This, also works into our, you know, vision of, like, building the standards. Right? So when you build based on JSON schema, it's easy for us to build language bindings and others. So we can build Python, we can build code, Java, whatnot.

So what that

drives is, like, the integration play. Now you have centralized metastore. You can actually build services just by embedding a client there. Either you're reading a data quality or you're reading some tags around the metadata itself.

You can easily embed all of this data in other services.

So that's a great investment we actually made into OpenMetadata there. Talking about the integration points, I'm wondering if you can discuss

the

process for an organization that already has an existing data platform. They have a number of different point to point integrations across their various systems. Each of those has some level of control over their own metadata. Maybe they're using a data catalog or a system like Amundsen or Data Hub to be able to try to build out this metadata graph.

What do you see as

the unique benefits that open metadata

provides in that environment and the process to be able to actually start

connecting the entirety of their platform into the open metadata system to be able to start realizing the benefits of this unified view of all of the context of their data across those

different boundaries, both technical and organizational.

Yeah. Open metadata can

centralize all the metadata in a single place. Right?

Now

our vision for open metadata is not to be a data catalog. Right? It is more than that. It has to be a collaboration

point for all the users. Right? Once you collect all the data context,

then users can come around it and within the metadata system can collaborate with each other.

That way, lot of user generated,

you know, discussions, knowledge, and all of those things

can be also captured as metadata.

So we want

to go beyond data catalog. Data catalog is a simple application of open metadata.

But then people collaboration is what we are focusing on. That can reduce the friction of multiple tools that are there today within an organization.

The second thing is the data context that we have provided is not just for people. You can also use it for build building tools around it. Right?

Now specifically to your question,

I think Amun Sen and data hub guys are centralizing the metadata quite well. But I think the differences that I called out are metadata APIs,

right, metadata schemas.

We believe that that will make it easy for you to build things around metadata

using open metadata.

And then finally, right, they can also coexist. Different tools can bring different functionality.

So these tools can coexist. We can integrate with each other and, you know, maybe capture the metadata in another catalog and then centralize it as well because there's a lot

of metadata that is already generated in some of the systems, Right? That needs to be brought in as well. So we'll build those integration.

Yeah. So a couple of points I would like to make here is, like, what we're building is the foundation layer of the platform itself.

So discovery, lineage, quality becomes applications on top of it. So if you look at, like, what I have moments in doing after

cataloging and discovery experience that could be built on top of the metadata.

The benefits of the platform itself is again, as we said earlier, like, you know, it doesn't need to be isolated. Right? You build this foundation layer. You build Discord and other things.

Other experience quality, you know, other experience can come on top of that. So that actually is a foundational

play that for the organizations to use it.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook ads?

Hi touch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hi touch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch.

In terms of the technical foundation of this platform of open metadata and the reference implementation that you're building, I'm wondering if you can discuss

how you're actually approaching that construction. I know you mentioned that you very heavily invested in JSON schema as the core sort of building block for it. I'm wondering if you can talk about

the overall architecture

and maybe some of the ways that you are working to delineate the

schema specification

from the reference implementation and sort of the overall goals of that reference implementation.

If it is intended to be kind of the canonical point of if you want to use open metadata, this is the tool that you use, or this is an example, but these are the core APIs and this is the more important piece of it. And just maybe talk about the technical and architectural design and the sort of schema and specification design and how those 2 are playing off of each other.

Let's start with

architecture of OpenMetadata itself, and then we'll think about the clients.

When we built OpenMetadata,

you know, we did consider open sourcing what we had built at Uber

and then starting with that. There are a couple of choices that we

made. 1 is

when systems are built in large companies that has large number of developers and some people to support and all of those things,

Certain choices you make in terms of architecture

are not gonna work for small companies. For example, when we build Apache Atlas,

we ended up using

Janus,

Apache Titan at that point in time, Titan,

and then HBase, and then HDFS

for storage of metadata.

Just operationalizing

this just for metadata is just is a nightmare. Right?

Second thing is we did not add Kafka

as a requirement or a graph database as a requirement because

not many people want to operationalize this just to have a metadata system.

However, our system is extensible to include Kafka where it is required for in real time publishing metadata

changes.

But it is not a must have, right, for everybody. Right? And

so having less dependencies, less moving parts, and when we have made choice of what

systems that we'll build on, we have made the choice of something that is well understood, well known. People know how to operationalize them. So from that perspective,

things that we depend on are MySQL and Elasticsearch.

Nothing else. If you are a big company, you can operationalize it. You can add Kafka and things like that, but it is not required for most of the companies.

Thus, making the operational complexity and the number of moving parts less. Second thing is looking at our own open meta data implementation. As you said, we start with schemas.

And using schemas, we generate all the code. And so

providing

rest

APIs, we use Jetty,

DropWizard,

JDBI,

or writing to MySQL.

So this is a fairly standard well known stack.

So that's what we use and that provides the rest APIs. And the all the rest APIs are provided.

The API parameters and this request responses,

everything is coming off of JSON schema. Right? And then the second thing is we built an ingestion framework using Python.

And it follows the usual,

you know, connect to a source and then get the data and then process it, put it into a different shape, and then

sync, which is write it back to

metadata system using our APIs.

So I wanna call out an important distinction here. A lot of systems end up doing ingestion directly. They write it into database,

which means that

and they've also exposed their database, let's say graph database APIs as

APIs for accessing metadata.

That ties you into the implementation detail.

You want to access everything through public interfaces, not directly go to implementation details.

And many of the projects are finding that some of the graph databases that they have used are becoming issues for some of the, you know, users that are using their system. And then when you try to change the implementation detail, you're gonna affect a lot of other things. So we use ingestion

through our open APIs.

And then we are using

TypeScript.

Yeah. All the, you know, regular,

newer JavaScript framework to build our UI. And then, again, UI

uses JSON schema to generate TypeScript.

So that's the architecture now. And then 1 thing I want to call out is our injection framework is flexible that is written in Python. So

we have put in a lot of effort to write it in a extensible manner. That has helped us great deal. Right? We can actually report

a lot of integrations under a day. Right? So and we already have now 20 integrations for a young project

starting from data warehouses,

all the popular data warehouses, popular

transactional databases,

dashboard systems,

and then the, you know, workflow systems. So we have and then Kafka included.

We have a lot of integrations.

Now let's think about from the client side how they can use

open metadata.

The power of JSON schema

is instead of writing code where if key name is equal to blah, I expect certain value and then you cast the value to some, you know, thing that you're expecting. And then anything changes,

your debugging becomes a run time debugging.

Instead,

the power of JSON schema is there are enough tools and, as Harsha was saying previously, you can generate language bindings

to any language of your choice.

And,

you know, you can generate your code and use that code and a lot of things that you are using. If you regenerate the code, you will find compile time errors instead of run time errors. Right? So that is for the clients.

So clients can use our JSON schemas,

embed our JSON schemas in their JSON schema because of, you know, JSON schema has a great importing

mechanism through references.

So they can use our JSON schema in their JSON schemas,

generate code using our JSON schemas.

So they can take the our schema models and make it their own. And any project can make this in their own. They can take the vocabulary if they like. They can reuse the vocabulary, their types and entities and all of that. Very well defined, well documented. People can use any of the things that we have. You can use it from the web. You point it to an HTTP URL. You know, you can use JSON keywords.

Yeah. So 1 of the important things when buzz are getting built at big companies and getting open sources,

there is a blind side to it when they are designing it.

Like, example, at Uber, there's Kafka team as an infrastructure, there's Hire, JFS teams as infrastructure

just to keep the services up and running.

So, you know, anyone who's building an application inside Uber thinks Kafka is there as a protocol. I can just publish it. It's free because they're not investing. They're not worrying about keeping up and running.

So when we are actually designing open metadata, we wanted to kind of pay attention to the rest of the industry to bring it up and running who doesn't want to kind of, hey, you know, just to get the metadata up and running, we need to get Kafka and Crafty and everything else.

And we want to make especially color that, you know, without

sacrificing any scalability aspects to it. A lot of this

thoughts about designing goes, never and beyond, try to include complex projects, complex services inside

just in the name of the scalability.

But we actually did, you know, take an extra steps towards that and built a sandbox with, 1, 800, 000 entities.

These are, like, 1, 000, 000 table entities, 500 dashboard,

100 topics so and so. And along with 4, 500, 000 relations

within a single instance, open metadata. And this is we are not outwardly

trying to tune it. This is the default configuration that we ship, and we are able to demonstrate that, you know, and publish the benchmarking with hundreds of users simultaneously accessing the sandbox.

So it scales well and, you know, especially attention given to simplicity

and ability to keep up and running between more efforts and production.

And the point of scalability is interesting because there are a number of different ways to think about that problem where 1 is the volume of data, the number of users as you were just discussing, but another is the

scaling of organizational and cognitive complexity of how do you structure

any, you know, siloing of metadata

if that's, you know, necessary within your organization to be able to represent the different sort of maybe geographic boundaries of business domains and

maybe being able to

federate installations of open metadata to be able to

do discovery of data assets across those organizational boundaries where maybe 1 business unit says, this is our open metadata installation that has everything to do with the data contained within our platform for our particular business unit. But then, you know, there's a different unit within the enterprise that says, we have our own installation of open metadata, but we need to be able to do discovery

across those 2 because there are some interchanges

of, you know, we're handing off, you know, FTP with the CSV file. We need to make sure that the structures of the way that we're exporting the data match with the way that you're importing the data and being able to handle some of those, you know, complexities of scale. That's a great question. I think scalability itself, I think we must be able to

handle the scalability of most companies on the planet barring 1 or 2.

So scalability is not a challenge. However,

organizational boundaries. Right? Maybe,

you know, the servers and systems that some part of the organization is using is in certain VPC, and they don't give

access from another VPC,

things like that. That is an issue for very large organizations with lot of lines of businesses and

sub bugs.

What we think we could do there is,

you know, this scales. Right? You can have different open metadata installation,

but then still you can centralize

the metadata that you choose to centralize into a single metadata store. That way entire

organizations, right, the entire company's data,

how you are doing with the data

is visible. Right? So there's concept of, you know, domains and, you know, all kinds of things that are coming right as an organizational unit. You should be able to have multiple open metadata installation and then, you know, you can ship that metadata into a centralized place.

That way

it's the metadata

integration into a central place than having to integrate the central open metadata to all your systems and exposing those systems. Right? So you could do that. That certainly is an option

where these are restrictions within a company.

In terms of the

organizational complexities of being able to model those different business domains as well as being able to model all of the

schematic elements of the data that you're dealing with. I'm wondering what have been some of the

challenges that you're facing in being able to balance

the generality

of the system and its flexibility

while being able to be specific and

appropriately

constraining

in the types of metadata that you're going to consume so that you can enforce a

meaningful and useful structure without it just being you can write any key value you want.

I think we talked about, let's say, Apache Atlas as an example. Right? That is the reason why a metadata system must model all the common entities.

And when new entities come,

my thought process there is

work with the community to add a new entity because if you have a need,

most likely that other companies have that need as well. Right? So

get

the entities that are required in the metadata system standardized. Make it available.

Otherwise, if you keep modeling your own metadata entities,

you know, they model the same entity concept, but the model looks different and that runs against the open metadata standardization. Right? And then tools have to deal with varieties of metadata.

The other thing that I would say is we have off the shelf some

tags and categories and things like that.

We would like to work with different companies that are in different

industry segments

to work with us

on defining

their tag vocabulary. Right? That way, if they collaborate to write these, tag categories and the business glossaries,

it becomes available for somebody else, and then they can add to it. So there is a way to even standardize

some of the business vocabulary

as an open source. But so far, I think our approach has been strongly typed entities,

make entities available to capture all your use cases.

If there are any extensibility

required, like I was telling you earlier, the core aspects of, metadata should be common. Right? The core attributes of an entity should be common.

There might be extension points that might mean specific to your organization,

your group, but that won't be commonly used. Right? That extension points will make them available. Going back to what we were discussing earlier about the fact that there are so many different

layers to the data ecosystem and businesses that are building these special purpose tools that are all

at their core built around different mutations and manipulations and interpretations of metadata with data quality being 1 category, discovery,

data governance. They're all

at their core

using metadata about your system to be able to build their specific capabilities. And I'm wondering what you see

as

the role of open metadata across that ecosystem of projects and some of the potential future for being able to

simplify

the integration

across those systems. Maybe if there are particular areas of those industry verticals that become

obviated by the fact that open metadata becomes ubiquitous and maybe some of the potentials for

new

efforts and business domains to be able to grow on top of this shared specification?

Yeah. What I feel is instead of tools becoming updated, what I feel is tools become a lot simpler. If the metadata already exists, lot of these tools can be built quickly,

and

it will increase the competition in this space. Right?

And the ones that are building

the best tools, best experiences,

best solution to the problems,

they will win. Right? Versus today, what has happened is a lot of the metadata is trapped in proprietary

formats.

And

just, you know, entering into

that kind of deployment

is lot harder because in order to just become a quality tool, you have to first crawl and integrate with everything. Right? What if that metadata was shareable and then, you know, there can be a lot of quality tools and lot of those tools can be a lot simpler. And then,

you know, barrier to building a solution around metadata, you know, decreases

significantly, and we believe that this will actually foster

innovation. Right?

This helps in making

our data tool space better. And so the whole idea of open metadata is

don't let your metadata, which is a significant asset you have,

Probably more important than your data itself. Right?

Set it free. Right? Make it easily shareable, available

so that innovation can happen

instead of letting it be, you know, letting it, you know, sit in, you know, vendor proprietary

formats, traps in certain systems.

I believe this can foster innovation.

As you have been

building the open metadata project and working to grow the community and exploring this space Anew. I'm wondering

how that has

influenced or shifted your perspective

on the

potential

that is available in this space and the challenges that are inherent to metadata management and maybe some of the ways that your initial ideas and assumptions of what the goals and capabilities of this project might be have been challenged as you have started to build out the this implementation.

Yeah. So if you look at the metadata space today,

metadata is only used

for discovery and then to certain extent,

governance. Right? But there are lot more applications of metadata as we discussed. Right? So

there is an overwhelming

response

to our open metadata announcement

where people

have, you know, come back to us

with, hey, metadata is not a solved problem. There are a lot of applications of metadata

that are possible. As you are saying, we completely concur.

And metadata,

a lot of people also have come back saying, you know, many metadata systems are just called catalog.

There's a realization

that

catalog is just 1 application of metadata. So these are things that that we have discovered as we worked

through the project,

not only as our own thoughts, but thoughts reflected back to us, many people,

in the community. Yeah. And, we have seen quite a few community members coming in adding new entities

and kind of liking taking the liking towards suggestion scheme or how easy it is to add a new entity

and get the ingestion run up and running. So in a way, it's kind of more of, you know, plus 1 vote. Hey. This is great. This is what we wanted.

And the APIs and schemers are looking great. And that's, you know, kind of stumbling blocks before

to have any integration to work with the metadata that we already have. And this is a project, you know, that has the right goals, right direction, and going the right way. So so far, it has been more of, you know, this script, you know, you want to start using it, start contributing it. So it is great. So for people who are looking to be able to simplify the management of their metadata across their data platforms and data systems, what are the cases where open metadata is the wrong choice and they might be better served with a data catalog like DataHub or, you know, these different

solutions for metadata management in a limited domain?

You know the answer to this. Open metadata is, you know, never

a bad choice. Right? Where I think things could be different is, 1 is,

today, our relations are

not managing their metadata at all. Right? They're, you know,

metadata is at the center of, you know,

data mesh, data culture, data observability, whole bunch of things that people are

grappling to sort of express

as a clear problem statement. But metadata is at the center of it. Where I think things can be different is if you have a centralized metadata system,

if somebody else

builds a tool that is delightful, that solves your problem better,

different tools can be used. Hopefully, they are thin tools. Right? Not building their own separate metadata system.

So we have certain experiences, user experience, collaboration experience that comes with open metadata. But somebody built something much better than that, something

that is very focused on a specific persona,

those tools, you know, should be adopted.

Yeah. So when our guys are going down this road, they kind of go in a civil fashion. Right? So they realize I have a day 1 problem of discovery.

So let me actually bring up catalog.

Now I have a catalog. I have a quality problem now, so I'm gonna get another tool.

Now I have a governance, so and so. Right? So

what we are saying is, hey. We have seen these problems many times, and, eventually, you have siloed tools and are solving these problems repeatedly.

Invest into open metadata. You're getting the metadata platforms, schemers, and APIs, and all this experience becomes part of it

rather than ASI load fashion. So if you invest open metadata,

your vision realizing the entire data culture will become a reality as we improve this product and continue to iterate on it. And as you continue to iterate on the problem and build out the open metadata specification and implementation, what are some of the things you have planned for the near to medium term or any upcoming projects that you're particularly excited about?

So lineage is something that we are working on. There's a lot of work that is going on in our 0.5 release.

We started off with lineage from

different workflow systems

as a part of the system, and then we are building versioning and eventing.

What we believe is

today, metadata

versioning is not tracked at all. Right? That means

as the metadata changes, it is reflecting how your particular dataset is changing. Right? The owner,

you know, the columns, the description, and things like that. So we are building this feature that we are super excited about. You will

have version metadata.

You will be able to go through what has happened in the

life cycle of a table starting from day 1 it got created,

how it changed over a period of time, you know, what tags got added, what tags got deleted,

what was the previous description, how description has improved through versioning of every entity. And then from that versioning of every entity, we will generate

change

events, which will be used for building bots and applications where

you can actually subscribe to a certain type of change. The table got created, the table got PII added, table

ownership change, and then you can start building your own internal workflows

to react to how the metadata is changing.

So those are features. Right? And then we'll continue to build a lot of integrations

to a lot of tools available in the data ecosystem.

But where we are right now in terms of our user experiences, we have discovery, right, which is the starting point. You must

discover your data

as a first step. Right?

And then we have built an experience where you can have great descriptions and things like that so that people can

understand what they have discovered.

We'll start building next kind of experiences where people can collaborate,

where they can understand it better through

questions, comments,

feedback, where to ask, for features,

additional columns, things like that.

That is for people. Right? Collaboration experience. Then we'll also start doing

some, reliability kind of stuff. You know, there are tools that provide, you know, the scheme as backward and compatibly chain. Now that we are worsening, we'll be able to let people know that if you are dependent on this data, it has backward in incompatibility change. Right? Some of the observability aspects. We'll be able to say your data distribution

is this today, which feature that we already support, but then we'll start using that information to say,

you know, there is something missing here. Maybe the number of rows that are coming into the table dropped significantly. Maybe there is missing data. So some kind of alerting like that. And then we'll also start adding support for test metadata so that, you know, we can have testing quality

tools that, you know, can integrate with open metadata. So

in a nutshell,

data catalogs have gotten stuck in discovery. Right? Go beyond discovery to

collaboration

and then automation.

And then the second thing is 1 of the automations that we also want to do is

a lot of data catalog, you know, description tagging is

something that is driven by somebody with a scale, and then they say, you need to complete this by this time. And people, you know, tag the data, then, you know, after that, they they forget about, you know, continue to keep it up. And then your descriptions are stale, your metadata is stale, your tags are not correct.

What we did at Uber was we built this tool on a weekly basis. What it will do is for every data owner, it will say, your description coverage is this, your quality coverage is this, Your SLA you're doing well, you know, poorly against SLA you're defined.

That was a way of constantly

nudging people,

right, without having to have this big mandates

and 2 weeks of, you know, doing things with fanfare instead of continuous improvement of data, right, as a culture. So you'll build some automations where

people will get both positive

and negative feedback. Right? Your descriptions are missing here compared to, hey, you did great. You know, your description coverage is top blah percentage within an organization.

Data is a tankless job for most of the people. We wanna bring some joy to the people who are doing data.

Right? Through these kind of experiences where

continuously,

right, organization's data improves

through quality feedback.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and get involved with the project, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. What I feel is more than gap, there are too many of them. Right? That don't talk to each other.

And

I think there is just like platforms. Right? 1, 000, 000, 000 of dollars went into platforms

where now we nurse our emerge emerging and consolidation is happening.

I think in tooling space, there are too many tools, and winners will emerge and consolidation will happen so that

confusing noises die down and the clarity

comes through it. I think the gap is the open metadata. I think, you know, you'll be thrilled to have your listeners, your audience come join us to build this. And this is 1 of the foundational things that we're building, and the reason to keep open source is to bring, you know, others into the community. So we'll be really thrilled

to have your audience come over and work with us. Metadata should be a solved problem.

And let's not keep rebuilding the same thing. Let's move to the next level of innovation. Absolutely.

Well, thank you both very much for taking the time today to join me and share the work that you're doing on open metadata. It's definitely a very

interesting project, and I definitely look forward to seeing it succeed and grow and be adopted. So I appreciate all of the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Thank you, Tobias.

Listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links