Enhancing Data Accessibility and Governance with Gravitino

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Your host is Tobias Macy, and today I'm interviewing Junping Du about Gravitino, an open source metadata service for a unified view of all of your schemas. So, Junping, can you start by introducing yourself?

Thanks, Topgitz. Yeah. The this is Junping Du, and,

I have, over 15 years working on,

the open source and data industry. So I served for companies such as VMWare, Hortonworks, Tencent.

I used to be a Hadoop guy, a long term contributor,

committer, and also the release manager of Hadoop. So I'm super fan of open source technologies,

especially for the data and AI area.

So now we are building a start up a new company called DataStratio,

that starts in

2023,

and, Cuartino is open source. Data catalog is our main focus to break down all the kind of data silos

because of different

data lakes or call vendors or separation of data and AI in software stack. We try to break down all kinds this kind of the data silos.

And do you remember how you first got started working in data?

Oh, I still remember. So the I that's,

over

13 or 14 years ago when I start

to look at what Hadoop going

work when I was doing VMware, I was I was thinking about, that's great. You know? In VMware, we are doing something like a virtualization,

but now

a lot of data, you know, stored here together with, you know,

with thousands of machines

can work together.

It's fantastic. You know, we don't wanna try to break the the the big

machine into pieces, but Hadoop tried to merge,

a lot of, you know, data

or engine or machines

into a big giant machines. I was thinking it's a very cool technology. I tried to

combine these 2 technology

into,

together. So that's why I start, the be Hadoop at that time. That's the internal project for VMware to how to

launch a long Hadoop,

a very efficient

and scalable in, VMware virtualized

technology. That's a start for our you know, from the cloud technology to the data technology.

Now digging more into Gravitino,

I'm wondering if you can give a bit of an overview about what it is and some of the story behind how it came to be and why you decided that it was worth your time and energy to invest in building this technology.

Of course.

So for my

as as I mentioned, I have been working on Hadoop Technology for a long time. And afterwards, I'm working on some

cloud data warehousing and did cloud, data lakes building.

So, definitely, we find several interesting

findings.

So first, it's a data silo problem. The the no means not only means data is siloed by different database data warehouse, but also sometimes it's siloed for multi cloud or hybrid cloud scenarios.

And, also, we would notice that the revolution of generative AI really created new patterns for accessing the data,

especially to for,

for unstructured data. So, previously,

structured data is more like structured and the plan, so we can do a lot of data preparation ahead. But for now, there's more data accessing. It's on it's ad hoc and on demand.

And, 3rd, we find the problem is data technology is growing actually a little bit slower,

or, you know, behind of the g how the GPU power going, how the large language

model technology is going. So it's definitely

we want to either work definitely make the data as a bottleneck

in this kind of, large language model evolution. So we want,

join the efforts to accelerate it. So we think Cartina will be serving as a

it's a it's just a it's just like a Tesla time for the data plat platform technology. Right?

So the autopilot

is is over faster engines. I think data plat platform is is in the same

stage now. So what is, so we ask ourselves, what is the intelligence

about the data?

I believe,

any,

data engineers, dating data,

data guys would think it should be metadata. Right? It has all the knowledge about data.

So that's why we try to building a metadata lake.

No matter it's data it's for

transactional database and analytics or AI workloads. We try to building a metadata lake that is, GravTina.

You mentioned

the

metadata catalogs that are in use right now for different query engines for tabular data. You also mentioned the application of Gravitino to unstructured data sources for use in more AI oriented workloads. And I'm wondering if you can talk to some of the ways that you think about the expansion of that metadata store beyond just the tabular

structures and beyond the

constraints

of a single database engine or query engine.

Yeah. The constraints of a single database engine or query engine?

Yeah. I can take the example. You know, rather than the tabular data, you you you manage all the things with tables, right, table schemers.

And now we are supporting the file set management,

that kind of the dataset management. So, basically, if you're,

PyTorch, you already want to consume some data sources from

some datasets so we can manage this. We can directly serving, you know, PyTorch or read directly. You can read it here. And with with this, we can managing

structure, unstructured in the same place. And then the data engineers will know, you know, how the AI team or AI engineers consume this kind of data. They can do access control. They can do monitoring of, you know, how

AI team is, using this kind of data. They know, okay, some of the On-site data is not useful anymore because no models, you know,

consume it anymore.

So they can do

a better, you know, cost of saving and,

data quality monitoring, something like that.

And I'm definitely interested in digging more into some of the specifics about how to populate and how to make use of that metadata

for unstructured

storage. But before we get too deep in the weeds on that direction,

I also wanna talk a bit more about some of the problems that you are looking to solve with Gravitino and some of the ways that you have seen teams

address those problems in the absence of Gravitino.

Yeah.

So, yeah, so Gravitino, just like I mentioned,

is we try to unify the different data formats and sources in you know, with different data engines.

So we create a new layer. We call it the modern open,

data catalog.

Yeah. It's, I think, the same,

concept as the data fabric. So with this kind of open data catalog,

you know, built by Cartino,

no matter what kind of data, you know, lives somewhere, which cloud it's it lives, on which kind of

format it's using. It's a half tables or it's a ice web table or it's some vector database vector,

store. We can use a mainstream,

you know, data engine or a engine to

access it no matter not only just read, but also the right update, depend, whatever it is. I think this is we try to building the solving the problem with the graph data. So, previously,

a lot of engines on, you know, data data products tried to building the vertical stack for data, but we are more like to building a layer to serving, you know, different

verticals

So that break down the silos, right, for access pattern,

of the data.

In the

category of tabular structures,

one of the longest running

projects out there, at least for the case where you're not using a specific database engine, is the Hive metastore

that was intended to address this

cloud data lake or Hadoop style architecture where you have lots of files everywhere. You have different table schemas or table formats. And so the Hive metastore was used as a means of being able to keep track of all of them and to have different methods of populating that metadata, whether it's from Spark or from data crawlers or from API calls. And I'm wondering what you see as

the reason that that was such a

lasting technology that stuck around long after Hive started to fade away and some of the opportunities that you see for

innovation

in that metadata

layer for being able to improve the overall experience around how to interact with these different data systems?

Yeah. So traditional I mean, Hive metastore

really last up for quite a long time. It's becoming our defector

standard for industry for a very long time since, you know, Hive

appears, maybe

more than 10 years ago. It's lasting there because, you know, a lot of engines

is trying to using, Hive Metastore,

to store all the metadata,

on top of that. But things have changed, I mean, a few years

until now. And the board modern

data engines try to building their,

metadata,

or catalog itself, such as we have such as StarRocks or g or

some some new engines. And, also,

HimetStore cannot managing, you know, the AI,

capabilities

such as unstructured data directly. It doesn't

can can manage it directly.

That's the reason, you know, because

HMS

is not a growing faster community.

It's not solving

the problem we're facing today.

So end up with you know, we either

join the Hive community to building something new with a lot of lot of, you know, legacy

code or system, or we can start some new project

with

inherit some capability from the high HMIS,

but actually extend it to something brand new,

features

and product. So we choose the later one, which is Gravity node try to building this,

new product that can compatible with HMS. I think that's also the,

the logic for some maybe other product is try to, build in heading 2 is try to compatible with HMS, but we're building something new. This is because of new requirement driven and also a there are a lot of,

a long history of legacy

that HMS already

formed. We have to consider this kind of

standard of follow.

In that general theme of Hive where it started off as a query engine, it brought along the metastore as a means of keeping track of the different table metadata.

Looking at the documentation

for Gravitino, it seems that it is also

doing a few different things at once where it has the metadata storage for being able to keep track of tabular and unstructured data so that you know where it lives.

You have some capability for data federation for being able to query across these different data sources similar to Trino or Alexio.

And it also acts as a way to do some measure of discovery of data in the vein of an open metadata or an ATLN. And I'm wondering if you can just talk to some of the ways that you think about how

Gravitino overlaps with some of those different

areas of functionality and some of the ways that you think about how it can,

either supplement or potentially even replace various technologies

in different use cases.

Yeah. I think that's, that's good quest question. So we do have the somewhat overlap with existing,

categories.

Take, have managed for example. Just like I mentioned, we're we have so, you know, today has a wrap up m HMS.

That is way,

we don't have to replace it. Instead, we manage and upgrade with new capabilities. Right? It's very important as HMS serving as,

quite important component in open source work for a long time. So we can be compatible with first and replace it later when after user, you know, feel more comfortable.

About the metadata repository parts, the opens open metadata, it's more like a a traditional data governance towards that. It can copy metadata from one plate to another,

and do some access control and some other

data lineage work, but it cannot serve, you know, computing engine directly. It's

different scenarios with Gravitino.

About data federation,

actually, we're supporting

the training. We don't do data,

federation directly, but we support the engines such as Trino, such as Spark, can do the federated query over the multiple, multiple

high meta stores or multiple data catalogs that can cross different,

cloud or different data lakes, which is super cool, technology on top of that.

And as you are building Gravitino

as it has some of these different capabilities, there's always the challenge of scope creep where you say, oh, it'd be really neat if it could do this thing over here. And then you start to expand the

range of things that it can do, which increases the overall load for maintenance and the complexity of the project. And I'm curious how you're thinking about

what to explicitly

keep out of scope for Gravitino and the things that you definitely don't want it to grow into.

Yeah. Thanks. I I think that's also a good question. As always, you know so we are serving as a unified,

metadata center. Right? So

that means we already I mean, our scope is big enough.

We don't wanna grow the scope to be something like a query engine.

We don't want to be a query engine project, right, to

serve

the optimization,

for some dedicated Query Engine.

So we treat the Spark, Trinos, Dreammail, ClearHouse, you know, whatever,

computing engine we support,

equally important.

I think this is very important, a unique value for us compared with, you know, other

new catalog that's supported by, you know,

one,

query engine vendors.

I think that's fairly important.

In terms of the technical implementation, I'm wondering if you can give a bit of an overview of the

architectural components of Gravitino

and some of the ways that you thought about the design and implementation

of it and the in order to achieve the

goals that you set out with?

Yeah. So in kind of a scene, the Graphite Unit has,

layers of architecture. Right?

The core layer is abstraction of the catalog layer, which could support

database catalog, file set catalog, which is dataset. Right? And also streaming catalog, which support Kafka and other streaming engines and also the model catalog, which is support,

the managing of the AM models.

So this is

we can we can support more, you know, catalog types in the future on demand.

But, currently,

this is pretty wired

enough to support the mainstream,

scenarios that we

find in the community. So underneath is a data connect connection layer,

that's we which connect with different data sources,

such as, Fiset, such as, tablet data, such like,

streaming data. Right? And on top of the

core layer, it has,

interface layer, which interface.

It actually,

supports the rest 4 interface, the JDBC interface,

and potentially a swift interface as well, which is useful

to different, you know, computing engines to using Graphtino.

And that's

these 3 layers is the core of Graphiteo.

And the most top layer, is a functionality layer, which can provide in,

some data management and data governance layer, such as access control, data lineage, data quality,

extra extra.

So that is typically,

the techno the tech technical architecture for Graphitea.

So we start to

work on it, in the beginning when we,

before we start the company.

We think the most important thing is figure out what to do instead of, you know, how to do it. So we just have thought discuss a lot

and decide to building,

to go with to building the metadata lake,

first because there's no

reason for the world to have additional

data engine. I mean, query engine. No matter query engine or computer engine, too much engines. Right? And which is important to have

a a single layer of the metadata. Right? To to make

the different engines can work with,

you know, different data sources.

So that's the that's make the core design of models the data models are very important. How to extract,

the models for different type of engines, different type of data,

especially their structured data there and unstructured there, how to abstract it. So we do very carefully design

carefully and flexible

design, and it can be continue to evolving

to, today's,

so we can support,

different,

way of data data manipulating

or data

computing. So that is, the the core part.

As far as the role of Gravitino in an overarching

data platform architecture, I'm curious if you can talk to the process of

integrating it into an existing set of systems, some of the types of technologies that you would want to

layer on top

of Gravitino to be able to take advantage of the information that it holds, and also maybe some of the ways that you're using Gravitino

in your own work at DataStrato.

Yeah. Of course. I mean, DataStrato,

we are building

the the Graphite, you know, it's a we it's a first,

product, and,

now it our main our main,

focus on top of that.

And to launch, the Graphite, you know, it's definitely our a platform service. It can run separately,

but it can easy to integrate

into this mainstream open data architecture seamlessly.

Right? It can support Spark and support Flink, Trino, Doris,

have table and the traditional Hive tables. Right?

So if you have Alexa system, keep it going. And then when you launch the Graphite, you you were,

very surprise surprisingly to find it can work with a lot of your existing components very well, and they can, you know, merge a different

metadata view into a same

single metadata view. And your engines can go through GraphQL to

to query

or to

do the computing on the, you know, some data stores

or the data sources that previously you cannot attach,

to it. So this is

a quite interesting journey, what we're finding from the community.

A lot of communities of, users and customers to give us a feedback.

Once the team has integrated Gravitino into their architecture, they're starting to use it to store metadata,

populate metadata,

query metadata.

Wondering if you can talk to some of the ways that they might interact with Gravitino

in a typical Workday and how it fits into the overall workflow of being able to

use the underlying data that Gravitino points to.

Yeah. So previously, if you're I mean, if you're using the Hotmail Store for a long time,

it's,

it's it's over almost,

you know,

seamlessly

as a previous experience.

Like,

your Spark job, your trainer job is work on work. Continue to work with, you know, previous,

data sources, which is fine.

And if you were,

like, a data engineers to work on some ETL, right, to merge different multiple tables from different places to be a single table.

And after that, you can do some additional work, you know, for your data pipeline. You will find it's, gotten a very powerful because,

some of the unnecessary ETR will be can be skipped.

So we can using GraphTina, you can know, you know, all the table there and try to building your,

your, you know, final table,

from the you know, skip some, you know, mid table to be a final table. So this is definitely the the change.

Additional,

because we are building the centralized

metadata link, so you know all the access pattern for your data pipeline and you know how your access your metadata or data is accessed in different ways. So you can put some monitoring on that. You can know, okay, some some of the table,

especially for the media table, is not needed anymore because no data pipeline is actually using. You can delete it or you can drop it.

You can do more fine grained data

the life cycle management

on top of that.

And now in terms of the

unstructured

data flow, tabular data is fairly well understood. Lots of different tools interact with it and interoperate with it. Unstructured data

has been around for a long time. People have different solutions, but it has continually been a challenge to work through. And I'm wondering if you can talk to some of the ways that Gravitino

helps to

address some of those challenges and the workflow for being able to

locate

and catalog

and interact with that unstructured data and managing the organization of those unstructured sources?

Of course.

So it has been a long history for

structured data to be,

managing,

you know, because we have, you know, more than many, many years

experience to managing this kind of data. Right? And for unstructured data, it's quite a new thing. I mean, previously,

we're using ETR for to be to make the on structured

data to be structured data. Right? This is, you know, Hadoop what Hadoop age doing

over 10 years. But today, a lot of

AI models actually,

they want go directly with abstract data. Right? To,

go to the AI models, consume it in a,

so so in in the consuming various ways.

So that make,

the

requirement for managing the abstract data,

it's more hard to achieve.

So, previously,

just, managing,

abstract data in a

with,

s three link

or

some storage place link is definitely not enough. So the first thing is we need to centralize the governments, right, for abstract data. So who can access it?

And this is tip this is number 1. And number 2,

do we put some,

you know, more reach for metadata info just rather than a link? Right? We can do we have a description

on what kind of unstructured data it is

and how to use it,

and how to,

turn turn it into,

to consume by your

models or the feature stores?

I think this is number 2.

Necessary is how to make your structure and the unrestricted data

to work together.

Sometimes,

the this is not only unrestricted

data is not only on the the training stage. Right? Sometimes it's it's useful on your rack building system.

You you will leverage your,

abstract data to give you some answer,

right, to your,

questions,

when when you could be getting to the large length model. So you need us, you know, a way a centralized way to managing both structure and abstract data to make them work together.

So that's,

typically I mean, it's 3,

cases we can sync,

from,

the today's,

requirement.

Earlier, you also mentioned

being able to have some insight into

whether a model or engine is actually accessing some of that underlying data to determine how long you wanna keep it around for or whether you can cull that data to save on storage or cost.

And I'm wondering

some some of the different access data that you are able to use and provide insights to,

the end user as far as what data is valuable, what data is just taking up space and cost,

and some of the

higher order workflows that people are able to build from that visibility.

Of course. I I think that's, important. And some of the community user

already leverage these, features.

So they definitely using the fireset to manage their on-site data

and definitely find,

the found access pattern,

pattern for their AI

AI team how to consume this data and then in which kind of frequency.

Based on this,

statistics,

they were using,

different back end storage. Right? They have tiered back end storage. Some some storage is for hot data, some storage for

warm data, some storage for cold data. So they can move the most frequency

access to data to be hot data.

But some of the, you know, non active access

for the unstructured data is moved to the cold data, which is saving,

a lot of,

cost

for this kind of,

unstructured

data revolution. So this is, definitely,

a a showcase

to how we leverage these

features

to achieve the, the man the fine grained management on on structured data.

Another piece that you touched on briefly is the

governance of that underlying data. And when I was looking at Gravitino, it mentions having

a centralized

access control capability.

And I'm curious if you can talk to some of the

permissions management

and some of the ways that that feeds into

some of the other systems that are relying on Gravitino and just some of the overall challenges of managing some of that permissions and access control across different layers of the data stack?

Yep.

That's that's fairly the pinpoint before. Right? You have to work on different,

data engines and set up this permission

access control settings.

So Cuartino is trying to solving that in the more unified way.

So the design

of the,

Graphiteo is we are

we have different authentication and authorization,

workflow.

Right? And, through authorization,

we can work with different on the layer platform such as,

Hadoop ecosystem.

It's you may be using Kerberos. Right? For the,

some cloud engines,

BigQuery or,

Snowflake, they're definitely using, you know, some I'm

related mechanism.

So we work with this different kind

of mechanism

and

so that the token we take, I mean, the Cartino can take,

can be using in different on the layer system. So this is typically,

the high level design

principles.

So with the unified,

centralized permission management, so customer don't have to be setting

this

kind of access control,

in a different places. You can set directly in the Graphite. You know? And after that,

this permission for Fineset management

or table management

or even fine grained

low level access control.

It it can be a support

and be set on the into the on the layer system. So we are not storage layer. Right? But we can help to,

set the storage layer

access control. So this is the design principle for us.

As you have been building Gravitino,

using it in your own use cases,

helping to support other people who are adopting it, what are some of the most interesting or innovative or unexpected ways that you've seen it applied? That's,

interesting,

because

previously, we want to build in some very

generic

catalog support. Right? We support

JDBC.

We support REST catalog in general. But, eventually, you know, users community users are asking, okay. How are you building

something like, data lake,

format,

catalog, such as iceberg catalog,

such as hoodie catalog. So that's why we you know, about half a year ago, we start building the iceberg,

rest catalog,

integrate the

capability into the Graptino and then we support it. So now we we we you you may can see 2 or 3 other open source data log and support, can support Iceberg right now. But this is,

we do it first, and this is because the community user is asking for it. I think with with the community,

continue to growing,

we have we'll find more and more scenarios

that, you know, you know, can potentially address,

just to follow follow the,

the community users' needs, and we were definitely finding something,

new and interest

to follow.

And in your own experience of building this project, working with the community,

building this business

that relies on Gravitino,

what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Yeah. So

I think, initially, we tried to totally replace

HMS, to be honest.

So we think the HMS is the last thing for too long time, just you said.

Right?

But after we do that,

we have a very early stage,

initial version.

But a lot of community users say, hey. You know, I want we want you to compatible with

HMS because we're using it for a long quite a long time. We don't want we may re to type it in some day, but not now. So can you work with us?

So this is something unexpected,

but, you know, we respect what the communities

request or or required.

I think this, important lesson.

And, also, a lot of,

lessons we learn

from it,

among the, you know, AI part. You know, we don't realize,

it's because the FISAT

management

stuff is not so it's not a very complicated

technology. Right? But with this kind of adding

features, we're finding a lot a lot of users

is really interesting on this feature. And, really, they're suffering a long term,

pinpoint from that. So, I think this is something,

quite interesting as well.

For people who are looking to either expand or improve their experience

of catalog management,

what are the cases where Gravitino is the wrong choice?

Yeah. So the if they have, you know, very limited data sources,

and these these

scenarios are very simple, like, they just have one data engine

and maybe on top of the one cloud,

I I think they they don't have to use Encryptino. Right? And if there's if there's no AI workload involved, just the pure

data analytics or data engineering stuff, I think

there's no, you know, governments required. I think Encryptino

may not,

very useful in this case.

It doesn't harm for that, definitely. Right? But it's not very useful.

And as you continue to build and evolve and improve on Gravitino, what are some of the things that you have planned for the near to medium term or any particular projects or problem areas or features that you're excited to dig into?

Yeah. So we try to build, continue to

building more AI capability on top of that, especially if we we are adding more features towards our structure

and management, including the lineage. Right? So our structure and the models,

how to make this, so from the our structure to features,

features towards features and to be models, it's kind of the,

lineage

capability.

We we're trying to building a lot of, you know, more governance lineage,

capability,

include also including the data, maybe data sharing,

in some timing,

for to make the graph, you know, more useful

in a lot of, you know, kind of cases.

We really think in the future I mean, when I I just mentioned,

data could be,

a bottleneck

for the this kind of AI revolution.

And then what what that mean? It means is we're lacking off enough data, and we lack off enough

high quality of data. So that's part of the Grafino's mission is we're trying to

unify all the possible accessible data, right, to make it more accessible. And, also, we monitor all the data the quality to make it more

increase the visibility

of quality of the data, no matter it's a structure or our structure.

I think this mission

is quite important to Grafino. And,

also, this is, this is

part of the very important

reasons why Apache,

that's why we try to donate to Apache becoming Apache open governance project because we want it to be open. We want to be,

addressing,

the real data challenge in the,

AI time.

Are there any other aspects of the Gravitino

project, the overall space of catalog metadata,

unstructured data management,

the ways that AI workflows

are evolving the needs for data cataloging that we didn't discuss yet that you'd like

to cover before we close out the show? Yeah. I think we we've discussed a lot on this kind of,

stuff.

We may

discuss, later in the future

if if, we we have made more progress.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. I think,

definitely,

it's,

the data guys doesn't know e ed too much. The e ed

guy, you know, doesn't

leak off the data techno,

you know, background or perspective.

It's,

it's today's,

you know, big

I would say, big gap.

So we try to be a unified layer to not only the product to uni

to

different technologies,

but actually to unify the data engineer and agent AI engineers try to make the 2 team can understand each other.

Unless they're using the single

data platform

or data tours, they can only send each other. If they can continue using different tours, how can these 2 guys, you know, 2 group of guys to know each other? So this is our,

you know, mission. I think it's also our dream to do that.

Absolutely. Yeah. There there's definitely a lot

of incidental complexity that is coming up as a result of the increased usage of AI and machine learning and the fact that the technology stacks are being

built independently and in isolation of each other. There hasn't been a lot of bridging that has happened yet.

Of course.

Of course. Yeah. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Gravitino. It's definitely a very interesting project. It's great to see more

innovation and investment in this space of cataloging

because as we noted before, it has been stagnating for far too long. So it's great to see you out there helping to push the space forward. So thank you again for your time, and I hope you enjoy the rest of your day.

Yeah. Thanks. Thanks, Toby. Toby, it's a very good conversation with you, and I wish you have a good one. Thank you.

Thank you for listening, and don't forget to check out our other shows.

Podcast.net

covers the Python language, its community, and

AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.