Summary
As data architectures become more elaborate and the number of applications of data increases, it becomes increasingly challenging to locate and access the underlying data. Gravitino was created to provide a single interface to locate and query your data. In this episode Junping Du explains how Gravitino works, the capabilities that it unlocks, and how it fits into your data platform.
Announcements
Parting Question
As data architectures become more elaborate and the number of applications of data increases, it becomes increasingly challenging to locate and access the underlying data. Gravitino was created to provide a single interface to locate and query your data. In this episode Junping Du explains how Gravitino works, the capabilities that it unlocks, and how it fits into your data platform.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Your host is Tobias Macey and today I'm interviewing Junping Du about Gravitino, an open source metadata service for a unified view of all of your schemas
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Gravitino is and the story behind it?
- What problems are you solving with Gravitino?
- What are the methods that teams have relied on in the absence of Gravitino to address those use cases?
- What led to the Hive Metastore being the default for so long?
- What are the opportunities for innovation and new functionality in the metadata service?
- The documentation suggests that Gravitino has overlap with a number of tool categories such as table schema (Hive metastore), metadata repository (Open Metadata), data federation (Trino/Alluxio). What are the capabilities that it can completely replace, and which will require other systems for more comprehensive functionality?
- What are the capabilities that you are explicitly keeping out of scope for Gravitino?
- Can you describe the technical architecture of Gravitino?
- How have the design and scope evolved from when you first started working on it?
- Can you describe how Gravitino integrates into an overall data platform?
- In a typical day, what are the different ways that a data engineer or data analyst might interact with Gravitino?
- One of the features that you highlight is centralized permissions management. Can you describe the access control model that you use for unifying across underlying sources?
- What are the most interesting, innovative, or unexpected ways that you have seen Gravitino used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Gravitino?
- When is Gravitino the wrong choice?
- What do you have planned for the future of Gravitino?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- Gravitino
- Hadoop
- Datastrato
- PyTorch
- Ray
- Data Fabric
- Hive
- Iceberg
- Hive Metastore
- Trino
- OpenMetadata
- Alluxio
- Atlan
- Spark
- Thrift
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Macy, and today I'm interviewing Junping Du about Gravitino, an open source metadata service for a unified view of all of your schemas. So, Junping, can you start by introducing yourself?
[00:00:29] Junping Du:
Thanks, Topgitz. Yeah. The this is Junping Du, and, I have, over 15 years working on, the open source and data industry. So I served for companies such as VMWare, Hortonworks, Tencent. I used to be a Hadoop guy, a long term contributor, committer, and also the release manager of Hadoop. So I'm super fan of open source technologies, especially for the data and AI area. So now we are building a start up a new company called DataStratio, that starts in 2023, and, Cuartino is open source. Data catalog is our main focus to break down all the kind of data silos because of different data lakes or call vendors or separation of data and AI in software stack. We try to break down all kinds this kind of the data silos.
[00:01:23] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:26] Junping Du:
Oh, I still remember. So the I that's, over 13 or 14 years ago when I start to look at what Hadoop going work when I was doing VMware, I was I was thinking about, that's great. You know? In VMware, we are doing something like a virtualization, but now a lot of data, you know, stored here together with, you know, with thousands of machines can work together. It's fantastic. You know, we don't wanna try to break the the the big machine into pieces, but Hadoop tried to merge, a lot of, you know, data or engine or machines into a big giant machines. I was thinking it's a very cool technology. I tried to combine these 2 technology into, together. So that's why I start, the be Hadoop at that time. That's the internal project for VMware to how to launch a long Hadoop, a very efficient and scalable in, VMware virtualized technology. That's a start for our you know, from the cloud technology to the data technology.
[00:02:37] Tobias Macey:
Now digging more into Gravitino, I'm wondering if you can give a bit of an overview about what it is and some of the story behind how it came to be and why you decided that it was worth your time and energy to invest in building this technology.
[00:02:50] Junping Du:
Of course. So for my as as I mentioned, I have been working on Hadoop Technology for a long time. And afterwards, I'm working on some cloud data warehousing and did cloud, data lakes building. So, definitely, we find several interesting findings. So first, it's a data silo problem. The the no means not only means data is siloed by different database data warehouse, but also sometimes it's siloed for multi cloud or hybrid cloud scenarios. And, also, we would notice that the revolution of generative AI really created new patterns for accessing the data, especially to for, for unstructured data. So, previously, structured data is more like structured and the plan, so we can do a lot of data preparation ahead. But for now, there's more data accessing. It's on it's ad hoc and on demand.
And, 3rd, we find the problem is data technology is growing actually a little bit slower, or, you know, behind of the g how the GPU power going, how the large language model technology is going. So it's definitely we want to either work definitely make the data as a bottleneck in this kind of, large language model evolution. So we want, join the efforts to accelerate it. So we think Cartina will be serving as a it's a it's just a it's just like a Tesla time for the data plat platform technology. Right? So the autopilot is is over faster engines. I think data plat platform is is in the same stage now. So what is, so we ask ourselves, what is the intelligence about the data?
I believe, any, data engineers, dating data, data guys would think it should be metadata. Right? It has all the knowledge about data. So that's why we try to building a metadata lake. No matter it's data it's for transactional database and analytics or AI workloads. We try to building a metadata lake that is, GravTina.
[00:04:57] Tobias Macey:
You mentioned the metadata catalogs that are in use right now for different query engines for tabular data. You also mentioned the application of Gravitino to unstructured data sources for use in more AI oriented workloads. And I'm wondering if you can talk to some of the ways that you think about the expansion of that metadata store beyond just the tabular structures and beyond the constraints of a single database engine or query engine. Yeah. The constraints of a single database engine or query engine?
[00:05:28] Junping Du:
Yeah. I can take the example. You know, rather than the tabular data, you you you manage all the things with tables, right, table schemers. And now we are supporting the file set management, that kind of the dataset management. So, basically, if you're, PyTorch, you already want to consume some data sources from some datasets so we can manage this. We can directly serving, you know, PyTorch or read directly. You can read it here. And with with this, we can managing structure, unstructured in the same place. And then the data engineers will know, you know, how the AI team or AI engineers consume this kind of data. They can do access control. They can do monitoring of, you know, how AI team is, using this kind of data. They know, okay, some of the On-site data is not useful anymore because no models, you know, consume it anymore.
So they can do a better, you know, cost of saving and, data quality monitoring, something like that.
[00:06:33] Tobias Macey:
And I'm definitely interested in digging more into some of the specifics about how to populate and how to make use of that metadata for unstructured storage. But before we get too deep in the weeds on that direction, I also wanna talk a bit more about some of the problems that you are looking to solve with Gravitino and some of the ways that you have seen teams address those problems in the absence of Gravitino.
[00:06:58] Junping Du:
Yeah. So, yeah, so Gravitino, just like I mentioned, is we try to unify the different data formats and sources in you know, with different data engines. So we create a new layer. We call it the modern open, data catalog. Yeah. It's, I think, the same, concept as the data fabric. So with this kind of open data catalog, you know, built by Cartino, no matter what kind of data, you know, lives somewhere, which cloud it's it lives, on which kind of format it's using. It's a half tables or it's a ice web table or it's some vector database vector, store. We can use a mainstream, you know, data engine or a engine to access it no matter not only just read, but also the right update, depend, whatever it is. I think this is we try to building the solving the problem with the graph data. So, previously, a lot of engines on, you know, data data products tried to building the vertical stack for data, but we are more like to building a layer to serving, you know, different verticals So that break down the silos, right, for access pattern, of the data.
[00:08:14] Tobias Macey:
In the category of tabular structures, one of the longest running projects out there, at least for the case where you're not using a specific database engine, is the Hive metastore that was intended to address this cloud data lake or Hadoop style architecture where you have lots of files everywhere. You have different table schemas or table formats. And so the Hive metastore was used as a means of being able to keep track of all of them and to have different methods of populating that metadata, whether it's from Spark or from data crawlers or from API calls. And I'm wondering what you see as the reason that that was such a lasting technology that stuck around long after Hive started to fade away and some of the opportunities that you see for innovation in that metadata layer for being able to improve the overall experience around how to interact with these different data systems?
[00:09:16] Junping Du:
Yeah. So traditional I mean, Hive metastore really last up for quite a long time. It's becoming our defector standard for industry for a very long time since, you know, Hive appears, maybe more than 10 years ago. It's lasting there because, you know, a lot of engines is trying to using, Hive Metastore, to store all the metadata, on top of that. But things have changed, I mean, a few years until now. And the board modern data engines try to building their, metadata, or catalog itself, such as we have such as StarRocks or g or some some new engines. And, also, HimetStore cannot managing, you know, the AI, capabilities such as unstructured data directly. It doesn't can can manage it directly.
That's the reason, you know, because HMS is not a growing faster community. It's not solving the problem we're facing today. So end up with you know, we either join the Hive community to building something new with a lot of lot of, you know, legacy code or system, or we can start some new project with inherit some capability from the high HMIS, but actually extend it to something brand new, features and product. So we choose the later one, which is Gravity node try to building this, new product that can compatible with HMS. I think that's also the, the logic for some maybe other product is try to, build in heading 2 is try to compatible with HMS, but we're building something new. This is because of new requirement driven and also a there are a lot of, a long history of legacy that HMS already formed. We have to consider this kind of standard of follow.
[00:11:20] Tobias Macey:
In that general theme of Hive where it started off as a query engine, it brought along the metastore as a means of keeping track of the different table metadata. Looking at the documentation for Gravitino, it seems that it is also doing a few different things at once where it has the metadata storage for being able to keep track of tabular and unstructured data so that you know where it lives. You have some capability for data federation for being able to query across these different data sources similar to Trino or Alexio. And it also acts as a way to do some measure of discovery of data in the vein of an open metadata or an ATLN. And I'm wondering if you can just talk to some of the ways that you think about how Gravitino overlaps with some of those different areas of functionality and some of the ways that you think about how it can, either supplement or potentially even replace various technologies in different use cases.
[00:12:21] Junping Du:
Yeah. I think that's, that's good quest question. So we do have the somewhat overlap with existing, categories. Take, have managed for example. Just like I mentioned, we're we have so, you know, today has a wrap up m HMS. That is way, we don't have to replace it. Instead, we manage and upgrade with new capabilities. Right? It's very important as HMS serving as, quite important component in open source work for a long time. So we can be compatible with first and replace it later when after user, you know, feel more comfortable. About the metadata repository parts, the opens open metadata, it's more like a a traditional data governance towards that. It can copy metadata from one plate to another, and do some access control and some other data lineage work, but it cannot serve, you know, computing engine directly. It's different scenarios with Gravitino.
About data federation, actually, we're supporting the training. We don't do data, federation directly, but we support the engines such as Trino, such as Spark, can do the federated query over the multiple, multiple high meta stores or multiple data catalogs that can cross different, cloud or different data lakes, which is super cool, technology on top of that.
[00:13:42] Tobias Macey:
And as you are building Gravitino as it has some of these different capabilities, there's always the challenge of scope creep where you say, oh, it'd be really neat if it could do this thing over here. And then you start to expand the range of things that it can do, which increases the overall load for maintenance and the complexity of the project. And I'm curious how you're thinking about what to explicitly keep out of scope for Gravitino and the things that you definitely don't want it to grow into.
[00:14:14] Junping Du:
Yeah. Thanks. I I think that's also a good question. As always, you know so we are serving as a unified, metadata center. Right? So that means we already I mean, our scope is big enough. We don't wanna grow the scope to be something like a query engine. We don't want to be a query engine project, right, to serve the optimization, for some dedicated Query Engine. So we treat the Spark, Trinos, Dreammail, ClearHouse, you know, whatever, computing engine we support, equally important. I think this is very important, a unique value for us compared with, you know, other new catalog that's supported by, you know, one, query engine vendors.
I think that's fairly important.
[00:15:04] Tobias Macey:
In terms of the technical implementation, I'm wondering if you can give a bit of an overview of the architectural components of Gravitino and some of the ways that you thought about the design and implementation of it and the in order to achieve the goals that you set out with?
[00:15:22] Junping Du:
Yeah. So in kind of a scene, the Graphite Unit has, layers of architecture. Right? The core layer is abstraction of the catalog layer, which could support database catalog, file set catalog, which is dataset. Right? And also streaming catalog, which support Kafka and other streaming engines and also the model catalog, which is support, the managing of the AM models. So this is we can we can support more, you know, catalog types in the future on demand. But, currently, this is pretty wired enough to support the mainstream, scenarios that we find in the community. So underneath is a data connect connection layer, that's we which connect with different data sources, such as, Fiset, such as, tablet data, such like, streaming data. Right? And on top of the core layer, it has, interface layer, which interface.
It actually, supports the rest 4 interface, the JDBC interface, and potentially a swift interface as well, which is useful to different, you know, computing engines to using Graphtino. And that's these 3 layers is the core of Graphiteo. And the most top layer, is a functionality layer, which can provide in, some data management and data governance layer, such as access control, data lineage, data quality, extra extra. So that is typically, the techno the tech technical architecture for Graphitea. So we start to work on it, in the beginning when we, before we start the company.
We think the most important thing is figure out what to do instead of, you know, how to do it. So we just have thought discuss a lot and decide to building, to go with to building the metadata lake, first because there's no reason for the world to have additional data engine. I mean, query engine. No matter query engine or computer engine, too much engines. Right? And which is important to have a a single layer of the metadata. Right? To to make the different engines can work with, you know, different data sources. So that's the that's make the core design of models the data models are very important. How to extract, the models for different type of engines, different type of data, especially their structured data there and unstructured there, how to abstract it. So we do very carefully design carefully and flexible design, and it can be continue to evolving to, today's, so we can support, different, way of data data manipulating or data computing. So that is, the the core part.
[00:18:27] Tobias Macey:
As far as the role of Gravitino in an overarching data platform architecture, I'm curious if you can talk to the process of integrating it into an existing set of systems, some of the types of technologies that you would want to layer on top of Gravitino to be able to take advantage of the information that it holds, and also maybe some of the ways that you're using Gravitino in your own work at DataStrato.
[00:18:55] Junping Du:
Yeah. Of course. I mean, DataStrato, we are building the the Graphite, you know, it's a we it's a first, product, and, now it our main our main, focus on top of that. And to launch, the Graphite, you know, it's definitely our a platform service. It can run separately, but it can easy to integrate into this mainstream open data architecture seamlessly. Right? It can support Spark and support Flink, Trino, Doris, have table and the traditional Hive tables. Right? So if you have Alexa system, keep it going. And then when you launch the Graphite, you you were, very surprise surprisingly to find it can work with a lot of your existing components very well, and they can, you know, merge a different metadata view into a same single metadata view. And your engines can go through GraphQL to to query or to do the computing on the, you know, some data stores or the data sources that previously you cannot attach, to it. So this is a quite interesting journey, what we're finding from the community.
A lot of communities of, users and customers to give us a feedback.
[00:20:14] Tobias Macey:
Once the team has integrated Gravitino into their architecture, they're starting to use it to store metadata, populate metadata, query metadata. Wondering if you can talk to some of the ways that they might interact with Gravitino in a typical Workday and how it fits into the overall workflow of being able to use the underlying data that Gravitino points to.
[00:20:37] Junping Du:
Yeah. So previously, if you're I mean, if you're using the Hotmail Store for a long time, it's, it's it's over almost, you know, seamlessly as a previous experience. Like, your Spark job, your trainer job is work on work. Continue to work with, you know, previous, data sources, which is fine. And if you were, like, a data engineers to work on some ETL, right, to merge different multiple tables from different places to be a single table. And after that, you can do some additional work, you know, for your data pipeline. You will find it's, gotten a very powerful because, some of the unnecessary ETR will be can be skipped.
So we can using GraphTina, you can know, you know, all the table there and try to building your, your, you know, final table, from the you know, skip some, you know, mid table to be a final table. So this is definitely the the change. Additional, because we are building the centralized metadata link, so you know all the access pattern for your data pipeline and you know how your access your metadata or data is accessed in different ways. So you can put some monitoring on that. You can know, okay, some some of the table, especially for the media table, is not needed anymore because no data pipeline is actually using. You can delete it or you can drop it.
You can do more fine grained data the life cycle management on top of that.
[00:22:10] Tobias Macey:
And now in terms of the unstructured data flow, tabular data is fairly well understood. Lots of different tools interact with it and interoperate with it. Unstructured data has been around for a long time. People have different solutions, but it has continually been a challenge to work through. And I'm wondering if you can talk to some of the ways that Gravitino helps to address some of those challenges and the workflow for being able to locate and catalog and interact with that unstructured data and managing the organization of those unstructured sources?
[00:22:48] Junping Du:
Of course. So it has been a long history for structured data to be, managing, you know, because we have, you know, more than many, many years experience to managing this kind of data. Right? And for unstructured data, it's quite a new thing. I mean, previously, we're using ETR for to be to make the on structured data to be structured data. Right? This is, you know, Hadoop what Hadoop age doing over 10 years. But today, a lot of AI models actually, they want go directly with abstract data. Right? To, go to the AI models, consume it in a, so so in in the consuming various ways.
So that make, the requirement for managing the abstract data, it's more hard to achieve. So, previously, just, managing, abstract data in a with, s three link or some storage place link is definitely not enough. So the first thing is we need to centralize the governments, right, for abstract data. So who can access it? And this is tip this is number 1. And number 2, do we put some, you know, more reach for metadata info just rather than a link? Right? We can do we have a description on what kind of unstructured data it is and how to use it, and how to, turn turn it into, to consume by your models or the feature stores?
I think this is number 2. Necessary is how to make your structure and the unrestricted data to work together. Sometimes, the this is not only unrestricted data is not only on the the training stage. Right? Sometimes it's it's useful on your rack building system. You you will leverage your, abstract data to give you some answer, right, to your, questions, when when you could be getting to the large length model. So you need us, you know, a way a centralized way to managing both structure and abstract data to make them work together. So that's, typically I mean, it's 3, cases we can sync, from, the today's, requirement.
[00:25:14] Tobias Macey:
Earlier, you also mentioned being able to have some insight into whether a model or engine is actually accessing some of that underlying data to determine how long you wanna keep it around for or whether you can cull that data to save on storage or cost. And I'm wondering some some of the different access data that you are able to use and provide insights to, the end user as far as what data is valuable, what data is just taking up space and cost, and some of the higher order workflows that people are able to build from that visibility.
[00:25:51] Junping Du:
Of course. I I think that's, important. And some of the community user already leverage these, features. So they definitely using the fireset to manage their on-site data and definitely find, the found access pattern, pattern for their AI AI team how to consume this data and then in which kind of frequency. Based on this, statistics, they were using, different back end storage. Right? They have tiered back end storage. Some some storage is for hot data, some storage for warm data, some storage for cold data. So they can move the most frequency access to data to be hot data.
But some of the, you know, non active access for the unstructured data is moved to the cold data, which is saving, a lot of, cost for this kind of, unstructured data revolution. So this is, definitely, a a showcase to how we leverage these features to achieve the, the man the fine grained management on on structured data.
[00:27:03] Tobias Macey:
Another piece that you touched on briefly is the governance of that underlying data. And when I was looking at Gravitino, it mentions having a centralized access control capability. And I'm curious if you can talk to some of the permissions management and some of the ways that that feeds into some of the other systems that are relying on Gravitino and just some of the overall challenges of managing some of that permissions and access control across different layers of the data stack?
[00:27:38] Junping Du:
Yep. That's that's fairly the pinpoint before. Right? You have to work on different, data engines and set up this permission access control settings. So Cuartino is trying to solving that in the more unified way. So the design of the, Graphiteo is we are we have different authentication and authorization, workflow. Right? And, through authorization, we can work with different on the layer platform such as, Hadoop ecosystem. It's you may be using Kerberos. Right? For the, some cloud engines, BigQuery or, Snowflake, they're definitely using, you know, some I'm related mechanism.
So we work with this different kind of mechanism and so that the token we take, I mean, the Cartino can take, can be using in different on the layer system. So this is typically, the high level design principles. So with the unified, centralized permission management, so customer don't have to be setting this kind of access control, in a different places. You can set directly in the Graphite. You know? And after that, this permission for Fineset management or table management or even fine grained low level access control. It it can be a support and be set on the into the on the layer system. So we are not storage layer. Right? But we can help to, set the storage layer access control. So this is the design principle for us.
[00:29:25] Tobias Macey:
As you have been building Gravitino, using it in your own use cases, helping to support other people who are adopting it, what are some of the most interesting or innovative or unexpected ways that you've seen it applied? That's,
[00:29:41] Junping Du:
interesting, because previously, we want to build in some very generic catalog support. Right? We support JDBC. We support REST catalog in general. But, eventually, you know, users community users are asking, okay. How are you building something like, data lake, format, catalog, such as iceberg catalog, such as hoodie catalog. So that's why we you know, about half a year ago, we start building the iceberg, rest catalog, integrate the capability into the Graptino and then we support it. So now we we we you you may can see 2 or 3 other open source data log and support, can support Iceberg right now. But this is, we do it first, and this is because the community user is asking for it. I think with with the community, continue to growing, we have we'll find more and more scenarios that, you know, you know, can potentially address, just to follow follow the, the community users' needs, and we were definitely finding something, new and interest to follow.
[00:30:55] Tobias Macey:
And in your own experience of building this project, working with the community, building this business that relies on Gravitino, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:31:12] Junping Du:
Yeah. So I think, initially, we tried to totally replace HMS, to be honest. So we think the HMS is the last thing for too long time, just you said. Right? But after we do that, we have a very early stage, initial version. But a lot of community users say, hey. You know, I want we want you to compatible with HMS because we're using it for a long quite a long time. We don't want we may re to type it in some day, but not now. So can you work with us? So this is something unexpected, but, you know, we respect what the communities request or or required.
I think this, important lesson. And, also, a lot of, lessons we learn from it, among the, you know, AI part. You know, we don't realize, it's because the FISAT management stuff is not so it's not a very complicated technology. Right? But with this kind of adding features, we're finding a lot a lot of users is really interesting on this feature. And, really, they're suffering a long term, pinpoint from that. So, I think this is something, quite interesting as well.
[00:32:32] Tobias Macey:
For people who are looking to either expand or improve their experience of catalog management, what are the cases where Gravitino is the wrong choice?
[00:32:44] Junping Du:
Yeah. So the if they have, you know, very limited data sources, and these these scenarios are very simple, like, they just have one data engine and maybe on top of the one cloud, I I think they they don't have to use Encryptino. Right? And if there's if there's no AI workload involved, just the pure data analytics or data engineering stuff, I think there's no, you know, governments required. I think Encryptino may not, very useful in this case. It doesn't harm for that, definitely. Right? But it's not very useful.
[00:33:20] Tobias Macey:
And as you continue to build and evolve and improve on Gravitino, what are some of the things that you have planned for the near to medium term or any particular projects or problem areas or features that you're excited to dig into?
[00:33:34] Junping Du:
Yeah. So we try to build, continue to building more AI capability on top of that, especially if we we are adding more features towards our structure and management, including the lineage. Right? So our structure and the models, how to make this, so from the our structure to features, features towards features and to be models, it's kind of the, lineage capability. We we're trying to building a lot of, you know, more governance lineage, capability, include also including the data, maybe data sharing, in some timing, for to make the graph, you know, more useful in a lot of, you know, kind of cases.
We really think in the future I mean, when I I just mentioned, data could be, a bottleneck for the this kind of AI revolution. And then what what that mean? It means is we're lacking off enough data, and we lack off enough high quality of data. So that's part of the Grafino's mission is we're trying to unify all the possible accessible data, right, to make it more accessible. And, also, we monitor all the data the quality to make it more increase the visibility of quality of the data, no matter it's a structure or our structure. I think this mission is quite important to Grafino. And, also, this is, this is part of the very important reasons why Apache, that's why we try to donate to Apache becoming Apache open governance project because we want it to be open. We want to be, addressing, the real data challenge in the, AI time.
[00:35:20] Tobias Macey:
Are there any other aspects of the Gravitino project, the overall space of catalog metadata, unstructured data management, the ways that AI workflows are evolving the needs for data cataloging that we didn't discuss yet that you'd like
[00:35:37] Junping Du:
to cover before we close out the show? Yeah. I think we we've discussed a lot on this kind of, stuff. We may discuss, later in the future if if, we we have made more progress.
[00:35:50] Tobias Macey:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:36:06] Junping Du:
Yeah. I think, definitely, it's, the data guys doesn't know e ed too much. The e ed guy, you know, doesn't leak off the data techno, you know, background or perspective. It's, it's today's, you know, big I would say, big gap. So we try to be a unified layer to not only the product to uni to different technologies, but actually to unify the data engineer and agent AI engineers try to make the 2 team can understand each other. Unless they're using the single data platform or data tours, they can only send each other. If they can continue using different tours, how can these 2 guys, you know, 2 group of guys to know each other? So this is our, you know, mission. I think it's also our dream to do that.
[00:37:01] Tobias Macey:
Absolutely. Yeah. There there's definitely a lot of incidental complexity that is coming up as a result of the increased usage of AI and machine learning and the fact that the technology stacks are being built independently and in isolation of each other. There hasn't been a lot of bridging that has happened yet.
[00:37:20] Junping Du:
Of course.
[00:37:21] Tobias Macey:
Of course. Yeah. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Gravitino. It's definitely a very interesting project. It's great to see more innovation and investment in this space of cataloging because as we noted before, it has been stagnating for far too long. So it's great to see you out there helping to push the space forward. So thank you again for your time, and I hope you enjoy the rest of your day.
[00:37:46] Junping Du:
Yeah. Thanks. Thanks, Toby. Toby, it's a very good conversation with you, and I wish you have a good one. Thank you.
[00:38:01] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Macy, and today I'm interviewing Junping Du about Gravitino, an open source metadata service for a unified view of all of your schemas. So, Junping, can you start by introducing yourself?
[00:00:29] Junping Du:
Thanks, Topgitz. Yeah. The this is Junping Du, and, I have, over 15 years working on, the open source and data industry. So I served for companies such as VMWare, Hortonworks, Tencent. I used to be a Hadoop guy, a long term contributor, committer, and also the release manager of Hadoop. So I'm super fan of open source technologies, especially for the data and AI area. So now we are building a start up a new company called DataStratio, that starts in 2023, and, Cuartino is open source. Data catalog is our main focus to break down all the kind of data silos because of different data lakes or call vendors or separation of data and AI in software stack. We try to break down all kinds this kind of the data silos.
[00:01:23] Tobias Macey:
And do you remember how you first got started working in data?
[00:01:26] Junping Du:
Oh, I still remember. So the I that's, over 13 or 14 years ago when I start to look at what Hadoop going work when I was doing VMware, I was I was thinking about, that's great. You know? In VMware, we are doing something like a virtualization, but now a lot of data, you know, stored here together with, you know, with thousands of machines can work together. It's fantastic. You know, we don't wanna try to break the the the big machine into pieces, but Hadoop tried to merge, a lot of, you know, data or engine or machines into a big giant machines. I was thinking it's a very cool technology. I tried to combine these 2 technology into, together. So that's why I start, the be Hadoop at that time. That's the internal project for VMware to how to launch a long Hadoop, a very efficient and scalable in, VMware virtualized technology. That's a start for our you know, from the cloud technology to the data technology.
[00:02:37] Tobias Macey:
Now digging more into Gravitino, I'm wondering if you can give a bit of an overview about what it is and some of the story behind how it came to be and why you decided that it was worth your time and energy to invest in building this technology.
[00:02:50] Junping Du:
Of course. So for my as as I mentioned, I have been working on Hadoop Technology for a long time. And afterwards, I'm working on some cloud data warehousing and did cloud, data lakes building. So, definitely, we find several interesting findings. So first, it's a data silo problem. The the no means not only means data is siloed by different database data warehouse, but also sometimes it's siloed for multi cloud or hybrid cloud scenarios. And, also, we would notice that the revolution of generative AI really created new patterns for accessing the data, especially to for, for unstructured data. So, previously, structured data is more like structured and the plan, so we can do a lot of data preparation ahead. But for now, there's more data accessing. It's on it's ad hoc and on demand.
And, 3rd, we find the problem is data technology is growing actually a little bit slower, or, you know, behind of the g how the GPU power going, how the large language model technology is going. So it's definitely we want to either work definitely make the data as a bottleneck in this kind of, large language model evolution. So we want, join the efforts to accelerate it. So we think Cartina will be serving as a it's a it's just a it's just like a Tesla time for the data plat platform technology. Right? So the autopilot is is over faster engines. I think data plat platform is is in the same stage now. So what is, so we ask ourselves, what is the intelligence about the data?
I believe, any, data engineers, dating data, data guys would think it should be metadata. Right? It has all the knowledge about data. So that's why we try to building a metadata lake. No matter it's data it's for transactional database and analytics or AI workloads. We try to building a metadata lake that is, GravTina.
[00:04:57] Tobias Macey:
You mentioned the metadata catalogs that are in use right now for different query engines for tabular data. You also mentioned the application of Gravitino to unstructured data sources for use in more AI oriented workloads. And I'm wondering if you can talk to some of the ways that you think about the expansion of that metadata store beyond just the tabular structures and beyond the constraints of a single database engine or query engine. Yeah. The constraints of a single database engine or query engine?
[00:05:28] Junping Du:
Yeah. I can take the example. You know, rather than the tabular data, you you you manage all the things with tables, right, table schemers. And now we are supporting the file set management, that kind of the dataset management. So, basically, if you're, PyTorch, you already want to consume some data sources from some datasets so we can manage this. We can directly serving, you know, PyTorch or read directly. You can read it here. And with with this, we can managing structure, unstructured in the same place. And then the data engineers will know, you know, how the AI team or AI engineers consume this kind of data. They can do access control. They can do monitoring of, you know, how AI team is, using this kind of data. They know, okay, some of the On-site data is not useful anymore because no models, you know, consume it anymore.
So they can do a better, you know, cost of saving and, data quality monitoring, something like that.
[00:06:33] Tobias Macey:
And I'm definitely interested in digging more into some of the specifics about how to populate and how to make use of that metadata for unstructured storage. But before we get too deep in the weeds on that direction, I also wanna talk a bit more about some of the problems that you are looking to solve with Gravitino and some of the ways that you have seen teams address those problems in the absence of Gravitino.
[00:06:58] Junping Du:
Yeah. So, yeah, so Gravitino, just like I mentioned, is we try to unify the different data formats and sources in you know, with different data engines. So we create a new layer. We call it the modern open, data catalog. Yeah. It's, I think, the same, concept as the data fabric. So with this kind of open data catalog, you know, built by Cartino, no matter what kind of data, you know, lives somewhere, which cloud it's it lives, on which kind of format it's using. It's a half tables or it's a ice web table or it's some vector database vector, store. We can use a mainstream, you know, data engine or a engine to access it no matter not only just read, but also the right update, depend, whatever it is. I think this is we try to building the solving the problem with the graph data. So, previously, a lot of engines on, you know, data data products tried to building the vertical stack for data, but we are more like to building a layer to serving, you know, different verticals So that break down the silos, right, for access pattern, of the data.
[00:08:14] Tobias Macey:
In the category of tabular structures, one of the longest running projects out there, at least for the case where you're not using a specific database engine, is the Hive metastore that was intended to address this cloud data lake or Hadoop style architecture where you have lots of files everywhere. You have different table schemas or table formats. And so the Hive metastore was used as a means of being able to keep track of all of them and to have different methods of populating that metadata, whether it's from Spark or from data crawlers or from API calls. And I'm wondering what you see as the reason that that was such a lasting technology that stuck around long after Hive started to fade away and some of the opportunities that you see for innovation in that metadata layer for being able to improve the overall experience around how to interact with these different data systems?
[00:09:16] Junping Du:
Yeah. So traditional I mean, Hive metastore really last up for quite a long time. It's becoming our defector standard for industry for a very long time since, you know, Hive appears, maybe more than 10 years ago. It's lasting there because, you know, a lot of engines is trying to using, Hive Metastore, to store all the metadata, on top of that. But things have changed, I mean, a few years until now. And the board modern data engines try to building their, metadata, or catalog itself, such as we have such as StarRocks or g or some some new engines. And, also, HimetStore cannot managing, you know, the AI, capabilities such as unstructured data directly. It doesn't can can manage it directly.
That's the reason, you know, because HMS is not a growing faster community. It's not solving the problem we're facing today. So end up with you know, we either join the Hive community to building something new with a lot of lot of, you know, legacy code or system, or we can start some new project with inherit some capability from the high HMIS, but actually extend it to something brand new, features and product. So we choose the later one, which is Gravity node try to building this, new product that can compatible with HMS. I think that's also the, the logic for some maybe other product is try to, build in heading 2 is try to compatible with HMS, but we're building something new. This is because of new requirement driven and also a there are a lot of, a long history of legacy that HMS already formed. We have to consider this kind of standard of follow.
[00:11:20] Tobias Macey:
In that general theme of Hive where it started off as a query engine, it brought along the metastore as a means of keeping track of the different table metadata. Looking at the documentation for Gravitino, it seems that it is also doing a few different things at once where it has the metadata storage for being able to keep track of tabular and unstructured data so that you know where it lives. You have some capability for data federation for being able to query across these different data sources similar to Trino or Alexio. And it also acts as a way to do some measure of discovery of data in the vein of an open metadata or an ATLN. And I'm wondering if you can just talk to some of the ways that you think about how Gravitino overlaps with some of those different areas of functionality and some of the ways that you think about how it can, either supplement or potentially even replace various technologies in different use cases.
[00:12:21] Junping Du:
Yeah. I think that's, that's good quest question. So we do have the somewhat overlap with existing, categories. Take, have managed for example. Just like I mentioned, we're we have so, you know, today has a wrap up m HMS. That is way, we don't have to replace it. Instead, we manage and upgrade with new capabilities. Right? It's very important as HMS serving as, quite important component in open source work for a long time. So we can be compatible with first and replace it later when after user, you know, feel more comfortable. About the metadata repository parts, the opens open metadata, it's more like a a traditional data governance towards that. It can copy metadata from one plate to another, and do some access control and some other data lineage work, but it cannot serve, you know, computing engine directly. It's different scenarios with Gravitino.
About data federation, actually, we're supporting the training. We don't do data, federation directly, but we support the engines such as Trino, such as Spark, can do the federated query over the multiple, multiple high meta stores or multiple data catalogs that can cross different, cloud or different data lakes, which is super cool, technology on top of that.
[00:13:42] Tobias Macey:
And as you are building Gravitino as it has some of these different capabilities, there's always the challenge of scope creep where you say, oh, it'd be really neat if it could do this thing over here. And then you start to expand the range of things that it can do, which increases the overall load for maintenance and the complexity of the project. And I'm curious how you're thinking about what to explicitly keep out of scope for Gravitino and the things that you definitely don't want it to grow into.
[00:14:14] Junping Du:
Yeah. Thanks. I I think that's also a good question. As always, you know so we are serving as a unified, metadata center. Right? So that means we already I mean, our scope is big enough. We don't wanna grow the scope to be something like a query engine. We don't want to be a query engine project, right, to serve the optimization, for some dedicated Query Engine. So we treat the Spark, Trinos, Dreammail, ClearHouse, you know, whatever, computing engine we support, equally important. I think this is very important, a unique value for us compared with, you know, other new catalog that's supported by, you know, one, query engine vendors.
I think that's fairly important.
[00:15:04] Tobias Macey:
In terms of the technical implementation, I'm wondering if you can give a bit of an overview of the architectural components of Gravitino and some of the ways that you thought about the design and implementation of it and the in order to achieve the goals that you set out with?
[00:15:22] Junping Du:
Yeah. So in kind of a scene, the Graphite Unit has, layers of architecture. Right? The core layer is abstraction of the catalog layer, which could support database catalog, file set catalog, which is dataset. Right? And also streaming catalog, which support Kafka and other streaming engines and also the model catalog, which is support, the managing of the AM models. So this is we can we can support more, you know, catalog types in the future on demand. But, currently, this is pretty wired enough to support the mainstream, scenarios that we find in the community. So underneath is a data connect connection layer, that's we which connect with different data sources, such as, Fiset, such as, tablet data, such like, streaming data. Right? And on top of the core layer, it has, interface layer, which interface.
It actually, supports the rest 4 interface, the JDBC interface, and potentially a swift interface as well, which is useful to different, you know, computing engines to using Graphtino. And that's these 3 layers is the core of Graphiteo. And the most top layer, is a functionality layer, which can provide in, some data management and data governance layer, such as access control, data lineage, data quality, extra extra. So that is typically, the techno the tech technical architecture for Graphitea. So we start to work on it, in the beginning when we, before we start the company.
We think the most important thing is figure out what to do instead of, you know, how to do it. So we just have thought discuss a lot and decide to building, to go with to building the metadata lake, first because there's no reason for the world to have additional data engine. I mean, query engine. No matter query engine or computer engine, too much engines. Right? And which is important to have a a single layer of the metadata. Right? To to make the different engines can work with, you know, different data sources. So that's the that's make the core design of models the data models are very important. How to extract, the models for different type of engines, different type of data, especially their structured data there and unstructured there, how to abstract it. So we do very carefully design carefully and flexible design, and it can be continue to evolving to, today's, so we can support, different, way of data data manipulating or data computing. So that is, the the core part.
[00:18:27] Tobias Macey:
As far as the role of Gravitino in an overarching data platform architecture, I'm curious if you can talk to the process of integrating it into an existing set of systems, some of the types of technologies that you would want to layer on top of Gravitino to be able to take advantage of the information that it holds, and also maybe some of the ways that you're using Gravitino in your own work at DataStrato.
[00:18:55] Junping Du:
Yeah. Of course. I mean, DataStrato, we are building the the Graphite, you know, it's a we it's a first, product, and, now it our main our main, focus on top of that. And to launch, the Graphite, you know, it's definitely our a platform service. It can run separately, but it can easy to integrate into this mainstream open data architecture seamlessly. Right? It can support Spark and support Flink, Trino, Doris, have table and the traditional Hive tables. Right? So if you have Alexa system, keep it going. And then when you launch the Graphite, you you were, very surprise surprisingly to find it can work with a lot of your existing components very well, and they can, you know, merge a different metadata view into a same single metadata view. And your engines can go through GraphQL to to query or to do the computing on the, you know, some data stores or the data sources that previously you cannot attach, to it. So this is a quite interesting journey, what we're finding from the community.
A lot of communities of, users and customers to give us a feedback.
[00:20:14] Tobias Macey:
Once the team has integrated Gravitino into their architecture, they're starting to use it to store metadata, populate metadata, query metadata. Wondering if you can talk to some of the ways that they might interact with Gravitino in a typical Workday and how it fits into the overall workflow of being able to use the underlying data that Gravitino points to.
[00:20:37] Junping Du:
Yeah. So previously, if you're I mean, if you're using the Hotmail Store for a long time, it's, it's it's over almost, you know, seamlessly as a previous experience. Like, your Spark job, your trainer job is work on work. Continue to work with, you know, previous, data sources, which is fine. And if you were, like, a data engineers to work on some ETL, right, to merge different multiple tables from different places to be a single table. And after that, you can do some additional work, you know, for your data pipeline. You will find it's, gotten a very powerful because, some of the unnecessary ETR will be can be skipped.
So we can using GraphTina, you can know, you know, all the table there and try to building your, your, you know, final table, from the you know, skip some, you know, mid table to be a final table. So this is definitely the the change. Additional, because we are building the centralized metadata link, so you know all the access pattern for your data pipeline and you know how your access your metadata or data is accessed in different ways. So you can put some monitoring on that. You can know, okay, some some of the table, especially for the media table, is not needed anymore because no data pipeline is actually using. You can delete it or you can drop it.
You can do more fine grained data the life cycle management on top of that.
[00:22:10] Tobias Macey:
And now in terms of the unstructured data flow, tabular data is fairly well understood. Lots of different tools interact with it and interoperate with it. Unstructured data has been around for a long time. People have different solutions, but it has continually been a challenge to work through. And I'm wondering if you can talk to some of the ways that Gravitino helps to address some of those challenges and the workflow for being able to locate and catalog and interact with that unstructured data and managing the organization of those unstructured sources?
[00:22:48] Junping Du:
Of course. So it has been a long history for structured data to be, managing, you know, because we have, you know, more than many, many years experience to managing this kind of data. Right? And for unstructured data, it's quite a new thing. I mean, previously, we're using ETR for to be to make the on structured data to be structured data. Right? This is, you know, Hadoop what Hadoop age doing over 10 years. But today, a lot of AI models actually, they want go directly with abstract data. Right? To, go to the AI models, consume it in a, so so in in the consuming various ways.
So that make, the requirement for managing the abstract data, it's more hard to achieve. So, previously, just, managing, abstract data in a with, s three link or some storage place link is definitely not enough. So the first thing is we need to centralize the governments, right, for abstract data. So who can access it? And this is tip this is number 1. And number 2, do we put some, you know, more reach for metadata info just rather than a link? Right? We can do we have a description on what kind of unstructured data it is and how to use it, and how to, turn turn it into, to consume by your models or the feature stores?
I think this is number 2. Necessary is how to make your structure and the unrestricted data to work together. Sometimes, the this is not only unrestricted data is not only on the the training stage. Right? Sometimes it's it's useful on your rack building system. You you will leverage your, abstract data to give you some answer, right, to your, questions, when when you could be getting to the large length model. So you need us, you know, a way a centralized way to managing both structure and abstract data to make them work together. So that's, typically I mean, it's 3, cases we can sync, from, the today's, requirement.
[00:25:14] Tobias Macey:
Earlier, you also mentioned being able to have some insight into whether a model or engine is actually accessing some of that underlying data to determine how long you wanna keep it around for or whether you can cull that data to save on storage or cost. And I'm wondering some some of the different access data that you are able to use and provide insights to, the end user as far as what data is valuable, what data is just taking up space and cost, and some of the higher order workflows that people are able to build from that visibility.
[00:25:51] Junping Du:
Of course. I I think that's, important. And some of the community user already leverage these, features. So they definitely using the fireset to manage their on-site data and definitely find, the found access pattern, pattern for their AI AI team how to consume this data and then in which kind of frequency. Based on this, statistics, they were using, different back end storage. Right? They have tiered back end storage. Some some storage is for hot data, some storage for warm data, some storage for cold data. So they can move the most frequency access to data to be hot data.
But some of the, you know, non active access for the unstructured data is moved to the cold data, which is saving, a lot of, cost for this kind of, unstructured data revolution. So this is, definitely, a a showcase to how we leverage these features to achieve the, the man the fine grained management on on structured data.
[00:27:03] Tobias Macey:
Another piece that you touched on briefly is the governance of that underlying data. And when I was looking at Gravitino, it mentions having a centralized access control capability. And I'm curious if you can talk to some of the permissions management and some of the ways that that feeds into some of the other systems that are relying on Gravitino and just some of the overall challenges of managing some of that permissions and access control across different layers of the data stack?
[00:27:38] Junping Du:
Yep. That's that's fairly the pinpoint before. Right? You have to work on different, data engines and set up this permission access control settings. So Cuartino is trying to solving that in the more unified way. So the design of the, Graphiteo is we are we have different authentication and authorization, workflow. Right? And, through authorization, we can work with different on the layer platform such as, Hadoop ecosystem. It's you may be using Kerberos. Right? For the, some cloud engines, BigQuery or, Snowflake, they're definitely using, you know, some I'm related mechanism.
So we work with this different kind of mechanism and so that the token we take, I mean, the Cartino can take, can be using in different on the layer system. So this is typically, the high level design principles. So with the unified, centralized permission management, so customer don't have to be setting this kind of access control, in a different places. You can set directly in the Graphite. You know? And after that, this permission for Fineset management or table management or even fine grained low level access control. It it can be a support and be set on the into the on the layer system. So we are not storage layer. Right? But we can help to, set the storage layer access control. So this is the design principle for us.
[00:29:25] Tobias Macey:
As you have been building Gravitino, using it in your own use cases, helping to support other people who are adopting it, what are some of the most interesting or innovative or unexpected ways that you've seen it applied? That's,
[00:29:41] Junping Du:
interesting, because previously, we want to build in some very generic catalog support. Right? We support JDBC. We support REST catalog in general. But, eventually, you know, users community users are asking, okay. How are you building something like, data lake, format, catalog, such as iceberg catalog, such as hoodie catalog. So that's why we you know, about half a year ago, we start building the iceberg, rest catalog, integrate the capability into the Graptino and then we support it. So now we we we you you may can see 2 or 3 other open source data log and support, can support Iceberg right now. But this is, we do it first, and this is because the community user is asking for it. I think with with the community, continue to growing, we have we'll find more and more scenarios that, you know, you know, can potentially address, just to follow follow the, the community users' needs, and we were definitely finding something, new and interest to follow.
[00:30:55] Tobias Macey:
And in your own experience of building this project, working with the community, building this business that relies on Gravitino, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:31:12] Junping Du:
Yeah. So I think, initially, we tried to totally replace HMS, to be honest. So we think the HMS is the last thing for too long time, just you said. Right? But after we do that, we have a very early stage, initial version. But a lot of community users say, hey. You know, I want we want you to compatible with HMS because we're using it for a long quite a long time. We don't want we may re to type it in some day, but not now. So can you work with us? So this is something unexpected, but, you know, we respect what the communities request or or required.
I think this, important lesson. And, also, a lot of, lessons we learn from it, among the, you know, AI part. You know, we don't realize, it's because the FISAT management stuff is not so it's not a very complicated technology. Right? But with this kind of adding features, we're finding a lot a lot of users is really interesting on this feature. And, really, they're suffering a long term, pinpoint from that. So, I think this is something, quite interesting as well.
[00:32:32] Tobias Macey:
For people who are looking to either expand or improve their experience of catalog management, what are the cases where Gravitino is the wrong choice?
[00:32:44] Junping Du:
Yeah. So the if they have, you know, very limited data sources, and these these scenarios are very simple, like, they just have one data engine and maybe on top of the one cloud, I I think they they don't have to use Encryptino. Right? And if there's if there's no AI workload involved, just the pure data analytics or data engineering stuff, I think there's no, you know, governments required. I think Encryptino may not, very useful in this case. It doesn't harm for that, definitely. Right? But it's not very useful.
[00:33:20] Tobias Macey:
And as you continue to build and evolve and improve on Gravitino, what are some of the things that you have planned for the near to medium term or any particular projects or problem areas or features that you're excited to dig into?
[00:33:34] Junping Du:
Yeah. So we try to build, continue to building more AI capability on top of that, especially if we we are adding more features towards our structure and management, including the lineage. Right? So our structure and the models, how to make this, so from the our structure to features, features towards features and to be models, it's kind of the, lineage capability. We we're trying to building a lot of, you know, more governance lineage, capability, include also including the data, maybe data sharing, in some timing, for to make the graph, you know, more useful in a lot of, you know, kind of cases.
We really think in the future I mean, when I I just mentioned, data could be, a bottleneck for the this kind of AI revolution. And then what what that mean? It means is we're lacking off enough data, and we lack off enough high quality of data. So that's part of the Grafino's mission is we're trying to unify all the possible accessible data, right, to make it more accessible. And, also, we monitor all the data the quality to make it more increase the visibility of quality of the data, no matter it's a structure or our structure. I think this mission is quite important to Grafino. And, also, this is, this is part of the very important reasons why Apache, that's why we try to donate to Apache becoming Apache open governance project because we want it to be open. We want to be, addressing, the real data challenge in the, AI time.
[00:35:20] Tobias Macey:
Are there any other aspects of the Gravitino project, the overall space of catalog metadata, unstructured data management, the ways that AI workflows are evolving the needs for data cataloging that we didn't discuss yet that you'd like
[00:35:37] Junping Du:
to cover before we close out the show? Yeah. I think we we've discussed a lot on this kind of, stuff. We may discuss, later in the future if if, we we have made more progress.
[00:35:50] Tobias Macey:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:36:06] Junping Du:
Yeah. I think, definitely, it's, the data guys doesn't know e ed too much. The e ed guy, you know, doesn't leak off the data techno, you know, background or perspective. It's, it's today's, you know, big I would say, big gap. So we try to be a unified layer to not only the product to uni to different technologies, but actually to unify the data engineer and agent AI engineers try to make the 2 team can understand each other. Unless they're using the single data platform or data tours, they can only send each other. If they can continue using different tours, how can these 2 guys, you know, 2 group of guys to know each other? So this is our, you know, mission. I think it's also our dream to do that.
[00:37:01] Tobias Macey:
Absolutely. Yeah. There there's definitely a lot of incidental complexity that is coming up as a result of the increased usage of AI and machine learning and the fact that the technology stacks are being built independently and in isolation of each other. There hasn't been a lot of bridging that has happened yet.
[00:37:20] Junping Du:
Of course.
[00:37:21] Tobias Macey:
Of course. Yeah. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Gravitino. It's definitely a very interesting project. It's great to see more innovation and investment in this space of cataloging because as we noted before, it has been stagnating for far too long. So it's great to see you out there helping to push the space forward. So thank you again for your time, and I hope you enjoy the rest of your day.
[00:37:46] Junping Du:
Yeah. Thanks. Thanks, Toby. Toby, it's a very good conversation with you, and I wish you have a good one. Thank you.
[00:38:01] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Junping Du's Background and Career
Overview of Gravitino
Problems Addressed by Gravitino
Hive Metastore and Metadata Layer Innovations
Gravitino's Capabilities and Scope
Technical Implementation and Architecture
Integration into Existing Systems
Managing Unstructured Data
Access Control and Permissions Management
Community Feedback and Unexpected Use Cases
Future Plans and Features
Closing Thoughts and Contact Information