Advanced Lakehouse Management With The LakeKeeper Iceberg REST Catalog

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.

DataFold's AI powered migration agent changes all that.

Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds

today for the details.

Your host is Tobias Macy, and today I'm interviewing Victor Kessler about architectural patterns in the lake house that are unlocked by a fast and feature rich iceberg catalog. So, Victor, can you start by introducing yourself?

Absolutely. Well, first of all, Tobias, thank you for, for inviting me to to your podcast and and giving the opportunity to talk about the

lake houses, icebergs, and all all that, stuff.

But, yeah, maybe just couple of words of myself. So my name is Victor. I'm based out of Germany.

And,

I'm one of the cofounder of, startup, Vakama, and we developed

a very cool technology named Lakekeeper, but more details on that one.

Well, from a background, maybe, just to give you some some idea, I I do have a computer science background, and I spent,

around, like, ten years building data warehouses.

I guess some people don't even know what is a data warehouse

nowadays. So everyone knows about data lake and lakehouses, but it's where where I started.

And,

from from my career, I was,

in a in a financial industry, but the last last

career stops where,

I was at MongoDB. I was at Dremio. So all the time, used to work with data. And do you remember how you first got started working in data?

Oh, yeah. Absolutely. I actually,

was even in in my study and, you know, like,

I'm I'm in Germany, and one of my,

jobs or, like, internship was at SAP.

And,

when when I was at SAP, I I got some data. And,

well, believe me or not,

my my instrument, my BI tool was,

Microsoft Excel and Microsoft

Access.

And that's how I started just to crunch some data to to prepare some KPI metrics for projects.

Well, there have been a few different,

anecdotal

statements that I've heard over the years that the most widely used programming language is Excel, so you're in good company.

Absolutely. Visual Basic

for VBA is is a quite quite interesting language. Yeah.

Digging now into

Lakekeeper

and what you're building and some of the story behind that, I'm wondering if you can just give a summary about what it is and why you decided that it was a project that you wanted to invest

in. Well, first of all, Lakekeeper is an open source,

project on a GitHub. And let let me maybe start with that idea from open source, and everyone who listens just wanna motivate to to become a contributor of open source. And, you know, like, contribution is not just about to write a code. You can just go to GitHub, search for Lakey Lakekeeper, for instance,

and and just test it, give us a feedback, and and, you know, like, give us a star. That's the easiest way how to contribute.

But back to the Lakekeeper. Lakekeeper is an open source, Apache2.0

licensed software. And Lakekeeper is a Apache,

Icebrooke REST catalog

written in Rust. That's quite interesting as well. And it's super easy just to to use it in order to build a lakehouse because, you know, like, if you wanna build a lakehouse, you need three main components.

The first one is, storage. So you can go to Amazon. You can go to Google or Azure. The second component of Lakehouse is compute. So you can use either PyIcebergSpark.

You can use, Dremio, Presto. You can use Starburst, Drino.

And,

the third component is, is a catalog because,

what what we have here, we have a table format, Apache Iceberg.

And in order to create a table and in order to to manage the life cycle,

you need you need a catalog. It's

unfortunately

a a misused word, catalog. But in essence, what we have here is something what we what we know from a database like information schema, and that's something what we built,

in the first place so everyone now can can actually build a lakehouse.

Digging more into that element of the catalog

that, as you said, has been part of database architecture

since databases were created.

And

in the

data lake, data lake house ecosystem,

the predominant

manifestation of that has been the Hive metastore,

basically, as an accident of history because it was something that was there. Everybody just kept using it

and adding to it as the Hadoop ecosystem

grew and expanded. It eventually fizzled out, but the the Metastore

stayed.

And in the past,

I'd say, two years, there's been a lot of renewed interest

in that catalog ecosystem

and the different types of features, the use cases for the catalog as an standalone component both within the lake house and also now more broadly in this

expanding ecosystem

of data assets as they

grow to encompass unstructured data, AI models, etcetera, etcetera.

And I'm wondering what you see as some of the

main motivating forces that have led to that renewed focus and renewed energy around the catalog as an actual product in and of itself?

Yeah. So it's quite

a lot of stuff happened in the in the last couple of years. You know? Well, the first thing maybe even to start before Hadoop and and Hive, if you look at at a classical data warehouse, so you had your Oracle, your Postgres,

and it was quite easy and comfortable just to go in and create some,

data warehouse core and data margin and store the data. But, then can Hadoop happened or came came by. And and what we've got is, that we started actually to decompose that monolithical system what we know, like, from a classical database.

Like, my my cofounder, Christian, is, explaining that if you will take a a Thor hammer and you will just hit the database and that will fall apart, you will get, like, first of all, a storage. It's what what we have, like, with HDFS or with Amazon S3.

The second part would be probably,

like, op local local database, which what we know, like, is a Kafka messaging system, NATS messaging system, lot lot of different messaging systems.

Then the third part of of that composition,

I would say, is,

a compute, and it's why we've got Spark, Python.

We got Trino, Presto, all the different compute engines. And and as of now, right now, we we see a lot of innovation in that space as well. So if you look at, Polar, Stata Fusion, DuckDuty,

lot of compute happening here as well. And the last piece was the catalog. And, you know, like, when we started with Hadoop, we started to decompose the monolith, and we've got actually Hive metastore. And as as you just described, like, okay, we have something which gave us

an opportunity or possibility to manage the table state, the metadata of a table, because table consists like of two main components of the metadata and of data itself. And,

we had our our catalog Hive Metastore as a facade, which,

we could we could use to manage the table. Unfortunately,

the main focus was always on the data

and

almost less on metadata.

And, therefore, if you look at the Hive metastore and everyone who played around the metadata,

it was not that kind of crucial and main thing. The main point was always like, okay. Let's look at the data. So someone was on the website, so let's send send him an email. Or he opened the email, let's send them a discount. So, you know, like, the whole actionability was on the data. And and if you look at the metadata, it's stuck in the past. It was more like necessity. Oh, let's go and and just type the information about who is the owner of that table. And it was like John Doe was the owner of the table. And if someone was like asking, we need to make a change to that table, let's go to John and ask him about that table. Where is John? Well, John left company for two years. No one knows who manages that table anymore. And at the same time, we we have kind of a challenge

even today with, well, ownership,

with governance, with equality,

outdated statistics, outdated results. So every challenge what we actually have is more related to the metadata. And and that was kind of a trigger

where the community was more concentrating how to solve the metadata,

how to solve that that all the problems and the solution is is is in the metadata,

and especially to to become the metadata actionable as as as the data itself. And,

the quite interesting thing if you look at the at the past, like, okay, databases. We had information schema, very

rigid, bureaucratic,

very hard to to do any stuff. Then we got the Hadoop ecosystem with Hive Metastore, very flexible but kind of chaotic, so really hard to manage the quality and ownership. And now we have, kind of a new player,

a lakehouse

with Apache Icebrooke as as a table format.

And exactly that is a promise of Icebrooke and a lake house that they can actually get to a place and to to a point where

the metadata will become actionable and the data will be actionable. But

the question of how to make the metadata actionable

is the responsibility of the catalog. And now we'll be getting, like, to to to the catalogs of Apache Iceberg, and there are multiple different ways of of catalogs. So, you you know, like, we started even with Hive Metastore, but the evolution of Iceberg catalogs get to the REST catalog and Lake EBay's REST catalog, which

has very cool features about how to make the metadata actionable. Another

challenging aspect of the naming of all of this, both in terms of catalog and the metadata management, is that there's been a lot of overload in terms of that concept of a data catalog, where there is the catalog

within the database engine and the compute layer and the fact that the compute engine needs to have that catalog in place to be able to find the data to operate on and execute queries against.

But there is also

and there still is, but for a few years, particularly around 2019

to 2022,

there was a lot of activity around data catalogs as a higher order product focused on

cataloging all of the data that you have across your entire data ecosystem,

including tables, but also including

files in s three or data models or various other database engines. So cataloging across the entire set of technologies that you're using, not constrained within

a particular

technology

or set of technologies. And I think that that has also led to a lot of confusion

around

when you say catalog, which type of catalog are you referring to, and what is the purpose of it? So in this context, we're talking about the catalog and the database, but then somebody might also have a catalog such as an Atlan or an open metadata for everything across that, including what's in your rest catalog and wondering how that has muddied the waters for somebody who is building that more focused and targeted catalog technology.

Yeah. You you you know, you you just,

describe my my kind of usual day. You know, like, I'm just starting to to explain what what we actually develop and what we do, and then or is it something like open metadata

or, like, Apache Atlas or whatever?

No, guys. We we have, like, a different focus. Unfortunately,

the name is catalog, but then, yes,

there's a there's a different focus and more

a technical focus and and especially on the technical metadata. But quite interesting thing is, though, that we started with Apache Iceberg, and we started with, a metadata of of Apache Iceberg.

But because catalog is in in so simple place, and especially if you look, on on compute and you have, like, a write compute and you have read compute. So you have, like,

Spark,

which which writes to a table, and then you have Trina, which will read from a table. And then you have optimized compute because what you need to do is to optimize your Iceberg table, like, every hour every day in order to get the performance of that table. But if you look at at that scenario, so you have your storage, you have your Amazon S3, and then you have, like, all the different computes on on on the left, something which writes to Iceberg, on the right, something what reads from Iceberg, and catalog is in middle of that. And the question is like, okay. So if that's not an open metadata catalog, which just has a description about the metadata, but it's a catalog which actually allows me to govern the access. And the question is

usually in in a large organizations, like every department has its own compute. So how to manage the access to to that table? And the answer is actually inside of the metadata because because we we able to manage the access. So inside of Lakekeeper, we implemented,

OpenFGA,

so open fine grained access control. It's open source technology as well, which allows us

to to give access to a specific group, let's say, marketing,

who manages the customer table, and marketing will use PySpoke to write to the table. And then we have someone from sales. They're using

Trino. So the Trino users will,

just only have access control to read the table. And inside of a catalog, we can just go and create the access for the metadata

to that table specified

on the groups and and give specific access control to to to to those groups. And and just to to finalize my my my kind of idea and an explanation to the customers about their catalogs, right now, you have a

a lakehouse catalog for Apache Iceberg, and Lakekeeper can manage your access. But then we have all the data scientists, and and they just want to use, like, ML, LLMs, and they would like you actually to access the volumes where you don't have an Iceberg table, but maybe you have just PDFs or or images, and you would like actually to run your your model training on on top of that data. And the question is from customers, okay. Now we manage

the access for metadata,

and why not just to extend the catalog such like Lakekeeper? So Lakekeeper will manage access for

AI objects as well, for volumes, metadata.

And,

that's something how we extend actually the the idea of catalog. And then that is, like, an additional thing how to explain. We started with Lakekeep of Apache Iceberg, but the future will go beyond the Apache Iceberg,

but, all motivated by by the metadata

to to provide access control, for instance.

The focus on Iceberg as well brings up the broader question too of cataloging across table formats where some of the most notable ones in the tabular dataset are Hudi and Delta Lake,

but now there is also additional

options.

Maybe one of the more notable ones is the Lance table format, which is a a sort of a superset of what Iceberg offers with the inclusion of vectors as a data type and,

some changes in terms of how it manages,

inserts and updates in the table structure.

And, there are a few others that are escaping my memory at the moment, and I'm curious, I guess, what are some of the opportunities

for, maybe not consolidation in terms of the tables, but,

being able to agree upon the interfaces for the catalog to be able to manage operations across table formats as well, for being able to act as that unifying layer across compute engines and table formats?

It's quite interesting to think about that topic.

You know, right now, we have, like, the three main table formats, Icebroke, Delta, and Hudi. And we can argue, like, which one is more popular, which one is more

performance, which one is more used by, engineers, and so on and so forth. And then we have, like, all that innovation because someone is not satisfied with Iceberg or with Delta, and they're just starting to do some additional stuff because the community,

either of Iceberg or other communities, are too slow. And and that that's kind of the thing about the open source that there is a a very

sometimes bureaucratic process to to to get new features. The thing is, you know, I would like to maybe go to my kind of a previous job as a solutions architect.

And you need to ask, like, two questions. If someone is just starting to do something, and developing and making innovation,

you need to ask, like, so what? So what is the business impact on on that feature? And

then then it it will be more understandable

why

some format will will be more

used by, by by industry. And the answer on, so what an iceberg is quite, clear because the adoption is very great. And especially for the REST catalog, the adoption got

very, very easy. So everyone is just to need to adopt the REST specification,

and then you can read and write to Iceberg. And that's why you you see,

as of now, a lot of, implementation based on Iceberg Java. We are, trying to contribute to Iceberg Rust. And I know, like, DuckDV is now working on Rust catalog implementation as well, so they have already PR. And therefore, Iceberg is quite

well positioned and,

I would say, dominates

the scene. But

I'm biased a little bit, I would say. But still,

the thing is if Iceberg is not capable to answer some business challenges

and is too slow to implement it, at the end, someone will implement it and maybe create a new project, which will end up in a new table format. And then is the question is how easy to adapt that new table format. So you mentioned a couple of them. And, I I think if that new table format will bring a lot of,

new business opportunities,

so BusinessNow can can make make a fast decision, or maybe we will go to the AI agent

times and and operators times that where ISB is not sufficient and we need to do something else, then that new format will be adopted,

as well. But as as long the question is, so what is not answered and and who cares about that new technology is not answered as well, it it will be kind of difficult to adopt that that new format.

And,

as of now, Icebrooke,

I I would say, implements

a lot of new stuff, especially around the

around streaming and then kind of real or near real time capabilities. And if if you look at the, SPAC b three and every every new innovation around Iceberg,

so I I would say we we can actually answer

a lot of data related

challenges with Iceberg, but not all of them. So we still have some some kind of thing around embeddings. And I have discussions, for instance, with,

vector databases,

and I understood that they actually,

well, they started similar, like, data warehouse in in in a way of,

monolithic approaches where compute and storage or data is all tied together. But right now, we're they experience same same challenges that they need to

split compute and and data.

So they will probably go similar way or maybe Aisto can answer the the answer,

the the challenge for, vector databases so they can adapt that as well. And and, it it might that we're gonna end with a new format, or it might that we're gonna have, like, ISPO tool at all for

warehouse, for analytics,

and for AI. Spending a little bit more time on the current

competitive ecosystem

of the catalog space,

probably the

two most widely

marketed, at least, options in the ecosystem right now come from the two behemoths in the space, Snowflake and Databricks. So Snowflake has their Polaris catalog,

Databricks released their Unity catalog.

And in that overall space of catalogs, the ecosystem around it, what do you see as the main

categories

of differentiation that the different players are aiming for and the opportunities

for innovation

and feature addition at the catalog layer?

Yeah. So the the idea of Unity was initially

a Delta table support and and to support

Databricks ecosystem. So right now, we're opening that, catalog for, for the community as well. So it's,

CNCF or it's it's Linux,

foundation, which,

takes care of Unity right now. And

the focus is still,

towards the compute engine. Right? So that Databricks can provide the best way how we can read and write to, well, at the moment, Delta, and I know that Icebook is just about to start. So I think in in around June, July when Databricks has its own conference, they're gonna announce the support for iceberg reading and writing. And and still, it's all about how to use compute and and your catalog and especially how to to manage the access, to to the data. And I know that Unity looks towards AI as well, so they provide

the access for volumes. So if you if if any AI use cases, you can actually use Unity. But the best integration is,

from from my understanding, is if you're inside of Databricks ecosystem.

On on the other hand, we have Polaris with Snowflake,

and here's similar story without focus on Delta. So we like, from scratch,

Polaris started with,

Iceberg. And

here's kind of a similar story, that Polaris is

well supported,

as a managed catalog inside of a Snowflake.

So you can actually go and then you don't need to think about how to operate, Polaris. You will just get the managed service by provided by Snowflake. But, again, here is the best integration. If you're inside of,

Snowflake ecosystem,

and and that's kind of like the corner point. So if you're inside of ecosystem of Databricks, inside of,

Snowflake, inside of Amazon, for instance, because Glucadrill provides that as well, I will I would say, that you you can become a very happy customer because because everything is integrated.

The,

the the proof is

that if you go to enterprises,

you will actually find the whole zoo of possible tools, Databricks, Northlake, Amazon. It's multi cloud. It's hybrid cloud. We have on premise. And the challenge of CIO,

is how to manage that zoo. Like, everyone is happy and then, you know, like, someone is not hurting someone,

because,

the ultimate goal of that organization is,

well, to make money. Right? So that that organization needs to be success filled. And and getting back to that kind of poor person,

CEO, CCO of of that organization, they need to manage all of that. And

what what we wanna just provide from our perspective is that become more agnostic.

So and if catalog is a central piece,

so maybe you need some some kind of a way that is,

not tied to any compute engine, not tied to any, let's let's say, ideology

of of compute engine. Like, oh, we use Spark and we use maybe not Spark. We use something different. And and that's the the main difference what I see from from a c level and from management level that they need to to make a decision

how they're gonna do that because there is a difference about where you store your data, how you, you know, like, read and write your data, and how you manage your data. Because for managing of data, you need the metadata, and that's, the thing of a catalog.

Again, maybe I'm I'm biased, and I am biased.

What what I see, there are plenty of different rest catalogs for Iceberg who just not concentrating on a specific compute engine technology.

Like, it's like our slate keeper and then there's Apache Gravitino as well. So the the idea is a bit different. Like, okay, we wanna create a catalog for

every compute engine. So you can you can actually go and use DuckDuty, and then you don't need, like, any big

compute maybe at all. The the thing is, if you look at us at at, like, more neutral player, agnostic players of catalogs,

the direction

is less on Compute Engine, but more on the metadata. And there are a lot of innovation around the metadata, which we have not in the last twenty years probably. And that's the main differentiator if you look at the pure

catalog players within ACE and metadata. And And if you look at the catalogs which are tied to to any compute engine, the DNAs data. And that's kind of a main difference that you have, like, metadata companies and you have data companies.

Spending a bit more time on authorization,

security, authentication,

You mentioned that Lakekeeper is integrating with OpenFGA for being able to give that granular control.

But more broadly,

as far as the integration

up and down the stack of authentication and authorization. That's something that I have personally run into

numerous places and numerous technologies of being able to manage that,

and let the different pieces of the tech stack know about where that authorization is happening. Because, for instance, if you're using

Iceberg with AWS Glue or with Lakekeeper,

with Trino as compute and Superset as the BI layer,

Every one of those things has its own ideas about permission and where it comes from and who owns it and what is the appropriate layer at which to apply those permissions.

And I'm wondering what you're seeing as the capability

of the

pieces that are higher up the stack from Lakekeeper

to be able to understand and reflect those permissions without just throwing obscure errors that lead to days of debugging as to why can't I do this thing that I thought I should be able to do.

Yeah. Absolutely. You know, like, the the thing is again so why we end up in that situation that everyone has an opinion, like, every tool has an opinion about the authentication and authorization,

is because we started to decompose the monolith. So before that, we had, like, our Postgres and then we had the user on the Postgres and everyone was happy. And then that worked. But then we started to decompose that monolith. Everyone started to build the tool and, business was asking, okay, how about authentication and authorization?

And it's how we we end up that in every tool, you will go and you will find some kind of implementation

about how to give access to to to an object.

So right now, like, as as pendulum goes in into one direction, now it goes in a different direction. So what we need to to understand

is,

the computation

and storage can be decentralized,

and they can use massive parallel processing and everything.

But there are components

which need to be centralized,

and the authorization is one of those components. So you need some kind of a way that you have, like, an essential place where you can just say that, Tobias can go and write to customer table and Victor

can go and read the customer table. And and, and you don't wanna end up like, okay. So in Superset, I can set that setting, and maybe I can go to Trino. But what about spark and PySpark?

So you you will end up like, that you will have some somewhere in your ecosystem inside of your company,

kind of a hole that someone will go and read your data or manipulate your data and evolves, like, that you have,

control of that. And and then that's something what we need to avoid, like, 200%.

There's a single system which authenticate someone, and that system is,

IDPs or, like, Android,

Okta,

Auth0,

Keycloak. I I don't know. There are a lot of different systems, but it's a single system in a in a whole organization which helps,

the identity of a person, of principle,

and it can issue,

the,

identity, like, token, for instance. Right? So we are all and on the second thing is about how to access that specific object.

And, since

catalog,

a REST catalog of, Apache Iceberg

manages the metadata and everyone needs to go to the catalog to ask where where is the data, That's

the one logical place how where you can actually put the information. If someone is asking for the customer table, let's ask who are you actually? So are you Tobias or Victor? And then the catalog will actually make a decision if, you can do that specific operation, what you're asking the permissions for or where the cable location

is actually. And the quite interesting thing, what we developed

with Trino

and Lakekeeper,

we took the next mechanism to OpenFJ. We took the open policy agent and we built a bridge

which allows you actually to use Trino as a multi user system where Trino will provide the information to Lakekeeper that we know that you as Tobias trying to to read the table. And,

then Lakekeeper will look inside of OpenFGA

what what is your permission state and will actually make a decision that you can read the table. So to put that all together, you have, like, Amazon S3 storage where Parquet file, the part of,

Iceberg table, is stored. Then we have Lakekeeper, which manages

the metadata of that Iceberg table. Next to, Lakekeeper is OpenFJ. So, again, it's not built in inside of Lakekeeper, but it's very close to Lakekeeper.

And, actually, you can you can actually take and replace OpenFJ for LDAP or, I don't know, like Ranger maybe. There are different ways how how to manage that

authorization information. But right now, we have OpenFJ.

And

then there is an OPA, like open policy agent,

which connects to TreeNo and to LakeKeeper.

And then if you run your select statement from open,

from Superset, we will understand who is running that,

query through for a whole setup, and which means you can actually build the lakehouse

for open source technology stack without spending any licenses at the at the moment, and you will, you will actually get the,

well, enterprise ready secured system for multiuser,

usage of of your enterprise.

So digging more into Lakekeeper itself,

you mentioned that it's built on Rust. You mentioned some of the ecosystem that it's integrating with as far as open FGA, open policy agent, the compute layers. Can you describe a bit more about the

design and implementation

of Lakekeeper itself and some of the ways that that design and the goals of the project have evolved from when you first started working on it? Yeah. That's quite interesting because we started,

almost for for a year. And the idea was, okay. We wanna build a lake house, and and we need the lake house for our platform. And at that at that time, you can actually go and use Tableau.

So we we made a decision that, this is gonna be a core component of our platform, so we need to actually to build our own,

catalog.

And we don't wanna use Hive Metastore or anything on those. And the question was, like, in in which language we're gonna build that? Because, you know, like, if you look at Iceberg, Iceberg Java is very mature. You can use it, right away and build your your,

catalog. So it's, super fast to to implement the catalog based on, on Java.

But

from our understanding and on from requirements, what we,

estimated that this is a very crucial

piece of infrastructure, of data infrastructure.

And to get, like, the most possible performance and stability, we decided to to use Rust. And Rust is kind of known for for especially for that. Right? So if you if you need something robust,

then you will use Rust. So we started to build, our catalog in Rust. And, we made a decision to to make that as fast as possible modular. So which means, like, if if you look inside of our code, you will find, like, a lot of stuff is modular. So there's

a part which is required for Iceberg, then there's a part which is for authorization. And right now, we're implementation for,

OpenEdge.

We made a decision that the store of the data is gonna be stored inside of Postgres. But it's, again, easy to just to to change because right now we use SQLX, and then you can go and build something your I don't know, like MySQL or, like, different dialects. So it's super easy just to to adjust for that or if you wanna use MongoDB. So it it would be not not a kind of a a big change that you need to rebuild the whole, catalog. It's just a module which you need to just to to implement inside of Lakekeep. And and that's that's kind of a design of Lakekeeper.

And the additional thing would be, decided at the beginning that it shouldn't be only for a cloud, like as a service.

But everyone who is,

like, trying to manage its own environment through Kubernetes,

we want we wanted to support that as well. And if you look at, the support of Kubernetes inside of Lakekeeper, it's it's super integrated. I would say it natively runs inside of Kubernetes.

We provide Helm Chatter as a Kubernetes operator as well. And,

we have even the integration with Kubernetes IDP. So here here for operator, we we can actually go and then create warehouses, create a namespace, and all that different stuff. So just to sum it up,

if you,

run your environment

with your Kubernetes, so vanilla Kubernetes, Akash, EKS, all the different Kubernetes',

outside, you can just take Helm Chart, deploy it, You can

use Kubernetes operator if you wanna be completely

out of tech, or you can use, like, some RDS from Amazon as well and,

just deploy Lakekeeper. And it take you it will take you, like, I don't know, like, five minutes to deploy the Lakekeeper.

And the plan is in in the future to provide soft software as a service as well. But right now, we have a lot of, proof of concepts and, I mean, production clusters on on Kubernetes.

For people who have already

invested in building out a set of tables, they have an existing catalog, whether that's Glue or Hive Metastore or some other rest implementation.

What does the migration path look like? Or maybe they have a bunch of iceberg tables and no catalog. And I'm just wondering

how you are thinking about the onboarding and the adoption process and some of the ways that you have worked to make that as smooth as possible.

Well, my first question would be if if it's required to to migrate. So if someone is in AWS Glue, you you don't need to migrate to any catalog maybe. So because if you're all integrated, it's all always good. You can just use AWS Glue. The thing is is getting more tricky if

somehow your management made a decision go to multi cloud because, you know, like, the price negotiation of the cloud to hyperscale is is not that easy. And and management is usually tending, like, to say, okay. Now we have Google or now we have Microsoft just as a second provider. Then it is like, okay. We need to think about especially that part. And,

then then there's a reason why you will migrate from AWS Glue, for instance. But you you don't need to migrate at all. On Hive Metastore, it's it's

might be different because if you look at the community of Hive MetaStore, that it's not that active anymore as as it it was. And,

maybe some some vulnerabilities on the Hive MetaStore, and and there's a reason to to migrate. And it's quite easy to migrate if you're already on Iceberg because,

everything what we need is to migrate the metadata.

The data can remain inside of, your object storage. We're just about to,

implement the HDFS support as well. So for everyone who is on Hive metastore with HDFS,

they actually cannot move easily to, Iceberg, like, because

Iceberg was intended to support object storages, but, we we just think about to support HDFS. So if you if you're on HDFS and you would like to use a modern catalog, then you can go to Lakekeeper. And there is a lot of tools.

So if if if you're not able to find it, just reach out to us or go to our Discord channel and ask us so we can provide you a migration sweep,

which which is going

to mainly

read the Hive Metastore metadata,

go to Lakekeeper,

register API, and register that table. That's it.

It's quite

trickier if you

if you're on Hive Metastore and you have, like, CSV, JSON, or Parquet based table. That that's something what will require more work because you need somehow to migrate your table.

And, especially,

we need to look at your ETL job. So for instance, if you use, like, sparkle, NiFi, whatever

by the way, NiFi dropped,

support for a Hive Metasflow in two dot o version or something like that. So you you need to take action,

in any any any case. But in in that situation, you will need probably to,

to make kind of a more migration from a data perspective as well. So you need just to go and then copy the data and rebuild your ETL in order to to write to Iceberg. But that will give you a lot of advantages. I mean, that is something what you need to invest at the beginning.

But through Iceberg, through, compaction,

vacuuming,

like optimizing the table, you will get way more,

advantages. And especially if I look at the parquet base or CSV or JSON based tables,

and that you need to metadata refreshes and reconciliation

over schemas and all that stuff. So it's all gone because you use just a table. Therefore, there is a a easy way with Iceberg.

There is a bit harder way if you don't use Iceberg. So it's it's very depending on on the situation of every use case. On that point of table maintenance, compaction,

vacuuming,

I'm wondering what your thoughts are on the role of the catalog

versus other tools. I'm thinking in particular about the new s three tables functionality

offered by Amazon and just where that,

maintenance is best managed.

Well, I can argue that,

there is a catalog with which can manage that in the future. So right now, none of the catalog can manage that. So

the

feedback feedback from customers and then, like, companies, what I have is that they need to manage that by themselves, like, every huge spark cluster because it depend depends, like, how you you make it, optimization of a table. So there are, like, five tasks for optimization of iceberg table, four of them just based on the metadata. So we're gonna build that pretty soon inside of the catalog, so you don't need to take care of that as well. The hard part is the compaction. So, you know, like, if you have, some kind of a streaming job which writes every five minutes, you will end up with a huge amount of a small files. And what you need to do is to go and,

let's say you you write thousand files a day, each is, like, one MB. It's better just to go and compact that to, like, four files, everyone 200,

64, 50 six MB. So you will end up with that, four files because it's not just about the performance,

but what everyone forgets.

For every

file, you will go to Amazon s three, and you will make a get a list. It will cost you money. And and it's very expensive at the end of the day if you don't if you don't do compaction. But, again, if you don't do compaction and, like, you have money, but the performance will suffer a lot if you will just go, like, from Trino, and you will just try to to read that table. So a lot of IO needs to to go down and read, like, every file. So it's why you need to make a compaction.

And the answer today is, like, okay. Kettle cannot do that. So you will probably go and use some external engines. So you can use Spark. You can use Trino to do the compaction.

What we

think that it would be super easy in that kind of a feedback, what what I got so far. And, guys, like, if if someone listens, give me feedback as well. That it would be cool if I can just go to into catalog,

to the table, and say, please make an optimization of that table, like, every hour or every day or once a week. And inside of the catalog, we can actually achieve that, and we're thinking to implement

the compaction based on the Rust with,

Apache Data Fusion. And it's way more efficient compared to the GBM based, way of a compaction,

which will again save you money and time on on your resources, on your cloud costs.

And, you you don't need to think about that because inside of the catalog, we track every every change, and we probably can end up even without enormous compaction for you. So we can just understand if something happening on that table. We can understand what's going on, what type of queries you run because catalog,

can can provide for scan API as well the way how how to access the data. So if we have all that, information,

which, again, is a metadata information, and then catalog can actually optimize your table

based on on your real requirement,

then the catalog is,

from from my perspective, is the best place to organize the optimization and compaction of the table.

And for people who migrate to Lakekeeper,

they're using the REST catalog,

they get the advantages of

the granular permissions,

some of the optimizations that you're working on implementing.

What are some of the ways

that moving from, for instance, Hive Metastore or AWS Glue changes

or

adds to the workflow that they have for working with their iceberg tables, some of the new capabilities that it unlocked? Well, the first thing is that if they use ResKettle today and they will move to Lakekeeper, there is no much effort. So there is no huge change or whatever. They don't need to to change actually a lot. They will just replace with URL and and maybe kind of a credential stuff, and everything should just run as as it was. But the question is like, okay, what what will I get to, if if I will move to Lakekeeper? And, you know, like, we built

a quite interesting feature inside of Lakekeeper with what we call change events. So which means every time you will run, let's say, Alta table or, like, whatever

type of a metadata change, we can push that change out of a Lakekeeper in in a messaging queue, in Kafka or in Nutz, and you can build all that reactivity

based on on the metadata. It's how we think that the metadata

data can become actionable. Because if that's somewhere in a queue, we can actually take the,

some kind of action on that. And the thing what we built is what we call an actionable data contract. And to explain you the idea behind that so imagine I'm the owner of the table, so I am a data producer. And let's call that table, a customer table. And Utobize, you you need that table. You need to build a report on that table because you will go to your management, and you you you will make a decision on that table because you need to make an investment decision. You probably wanna sell something and so on and so forth. So what we see in every enterprise that the data producer and data consumer, they usually don't know each other, and they don't know, like, that you rely on that table and that data every day. And what the data producers usually do, they have, like, the way that they need to change the tables. And,

and what I can do, I can just go and make drop filter, alter table, alter schema. So I can just do some changes,

which will break your report, so which means you have a business impact. And, you know, if if you look then you can you cannot make a decision. You cannot make a business decision, so you're not able actually to increase your revenue or you're not able to make cost reduction, whatever different targets on on your business side could be. And the idea is for that change event mechanism

to build

a contract

where you and I, it's not just a PDF or kind of a Wiki page contract, but you and I, we will just go and assign a contract where you will tell me that you use my table, like, every day and you need a stable schema. And that is an SLO. That's an objective. A table schema should remain

in in place.

And that is gonna be like a computational unit stored inside of the contract engine. And what Lakekeeper can do, if I will run the older table, Lakekeeper will communicate, first of all, with OpenFGA,

and OpenFGA will tell Lakekeeper, yeah, Victor is in honor of that table, so he can do that change. But on the second step, late keeper will communicate with the contract engine and ask, okay. Do we have some kind of a business constraint here in that table? And that contract engine will tell,

late keeper, yeah, Tobias

has an SLOs table schema. So you cannot change that table. So I will get a conflict. At the end of that operation,

you and I, we will be informed that there's a change probably in the future. And,

what what I can do, I can go and churn the contract. So you will probably get, like, a grace period, seven days, let's say. So you can go and change your report. But the main idea,

of all that

contracting thing is to build the unbreakable

consumption pipeline. So you can rely on report. You can make your business decision. We we need to to do everything

in order to to keep your business process just intact. That that that's not gonna be broken by someone.

In your work of building Lakekeeper,

working with the ecosystem

of

storage and compute and consumers and data ingest,

what are some of the most interesting or innovative or unexpected ways that you've seen the Lakekeeper capabilities applied? Well, first of all, what what we see is a contribution. So for instance, like, Kubernetes operator was contributed, and that was very surprising that people were

just, wanna wanna do stuff, and then they are very happy to contribute. But

the last thing,

which is about probably to happen pretty soon, is

that one of,

from from, from a community

offered to build DuckDB terabytes inside of a LakeKeeper. And I am quite interested in that one because, you know, like, what what you have nowadays, like, a very powerful laptop or notebooks on, every every desktop. Right? So you

could just imagine you have, like, sixteen, thirty two gig of RAM. And so if you look at,

the last, research of,

like, how large is is is your little query, so it's usually, like, I don't know, like, maybe a gig or something or not not even that. And that's quite interesting to see that inside of a Lake Cooper UI,

you will just open, like, SQL editor, which will use the Wasm and DactDB. And then you can actually go through REST catalog and and make your queries.

And

that is which will make your local computer, like, a query engine, so you don't need, like, a query engine at all. And that's

intriguing,

idea that you can you can manage a lot of your queries

on your local computer without having,

any big query engine in a in a back end somewhere.

Yeah. The ability

to expand the possible set of compute and execution environments is definitely

very interesting and one of the things that I like about the Iceberg ecosystem and the fact that it is growing and evolving to include so many different computation paradigms.

Absolutely.

And in your work of building Lakekeeper

and

releasing it as open source,

working on building a commercial entity around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

The challenging thing was about,

Iceberg Rust and, like, Rust in general. So it's still

very popular and and very cool, but somehow,

I have a feeling that it's it's not that adopted as as Java or, like, JavaScript

or Visual Basic or

Excel.

And and that was kind of a a challenge for us, to to see that that ecosystem is still, under development.

So we needed to invest ourselves

a lot inside of ice per crust. But right now, if you look at ice per crust, you see how how ice per crust evolves and and getting,

even more popular and mature. So if I see there is a even the

kind of discussion,

why not to build Iceberg Rust and then develop

a a a client for, like, PyTorch or c plus plus or, like, all the different languages. So, so it might even become, like, a central piece,

which will,

then then serve all other different languages and then just you can you can create your client for specific language.

Yeah. And and that's something from a technical perspective. But maybe to give you like a business challenge, what what I see is well, first of all, it's hard to explain what type of catalog we develop.

But I already mentioned that. And and still, I I see that there is a lot of tools and and there is a lot of different

technologies out there. So I think if you look at the customer and let's maybe move from a catalog to the compute engines.

So if if I would be a customer today, so I I don't even know how to pick a compute because there are so many. Right? So starting prem from, like, Spark by Iceberg going to Trino, Presto. Then you have StarRocks. You have Rising Wave. You have Databend. You have Pollers. You have Data Fusion. And just Data Fusion has, I don't know, like, a hundred different compute engines,

implementation.

So, it's it's super hard for the customer. And I think that is kind of a main challenge in our time to explain or to give a reference architecture, like, okay, that's how you're gonna start. It's not gonna cost you a million at the beginning, so So it's probably not cost you at all from a licenses perspective. Everything what you need is an infrastructure.

So go and get some EC2 machines, and and then let's start and build that stuff. So and the second lessons learned, what what I give and talk to to customers,

there is no silver bullet, and there is no, like, oh, that's the solution. You can go, like, to Big One, Databricks, Snowflake. You will end up that, well, everyone,

like what we say in Germany, cook with water. And so they cook with water as well. And at the end of the day, you will probably end up where those solutions are not sufficient. They're

maybe too slow, maybe too costly.

And, what what I learned is that

data is a specific thing, and and we need to be more, well,

data aware

and use all the different tools and then just

experiment and not trying to plan for the next five years. That was like just this week, I had a conversation like, Oh, we have a plan for the next five years. And I'm like, Oh, that's gonna be hard because in next five years, we will have, like, all all the different type of technologies. So what what you're gonna plan today is already obsolete probably in a in a six months, and that's something what companies need to to adjust themselves. You know? Like, the companies needs to become more startup

like. Yeah.

And you've addressed this a little bit already, but what are the cases where Lakekeeper is the wrong choice?

Well, first of all, if you,

all integrated

in in in a solution, so I would probably not

recommend you to use Lake EPU or, like, any different catalogs outside outside of your ecosystem

if that is the case, but it's usually not the case. The not a good choice is if you have, like, a delta table. So we don't support Delta, at least for now. And, therefore, I wouldn't say, like, use, Lakekeeper. And Lakekeeper catalog is not a good solution if you want to build something like what open metadata,

provides. So So it's a different type of solution. So we we can we can collect a lot of metadata, but that's not a business type of the metadata, what what a business would expect usually. And you can actually run BladeKeeper, create a table, and there's a tab row. You can open that tab row, and you can see

what type of metadata I'm actually talking about. And and then that's a super different stuff compared to a business metadata. Yeah. As you continue to build and grow and evolve the capabilities of Lakekeeper, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore?

So the first one is,

I already mentioned the HDFS supported. I I was asked a lot of of time to to support that one because still a lot a lot of customers on HDFS.

The second thing, what we are still considering, is how to help customers on Hive metastore

with, a non Iceberg table. And, there are different ways how we can achieve that, maybe to to provide some kind of a partial Hive metastore API so you can actually communicate through Lakekeeper to your CSV

stuff. So that's that's the thing in order to replace completely Hive metastore. But more interesting is that,

we've been asked about the central access governance,

for iceberg tables and AI. So it's why we started to build,

something around the volumes because what what you need to end up is some sort of a a similar mechanism what you probably know from a Delta share. And they're thinking about, like, to provide something like lake share or lake house share. And that will allow you actually to create a a just single catalog and manage the whole access, for both,

for data and for AI. And the last thing what I would like to mention is that what we would like to do is we would like to,

kind of connect that two silos, a data silo and AI silo for the metadata. So just to give an example, if your ML model

somehow,

depends on the Iceberg table, so nowadays, you actually have no idea if something is gonna change inside of Iceberg table. And I'm not just talking about the new data, but if,

structure of your data has changed. You have, like, new data type or whatever, so we need to rerun your model training, and and that's something what what we would like to build inside of Lakekeeper as well. So the siloed data and siloed AI needs to be connected, and we truly believe that metadata level is exactly the way how to achieve that. Are there any other aspects of the work that you're doing on Lakekeeper,

the overall ecosystem

of Iceberg and REST catalogs,

or data lake catalogs in general that we didn't discuss yet that you'd like to cover before we close out the show? The last thing what I would like, to to say here is as I opened about the open source, guys,

just really, really would love to motivate you to become a contributor. Think about all that great things, what what we've got in the last, I don't know, like, couple twenty years, around the open source. That open source is a real innovator.

So everyone who, somehow connected to open source, and I'm just talking to everyone, even a developer,

if you have,

some some contributor level, PMC level, it will boost your career. It will help the community. It will help to everyone. So it's a win win situation.

So please,

go to a project.

Lakekeeper is one of them, but there are different projects as well, like Apache Icebrock. Go, test them out, write a new documentation,

read the documentation. You will definitely find a a spelling bug. Open the jar. Star will get Treppo. It's it's what I would like just to mention. Yeah. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing on Lakekeeper. I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the technology or tooling that's available for data management today. Yeah. You know, like, that's quite interesting because, I I do have kind of, a lot of discussions with business people. And,

you know, like, if you go to inside of a company and you will see usually, like, hundred people, so 10 people understand,

all the data management thing, but 90 they don't understand.

And what what I see as a gap is how we can actually explain them or maybe make a connection to the business process to to the data management. And I think there is a solution on the technical side, which we don't have at the moment. And, I hope that we're gonna develop something that, like, everyone in the organization understands how data can drive revenue or how data can help their organization

to be more successful.

Well, thank you very much for taking the time today to join me and share the work that you and your team are doing on Lakekeeper. It's a very interesting project. It's great to see the innovation and investment in that space.

I definitely look forward to seeing how it progresses and watching it grow. So thank you for taking the time and energy to help move that ecosystem forward and expand on its capabilities, and I hope you enjoy the rest of your day. Thank you, Luis. Thank you for everyone who listens. Bye.

Thank you for listening, and don't forget to check out our other shows. Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and

coworkers.