Summary
Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!
- Your host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, "Apache Iceberg, The definitive Guide", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Nessie is and the story behind it?
- What are the core problems/complexities that Nessie is designed to solve?
- The closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case?
- Why would someone choose Nessie over native table-level branching in the Apache Iceberg spec?
- How do the versioning capabilities compare to/augment the data versioning in Iceberg?
- What are some of the sources of, and challenges in resolving, merge conflicts between table branches?
- Can you describe the architecture of Nessie?
- How have the design and goals of the project changed since it was first created?
- What is involved in integrating Nessie into a given data stack?
- For cases where a given query/compute engine doesn't natively support Nessie, what are the options for using it effectively?
- How does the inclusion of Nessie in a data lake influence the overall workflow of developing/deploying/evolving processing flows?
- What are the most interesting, innovative, or unexpected ways that you have seen Nessie used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with Nessie?
- When is Nessie the wrong choice?
- What have you heard is planned for the future of Nessie?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
- Project Nessie
- Article: What is Nessie, Catalog Versioning and Git-for-Data?
- Article: What is Lakehouse Management?: Git-for-Data, Automated Apache Iceberg Table Maintenance and more
- Free Early Release Copy of "Apache Iceberg: The Definitive Guide"
- Iceberg
- Arrow
- Data Lakehouse
- LakeFS
- AWS Glue
- Tabular
- Trino
- Presto
- Dremio
- RocksDB
- Delta Lake
- Hive Metastore
- PyIceberg
- Optimistic Concurrency Control
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png) Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit [dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council) and use code **dataengpod20** to register today! Promo Code: dataengpod20
- Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png) Data teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. Dagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to [dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast) today to get your first 30 days free!
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data.
Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Forms and data pipelines. It is an open source, cloud native orchestrator for the whole development lifecycle with integrated lineage and observability, a declarative programming model, and best in class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments. Go to data engineering podcast.com/daxter today to get started, in your first 30 days are free.
[00:01:40] Unknown:
Your host is Tobias Macy. And today I'm interviewing Alex Merced about Nessie, a git like versioned catalog for data lakes using Apache Iceberg. So, Alex, can you start by introducing yourself? Hey, everybody. My name is Alex Merced. I'm a developer advocate for Dremio. So for the last several years, I've been talking about data lake houses, about open source technologies like Apache Iceberg, Apache Arrow, Nessie, which will be, something I'm gonna love to talk about today, but, all about the lake house even so much being 1 of the coauthors of Apache Iceberg, the definitive guide, an upcoming book from O'Reilly. And do you remember how you first got started working in data? It's a it's a fun story. I have a I have a very long, not traditional way I kinda got here. So the long and short of it is basically and then, you know, definitely, I I I have a longer version of the story in places. But, basically, I did start off as a computer science major, but then I got really into music and kinda went into this completely different category of setting, like, culture, marketing, and which somehow led me into a career training people in finance. And I ended up training people in finance for 10 years. So I spent a lot of time breaking up really complex ideas and helping people kinda understand them in a more accessible way. But I then eventually ended up back in software and came back as a software developer and did that for a a few years and also trained software developers.
[00:02:52] Unknown:
But I was always a big fan of,
[00:02:55] Unknown:
working with databases. So, like, some of my favorite projects were finding ways to optimize the database, finding ways to offload workloads, where from from and business logic from the wrong places when people put, like, maybe too much of that stuff in their the client side of their websites. So I started kinda gravitating more and more toward the data space. And then I also started gravitating more towards, like, a DevRel of advocacy because I was always naturally someone who would like to teach. I like to create content. I like to, break down ideas. So I decided to make the shift from software development into the dev world of advocacy world. And I ended up finding a home in Dremio where I get to spend a lot of time learning about this really cool exciting thing called the data lake house. And that's definitely got makes me wake up really excited every day. And now I get to, help people understand that and bring that understanding of how not only what it is, but how to implement it, the technologies around it and so forth.
[00:03:43] Unknown:
And for the conversation today, we're focused on the NESI project and I'm wondering if you can describe a bit about what it is some of the story behind it and where it fits in that context of the data lake house.
[00:03:55] Unknown:
Got it. Okay. So bottom line is the NESE project at its core is a catalog. So when it comes to, like, the Apache iceberg table format, there is a need for a mechanism to be act as a catalog. So it tracks all the different tables and primarily what it does attracts a reference to what is the most current metadata dot JSON file for that particular table. So at the core, that's what Nessie does, But Nessie provides us additional ability to actually create commits not at the individual table level, but at the catalog level. So it actually every time that those catalog references changes, it basically treats it like a commit. Which means it allows you to have the same sort of get like semantics as Git far as being able to do branching, tagging.
And this kind of changes the dynamics of sort of how you interact with your catalog and how do you plan sort of like data ops type practices where you wanna kinda isolate developer environments or or roll back when it comes to disaster recovery. It changes a lot of things and actually makes it oftentimes easier and create sort of new patterns when it comes to the data lake house.
[00:04:55] Unknown:
You mentioned the ability to do branching and committing and merging and tagging. And I'm wondering, in terms of the context of data lake houses, the overall data pipelining and workflows, What are some of the core problems and complexities that Nessie is designed to solve for? I mean, bottom line, like, a couple different situations where Nessie becomes really useful is,
[00:05:18] Unknown:
the probably lowest hanging fruit is like data rollback. So basically, you you have maybe a pipeline that fails and now you have bad data or partial or inconsistent data and let's say let's say a handful dozen tables. Now you've technically can roll back those tables directly from the table format in Apache Iceberg. We have to do each table 1 by 1. By having a catalog level abstraction, I can just roll back the catalog to the commit that was like the last clean commit, and I can do that in all in 1 fell swoop and move the whole catalog back to before the that ingestion job. But also what happens a lot of times is that people would create duplicates of their data like for a developer environment. And then they would do all their work there and then at the merge those environments and you it was harder to create these environments and to more costly because of the storage.
But with versioning like Nessie, I can basically create that isolated branch environment without creating a single duplicate of it of my existing data. It would just basically isolate the new snapshots going forward which is only so the only new data is really the new data of those new transactions.
[00:06:16] Unknown:
In terms of my experience of surveying the overall data ecosystem, in particular, the data lake and data lake house environments, the closest thing that I've seen to Nessie as far as this branching and merging semantics, the ability to do that kind of 0 copy cloning, I guess, there there there are 2 pieces to that. 1 is the 0 copy cloning and being able to do very low cost developer environments, copy on right semantics is with Snowflake. I know that they have the ability to do that kind of snapshot tables, create a a copy of a table using the same existing underlying data. But from the lake perspective, the closest project I've seen is lake FS, which has that same idea of Git semantics, but at the s 3 abstraction layer. And I'm wondering if you can talk to some of the overlap and some of the divergence between Nessie and Like FS and when you might decide to use 1 versus the other.
[00:07:11] Unknown:
Oh, 0, yes. I actually I find the differences quite interesting. And the funny thing is, like, I think they were both sort of kinda coming into existence around the same time. I recently saw a talk where they talked about sort of the evolution of Lake FS, and I were seeing a talk about the evolution of Nessie. And those initial questions were the same, like, we're and both of them started, like, basically asking questions. Can we just use Git? And realizing, okay. Like, the type of throughput, the type of, the the amount of changing that happens in data is not gonna get it's not really built for that. So, basically, you have to kinda find some other abstraction.
So Lake FS went the approach of where you basically capture sort of deltas in the actual files. So you say, okay. Add this file, subtract this file, while Nessie goes to the approach of just capturing sort of that that metadata change. So a couple ways to kind of think about it is imagine I updated an iceberg table with the insert. That might create a 1, 000 new files. So in the case of the lakef s commit, it's not it's not aware of the table. So it's not aware so that a table exists. It's just where there's a 1, 000 new files, in my file system and then captures a commit that says, okay. Hey. These dozen files have been added. Well, in NESE, the only thing that changes and the NESE is 1 thing. It changes the meta there's a new metadata dot JSON file. So instead of tracking a 1, 000 new things and saying a new a 1, 000 new things were added, it's just, hey. This table snapshot has changed from pointing to here to there.
So it's a much more sort of lightweight change that can handle sort of very, high velocity throughput. Or as, like, if you're making a lot of changes at the time because you're not tracking as many different items. But it's also a couple diff other differences are that it's sort of more table aware because it is at the catalog level, which allows you to sort of move all of those Git like semantics into SQL. So I can create a branch using SQL. I can merge a branch using SQL. I can create a tag. While with a link, it's usually either done through it's mostly done through the file path. So, basically, what it does, it takes advantage of object storage and says, okay. Hey. This is gonna be this dynamic part of the the file path that represents which branch you're on. And then oftentimes you create these branches. Oftentimes, all of the work has to be done with the CLI. So so while probably, like, for a lot less technical users, SQL is gonna be a much more accessible approach to to doing a lot of these things, and the CLI tool might be maybe a little less accessible. So there's some more also some ergonomic differences, I would say.
[00:09:22] Unknown:
Zeroing in on that catalog element, we've mentioned a few times that Nessie is a catalog, and it corresponds to various pointers into the iceberg table format. And I'm wondering if we can dig a bit more into the context of what purpose does the catalog serve in that data lake data lakehouse environment, and what are some of the alternatives or what what are some of the pieces that Nessie might replace if somebody already has an existing lake house environment?
[00:09:49] Unknown:
So a couple things first. Like so right now, Nessie primarily works with Iceberg. The cool thing about Nessie's architecture is that it just tracks sort of these, like, little metadata objects. So, basically, it's really just an object. It has, like, a data type. And right now, the main data types you see are Iceberg tables, Iceberg views. Theoretically, other table formats could come into the picture, pretty easily. It's but basically tracks that metadata. Now the thing is that the way the iceberg spec works is that generally the catalog that catalog reference is sort of like your, source of truth when it comes to the current state of the table. So the problem is you generally don't want your iceberg references in more than 1, catalog. So this is where the this is where basically, hey. If I choose Nessie as my catalog, then that precludes me from using another catalog like an AWS Glue or a Tabular or something like that. So oftentimes, when you are adopting an Apache Spring Lake, so you do have to take a look at sort of, like, what are the tools you're using and what are the different features of the different catalogs. Most of them are gonna generally provide you the main service of of, basically, hey. I can identify my tables, and now I can take, hey. I can take this catalog of Spark. Spark sees on my tables. I take this catalog of Flank, it sees on my tables. I take it to Dremio, it sees on my tables. But, you know, not every catalog works every tool currently. I think that's got that story's gotten a lot better. So most catalogs are workable in most places nowadays, but that is essentially sort of 1 of the big sort of cost benefit calculations you have to make when selecting a catalog.
And when it comes to particularly with, like, Nessie, it works with most pretty much all the name, the typical open source tools so it works with Trino, it works with Presto, it works with Dremio, it works with Apache Spark, it works with Apache Flank. So you get that branching and merging across all these tools. So if your workflows incorporate these tools, you can then add that branching level emerging tagging to it.
[00:11:30] Unknown:
And now digging into the versioning capabilities specifically, you mentioned that at a high level, what Nessie does is it keeps a, a reference to all of the table metadata pointers so that within each set of transactions or each commit, you can say, I am pointing at this set of metadata for all of these tables. And so you can have commit and rollback functionality across tables, across transaction levels. And in terms of the actual versioning of the data, I know that Iceberg has built in support for being able to do optimistic concurrency control and being able to keep snapshots, to different points in time of data based on the underlying files and the changes there. I also know that it requires a certain amount of maintenance to keep the tables kind of happy and performant as far as doing things like vacuuming and pruning old references and old versions there. I'm curious if you can talk to some of the ways that Nessie handles the interoperability with the versioning in iceberg as well as, any of the maintenance pieces that it can help with as far as, like, pruning old versions, running table etcetera.
[00:12:44] Unknown:
Yes. Okay. So, basically, the architecture of, like, Nessie is that mainly you you it's it's basically, it could be a running service that you would run. You could also get it as part of integrate it's actually integrated into Dremio. It's integrated catalog. But essentially, it interacts through a rest API And, when it comes to the versioning aspects, right now, if I were to capture a commit, basically, it creates a sort of, like, JSON like entry in the backing store. So it could be like rock c b, a postgres, whatever you as you choose your backing store. That'll say, basically, I have a time stamp for that commit, sort of, like, the parent commit to that. So that way it knows what the sort of the the tree looks like. And then just a couple of the metadata pieces. So right now, it's like a very small metadata imprint. So right now, it's ability to, like, generally, the best practice is oftentimes, like, 1 branch at a time. And there's actually I'll give you a couple of examples of people who are actually doing that in production in that way. But when it comes to the maintenance side, this is where it gets a little bit tricky because typically when it comes to iceberg, when you're doing, like, expire snapshots or something like that, the assumption is that there's only essentially that, like, that that table's metadata is aware of all of its own snapshots, which then that that's sort of with Nessie, you might have different branches where there's different versions of the metadata JSON that has references to different snapshots for it to be aware of, like, okay. Hey. Which when I expire which snap the snapshot, how do I know which files are I can I can safely delete? So what Nessie did is they created their own tool called the GC cleaner, which does that kind of garbage cleanup. So it'll actually take a look at the metadata adjacent at the head of each sort of branch and be able to kinda safely identify, hey, which files are able to be deleted.
So when you run, the vacuum command on on either when you run the GC cleaner independently or if you're using Dremio, you use the vacuum command, it'll use that tool to then safely make sure it deletes the right the right data file thirst without affecting other branches.
[00:14:31] Unknown:
Now as far as the versioning pieces, anybody who's used Git for any length of time has dealt with the dreaded merge conflict. And when you're dealing with numerous tables, potentially dozens or 100, the last thing that you wanna think about is how do I deal with a merge conflict if I'm creating a branch and then I need to merge it back after somebody else has created their own branch and merged it ahead of mine. And I'm curious if you can talk to some of the ways that those versioning changes, branching and merging are, kind of sanitized so that we don't have to deal with these big complex messy merges in the event that underlying data has changed in the manner that is incompatible across branches.
[00:15:10] Unknown:
Yeah. I mean, right now, it is pretty shallow. So it's just tracking basically that that metadata reference and and and essentially a time stamp and a parent. So right now, you can get a merge conflict pretty easily if you're sending, like, like, several branches at the same time. Typically, the pattern we've been seeing is that people they'll do is they'll start a branch at the beginning of the day. So what they'll do is they'll create a branch for that day, and then they'll do all their ingestion for that day on that branch, run some validating logic at the end of the day, and then basically merge that branch at the end. So instead of creating, like, lots of branches, you know, for at least for ingestion purposes, usually, you wanna stick to sort of, like, 1 branch per catalog. And then that or you create a new branch for each use case. So based off, create a branch for today, we validate at the end of the day, and then, basically, at the end of the day, you're always merging that validated data back into production.
And then other uses if you wanna do more branches, usually other use cases would be like, hey. I'm just creating a branch just for experiment experimentation purposes, or I'm creating a branch to isolate, some particular changes that I don't plan on merging in, but I wanna kinda keep this separated. But generally, as far as, like, merging in, right now, you probably would prefer to keep it sort of make a branch, merge it in. Part of the what's evolving in the project is kinda adding more metadata at that at what the catalog tracks. So that way later on, you can have more sophisticated sort of merge resolution. But right now, best practice would be sort of, like, make an ingestion have, like, a a branch that is your ingestion branch and and keep it that way, and then merge it, and then create another branch for the next ingestion job after that ingestion job is complete.
[00:16:33] Unknown:
Digging more into Nessie specifically, you mentioned a little bit about some of the specifics of running. And I'm wondering if you can talk to the overall architecture and design of the Nesi project and some of the ways that it has changed and evolved in scope and purpose from when it was first started. Yeah. I mean, I think when it first started, it was
[00:16:52] Unknown:
and I think it still wants it. It's it's still in this regard of being sort of a lake house catalog. So while it mainly works with Apache Iceberg, it has the it has the architecture so that it can expand in a sense because, basically, what happens is that there's all these different types of things that can track, and then this is just it's essentially just deciding on an agreed schema that's built into or built in types. So right now, like, the types are, like, namespace. So if you're creating, like, a subfolder or, like, database, however you wanna think of these namespaces, there's iceberg views, iceberg tables. There's also Delta Lake tables that are actually part of this the the the spec right now. And they did try to make I think there was a pull request made to the Delta Lake repository to kinda have that functionality, but that pull request number got merged in. So that is a to be seen in the future to see if we can eventually get that change made. But, I mean, you know, for so for, like, a format like a Delta Lake or a Hoodie, most of the time, the table is just a direct particular directory. So it could just be as easy as just having a a schema that's just basically hoodie table, Delta Lake table that just points to a directory, and then it could catalog those as well.
It doesn't now, but it wouldn't be it wouldn't be hard to do because it has a very again, it's very flexible. It's just capturing. This is the type of metadata that this little block this little object tracks and then making sure that you have a metadata, object attached to that that matches the schema for that type. So Iceberg has a particular set of information that you would keep with it, but the way you interact with the catalog is through a rest API. Now does that so you could always custom make, you know, custom make these API calls, but there is a client on in Java and then Python to to directly interact with Nessie on top of the integrations that are already existing with a bunch of tools.
But basically, the way there is a standard open API spec spec on the on the Nessie documentation that holds the endpoints. Like, I spent I definitely spent a few days exploring that quite in-depth because I made, like, an unofficial client just to kinda get more acquainted with it. And that was that was a that was a fun adventure. But,
[00:18:52] Unknown:
it's a pretty straightforward API. Are you sick and tired of salesy data conferences? You know, the ones run by large tech companies and cloud vendors? Well, so am I. And that's why I started Data Council, the best vendor neutral, no BS data conference around. I'm Pete Soderling, and I'd like to personally invite you to Austin this March 26th to 28th, where I'll play host to 100 of attendees, 100 plus top speakers, and dozens of hot start ups on the cutting edge of data science, engineering, and AI. Community that attends data council are some of the smartest founders, data scientists, lead engineers, CTOs, heads of data, investors, and community organizers who are all working together to build the future of data and AI.
And as a listener to the data engineering podcast, you can join us. Get a special discount off tickets by using the promo code depod20. That's depod20. I guarantee that you'll be inspired by the folks at the event, and I can't wait to see you there.
[00:19:55] Unknown:
Another interesting aspect of this project going back to its nature as a catalog is that the overall space of data catalogs for data lake environments has largely been a pretty static target of you have the Hive catalog or you have the Hive catalog maybe in the form of AWS Glue, which is actually still just the Hive catalog. And I'm curious in the work of building and evolving Nessie, using it as an alternative catalog to that Hive ecosystem, Some of the ways that you have been constrained from innovating a lot in terms of what the catalog can offer and how to operate with it and some of the ways that you're able to try to move the entire ecosystem along a bit to understanding some of the new ways that the catalog can and should be thought of in this data lakehouse ecosystem and maybe some of the arbitrary limitations
[00:20:48] Unknown:
that the Hive catalog API has imposed upon us until now. Yeah. I mean, I think a lot of, like, a lot of the solutions to that particular problem were more repaired on the, like, the table format side. So, essentially, like, Iceberg really kinda broke away from sort of, like, the constraints of having to of Hive where you have to kinda have folders and subfolders that define your table. And then Nessie is able to leverage that by being able to just refer to that table metadata and just focusing and capturing, the versions of that. So, basically, it it almost takes a whole different paradigm of what the catalog does that instead of it being the bearer of the metadata, but instead of it being sort of the the gatekeeper of where the metadata is. So basically where Hive, you have the Hive metastore that kinda access both your catalog and Metastore.
Nessie basic access to the catalog and then Iceberg will then be sort of really where the metadata is stored on your s 3 in those manifest and manifest list. And in that case, you can much easier incorporate future formats and and new paradigms, the catalog. So I don't think it's initially been constrained. It's just a matter of, like, people choosing to adopt Nessie. That's become a lot easier in recent times, particularly just because it's like now that it's integrated into Dremio, a lot of people are just using it because it's it once you have a a a Dremio lake house, it just kinda is there. So it just it's there. So why not use it? And then you don't have to send up the service. You don't have to maintain it. So it makes the whole process a lot easier. But there's still also a lot of people who just deploy Nessie on their own and are just using it that way, because they prefer to have that service that they manage on their own. They wanna use a different backing store. They just wanna have control over it.
So we have seen a lot of, adoption on that side too, especially over, like again, this last year has definitely been a big year for us seeing growing adoption for Nessie.
[00:22:31] Unknown:
For that integration process or running it yourself, what are some of the steps involved in actually getting it deployed, getting it getting it integrated into a data stack, and maybe some of the complexities that people should be planning for,
[00:22:45] Unknown:
especially if they have an existing catalog that they wanna migrate away from? I guess the first step as far as deployment goes, I mean, if you just wanna try it out, there's a Docker container that's pretty straightforward to use. If you wanna deploy it for production, there is a Helm chart, you can play that pretty easily using the the Kubernetes helm chart. And then just very soon, there'll be an iteration of the Dremio helm chart that also should incorporate a lot of those details. So that way you can simultaneously deploy them easily. But once you actually have it deployed, far as, like, migration goes, it just depends on sort of what your use case is. So, basically, I'm a the assumption would be, hey. You're probably using Apache Iceberg or going to Apache Iceberg. So if you're already using Apache Iceberg before you adopt an SSE, the question is then becomes what is your prior existing catalog? So regardless of which catalog it is, what happens actually, part of the Nessie project, they came out with a CLI tool for catalog migration, which is not just for Nessie, but for any iceberg catalog. So you could literally you would just put in the credentials for the source catalog, and then you put in the credentials for the destination catalog. And what it does, it'll move all the references over. So then that catalog will have all the metadata references of basically in 1 fell swoop. The only challenge there always becomes is well, not really a challenge there. Like, there that should work fine. It's always, like, issues like there's the catalog because the the the actual query engine has to have access to the catalog and then separately have access to the storage where the actual metadata is stored.
So where where an accident can happen is that, you know, you decide you are using an engine that doesn't read, you know, your files are in Hadoop. You just do a blanket migration of catalogs, but now you're using a tool that can't read Hadoop file storage. So now you still can't read those tables even though you can read the catalog. So you so you definitely have to kinda keep in mind that you always have to think about, hey. Does the tool have access to the catalog and the storage? But long as you keep those 2 in check, usually, you shouldn't really run into any problems because essentially the the query engine's path is check with the catalog, then check the storage. As long as it can do both, you're gonna be able to read those tables just fine, assuming it adopts the iceberg spec. Now if you're not using iceberg, then you're probably not using that c so it's less of a consideration there.
[00:24:46] Unknown:
In terms of iceberg itself that also provides a moving target because it's a very active project. A lot of different engines are adopting it. It has been growing in terms of its overall capabilities and usage and I'm curious how that has influenced the direction and development of Nessie and some of the ways that Nessie has been able to capitalize on the newer features in Iceberg.
[00:25:10] Unknown:
Basic messages operates as a way to discover the table. So in that case, it's independent of what's in the metadata. All it cares about is the location. Right now, it only cares about the location of that metadata. Json. So what's inside the metadata.json? What's inside the other metadata files? So as we start adding things like delete files, puffin files, whatnot to the iceberg specification. And in the future, other files, I think, was with there's also other other things that are sort of in discussion right now. All of that would not affect the way Nessie operates since it's, basically, it's only versioning the references and not versioning the actual metadata itself right now. Again, in the future, it'll probably start holding more of the metadata so that way we can do those more sophisticated merges and be more context aware of the tables.
But the kind of data that it's probably going to need to track to do that is probably not the kind of stuff that's changing right now. Because, I mean, we're talking about, like, okay, how many what are the trials I got added? What are the files that were subtracted? Base it doesn't necessarily have to track, like, every single thing that the same iceberg metadata does. Just what it needs to be aware of, not tripping up when it's merging.
[00:26:09] Unknown:
And then another responsibility that can often get pushed into the catalog layer is the question of access control or permissioning, and I'm curious how Nessie handles that aspect of the problem space. Yeah. There's 2 ways you can handle that right now. Essentially,
[00:26:26] Unknown:
you can have different lay you can have different users that are accessing the messy catalog. And then essentially, the access controls are applied to, like, the user. So, basically, if I access the catalog with a particular token, well, basically, it'll be aware of, hey, that person this person using this particular access token can only access these branches, these these objects, these kind of things. So you can do that manually with Nessie, and there's ways of sort of configuring, a lot of that. That's still probably requires, like, a lot of manual configuration.
When you're using Nessie as it's integrated to Dremio, then it falls into Dremio's more point and click type of, authorization where you can basically have role based access controls, role based access controls, column based access controls the, at the at the query engine layer. So, basically, you'll have it'll leverage some of, like, Nessie's branch level controls and then also leveraging, Dremio's Quela engine level controls, when you when you when you give different users tokens from different tools.
[00:27:23] Unknown:
Once Nessie is part of a given data stack, the versioning and branching and merging capabilities are part of the core primitives of the system. How have you seen that influence the overall workflow and design approach that teams take as far as the development deployment evolution of their data processing and data delivery flows?
[00:27:47] Unknown:
Basically, it was pretty simple because oftentimes, like like, again, the pattern I see most was the part I mentioned earlier where people just do a daily branch. So, basically, all you do is you just tweak all your jobs to just hit that particular branch. And since you have, like, a naming scheme, so basically, it'll be, like, you know, branch whatever you wanna call the branch and then, like, it maybe a timestamp or a date. Basically, it's pretty easy to programmatically set up your pipelines to always make sure that they're targeting the right branch name. So then you can just kinda run the branch. It'll always hit that day's branch, and then basically everything comes becomes very turnkey. But, like, midday, you're not gonna see something like your production data getting painted because it goes through that daily process. And then systems that need to access the day that data, like, real time, they have access to that branch. So they'll query that branch directly without the same sort of guarantees you would get with the production branch, and that'd be sort of clearly communicated.
But that yes. Usually, yeah. Because once you have it, basically, the the actual creation and merging of branches is pretty straightforward enough to do an SQL and then automating that SQL with whatever, whether it's Spark, Flink, or Dremio is pretty easy. So once they have it's basically just kinda deciding what is the frequency of their their their branching patterns. Do they wanna do hourly branches, daily branches, weekly branches, and how what their merge cadence is gonna be. But once I kinda figure that out, once it's once it's implemented, you don't really think about it anymore. It just kinda works.
[00:29:06] Unknown:
Another common target for operating with this data is something like a DBT, and you mentioned the 0 0 copy clones effectively of being able to create per user branches. I'm curious how you've seen folks incorporate Nessie's versioning and branching capabilities into the development workflow of DBT users and data analysts.
[00:29:28] Unknown:
I've seen it with d d d b t users because Dremio does work with well, again, any tools with DBT. But basically, in the SQL, you can specify the branch in your query so you can just sit there. So I've I've seen it personally. I've seen it with Remio. And then basically, you can just sit there and just add add branch at the end of each of your queries and then you get all the benefits of DBT and all the orchestration and the using Git version control on your DBT models. But then you also get this other layer of of versioning at at the catalog level, so you get to leverage both and get the benefits of both.
[00:29:59] Unknown:
In your experience of working with Nessie, exploring its ecosystem, diving deep into the iceberg table format, and the ways that the 2 interoperate. What are some of the most interesting or innovative or unexpected ways that you've seen the Nessie project applied?
[00:30:14] Unknown:
There was 1. I'm trying to remember what I'm trying to remember what the exact details were, but I've I've seen I've seen some interesting applications of just creating a branch for, like, they're not just to kind of create wildly different versions of the data. Like like, actually, you know, 1 example with, you still using that date pattern I mentioned before, but also what they'll do is they'll create experimental branches because these these are, like, generally, like, large financial institutions who we've seen this pattern with. And what they'll do is they'll create a branch that they use for doing, like, stress testing type, whatnot. Because what they can do is that they can create a safe copy of their production data for that day to then make the changes to that data that they don't wanna permanently make to then run all their stress testing, calculations on. And then they can just throw away the branch at the end of the day without having to really worry about roll rolling it back or undoing the data. So there's bigger branch at the beginning of the day that's for, like, stress testing.
Add in the, hey, bad scenario here, worst case scenario there,
[00:31:13] Unknown:
and then run their tests, and then they can dispose it every day. In your experience of exploring this space, keeping up to date with the use cases, the the technologies behind it, what are some of the most interesting or unexpected or challenging lessons that you've learned? I mean, oftentimes, I think where
[00:31:30] Unknown:
it's always gonna be sort of like the great thing about the lake house house is that everything's very modular. So you can kinda swap out the different pieces you want, but there's still, like, little gotchas. And it's particularly in sort of like, as I mentioned earlier, when you're working with any catalog in the iceberg space, there's 2 sort of 2 layers. So you have to make sure that you have the authentication to access the catalog, and you have the authentication to access the storage. And different tools have different stories when it comes to both of those layers. And that's oftentimes where a lot of gotchas kinda come in. So I always just say, hey. Doing the leg work to make sure that, you know, when you're working with the team when you're working with the catalog, making sure that the tools you use can read the catalog and then also access the storage because I can definitely bite them in the butt where they're, like, they're working with something that they like, but then they move to x tool.
And then now well, the they were using, let's say, x object storage. And now that particular storage layer isn't readable by that tool, so it kind of interferes with their plans even though it could interact with Nessie or some other catalog.
[00:32:26] Unknown:
And for people who are interested in these versioning capabilities, what are the cases where Nessie is the wrong choice and maybe you're better served by just using an AWS AWS Glue or maybe just not even I not even using Iceberg at all. If you yeah. I mean, well, basically, if you're using Iceberg, I think Nessie is a good option. Now these reasons you would not use you might choose AWS Glue would oftentimes be because you're really inside
[00:32:49] Unknown:
the AWS ecosystem. So if you're connecting to Athena, you can Redshift, you're connecting to all these tools, then, you know, a AWS Blue is gonna be a very easy sell because it's gonna have that interactivity. But if you're operating multi cloud or in a completely different cloud, that's, you know, that's not necessarily gonna be the same, kind of the same saliency. But if you're not even using Iceberg at all, like, you're using Delta Lake or Hoody, then oftentimes, like, different solutions might work better. Like, generally, with there, the only option, it would be, like, LakeFS. You only have file level versioning available at the moment, which again, another I always like think of it as like another feather in the cap for iceberg. Not only does it have the rich ecosystem of things I can write to it, read to it, manage tables, but you also have a a rich, options of of how you can version control your tables, whether it's file versioning, table level versioning, or catalog level versioning. Iceberg really gives you a lot of options to kinda really architect the lake house you need. Have you ever seen where people are using both Lake FS and Nessie in tandem?
I I don't think I've seen it yet. I've seen 1 or the other. Theoretically, they can work together. I mean, it could be like, basically, 1 of the issues, like, Lake FS has with, like, iceberg in particular is that iceberg really depends on absolute paths, and Lake FS depends on relative paths. So Lake FS had to create their own custom catalog. Always the problem with the custom catalog though is engine support. So it works with, like, Spark or Flink, but then you get too many other engines. You have trouble connecting that catalog. So I could see a world where, basically, someone is working with multiple formats. They may be working with a Delta Lake and an iceberg, and they might wanna use Nessie for iceberg, but they wanna use, like, like, if that's for Delta Lake. And I can see that. And, I mean, I can see different situations where you're working with data that's outside of a table that you're gonna want to roll back, whether it's, like, you know, a group of CSV files. In that case, like, Lake EFS could be would be helpful. But, again, when it comes to your your main lake house catalog, you you might prefer a Nessie to provide those kind of semantics. So I I can see a world where all 3 levels have benefits because even at the table level with iceberg, a nice thing about being able to tag tables in iceberg is that it prevents them from being cleaned up when you do cleanup operations. So if I tag, like, an end of month snapshot, then when I expire snapshots, it won't clean up those tags snapshots. I mean, the same story when you tag, commits in the catalog level. But, again, there's gonna be different situations where you might want each of these levers to be available to you. So I haven't seen it too much yet because I just feel like I'm still seeing just it's just I'm just starting to see people start adopting these kind of patterns, at least on the lakehouse level. And then also, like, this sort of git style delivery of them, but I'm starting to see it more and more adopted, but it's still sort of very early days.
[00:35:21] Unknown:
For people who are interested in Nessie and want to keep abreast of its development, its future direction. What are some of the things that are planned for the near to medium term or anything that you're keeping an eye on or you're excited to see come to fruition?
[00:35:36] Unknown:
I guess I guess my my, wish list is gonna be probably be, like, high iceberg support for Nessie. That's definitely gonna be that's that's definitely high on my wish list. I I tried to make that contribution. I so so I started, like, writing, some of the the full request, but I didn't ended up just not having the the the time that I would like. So if anybody wants to help contribute that, please please go go join over there. There's a lot of there's a lot of great work to do there, and there's a lot of really great devs working over there on Nessie that you can you can you can communicate directly with them on the Nessie Zulip. So Nessie uses instead of Slack, it uses Zulip, which is like the open source Slack. So you can communicate there. So there you can, like, learn be part participate in the conversation about the evolution of the format.
I mean, not the format of the catalog and, its future features. But I would say, like, my short term wish list would be pie iceberg support. 1 of the cool things that, you know, I keep I keep hearing about long run, it's gonna be again that sort of more context awareness. So, and then also would be really cool is that eventually, you know, the pull request gets get accepted over there at Databricks so that way I mean, Delta Lake to, be able to support, be able to support nest Delta Lake and SC or something like that to just offer more options. So that way it can become, ideally, you know, you have a catalog. You can hold all the things. But, it's a pretty cool tool and the patterns I'm seeing with it are pretty fun. And then I think what's most unique about it is just sort of when you start doing the SQL for it and you're seeing how easy it is to do it. Yeah. To me, that's when I was like, okay. This this is nice. This is just easy and simple to use that that it really does make make a lot of new patterns a lot easier to execute.
[00:37:12] Unknown:
Are there any other aspects of the Nesee project, the overall kind of use cases or capabilities of data versioning in the lake house that we didn't discuss yet that you'd like to cover before we close out the show? I guess a couple other use cases,
[00:37:26] Unknown:
that I think are implied, but just to make them explicit are, like, multi table multi table transactions. And 1 thing I think like, right now, they have introduced, like, multi table transactions at the table level versioning in or in the table level at in Iceberg, But the way it's done is you have to use an a catalog, the support for us catalog, and they have to kinda implement this, multi table transactions. And it's a it's more like a traditional sort of begin end transaction type style where basically, you have to kinda do everything at 1 time. The nice thing about the Git style, which you get with, like, a Nessie or, like, a fast, they're both taking that sort of git like approach, is that I can create a I can create a branch and I can do multiple transactions and none of those transactions are published until I do a merge. So I could be doing 1 transaction on 1 table in Spark, another transaction on another table in Flink, another transaction, another table from Trino or Dremio.
And then once all those transactions are done, all those transactions can be published simultaneously to all those tables through 1 merge. And that's sort of a unique form that you that just doesn't actually currently exist in a data warehouse at all. So that's, a really neat thought process because I I do think it opens up some new ways that you think about how you you you do those transactions across multiple tables and work with multiple table semantics.
[00:38:40] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:38:55] Unknown:
Basically, basically, my opinion is gonna be something to tie it all together, and that's that's kind of what I why I find working at Dermium really exciting because we do have a tool that's really trying to tie things like Iceberg, Nessie, your all these different data sources and tied together in sort of 1 cohesive platform where it feels like you're getting that modular system, but it comes with, like, the ease of use and nice sort of flavor that you get with, like, a more integrated, system like a like a like a snowflake, but you get that use of use in a more deconstructed system on the lake house. And I think that has been the thing that people have been really looking for, and I do feel like we are we are or on the verge of really kind of providing the solution to that. So I that's a pain you're feeling definitely come talk to me.
[00:39:38] Unknown:
Alright. Well, thank you very much for taking the time today to join me and, share your perspective and your experiences working with Nessie and helping us understand the problems that it solves and how to incorporate it into a data lake environment. It's definitely a very cool project. It's great to see more investment and evolution of this data versioning capability in the data processing ecosystem. So appreciate the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you very much. It was a
[00:40:10] Unknown:
pleasure. Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast. Dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview of Data Lakes
Interview with Alex Merced: Introduction and Background
The NESI Project: Overview and Core Features
Comparison with Other Tools: LakeFS and Snowflake
Catalogs in Data Lakehouse Environments
Nessie Architecture and Evolution
Deployment and Integration of Nessie
Impact of Iceberg on Nessie Development
Use Cases and Workflow Changes with Nessie
Innovative Applications and Lessons Learned
When to Use Nessie vs Other Solutions
Future Directions and Wishlist for Nessie