Summary
Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
- Your host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Hudi is and the story behind it?
- What are the use cases that it is focused on supporting?
- There have been a number of alternative table formats introduced for data lakes recently. How does Hudi compare to projects like Iceberg, Delta Lake, Hive, etc.?
- Can you describe how Hudi is architected?
- How have the goals and design of Hudi changed or evolved since you first began working on it?
- If you were to start the whole project over today, what would you do differently?
- Can you talk through the lifecycle of a data record as it is ingested, compacted, and queried in a Hudi deployment?
- One of the capabilities that is interesting to explore is support for arbitrary record deletion. Can you talk through why this is a challenging operation in data lake architectures?
- How does Hudi make that a tractable problem?
- What are the data platform components that are needed to support an installation of Hudi?
- What is involved in migrating an existing data lake to use Hudi?
- How would someone approach supporting heterogeneous table formats in their lake?
- As someone who has invested a lot of time in technologies for supporting data lakes, what are your thoughts on the tradeoffs of data lake vs data warehouse and the current trajectory of the ecosystem?
- What are the most interesting, innovative, or unexpected ways that you have seen Hudi used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hudi?
- When is Hudi the wrong choice?
- What do you have planned for the future of Hudi?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Hudi Docs
- Hudi Design & Architecture
- Incremental Processing
- CDC == Change Data Capture
- Oracle GoldenGate
- Voldemort
- Kafka
- Hadoop
- Spark
- HBase
- Parquet
- Iceberg Table Format
- Hive ACID
- Apache Kudu
- Vertica
- Delta Lake
- Optimistic Concurrency Control
- MVCC == Multi-Version Concurrency Control
- Presto
- Flink
- Trino
- Gobblin
- LakeFS
- Nessie
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse, enabling identity stitching and advanced use cases like lead scoring and in app personalization. Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder.
[00:01:22] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Vinuth Chandar about Apache Hoody, a data like management layer for supporting fast and incremental updates to your tables. So, Vinod, can you start by introducing yourself? First of all, thanks for having me here. And it's a great show, and and it it's a pleasure to be an honor to be here. My name is Vinod, and I am the PMC chair for the Apache PD project on ASF. I started this project, you know, born out of Uber in 2017, and I've been supporting the open source community for the past 4 plus years.
And, yeah, I'll I'll really look forward to talking talking through the project's journey today. And do you remember how you first got involved in data management? Oh, yeah. Definitely. For me, my my real, you know, interaction with this ecosystem was during my employment at Uber. I was a database guy. I'm still, like, a database guy. Started my career with Oracle, and then, you know, working on the CDC products, Golden Gate Trains, and, like, then more on LinkedIn where it led the Baltimore key value store project for almost 3 years. Then at a brief standard box, again, building, like, some version of Firebase for them before that project was scrapped and I moved on to Uber.
But even after Uber recently, I was a personal engineer at Confluence, spending time turning ksql into ksqlDB, spend time on Kafka storage. So for me, I'm a database guy, you know, try to build I really wanted to do big data analytics and did the whole Hadoop stack back in the day, if you will. You know, I was newly engineers. We built out the data team. This was the hypergrowth, hyper scale years of Uber, 4 x ing every year, and, like, you know, launching cities every day and all of that. Yeah. We built open source stack. We brought Kafka. We brought Hadoop. We got, you know, Spark, Edgebase, Presto, everything. And after a year or so, we had a functioning data lake, you know, things were okay except for the fact that we were doing so much data management and DIY data management in the lake. No transactions.
Right? And then, like, large batch jobs. And yeah. Essentially, we decided to do something about it, and I got sucked into that
[00:03:35] Unknown:
a lot. And I've been in that thing for almost 5 years now. That kind of leads us up to the motivation for building Hudi. So can you dig a bit more into what it is that you built there and some of the problems that you were dealing with at Uber that led you to building this new system?
[00:03:51] Unknown:
Yeah. So what we did was, you know, I honestly wasn't trying to innovate anything when we started writing Kari. What we wanted to do was, you know, actually imagine Uber is a very real time business. Right? So building for the data products within Uber is very different from even, let's say, LinkedIn. Right? I came from LinkedIn. LinkedIn is a brand in terms of data infra contributions to industry. But at Uber, you have 10, 000 plus people who operated the cities, and you almost had to build products like you're building for external customers. So the bar is pretty high. You have to get real time data. And we had managed to, you know, get change logs from all databases, collect Kafka even streaming data, steamerize everything. It was they are streaming up to the point that it had to be ingested into the data link. And then when you try to do that, you had no, like, good mechanisms to sort of apply your database CDC logs incrementally onto the data lake.
And furthermore, we you have to do these, like, large bulk batch jobs. We stood up hedge base and we, you know, streamed all these CDC logs into hedge base. And then we'll dump HedgeBase into parquet every XRs. It's pretty time consuming, extremely resource inefficient, and then it it doesn't give the latency that we want. Right? And and it doesn't stop with just the ingestion. And we noticed that it was a more systemic problem where, let's say, you have the core trips data to where you can imagine how many, like, you know, entire company wants to query that. Right? But there'd be, like, derived tables that you would build of the core trips table, which are, you know, all are again batch jobs, like, large recomputations.
And there is no intelligence in this model, in the batch processing model to say, you know what? The upstream table does not have any data. Don't run it. Only reprocess these records. These are the only records that have changed upstream. You don't have to recompute the entire thing. So as a result, we could neither meet our data latency, you know, kinda like promises to the company. We are building a lot of hardware in terms of Huddl jobs that we are running to kinda like bake and do these datasets. And we rethought this problem. We even considered systems like Hive Asset back in the day. You know, we couldn't really do it because it's so tied to Hive. And as we talk about later, some of the storage layout properties we didn't like and looked at things like KUDO were coming up at that point.
And it is promising, but the thing is you had to move this entire stack outside. It's a new storage system, which needs a new hardware stack. It wasn't very cloud proof. You know, tomorrow if you wanted to store all this data in the cloud. So it led us to a point where we were like, let's think about how we even solve this before we brought all this into where which is we had Vertica. Right? And then Vertica had transactions updates. And then, you know, we said, okay. All we need to do is build a transactional layer on top of, you know, like, HDFS or the Huddl file, like a cloud storage kind of abstraction, and then provide update deletes.
And then as a, you know, like, drawing for the derived data problem, we were like, okay. How do people generally solve it? Like, people write stream processing pipelines, which process data incrementally. And how is that done? That's done by something like Kafka providing you first class ability to, you know, efficiently consume only incremental changes from, like, a topic. Right? So we brought that incremental change stream CDC from databases on Kafka to this world, and that's how Hudi was born. It was created as an abstraction which provided update to leads and, you know, incremental changes from tables.
And we we use that first pass to kinda completely make all of our ingestion and our core kinda derive the warehouse details, if you will, and be, like, completely incremental.
[00:07:51] Unknown:
That's the story. Yeah. It's definitely a very interesting project. And this is an interesting time in the overall ecosystem of data lakes because, as you mentioned, Hadoop sort of reigned supreme for the early to mid 2000 and then started to fade out as things like s 3 gained more prominence and other engines built on top of the actual file system layer were built up to get more of a database type abstraction, and the MapReduce paradigm went away. And so in the past 3 to 4 years, there have been a number of other table formats that have come to the fore, hoodie being 1 of them, other notable entrants being iceberg in particular, and then Hive has been around for a while as kind of the sort of table stakes for that space. And I'm just wondering what your perspective is on the current state of the ecosystem for table formats for data lake engines and sort of the overall role that these different metadata and schema management layers play in the overall storage and processing ecosystem?
[00:08:50] Unknown:
Right. Right. This is a very big question, and also, I think, something that everybody wants to hear about as well. So if I may, I'll take some time to break this down and explain in layers, if you don't mind. Yes, please. So first of all, right, this whole, like, table format, the term is kinda new. Right? I mean, essentially, first, so it may be helpful to first define what that is and kinda push in Hudi related to kinda like, like, some other formats out there. Right? So file format has, you know, has a schema. It gives you ability to write and read data. Right? So table format, you can think of as something that tracks schema for a table, some partitioning information, some statistics for the table. That's basically what's role is to be. And also provides a concurrency model around it so you can, like, you know, change this data with some guarantees.
Right? It's more about the metadata around the table. And when we look at Huddl, we never thought about Huddl as a table format. It'd be the design as a table format. Huddl is designed more like a database that is built or on top of these table formats, if you will. Right? It comes with indexing. It comes with, like, a redo log, you know, and then, like, databases have demons. Right? Like, which all do, like, various things to make the database, functional and also optimize and perform it. So it comes with set of daemons. We who all, like, know about each other. They're all working off of the redo log. And a format is, like, 1 piece of the puzzle. And Huddl started by simply adding transactions, updates, change capture on top of the Hive format, if you will. Right? Hive tracked schema, partitioning information in the Hive meta store, but it couldn't really track anything that's scalable around, you know, file level statistics.
That was the biggest gap. Right? And our file listings, for example. So for us, contrary to what, you know, I've heard from other format designers, it's actually doable to build a transactional layer on top of the high format. That's what we did. Like, we've shown it that can be done like, a 200 terabyte data lake where it's been running for 5 years like this. And then many, many companies have done this as well. Right? And to the earlier point, right, we actually have a Jira open for almost 2 years now to say, hey. Can we plug in something actually also open under the same Apache umbrella like Iceberg as an alternative? Just like how we sit on the high format, can we make the database stuff of Hudi sit on top of WiseBird. Right? So that is something that, you know, we remain open to, and then it just being like a resourcing thing. Right? So end of the day, to summarize, I feel table formats are removing decade old kind of bottlenecks in terms of metadata access around the tables.
It's still a means to an end. That's kinda like how we look at it. Right? Now going down into the actual issues around table format, the high format, that's how it's issue like you mentioned. Doesn't like track anything file level, folder based partitions which are very kinda, like, you know, pretty rigid. Right? And then poor or no concurrency models. And I think when you operate in cloud and runs especially and with large tables, you feel face all these, like, listing issues. Right? Or when you try to scan large tables, small tables are okay. Right? So over time in Hudi, we also hit the same issues, and we extended the Hive format with our own native, like, kinda, like, you know, pre implements its own table format internally, which, for example, stores file listings, some statistics similar to what, like, Iceberg and Delta have done. But we try to design for our original goals of, hey, we want to enable, you know, kind of streaming model of writing and reading data from the data lake, which means hoodies for table format optimizes for smaller files, which means a lot of metadata. So we index metadata in our table formats.
And we've been actually building our table format, I'm saying, for a different purpose, which is more in line with the the overall goals of Hudi to, you know, unlock kinda incremental data processing or, like, kinda change batch processing, like, more general purpose across the link. With this context now, thanks for being patient. The actual similarities between all 3 projects, all 3 projects provide transactional, like, inserts and versioning and point in time queries. Delta Lake and Hudi have supported updates and deletes for almost 3 years now. It's been out there. I think I just got as a v 2 format for deletes, which is nearing completion as I understand it. But the difference is I'll try to touch upon a few few major areas. If you look at storage layout on how you actually version and store data, could he actually they wanted to be started supporting 2 kinds of laying out data. 1 is copy on ride, which you just keep versioning parquet files.
And merge and read where you actually have delta logs when you're trying to update a table, and then you merge on the read side. Right? So we actually see these terms now used even across, you know, like, Delta talks or, like, noise work community and stuff. So talking in those terms, Delta does not have the MOR capabilities yet. Right? Ironically, there are no Delta logs inside Delta Lake. Iceberg v 2 model around the high of asset layout where you keep, like, log files which are more global to the partition or a table. Going back, since we want to optimize Hudi for very fast upserts, we bucket the delta logs. We have an indexing mechanism where we map every single incoming record to a single file. And then we, you know, break down these deltas and push them to individual files, which helps us actually, you know, reduce the overhead of merging and kinda like the amount of data that you scan during margin rate is is lower in this kind of layout model. These are some things that are different at the storage layout level. You will get different performance for different workloads with these 3 formats. Right?
Concurrency, both Delta and Iceberg support what is called optimistic concurrency control, which essentially means I'm not going to expect concurrency. If there is concurrency, 1 of them will succeed, the other 1 will fail. The problem with this approach when you mix with long running transactions, which is what batch jobs are. Right? When that you write across Delta Reisberg. The longer the batch job runs, the higher the probability of collision. Just just, you know, theoretically speaking. Right? And then these jobs run for a long amount of time. They waste a ton of CPU, and then they finally 1 of them fail. Right? So from early on, we've embraced a more MVCC model. Just like a database, we differentiate between internal daemons and then actual client writers to Huddl, where writers may get optimistic concurrency control between them. Like, 1 of them can succeed, 1 of them can fail. The internal service is like clustering data, compaction. Anything that is maintaining the table can run-in a completely non blocking way. So you can keep writing to changes to a file as we compact data in the background. And this is, I think, very key if we have to really deal with mutable data at large scale. I feel like the core design for kinda like batch formats like Delta and Iceberg are they're more designed for append only workloads in which I think OCC may work fine. Right? Whatever I'm talking about when you have a lot of deletes or updates, then your core indigenous updates, these problems will show up at scale. And then most of these design choices that we made in Hudi are made for running some high scale, like, data lake, like, something like Uber. Right? But you cannot really violate the SLA of the rights. There are updates that are you know, we want to replicate the trips database very quickly while they'll be deleted to the trip table because, you know, some user left the system or GDPR or, like, something happened. Right? So how do you, like, accommodate these 2 things without locking and blocking and timing out and retrying? That's where Hoori really stands out and the design is kinda like these are, like, very key database design differences, which sometimes don't get, you know, bubbled up as much in the the kind of data lake ecosystem menu. Simply look at it as a format if you will. Right? The other interesting thing is we add record level metadata and then the the incremental query part of it. Right? So Delta Lake has incremental queries, but we actually track record level metadata.
If you go back to stream processing concepts, there is event time and then arrival time. We actually track both for data that is written in Huddl. So we are actually able to give you, like, you know, a Kafka like experience when you say incremental query. We will only get the so with Delta, I think my understanding is you can get the files that changed. I mean, you diff iceberg snapshots, you can get the files that change between snapshots. But with Hudi, you get, like, record level CDCs. And then we have a ton of metadata internally that we track to build, you know, trigger watermarks with something like Flynk can eventually understand. So we are, like, building towards that feature. Finally, looking at it as a project, right, the actual software that's in the project, looking at it just like a table format doesn't, like, account for all the platforming components that we have. We have, you know, just like how Kafka doesn't just give you a pops up system. It gives you connectors. You know, it gives you monitoring. It gives, like, a bunch of things around it that you can take and deploy and actually use Kafka for free, right, in your organization. But he has built, like, ingest tools, ETL tools. It can do out of bus deduplication of event streams before ingest. A lot of upper level functionality that has been built on layer on top of this, like, kinda database layer. That's what you get today in the project. And these are all, you know, thanks to all the great work that has happened in the open source community. We didn't have much of this when we, you know, open source a project from Uber. But all of this has been built, and that's, again, like, something to highlight the power of open source and, like, how much we can get do together here.
[00:18:59] Unknown:
Given the level of functionality and the number of different capabilities that Hoody introduces to the data lake ecosystem with these database like capabilities such as MVCC where you can have a snapshot view of the state of the data during the lifetime of your transaction and the ability to manage arbitrary updates and deletes of records within the tablespace, which is, you know, something that has historically been challenging. Because of this fact, it starts to look more like the sort of so called lake house architecture that has been sort of gaining some popularity, at least in terms of the nomenclature. And I'm wondering, with what Hoodie is bringing to the data lake ecosystem, what do you see as the current dividing line and motivating factors between a data warehouse, particularly with cloud data warehouses and data lakes, given things like Hoody and some of the overall processing engines that are out there. Like, the lines are starting to blur a bit more. So as somebody who's working in the space, building this data lake capability, what do you see as sort of the motivating factors to land on 1 side or the other? 1st on the the lake house, the idea of, lake house. Right? And I think
[00:20:14] Unknown:
it's a pretty neat way of putting this. When it first came out, I actually, at least wasn't, like, that's our price because I was worked on 2 companies now where both at LinkedIn and Uber, like, you know, large data companies with lot of open source. That's how we built it. Right? You get your operation system. You first, you know, store an open file format. That's what you use for analytics using something like, you know, Presto. And then the same thing, the data gets fed for machine learning. I think Databricks has done a really good job of, like, kinda, like, elevating this as a term. And then kinda, like, you know, now we have a reference point talk about it. So I think good amount of credit should go to them for kinda up leveling the conversation in the industry around this.
So the vision for a lake house though. Right? When I think about it, when I compare it to the warehousing, is the cloud warehouses. Right? I do feel like cloud warehouses and warehouses in general are perceived as a much more usable product into it. And this is something that on the lake space, we keep I mean, I'm including myself also in that thing. We keep focusing a lot on the technology, but then warehouses make it very, very easy for users to onboard and use it and even with all the other gaps. Right? So the part of the lakehouse vision is that Delta Lake installed and Databricks are Huddl installed in EMR. You can go and build their own lake house.
Right? But with Snowflake and BigQuery, you don't have to build that lake warehouse. It's built for you. Right? So these are the things that we've slowly tried to change with Puriv to build more and more components and make it, like, as much as possible, like, single stack that people can take and deploy and monitor and operate. But I think this level of usability and user experience has to be built around the lakehouse even with the technology gaps I feel. That is, I think, super important if we are to really unlock this thing at scale. All the large tech companies may be able to hire a lot of good engineering talent and build these kind of things in house. Right? But for like a company who's starting out, right, and is trying to grow, like, they're not gonna be able to hire. They want something that's readily available.
A Lacoste that is, like, readily usable and they can, like, you know it's a turnkey solution. You can do it and then it works. Then we go into the actual technical differences between architecturally cloud warehousing versus the lake also. Right? I technically I essentially see 1 major difference, 1 major limiting factor, which is running ML and data science workloads on top of cloud warehouses in a cost efficient way. Right? And let me double click on it. So when we say that, what we really mean is a good chunk of AIML data science is writing, you know, jobs to feature engineering, like scanning large amounts of data. Correct?
So data warehouses are best in class, I feel, in terms of, you know, even as engines like Presto and Trino are, like, constantly catching up to them. Right? So the problem of querying data is about reducing the amount of data that you scan so that you can return the results very quickly. Right? But the job of data science and data engineering workloads is to score through large amounts of data very quickly. Right? And then extract new things. So if you look at, let's say, even Snowflake, if it's still the same. Right? Last day checked, the largest cluster that you can run is 1 28 notes. Right? And at that point, you're already paying a very high cost for that ginormous cluster. If you want to do data science on reducing spark, you need to pay for, like, a spark cluster, keep a snowflake cluster running, access data through that. Right? This is prohibitively expensive for anyone to do even at, like, moderate scales.
So I think the technical gap remains, can warehouses, actually, cloud warehouses really support the cost effective and scans in terms of for data engineering and data science workloads. I see as the core kinda, like, technical advantage that you have on the lake house or data lakes today versus the warehouses. Of course, there is openness. Right? I don't believe that, you know, even if I had to go back and do it, even if we solve this scaling problem in the cloud warehouses, I still feel like most of the data should be kept in an open format, like data format, like, you know, Parkey or Varcy, managed by open table formats, if you will, that all that we discussed. Right?
And that's the better future for most companies to be building towards. Right? And then we should go back to this model where these are all external tables in a cloud warehouse. So if you want really the price performance of Snowflake or BigQuery, you move your data in to the native format. Otherwise, you keep it in parquet. Right? And this, I think, is a better feature for us to build towards. Just the openness. It gives everyone, like, lot of choices. Right? And it generally fosters better innovation, I feel. You're not beholden to 1 company adding more support for a particular framework, access your data efficiently. Right? Yeah. So that's how I generally think about this landscape in terms of this. Data lakes need to fix a lot of things around usability and experience and, like, do do more out of the box.
Data warehouses have to, like, get over some of the cost scalability issues around accessing larger ones of data.
[00:25:41] Unknown:
Still, I think we should have a model where we have open data. Right? And that's what the industry should be building towards. Yeah. I definitely agree on all of those points, especially about the sort of usability and access layer for the data lake because that, I think, is the piece that's still missing, at least in the open ecosystem where there are some companies that have built that layer on top of open formats to make it easier to manage data lakes in an organizational capacity. Whereas if you want to put together your own stack from pieces that you pull off the shelf, there's a lot of engineering to do, and so definitely interested to see sort of how that evolution progresses. And as data lakes move more towards the usability of of cloud data warehouses and cloud data warehouses, maybe start to add extensibility to support more open formats because that's definitely 1 of the things that always gives me pause as I start to look at things like Snowflake or BigQuery is, okay, I can do this and I can use it, but then all of my data is beholden to this company until I pay the, you know, expensive cost of then exfiltrating it to a different system down the road. I think if we can spend, like, another little bit more time on this topic. Right? I think, yeah, that is the part that I feel like if the data lakes don't fix this problem in the next few years,
[00:26:56] Unknown:
people have the need though. Right? The data volumes are increasing. Now they can't really afford to build a a lot of this themselves. Those things are pretty clear. So this data has to go somewhere. Right? So if it's not the lake, then we'll be we'll be in that situation. And for me, working on the Huddl community for, like, 4 years, and, you know, I spend time in daily basis, like, looking at what people are doing in the community. I can see people, right, who are data engineers in their companies. They're really good at writing spark jobs and, like, writing business logic, a fraud detection pipeline. They can do an excellent job with writing that. A lot of those engineers are suddenly tasked with, hey, you know, build out the data lake for this entire company. And then what happens is they need to know, like, it's not just about a technical problem. Right? You need to grow a lot of data engineers from 2 platform engineers. That's a very rewarding journey for the individuals who are trying to do it. Right? Some of them come out with flying colors, but some of them don't. And for the company, it takes a long time to get ROI on these projects because lot of the platforming components aren't there.
And going back to Huddl, if you take 1 example, right, we have this tool called the Delta Streamer tool. We just named Delta Streamer even before Delta Lake came out. So it has nothing to do with Delta Lake. But anyways, so it makes it very easy for you to ingest, you know, Kafka topics into, like, s 3 query table by, you know, issuing a spot submit command. Right? And then if you run it in a continuous mode, it can do clustering for you. It can do compaction. All of that is, like, self contained within that single spark job that you submit. And then people seem to really like it because, you know, I get, like, consistent monitoring out of it. Right? So you kinda, like, tool the way 1, at least, large use case.
And then you're able to, you know, hit the ground running and show somehow, like, you know, progress to your organization much much more quickly. You know, I'm not saying this is the ultimate thing, and we've solved all of this. But I definitely see that more than the tech even. Right? I think this is where the data lakes have to focus all their dollars on. Now digging more into the Hoodie project itself, you've mentioned a little bit about some of the architectural elements as far as supporting MVCC
[00:29:18] Unknown:
and having the sort of compaction layers and copy on write versus merge on read. But I'm wondering if you can just give a bit more of an overview about how the overall Hoodie project is implemented and the different moving pieces and the components that are necessary to deploy to get it actually set up and functional from an end to end perspective?
[00:29:41] Unknown:
At a very minimum, what you need to do to, you know, play with 2 d is you need a spark shell and s 3 access to s 3. So you can, you know, write some data and then update the data, delete the data, and query it all from here. Right? But a typical setup looks like this, where you have some upstream data that we used some tool. Let's say it's all sitting in some Kafka topics or some files that are getting uploaded to other s 3 buckets. Then you run the Delta Streamer tool, or you can write your own custom spark jobs, fling jobs to, you know, read that data and first ingest it onto a bunch of, like, you know, raw hoodie tables and register them out to high meta store, blue catalog, or, you know, Google Cloud offers you a managed, HMS, any some meta store.
And then you're able to now query this data in different ways from all the major lake engines. Right? Presto, Spark, Trina, and also, like, you know, manage solutions like Athena from AWS, Redshift, and, like, even Impala. I I think we're supporting Impala Copy on Right as well. So that's how the world layout looks. Right? And then going 1 level down. Right? Let's try to just understand what happens on the re writer side. You know, how files are actually laid out in storage and then how, you know, action then queries actually access this data in a consistent way. So first talking about storage. Right? So what Hudi does is, you know, you can have an unpartitioned or a partitioned Hudi table. For simplicity, let's take, like, partition example.
So essentially, you write data, you know, and then let's say you're inserting data, it produces a bunch of parquet files. Right? Then we call each file is assigned like a unique UUID, which we call, file ID. From there on, if you now try to update any update to the individual parquet files, for merge and read storage type, it is stored as a delta log, which is associated to a given file. For each file, we essentially call this combination of base file under delta log as a file group. For copy and write, for example, we don't write delta logs. We implicitly, you know, merge the log into the parquet file and produce a new parquet file. So essentially, what you will see is data is grouped into file groups where each key, a record identified by a key is, at any given time, mapped to a single file group. And then within each group, you have these slices of data where there's a base file, a bunch of log files, then maybe a compaction runs and produces a new parquet file, which is a new slice of data. Right? So this is the MVCC kind of storage model within within audio. So all actions, commits, deletes, updates, all the background actions we have, you know, compactions, you know, like, remove older file slices as you write more and more data.
And to keep the, you know, storage growth bounded, There's a compactor which can take the logs, merge it to the base file, produce a new base file. Clustering which will now look at a bunch of file groups and say, I'm gonna z order these things into a new set of file groups. So all of these things work on a single event log, you know, going back to our inspirations from stream processing. And they all are materializing their plans for compaction, for example, for clustering to a folder under each table, what we call as their timeline. And then that's the source of truth redo for this entire, you know, database.
So all these background services and writers do coordination based on events that they see in this timeline. And what happens today is for talking about all these extensions to file form table formats that we need to build. This timeline is then, like, you know, kinda replicated into an internal holy table, just like how, let's say, Kafka use internal topics to store schema and offsets and stuff. And we, you know, make listings, we make statistics, we make, like, tons of metadata around the table. So that's the storage layout, how it works. When you're writing, the right pipeline can natively support something called pre combining. Here is the use case for something like that. Right? Let's say you are obtaining change logs from an upstream database within a single run. Like, just in a single batch, a record may be updated many times. Say, for example, a Uber trip ID is the primary key, then you get many updates to that same trip. People need a way to combine these into a single record before they apply it on storage. So the first step for the absurd delete operations is to combine these change logs that you're getting into a final form.
Then we look up an indexing scheme, which gives us a mapping between a key and the file group. And we support external indexes. We support, like, join based index. We support, like, index using bloom filters, and we have PRs out for record level indexing. So this is something just like a database. Right? This is an example for functionality that Huddl builds on top of, like, a table format. The second step is so the indexing step tells you whether it's an update or an insert. If it's an insert, we create new file groups. If it's an update, we, you know, either log deltas or we, you know, produce new parquet files. Right? So the final stage is actually writing the data out, and we do many optimizations here to, you know, sort the data so that it can compress better or it can provide more locality for queries. Like, for example, in the Uber example, we would want to sort by city IDs. If, let's say, data scientists are always working based on city IDs, they're querying always by city ID. Our dashboards are plotting stuff for a given city. Before we ingest, we wanna sort it by city IDs. Right? So that you scan less amount of data. So these kind of storage layout optimizations happen, and the data is finally written out. For the queries, let's take a Presto query. Right? Presto query is going to actually see it as a high table pretty much.
And then within Presto, there is a point where it says, hey. I'm gonna do everything like how I normally do things except I know this is a Huddl table. I'm gonna go look at the timeline. I'm gonna figure out what is the latest committed, you know, transaction is. And then I'm going to actually now filter all the files regardless of whether this listing comes from who is metadata or actually you list this 3 or HDFS. It simply does filtering and then exposes the latest snapshot of the table for you to query for snapshot queries. Even you look at something like spark, we support lot more flexibility where you can either query it as a snapshot table or you can say, give me records that changed after 8 AM, and then we can give you records after 8 AM. Right? And you can also do a point in time query where you can say, hey.
Give me records in the past which are, you know, like a month old. Then what we can do is we let simply let you query that file slice that was written back then as long as you you retain it in history. So we we expose a lot of flexible ways to query the data. And specifically talking about MOR, we support 1 additional, you know, view of the data called read optimized views. So it's a very practical scenario where normally you want to keep all of your data in parquet tables. Right? You have, like, date partition, event logs or something, which is, like, everywhere. And then you'll get sporadic deletes that you want to kinda log in the delta logs and then compact them later.
For those kind of scenarios, we offer an additional view for the query engines, which 1 leak query, like, the base parquet data. So you get great columnar performance without any of the merging overheads. Right? And as long as you design a compaction span, it works for a common use case like this, then you actually get, you know, best in class column or queries for recent data while, who is indexing and, like, the m o the logging and the compact asynchronous compaction, keep things like data deletion and everything kinda, like, chugging along without affecting queries or even affecting ingestion jobs. Right? This is actually how we pulled off data deletion at, like, you know, like, 100 of better bytes of scale at Uber because we cannot take logs or we can like, going back to our previous conversation. Right? So these are some common scenarios in which you you use it. And all of these background services can be run on an external workflow scheduler like airflow, or they can be run-in line with every write, which will add latency to the write, but it's simpler.
Or it can also be run within the same process, the writer process, but it's been done in a asynchronous way. So for example, you kick off Spark streaming job. It's writing data from Kafka, but it's also, like, using the same sparks, like, internal job pools and everything to schedule intelligently, schedule clustering compaction, and also do all these, like, background tasks. Again, like, how let's say, for example, Rocks DB or Berkeley DB would behave if you embedded within your application. Right? It'll be running some background threads, which are, like, cleaning, compacting. Similarly, Hoorie can self manage all of this for you.
[00:39:21] Unknown:
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14 day trial.
In terms of alternate right paths where right now Hoody is fairly closely tied to Spark. What are the options for people who might be using a different processing engine or a different input workflow for being able to populate files into the Houdi ecosystem?
[00:40:19] Unknown:
Right. So the 1 thing I probably missed out is that you can do Spark and Flink today, not just Spark. But, yeah, this is 1 thing where since we did not design it as a generic table format that you would want to write data from every engine. Right? We basically picked Spark and something that people most data pipelines in. Right? And that's how we did that. So we are still working on, for example, how we would write data from high. We need to write a storage handler and for Presto and Trino as well In the coming releases, we are planning to add right capabilities from that. For now, Spark and Flink is how you write, and then you can query from everywhere else.
The reason is, again, Spark and Flink are more general purpose, you know, DAG. You know, you can write more general purpose DAGs. All of this great functionality that I talked about where I can keep, like, you know, a thread pool, a long running Spark streaming job. I can do pools. I can do indexing, which needs, like, shuffling of data. Right? All of these things we are able to do with full fledged data pipeline processing frameworks like Spark and Flink. Whereas with more sequel based, you know, rights, for example, in Hive, the output format, if you wanna do updates, we are not able to do the indexing step there. We have to do this pre combining indexing, these additional right pipeline optimizations.
We are not able to do it there. So we picked, like, initially the major data processing engines, and we just try to optimize this run time to work really well on those. And we are looking into that, but that's how things are today. And then another thing worth digging into a bit more
[00:42:00] Unknown:
is the capabilities around arbitrary updates and deletes that we touched on earlier. But I know that that's a challenging problem, particularly in data lake ecosystems because of the right amplification where, you know, a delete in 1 record and 1 file might then propagate to arbitrary downstream files because it's a, know, data dependency in other table structures. And I'm wondering how the architecture of Hudi helps to make that a attractable problem without having, you know, the deletion of a single user record than causing you to have to rewrite a large portion of your overall data warehouse of your overall data lake storage.
[00:42:39] Unknown:
You know, going back, right, at your data deletion is probably the single biggest use case that propelled this category mainstream, I would say. For Hudi, like I mentioned, things like m o, merge and read, and asynchronous compaction indexing, The use cases, the motivations are building those things in Huddl was essentially data deletion use case. So the main problem, if we can spend a little bit time just, like, describing the problem. Right? The data arrives in certain order and for analytical queries, it kinda has to be laid out in a different way based on a query. But then often when you try to delete, it's by some other fee. Like, for example, you're collecting events. Events arrive in some order, then your dashboards are grouping events by some dimension. So you want to probably sort or cluster that dimension, let's say, for again, from over, like, cluster data by city ID and time stamp or something.
But you want to delete data by user ID, which is kinda like what happens when a user leaves your service and says, you know, with data privacy laws, you're required to go and delete all these things. So the indexing does help. The record level indexes, approaches that we've taken with external indexes and whatnot do bloom filters. They help you identify quickly all the files that given record is present in. Like, a given user record is present in. Right? But still, talk going back to what you were mentioning with right amplification, companies cannot afford to rewrite entire records. Right? Entire parquet files. So that's where having a delta log like what we would, m o r, just queue up all the deletes through, like, a period and then, you know, a more later compact it every like, few times a day, amortizing cost of all the rewrites, that I think is how far we've gotten so far, to be honest.
Right? Because in some sense, what we really need is you you need a system where you're clustering recent data for query performance based on, you know, what queries are accessing. But as you age them, you are also clustering by user ID or something in a way that is more conducive for, you know, deleting data. Right? And these things are not platformized well today even, Naturi. Like, people are building and cracking their own recipes to do this. The other thing that I've seen people do with Huddl is Huddl allows you to do partial mergers. Right? Instead of logging full deltas, you can, like, you know, write code to, like, customize who need to say, okay. I needed to do data deletion or masking, but I only wanted to mask this 1 column out of 100. So instead of writing all the 100 again, you can actually log just the 1 column.
Right? And then merge it flexibly for queries on the top. So these are things that Huddl helps you do. And now zooming out a bit, this is all at a single table level. Correct? How do you now generalize this across a data like architecture where you have raw tables, bunch of jobs deriving tables. The thing is, this is where I feel for us moving to a more incremental or borrowing from stream processing is going to be the better approach for us to solve this generically. And for example, if you were to send, delete records in every chain stream, and then your ETL jobs are all, like, actually processing them. Right? At least a good chunk of use cases you can actually delete automatically.
For example, for some use cases, we did soft deletes with nulls. So when you do a incremental query, you will get a null for that user or someone. Then you just observe the downstream table. It it gets deleted in downstream automatically. Nobody has to do any manual intervention. Right? You update the raw table, the downstream table is deleted. Okay. But still rudimentary. But if you do it with help of, like, a stream processing framework like Fling, you know, which has watermarks, it helps you define a lot of reconciliation policies and all of that. You can actually write intelligent things where you say, this thing got deleted. Now I have a pipeline which is doing some aggregation.
What portion of the output should I recompute? Right? This is what stream processing framework actually solve very well. And that's where I feel like the future should be bringing all these primitives to just batch data. And then Hoori's role here is to simply provide optimized streaming, optimized, like, lake storage so that you can actually write these programs. Right? And that's, I think, how we solve it in a less gnarly fashion. And I think this is a very, very hard problem, even comparing to data warehouses. Because people typically keep large scale event data in, like, data lakes. Right? Not warehouses.
So maybe you can afford to, like, delete data from all your cloud browsing tables even. But, you know, like, on the lake, it just it's too much volume to rewrite all these things. It's still open in what I'm saying. That's all I'm saying. So in terms of
[00:47:51] Unknown:
the migration path for somebody who has an existing data lake and they want to adopt Hudi because of some of the guarantees and use cases that it enables, what is the process for actually migrating into that format? And what is the level of support for being able to have a heterogeneous set of tables where some of your data is using Hudi, whereas other tables are using Delta Lake or Iceberg or Hive?
[00:48:20] Unknown:
There's really 2 questions here. Right? First is I have bunch of, like, Parquet tables. How do I bootstrap it into Huddl? So we have, feature called bootstrapping where what we do is we don't really touch or copy the the source data. So you can keep you on writing new data to that table as long as you don't do destructive things. Huddl essentially mirror that logical structure into a Huddl table and just generate, like, indexing information out of the initial table, which is fairly quick. And as you write data, we will, like, use the source data as external base files. And then when as you touch portions of your older data, right, it'll all, like, you know, move slowly into the Huddl table.
And this model lets you actually keep running 2 pipelines in parallel. Like, a Hudi pipeline keeps, you know, just writing to the Hudi table. The old data is kept kinda immutable, and you can, you know, add new files to it also if you want. And then you can do count verification for the 2 tables. And it's a it's a very practical way how data engineers actually perform data migrations. So all of this functionality is what Huddl supports. It's integrated into, for example, these park writers, like the data stream or tool and all of that. They all have an option where you can say, like, bootstrap. And then for the first try, it does all this bootstrap, and then it it gives you a table. So that is functional already.
That's typically what you do. That said, I actually see a lot of people also generally full migrating even because of following reasons. 1 is Huddl does things like file size management for you. Right? Which which reduces the metadata overhead or the listing problems by significantly it can be reduced by that. So maybe your old data is not, and many companies now, they have more awareness around these problems and how this affects query performance. They feel like you're getting a lot more with these frameworks that they're even willing to full migrate their tables once. I see that actually surprising me a lot than what I would imagine people would be willing to do. The second part of the question was more around interoperability for different, you know, table formats, and do we see a polyglot world? I do think there'll be a polyglot world for the following reasons. I mean, I can say we'll have every capability other formats. So other formats can say we can have every capability.
Yes. Right? I'll I'll leave that out. But as it stands today, right, we use best in class in terms of, you know, CDC quickly ingesting on our format. I'm just talking about even just the lower level design, the table format layer as we talked before, Not even including all the rest of the other platform functionality. Our format is the best suited for kinda like large scale streaming data from operation systems into a lake. Right? And very quickly because we are indexing all of these other things. So I think for a company, I would imagine they have all of their raw data in a like, some, like, hoodie. But it gives data in quickly and expose the streams quickly. Right? So it lot of the impedance mismatch that you have gives you a very well managed raw data layer that's much more voluminous typically.
Right? And kinda, like, exposes these things for batch jobs as well. Now you may be a Daybreak user. I mean, that's the other thing. Right? All of the functionality that I talked about here is all tied to the database run time as a paid feature in Delta Lake. It's not really like, it's a open format, but it's not a open run time. Right? So maybe you would want Delta Lake for derived tables because, you know, that's when Delta Cash will work for you. Right? The the Databricks run time, generally, the spark jobs you're writing within Databricks are optimized to work on the Delta format if you will. Right? So maybe for derived data, you end up picking, Delta Lake. And then for, you know, similar reasons, maybe, you know, people want to pick iceberg for the derived data, right, for batch pipelines.
And then you can do the same batch placing with Hudi as well. Right? So that's how I think there are sets of users for this batch, the dev data pipelines where they would pick any any of these 3 formats for dev data. But I feel like tying together all the different things that we discussed here, right, from oh, how does stream processing, like, incremental streams fit in data deletion? How does it derive efficiency? As the incremental stream processing processing kinda, like, model grows more and more and becomes fully mainstream for even data lake processing, we expect, like, most of this data to, like, derive data to kinda, like, sit in Huddl.
So to support that, you need much more than just a table format. You need, like, a run time and all these other functionalities. Where I see the future going towards. Right? That said, it'll be great, I think, maybe, like, there's some scope for consolidation here where if we can fit some of this other functionality on top of, like, Delta Iceberg in Huddl today. Even for batch jobs, people can get make use of, like, these services. Right? And they are in case of Delta, they are open. They're, you know, freely available. You don't have to, like, really tie it to that data breaks run time if you don't want to do that. In terms of iceberg, you don't have these capabilities. So all these service already built. They, you know, work well, well oiled, turning in production in that many places.
So, yeah, the iceberg user may be able to get a lot of this usability and operability around their format for free. So that's another angle that we are thinking about. Like, it's a tricky kind of, like, you know, conversation, but I think that's a better model. Again, big picture, the lakes, how to get better at this stuff if, you know, we were to build the kind of the open data as a vision for our future. And, please I'm personally like, you know, open to taking any contributions. I'm I'm pretty sure the holy community will also have similar opinions. I can speak for the community at large, but personally, I'm open to take contributions that even take us in along these spots. As somebody who has been involved in the Hoodie project since its inception, what are some of the most interesting or innovative or unexpected ways that you've seen it used? Recently, there was a very interesting kind of proposal slash use case. I actually spent some time researching into it, and then I think it makes sense. It completely, like, you know, came out of the field. So if if you look at even streaming systems, so we've been talking a lot here about stream processing and hoodie and mixing, like batch stream processing and all that. And this is also like a testament to something like that. Right? Where even streaming systems over time have kind of done the reverse, which is they've built tiered storage into their models where you have, like, a stream processing program. It needs to backfill now. So or bootstrap.
Now you need to run it on also the historical events, and they need to consume those events in that same order. Right? So I think pretty simultaneously from people in the Kafka and the Pulsar community around, hey. Can we back the tiered storage in Kafka or Pulsar or the even streaming systems with something like Huddl? Because you get we'll get transactionality. You get, like, an Kafka style offset that you can tell data by. You get columnar data because oftentimes, streaming queries are also, like, you know, only selecting a subset of columns from the data. Right? It may be okay to do row inter processing when you're doing, like, record by record. You know, columnar probably a lot latency.
But when you're doing, like, historical events when you're consuming them, you can by pushing, you know, projections and filters down to a system like Hudi, you can actually gain ton of efficiency in terms of the amount of data that you bring down from cloud storage and tier storage for your processing. So that was, like, to me, you know, something that made me very happy because it kinda, like, for me, like, completed the circle of okay. Yes. Like, I've been I've been saying stream processing with data lakes, but then for someone from, like, the stream processing, even streaming world to recognize how we are filling the gap from that lens was very, very interesting and gratifying for me. In your experience
[00:56:26] Unknown:
of helping to create this project and iterate on it and continue to contribute to the overall ecosystem? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:56:39] Unknown:
The community process. Right? So we have been in open source for almost 12 years now, but this is my first time kind of, you know, growing the community and sharing that responsibility of, you know, making sure everybody in the community is, you know, able to do what they want to do. And there are lots of lessons for me just in terms of how we interact with people, what are people looking for. Right? Sometimes, even you're working in lot of these, like, large tech companies, like, you're architecting big, big systems. And then for me, it was an interesting experience to go spend these 4 years that I have spent with the community kind of like very 1 on 1 helping people who are probably learning the fundamentals and trying to build all of this themselves.
I would say I learned a whole bunch of interesting lessons around the community building and that aspect of open source, which was very fun to see firsthand. And then I believe in open source more than I ever did before after after this. Yeah. For people who are considering
[00:57:44] Unknown:
Hudi for building out their data lake or they're trying to figure out what structures to use for being able to organize the data that they may already have? What are the cases where Hudi is the wrong choice so they might be better served with a cloud data warehouse or Iceberg or Delta Lake or many of the other options that are out there?
[00:58:03] Unknown:
Yeah. So I would say starting from since you said, like, the analytic data, like, workload, of course, using I've had cases where people want to use Huddl as a Optimum online system. Right? Because we have hedge base and we have index file formats there as well. So that, I think, would be like a wrong use case or OLTP scenarios off of the bat. When would you use, cloud warehouses as opposed to using something like Huddl? I mean, end of the day, Huddl provides a data plane, which is, like, you know, very functional. But the value that you get for your queries also depends largely on the query engine itself. Right?
So I think that will dictate this more than the cloud warehousing. Right now, I would say in Hudi, on the platforming components, we don't have few components that cloud warehouse already have. Like, encryption service or, like, couple of things like that. I think for now, if you really need those things, yeah, you would probably find them readily available in our cloud warehouse than the Huddl project. In terms of the flow level format itself, I would say if you are and this is what actually happened as well. Right? A large of large companies, for example, they have their built in systems where they just want to commit files. That's what they want to do. They don't want the indexing, pre combining, like, none of that kind of thing. Like, Hudi does make life easier for the regular data engineer far more than the other formats in my opinion.
But if you're a large company, let's say LinkedIn, you already have Goblin. Right? And which is an interesting system there. You don't wanna change goblin at all. You don't wanna move from goblin to Delta streamer tool, but you just wanna commit files. Yeah. Iceberg has a nicer lower level API for that, which we have not invested in before. Right? We are thinking about opening these things up. But, again, I feel like the the future is more around, you know, how do we help all the other companies that cannot afford to build their own goblin to kinda, like, do these things out of the gate. So we are focusing lot more energy on that. So you may see a lower level format for API from us, but we are prioritizing for this. So if you really, really want something which commit files, then, yeah, it will probably has a better lower flexible model.
And, yeah, that that's about it. But every other case, I feel you will get a lot more bang out of your buck by picking Hudi. When we commit files to Hive, so we can also commit files to Hive. Right? Unlike it's not like the park here versus ORC or, like, this battle there. It's actually about data. Right? This is like metadata. All systems store data in parquet files. So whether you use Huddl stores it in a internal metadata table, iceberg stores it in a narrow file, delta stores in a JSON file, you can easily go between these things. If we do these things, maybe you can probably get the benefits of the formats. But, again, it's a means to an end, but you continue to use the upper functionality of Hudi to ingest data, like, readily and then, you know, can interrupt with these formats. Right? That's again, like, probably a vision that we should build towards. And it does need some cross format collaboration across these projects as well.
And I really don't wanna approach this from the parquet versus ORC kinda like mindset that that went on in the height of the Huddl vendor wars. I I think that was very detrimental to the entire space. That's no. I know it's a side from your original question, but I just wanna, like, register this here. Right? But I think we share it all in open source. We should collaborate and see see what benefits the community at large. Absolutely. Definitely something worth calling out. And then in the Iceberg ecosystem,
[01:01:48] Unknown:
they added in the Nessey project for being able to get some of the kind of snapshotting capabilities and have more of a Git style branching and merging capabilities, and then there's Lake FS for doing that on arbitrary files in s 3. And I'm just wondering what your thoughts are on the relative utility of things like that versus the MVCC support of Hoodie or the potential benefits of bringing in some of those branching and merging and snapshotting utilities for Hoodie
[01:02:18] Unknown:
itself? So Hudi does have some snapshotting kinda, like, exporting port kinda, like, utilities, but they are designed more, like, around, like, database backup y kind of, like, things like Doctor, restore kind of thing. Right? You had save points and things like that. We do have utilities for those 2 as well. I think, like, if it's already supports Hudi, and Nessie, there's a ticket open to support something support Hudi as well. So, yeah, those are things I think we want to do over time. Again, recognize free tables, we'd rather do that than try to duplicate their functionality. Right? I do think that these are, like, separate use cases, though, where 1 is the branching kind of thing. Again, if you go back to it, right, It's mostly for immutable data set. It's like a more distributed DVC kinda use case where you want to keep versions of data for you to play with. We aren't really focused on that scenario a lot. We focus a lot on, hey. Like, lots of updates, deletes, streaming data coming in. Once you have, like, a lot of these deletes and everything, I don't really think any anyone has solved the problem in the context of all of these things. Right? So in short, we would like to integrate with Nasi as well. We have a ticket open. But more and more, I would like to understand, for example, how you build these things on top of, like, data warehouses, for example.
Right? Do you really care about file versions anymore on the lake? Like, do you really need to deal with files file versions? Or can we just work with, like, okay. I took a backup or a snapshot of this table. I worked on Oracle code base. That's what how Oracle was. So we built every single database. Like, this is my 4th to 5th database project. That's how everything felt like. So philosophically, I'm a little bit conflicted around what is the right interface for these kind of use cases. But yeah. I mean, with open arms, we'll integrate interrupt systems as as as we get more resources.
[01:04:04] Unknown:
Alright. And you've touched on this a little bit already. But in terms of the near to medium term future of the Hoodie project, what are some of the things that you're particularly excited for or that are worth calling out for people to pay attention to?
[01:04:18] Unknown:
I think in the near term, we've had, like, a streaming right API for a while, like, through to kinda, like, how we started. Right? We're introducing a SQL layer now with Spark SQL. We're starting with that. And then we are, like, also packaging, like, Apollo functionality, like clustering compaction. All these things are packaged into the language as well. So we are hoping that makes life lot easier for people. And it does interrupt with tools like Delta Streamery. So you can be doing that for ingesting the data very quickly, and then maybe you want to do a delete, and then you can use drop to a single prompt and do that and makes life much easier. So I think I think that something we look forward to. And we are doing a lot of things around record level indexes. We are adding lot more indexing capabilities inside GUI.
And there's a PR already out for record level indexes, which let you without an external key value store, we can quickly take random updates and be able to apply them quickly to your tables. And I think someone just opened PR for z order indexing, kinda like more clustering strategies that are, you know, out of the box on your datasets. Right? All these functionalities are coming to you in the short term, I would say. In the medium term, we still do see some gaps when you compare what Huddl provides to something like a cloud warehouse where, like, the metadata scaling, the the way the metadata has been designed and the batch formats doesn't really work well when you're trying to do a lot of frequent comments in HUI. So we're thinking about we have a meta server already. We call the timeline server. We run it embedded within every writer to kind of deal with the more voluminous metadata needs that we have. We are thinking about actually extending that a bit and, you know, providing a recipe for running it stand alone in a Kubernetes cluster so that you have, like, a more scalable metastore backed by a database like RocksDB, which is, like, much more performant than reading files from, like, a file system. Right?
So that's another project. And we are also thinking about building out kinda, like, a mutable Huddl, or caching layer. The other problem is you can write a lot of small files to, you know, like, quickly ingest data. But then on the query side, you need to open and close a lot of files. Right? And then you still pay some merge costs when you're merging data. So can you amortize all that? Can you build the buffer pool for the DLA, if you will? Right? That's another thing that angles that we are looking in. And we are investing a lot more in Flink, ability to even pass and going to your point about how do we do data deletions across the lake. So we are now passing in the operation flags. So something like Clinc dynamic tables can now, you know, recognize all these flags and apply all the deletes and everything.
We are evolving our incremental queries to be more and more like a division log that you get out of this. Right? Full blown CDC capabilities.
[01:07:17] Unknown:
And these are all in line with our vision for, hey, let's incrementalize all of the batch workloads. That's kind of how we apply. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:07:38] Unknown:
Yeah. I think that the common theme from even our conversation today, right, which is we need more self managing systems, We need more ways of platformizing the data ops, the workload that if you will. We've designed, at least in the data lake ecosystem, more general purpose things that let you read and write data, and we haven't really done a good job of platformizing classes of read, write, workload into, like, a pull of tooling as much as you would like. So I think that's where we need to focus a lot more energy on and then, you know, help make whether you call it a data lake or I call it a streaming data lake or, you know, a lake or whatever you wanna call it, it should be easy to get up and running with that.
[01:08:19] Unknown:
And I think for that, we need lots of things to build. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on the Hoodie project. It's definitely a very interesting set of technologies and a interesting architecture. So I'm definitely very excited to see where it continues to go in the future. So thank you for all the time and effort that you've put into that, and I hope you enjoy the rest of your day. Alright. Thanks for having me, and, yeah, greatly appreciate all the questions. Listening. Don't forget to check out our other show, pod cast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Vinuth Chandar: Introduction and Background
Motivation and Development of Apache Hudi
Current State of Data Lake Ecosystem
Data Warehouses vs Data Lakes
Challenges and Solutions in Data Lake Usability
Hudi Project Architecture and Implementation
Writing Data to Hudi: Options and Challenges
Migrating to Hudi and Interoperability
Interesting Use Cases and Lessons Learned
When Hudi Might Not Be the Right Choice
Future Directions for Hudi
Biggest Gaps in Data Management Tooling
Conclusion and Closing Remarks