Summary
Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today.
- Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem?
- Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations?
- What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018?
- Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects?
- Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons?
- For someone who wants to manage their data in Iceberg tables, what does the implementation look like?
- How does that change based on the type of query/processing engine being used?
- Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance?
- What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular?
- When is Iceberg/Tabular the wrong choice?
- What do you have planned for the future of Iceberg/Tabular?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Iceberg
- Hadoop
- Data Lakehouse
- ACID == Atomic, Consistent, Isolated, Durable
- Apache Hive
- Apache Impala
- Bodo
- StarRocks
- Dremio
- DDL == Data Definition Language
- Trino
- PrestoDB
- Apache Hudi
- dbt
- Apache Flink
- TileDB
- CDC == Change Data Capture
- Substrait
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Acryl: ![Acryl](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/2E3zCRd4.png) The modern data stack needs a reimagined metadata management platform. Acryl Data’s vision is to bring clarity to your data through its next generation multi-cloud metadata management platform. Founded by the leaders that created projects like LinkedIn DataHub and Airbnb Dataportal, Acryl Data enables delightful search and discovery, data observability, and federated governance across data ecosystems. Signup for the SaaS product today at [dataengineeringpodcast.com/acryl](https://www.dataengineeringpodcast.com/acryl)
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Are you tired of dealing with the headache that is the modern data stack? It's supposed to make building smarter, faster, and more flexible data infrastructure a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it, it's all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to work properly. But don't worry, there is a better way. Time extender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, Time extender helps you build data solutions up to 10 times faster and saves you 70 to 80% on costs. If you're fed up with the modern data stack, give Time extender a try. Head over to data engineering podcast.com/timeextender where you can do 2 things.
Watch them build a data estate in 15 minutes and start for free today. Your host is Tobias Macy, and today, I'm interviewing Ryan Blue about the evolution and applications of the iceberg table format and how he and his team are making it more accessible at Tabular. So, Ryan, can you start by introducing yourself for anybody who hasn't listened to our previous conversation?
[00:01:22] Unknown:
Oh, yeah. Absolutely. I'm Ryan Blue. I'm 1 of the creators of the Apache Iceberg project along with my cofounder, Dan Weeks, while we were both at Netflix.
[00:01:34] Unknown:
And do you remember how you first got started working in data?
[00:01:38] Unknown:
I was actually trying to think about this. I think I applied to, a whole bunch of overseas jobs, and the 1 I got happened to be, running a Hadoop platform.
[00:01:53] Unknown:
Yeah. That's definitely an interesting thing to get launched into the middle of, especially if you don't have prior experience.
[00:01:59] Unknown:
Yeah. It it was really interesting, like, pretty fun to to do that. I was always interested in distributed systems. So, you know, I was familiar with, like, the MapReduce background and and stuff like that, but running Hadoop in the, like, 0 dot 20 days was quite a chore.
[00:02:19] Unknown:
Absolutely. And so the main topic of conversation here is about the iceberg table format. And we did a conversation earlier on in its life cycle when you were still at Netflix back in, I think, 2018. And so for anybody who isn't already familiar with it or who hasn't listened to that interview, can you just give an overview about what Iceberg is and some of the overview about its current position in the data lake slash lakehouse ecosystem?
[00:02:47] Unknown:
Yeah. Absolutely. So the the definition I go with the most these days, is quite a bit shorter than that other, conversation we had where I was just spewing technical details, without stopping. These days, I go with it's a, format for managing a collection of files as a table that has SQL behavior and is an open standard. And I think the the 2 really critical things are the SQL behavior and the open standard.
[00:03:20] Unknown:
And in terms of the SQL behavior element of it, that is definitely a huge piece of its appeal, particularly given the recent attention to this paradigm of the data lakehouse where the cloud data warehouse has come to be kind of all consuming in terms of the conversations around how data gets done in the modern era. And as people started to fight against some of the challenges of these monolithic and centralized warehouse platforms, They wanted to be able to get some of those same ease of use patterns and semantics on top of a broader variety of datasets. And I'm wondering what you see as some of the motivating factors for the pendulum swinging back in the direction of we really want data lakes, but we also want this other thing. And if we could just push them together, that would be really great. And some of the forces that you were experiencing in your work at Netflix that helped to bring iceberg about and kind of prime it for this time in the ecosystem of data.
[00:04:22] Unknown:
Oh, yeah. No problem. You know, the first thing is really just the semantics and, you know, the capabilities in terms of correctness. Right? You want multiple writers to be able to touch the table at the same time without corrupting people that are reading the table or, you know, corrupting 1 another. It it's really like you you have to safely perform operations. And that's 1 thing that we just sort of accidentally gave up during the Hadoop days because I think we thought we were building a, you know, programmatic, way of of working with data, and we didn't realize until way too late that we were actually building a distributed database. But a distributed database with some really nice properties. Right?
A distributed database that wasn't just tables, you know, to your comment about, like, this sort of lake house kind of architecture where we're thinking, how do we have both structured really fast tables as well as unstructured and semi structured data, you know, sort of all in the same space. So, you you know, we we wanted to go in that direction. You know, so getting back to your question. Right? Number 1 was just safety, asset semantics, you know, being able to have transactions that could do anything on a table safely. That was huge.
The other was, I I think, sort of this pattern of use that goes back to the SQL behavior. You know, we were after we built all these databases by accident, we didn't have all of the SQL guarantees. So, like, you had to know, oh, I shouldn't rename that column because it's this particular format, and it would corrupt things. And that's just not a good place to be for, you know, people coming from a data warehouse world and trying to get up and running in in a Netflix, data ecosystem. So, you know, we wanted those, SQL semantics so you had predictable behavior.
You could update tables in place and and basically work with them without having to turn your brain on and think, can I do this? And then sort of building on that, we also wanted how to get how to get every different engine to do what you want. Like, tell the table what you want and have those engines figure out how to do it.
[00:06:45] Unknown:
And the nature of Iceberg being a table format and fundamentally a specification is interesting as well to think about these are all the things that we're trying to guarantee with it. But all we're really doing is we're saying, this is how you should do all of these things, and how do you actually bring all of the different technical components together to actually cooperate properly around that specification to say, this is what we're trying to do. Here's information about what you're supposed to do and then making sure that everybody does it the right way. So, you know, how do you make sure that your Spark writer is treating the table the same way that your Trino cluster is querying it the same way that your Flink, you know, streaming engine is pulling data in and out? It's a it's a really hard challenge,
[00:07:30] Unknown:
and I think it's, you know, fairly new as well because data warehouses have never had to worry about, integrating 10 different engines on top of their storage layer. It it's completely new. And I think that's, you know, the appeal to the data warehouse world is we're adding that flexibility to touch the underlying data. But like you're saying, it's it's pretty tough. Luckily, laziness is on our side. So we have a reference implementation, and 95% of the time, that reference implementation is what gets used. So you can either use it to test your implementation. Right? You spin up the reference implementation, write some data, make sure you can read it, and vice versa. A lot of engines actually have a JVM component precisely for this reason, and that actually dates back to, like, the Hive days when hive and and other hive like tables were, you know, pretty tied to the JVM and what the JVM did in certain edge cases. So, like, Impala and StarRocks and, I think even Bodo, some of the newer Python stuff, they just spin up a JVM process, do the planning or, you know, table interaction there, and then they know that they're using the reference implementation, and it's all good. There are some other things with, like, column projection that you have to do correctly, but that's like when I learn about a new implementation, we always reach out and we're like, hey. Can we help you? We'd love to see how you're doing column projection.
[00:09:01] Unknown:
And cycling back on the kind of history of the project, as I mentioned, we talked around October of 2018, and it's been almost 5 years since then. And the lakehouse paradigm has gained a lot of traction and momentum. There are a number of, kind of different approaches to it. And I'm wondering if you can just reflect back on some of the notable changes in the iceberg project and the kind of surrounding ecosystem that has been built up around it. Yeah.
[00:09:34] Unknown:
You know, it wasn't really lake house at the time. That term came came much later. We always thought of it as just building real database capabilities into what we were using in the the Hadoop ecosystem. And, you know, that was that was really how we started, and then we were talking to other people about how, hey. Don't you have this problem where, you know, users come to you all the time having accidentally broken something? And, you know, oh, yeah. We do have that problem. How do you solve it? And, you know, we we had this long period of just talking to other big tech companies and accumulating partners. So that was when, you know, like Apple and Adobe and Airbnb and LinkedIn started using the project and starting their deployments and and contributing back.
And so we we had a lot of just development in the space, to to meet everyone's needs and and also extend the project for a long period of time. But that that era, I would say, was was very much defined by big tech companies with, you know, teams that are already contributing to Spark or contributing to Trino or or something. You know? And then the the next era is really, like, now we're seeing vendor adoption. Right? We're seeing, like, Snowflake, for example, announcing support. You know, Starburst, having great support for Iceberg in addition to Delta and and Hive formats, other vendors like Dremio, even BigQuery, adding support. Dremio has been been beating the drum for a while. I guess I should give him a lot of credit for that. You know? But the the vendor adoption, I think, really brings us into this new era of not just data practitioners and people running data platforms interested in the project, but bringing it to a a much broader array of people, like data engineers that don't wanna set up a data platform. They just wanna get their jobs done. Yeah. And it brings also an interesting point to
[00:11:41] Unknown:
the factors that lead somebody to select Iceberg as their table format or adopt Iceberg in their platform where if it's something, you know, particular in the case of, like, Dremio, for instance, they're not actually selecting Iceberg. They're just selecting something that will help them get their job done, and it's just incidental that Iceberg is being used under the covers or similarly with, for instance, the CyberGhost Galaxy platform. I think that by default, if you create a table, then they'll just say, go ahead and create it in Iceberg format, you know, given certain configuration parameters. And so you don't have to care that that's what it's using. You just care that it does that it works the way that you want it to work.
[00:12:16] Unknown:
That's exactly what I want. I I have always gotten up in front of groups of people. When I used to do, like, Netflix new hire training, I've always said, I don't want you to think about iceberg. Right? That's why SQL behavior is so important. It's a table. You know how tables operate. Right? Like, don't think about it. We're gonna do the best we can automatically, and we're gonna tell you when you need to think about it. Like, when you create a table, you create the partitioning structure and the file layout and and probably, like, the sort order for clustering. But you shouldn't have to worry about it after that, certainly not in every job or every time you query the table. And to your point of we all know how tables work,
[00:13:02] Unknown:
that also brings about the question of kind of how do you ensure that Iceberg works the way that you want the tables to work? Because, you know, we all know who how tables work is also a a bit of a misnomer because it depends on what engineer you're using. It depends what database you're using. It depends what data format you're using. So is it a table in a graph database? Is it a table in a relational database? Is it a table in a data warehouse? Because those all operate a little differently, and you might have to tune different parameters. How do you think about some of those challenges or of the kind of accessibility and ease of use around Iceberg given the it's just a table approach?
[00:13:39] Unknown:
Well, you you make a good point there, and we're getting more and more into that with, different different engines, adopting the format and and making it first class. We try really hard to document what should be done in most cases And, you know, not to lower the bar too much, but, you know, when we started, you didn't have schema evolution. You couldn't rename a column. And so, like, when I say something like everyone knows how tables work, like, you have some expectation when you run a DDL statement for what's gonna happen. You know, you you say rename this column, and it doesn't drop a column and add a new 1 with the with the new name. Right? Like, that was the old behavior.
And so, you know, a certain to a certain extent, we're just bringing the the baseline of SQL behavior back. Then in terms of in terms of performance and, you know, job planning and, like, cost based optimization, that's when we get into the the nuts and bolts and, like, kind of hard things. To do that, we we really try to do as much below the iceberg API as possible. So it's very easy. You call into iceberg. You say, hey. I'm trying to read data that looks like this, and we give you back, splits that have or I'm using a an old Hadoop term. We give you back tasks that have very strict definitions. Right? Go read these files, remove these deleted rows, and you're good to go. But there are still some things like bucketed joins or or what we call storage partition joins because we've actually generalized them to any partitioning. You know, it it is a bit more difficult there.
[00:15:22] Unknown:
Another interesting element of the evolution of Iceberg is that when we first spoke and as I was beginning to keep an eye on Iceberg as a project, the only way to really be able to actually write anything into an Iceberg table, quote, unquote, was if you were using Spark, because that was the engine that actually had some integration built in. Obviously, that has expanded beyond those bounds now, but I'm curious what that journey was like as somebody who was so deep in the weeds of Iceberg and trying to kind of bring that capability to a broader audience beyond just people who happen to already be running Spark or who cared enough about Iceberg to set up Spark for that purpose?
[00:16:02] Unknown:
You know, we were lucky that that Spark was very widespread at the time. Right? It was the obvious choice for, the engine to go with before rights. Basically, everyone was using either Hive or Spark for ETL at that point, And every anyone using Hive was planning to move to Spark within the next 5 years. So that that worked really well. And then, you know, we also had Presto at the time, which is now Trino. That was used as an ad hoc query engine, so we primarily need read support over there. But it actually simplified the process or idea of adding right support to it, at least for for Netflix. So luckily, the the engines to attack next were were definitely well defined.
[00:16:48] Unknown:
And so around the time that iceberg was first being developed at Netflix to be able to bring some measure of sanity to the sprawling data lake that you were dealing with, There were a number of other projects that were also being implemented at places like Uber for the Hoodie project and Databricks with their Delta Lake implementation. And so there are a number of different alternative table formats available for these lakehouse style platforms. And as people are trying to evaluate which 1 they want to go with, I'm curious what you see as the characteristics of Iceberg that lead teams to choose it for their lake house projects and some of the ways that they think about how to establish a useful comparison across these different options to be able to come to an educated
[00:17:35] Unknown:
decision? That that's a a good question. So we see people coming at it from from different angles. Certainly, like, 1 is performance, where performance is is really important, and you wanna make sure that you're on a a platform that's going to have that performance over time. I think the the challenge there is that it's so hard to gauge. Right? It's just very hard to gauge performance unless you actually go and test your workloads and, you know, really try to get the best out of those formats. So that's probably the the 1 that people go to most often, but the 1 that's the trickiest to actually, you know, get more signal than noise on. You know, like, we have people that evaluate iceberg, and they're like, woah. It's way faster. And we're like, well, you probably, you know, you might not be taking advantage of the exact same features in this other thing. But, you know, we we know that the the flip side is true as well. Right? Sometimes you you happen upon a a feature that makes another 1, way faster. So it's incredibly hard to gauge. I like to think in terms of human productivity rather than, just machine time because I think that that's what you should really be optimizing for.
Right? The performance is really just about how long are we making a human wait. Right? There's a a cost angle to it certainly as well, but I really like to focus on the the human factor. And that's where I think, you know, for a long time, it's been just so important to have the schema and layout evolution, you know, the approach that iceberg takes of trying to have you declare what you want and tell engines, you know, here's how you're gonna do it. I think that that's a a really critical factor. But, yeah, it is definitely hard to to gauge. I think there are a couple others that we we see occasionally. So 1 is openness.
Right? The the iceberg community being open and, you know, broad trying to accept, anyone in there And, whether you're a vendor or a a large tech company or, you know, just someone trying to to use Iceberg, like, we really want that community there. And especially for vendor adoption, that has been really critical because you don't wanna use a project that is completely owned by a vendor. So that that 1, I think is is probably not used as a deciding factor as much as I would recommend, but, you know, we certainly see that with the the vendor adoption angle. They're looking at that and saying, well, I I wanna be on an on equal footing, with the the format that we use for storing data.
[00:20:23] Unknown:
And another interesting challenge of this space is that each of these different implementations are constantly evolving and adding new capabilities and features. And that that also compounds the challenge of the performance question, but it also makes it difficult to get an up to date and accurate comparison of what are the features that are even available. And obviously, every feature you can say, okay, I support streaming inserts, but what are some of the specific edge cases and semantics around that? And how do you find out that piece of information? I'm curious what you see as some of the inherent aspects of this particular problem space that lead to it being so challenging to be able to actually create and maintain these up to date and accurate comparisons?
[00:21:05] Unknown:
Oh, yeah. Well, I mean, there's there's a lot of confusion here. You're right. You know, the the fact that there are multiple versions of Delta with some open source features and some closed source features, and it's not real clear, which ones are which. And, you know, periodically, Databricks will open source some of those features and add them to the Delta project. So, you know, that that adds to, I think, some of the confusion. And then there's also this, like, well, you know, in terms of data patterns, you know, you you mentioned streaming. That's a a hard 1 because, you know, Databricks and Delta, they're gonna say you wanna use Spark Streaming in this micro batch approach.
Then you've got Hoody saying, like, you know, row level upsert is the model that you should go with. Iceberg, we mostly say, you know, you should of course, I I think of everything in, like, a a nuanced answer kind of thing where it's like, well, by default, we're gonna use copy on right because that's gonna give you the best read performance with, you know, no overhead. Merge on read is gonna give you better write performance, but then you have to have something going and compacting that data regularly. What we also recommend is a third approach, which is writing a change log table and periodically ETL ing that state to compact down and, you know, snapshot the state. And that last 1 actually gets you the semantics of, you know, being able to coordinate with transactions in the upstream CDC system. You know? So they're like in in terms of that, you've got these 3, like, vastly different views of what you should use and why.
And so it's not even enough to say, like, we support streaming or we support this feature. Right. It's, well, are you even using the approach to the problem that makes the most sense?
[00:22:59] Unknown:
And to make it even muddier, it's also, are you using the engine that actually supports this particular implementation of that feature?
[00:23:08] Unknown:
Yes. Exactly.
[00:23:10] Unknown:
And so to that point, for people who are interested in the capabilities that Iceberg offers, they want to be able to manage their data in Iceberg tables, what does that path look like when they say, okay, we like Iceberg. We want to use it, but now I need to figure out which engine I want to use because I need to be able to do x with the data. So, in my particular case, I'm using the Trino engine and its iceberg support for being able to manage data, you know, via dbt and just purely SQL native. Many people are already using Spark, and so they wanna be able to use that the the Spark iceberg support for being able to write data in, you know, from, you know, a a raw source into a sync. I'm just curious what the kind of decision matrix looks like for people who are saying, okay. I'm all in on iceberg. Now what?
[00:23:58] Unknown:
Well yeah. So I would say, like, you know, you're in the the right position where you're choosing your engines. Right? And then you sort of need to say, okay. What else do I need in addition, or how do I use that engine to get the pattern or or, you know, the the best use? So Trino is a a great engine for, you know, this sort of use case. Love DBT and Trino together. Right? Trino, though, by default, is gonna use merge on read to do all of its operations because it wants to be fast and return quickly. So that means you're gonna have a situation where you've basically put off work until later, and you have to pay that, you know, technical debt or you have to pay that table debt at some point, which means you just need to know how you're optimizing your table.
Right? When to run optimize, when to run those compactions to get rid of the delta files that that's gonna generate. And, as long as you do that, you'll be alright. You know, we we want to work with and we actually do work with, like, Brian Olson over at the you know, on the Trino side to, like, try and and talk about these patterns in the open source a lot. Like, if you're using this engine, this is what you would do. What is, for example, Flink really good at? Well, pumping data into an iceberg table, giving you exactly 1 semantics, and making it available for the next engine. Engine. Definitely use Flink for that. Right? So I I think once you see sort of the patterns that these engines use and and what other things you need to do,
[00:25:30] Unknown:
that's that's a a really great step. And the other question beyond just, okay, I've got all my data in iceberg tables, and we've we've touched on this a number of times already is the question of ongoing maintenance and support of, okay, I wrote that table in, but now I've got a new column, or now that column has a different data type, or now the column name has changed, or, you know, I've got a whole bunch of deletes that I need to actually process and compact. And just what does the ongoing maintenance and upkeep of your iceberg tables look like? And what are some of the challenges that you start to run into as you're scaling both in terms of number of tables and size of tables, both in terms of volume of data, but also in terms of number of columns?
[00:26:12] Unknown:
Well, that's a big question. So maintenance largely looks like stored procedures or if you're in Trino, like an optimized call or something like that. I would recommend trying to know what your engine is doing at a high level, whether that's a copy on right or a merge on read, and then know what downstream operations. You know, basically, when you have deferred work until later, then you're gonna need to do those those tasks later. So in, Spark, you can actually choose copy on write or merge on read for an operation. If you choose merge on read, know that you're gonna need to compact that later, and there are stored procedures to help you do that. You know, rewrite data files, look for anything with more than 3 delete files.
Boom. You're you're up and running. I should also plug my company tabular here because, you know, our our long term goal from the start of the iceberg project has been, like I said earlier, to reduce the amount of time that data engineers and people think about these problems and have to deal with it. So what we really wanna do and why why we built the iceberg project to be able to you know, the reason why we we built ACID Semantics was so that we didn't just have to write once and do it correctly. We could fix it after the fact. Right? We could defer work until later and do it later, or we could accidentally write something the wrong way and go and correct that. So that's the the the idea that connects iceberg as the table format that enables, you know, the the building blocks for this and tabular where we're actually building a platform that does those things automatically.
So if you, you know, commit a 1, 000 small files, we're gonna say, hey. Looks like you just wrote a ton of small files into this table. We're gonna go compact that automatically, or we're gonna monitor for delete files that are coming off of Trino, and we're gonna say, okay. Well, after this amount of time, those files aren't really in use anymore, and that's a good time to go compact and remove those deletes. That's the the approach that I think data architecture is moving to, and we wanna sort of lead that change so that data engineers don't think about this anymore. Right? Where, again, this is SQL behavior.
How often in Postgres are you thinking, are things fragmented? Should I maybe run an optimized, like, hopefully not daily or hourly?
[00:28:45] Unknown:
Well, there there is the question of when do you run vacuum on your postgres tables?
[00:28:50] Unknown:
Right. I mean, there are some of those those questions, and that's a a good thing to, like, as have database administrators and, like, power users thinking of. Everyday data engineers don't need to be burdened with how large are the files on average in this table?
[00:29:11] Unknown:
That that's a a detail I want people to just stop thinking about and trust. Yeah. Definitely interested in digging a bit more too into some of the kind of business model and integration path for what you're building at Tabular. But before we get to that, another interesting element of Iceberg is the question of being able to do time travel for the data where you can say, hey. I just committed this new change, but I actually wanna see what it looked like right before I made that set of inserts and being able to have that kind of MVCC approach. I'm wondering if you could talk to some of the interesting patterns that that fundamental capability of Iceberg provides for people who are trying to do more sophisticated work flows with their data?
[00:29:52] Unknown:
Oh, yeah. Absolutely. That was a really cool 1. So that fell out of this, like, you know, coordinating readers and writers. So your writers all, you know, optimistically build a new version and try to commit it. And what you end up with is this, you know, table form where you have each separate commit as a separate state that you can can view and address and and look at. And, you know, that's why you have time travel. Iceberg in particular because we use a git like model where, you basically have a whole bunch of data and metadata trees that are overlapping, and you you write the changes in the tree to to commit, that Git like model actually enables you to do some really cool things. So first of all, you can stage a commit without actually, updating the the current reference. So we use that at Netflix to do this pattern of sort of integrated audits or write audit publish, we called it, where we would write and stage a commit, then an audit process would go through time travel to that very commit and make sure that, you know, the nulls look right and, you know, the the columns had these ranges.
And then it'd say, yep. This is good to go, and it would basically cherry pick or fast forward the main branch to that commit. And that was really handy because people talk all the time about how important lineage is and and things like that, and definitely it is. But lineage, it seems like 50% of the use cases for it is like, how do I know when data bad data leaks, where it went? And this pattern sort of flips that and says, well, let's make sure the data is good before we release it. Right? So if you can produce the data without publishing the data, it's really powerful.
So that led us to basically branching and tagging as well. Again, because we're using a Git like model, we've recently added branches and tags to table metadata. So you can go say, oh, okay. I wanna create a branch, branch off a main, test code, you know, build up a whole bunch of commits and say, okay. Yeah. That looks good. And then deploy that code to production, delete the branch. Or you could do a longer version of that write audit publish workflow where you create a branch for, you know, this hour worth of data. You accumulate the data, make sure it's internally consistent, and then you fast forward main to the state of that branch. You can also do really cool things like, you know, your quarterly reports are based on a certain version of your tables.
Well, you can go tag all of your table versions on the 1st of the month or 1st of the quarter and then keep that version around for 5 years or whatever your, your retention policy is. So there's a a ton of goodness in this Git like model and and just, like, a lot of new use cases that that we're excited about. And digging now more into what you're building at Tabular, you mentioned that you offer some of this kind of ongoing maintenance support and automating
[00:33:01] Unknown:
the, you know, very low level detail oriented aspects of working with iceberg tables. I'm curious what your process was for identifying what is a viable business model for this thing that I've already put so much of my time and energy into, and, how you think about the kind of the the sales pitch and the integration path for being able to actually work with companies who are using Iceberg, particularly if they don't even necessarily understand the fact that they're using icebergs, you know, for people who are using maybe Trino, for instance?
[00:33:32] Unknown:
Yeah. So what we want is for it just to be easy. And so, you know, originally, we thought in terms of business model, we thought about all the things that we were gonna do to to, you know, help you out. Like, iceberg, you every commit is a new snapshot. Well, eventually, you have to clean out the old snapshots because you don't need infinite table history, and you wanna delete some data. So we thought, oh, we're gonna make a an a table expiration service. And it turns out that's not something that anyone wants because if you come to someone and say, we have a, you know, version expiration service for your tables, they're like, what?
I didn't know that I needed that. You create you solved a problem I didn't know what I had. And so while we do all of those background maintenance things for iceberg tables, really what we're looking at is saving human time and making people more efficient, not providing iceberg maintenance services. So our our platform is really built around that, that idea that, essentially, you want the bottom half of a database. You know, it should figure out how to manage your data. It should compact automatically. It should, have features for loading data from s 3 automatically.
You shouldn't have to worry about this. It should be auto optimizing. Right? That's that's how we think about all the the maintenance and compaction and recommendations and things. It's it's simply auto optimizing. Then this plugs in essentially as the data layer for any query engine. And so, you know, we're thinking in terms of what do you need if that's what you want. Well, first of all, you need security. Right? You you don't want to secure every engine that people use. Right? You wanna secure the data, not the access. So that's a a big part of our offering is is offering security that actually works across all engines that people use.
And then just making it super easy to plug in those engines and and, you know, use, like, a OAuth flow. You know, what we've been pushing for since the beginning is this OAuth flow where you go to, say, Starburst Galaxy and say, I wanna connect to Tabular, takes you over to Tabular, and Tabular says, hey. Starburst wants to access these 3, warehouses. Are you cool with that? And if you're an administrator, obviously, you click yes, and it takes you back, and we're done here. You know, just like allowing access to your Google contacts or something. It's definitely a very
[00:36:16] Unknown:
interesting space to play. And it's it's always fun to see kind of what new business models are available as new technologies come about and kind of fragment the ways that people already got used to doing things and then realized, oh, hey. There's a better way to go do it. But now I have to invent new things to be able to make it work
[00:36:32] Unknown:
properly. It it's really fascinating because we've we've never been able to separate the query engine from the data layer in in databases. Right? I mean, we did it to a certain extent in the Hadoop space where we had all these different things that had no idea how to coordinate with each other. Right? And then, you know, projects like Iceberg came in and said, okay. We can handle that coordination. Well, now we're simply taking that to the next level and saying, like, what if we built a business around storing and managing your data and making it easy for you to plug in any engine that you wanna use? I think that's a a really cool space, and it it could be very transformative.
[00:37:10] Unknown:
Yeah. Definitely. It it puts me in mind a little bit too of what the folks at TileDB are have been building towards of, we want the data to be the thing that you care about and then let anything access it, but all of the intelligent things happen in the data layer. Yeah. That that is very similar to what we're doing. And so in terms of your experience of building Iceberg and now founding Tabular and building the business around that and working with the community and ecosystem, what are some of the most interesting or innovative or unexpected ways that you have seen the Iceberg format used?
[00:37:43] Unknown:
There were a a few, surprising use cases back at Netflix. 1 was, and I I think I've talked about this a lot, but we had this really huge elastic search cluster that was only being used to look things up by 1 identifier. And we were able to replace that very expensive Elasticsearch cluster with iceberg because we index the, you know, iceberg indexing is, like, we index metadata files. We index data files. We use, you know, multiple filters to to try and just narrow down as best we can on the data that you so I was really surprised that we had, like, you know, Trino and s 3 combination taking on the workloads of an Elasticsearch cluster that was really, really expensive.
Now granted it didn't search by any other dimension but that 1, but that was that was, I think, when I realized that we're really onto something cool and special. Right? Like, being able to do that. Another 1 just happened, recently, and I alluded to this when I was talking about different streaming models. So there's, you know, CDC, obviously. How do I take a an or sorry, a transactional database and put those records into an analytic table so that when I run big analytic queries in parallel, I don't crush my transactional table running a website. You know?
A a customer came to us and said, you know, hey. But what if we want to have transactional consistency? Right? So imagine you have a table of, like, bank accounts, and you have transactions going between, you know, take $500 from my account and put it in your account and take $200 from your account and put it in someone else's account. Well, when you get to, you know, the record level, it looks like update Ryan's bank account minus 500. Update Tobias's bank account plus 500, and those are 2 separate events in the system. And, you know, at the end of the day, you kinda want the bank's total assets to add up to the same.
Right? If you're just moving things around between accounts, it shouldn't look like the bank has more money. And so that's essentially what you wanted with this want with this coordinated CDC. And so we said, oh, okay. Well, if you take this approach where you just pump all the changes into a table, right, it's way easier to write because all you're doing is taking a fact set and write it again into a table. And then if you keep track of the transaction boundaries and, you know, the transaction IDs in that dataset, you can actually time travel. Now this is time travel within the data and not within the the metadata.
You can time travel to any transaction within that table, which is super cool. Right? And then you can also extend that if you have the same transaction ID space across tables in your upstream system. Well, then you can actually have transactional consistency across however many tables you want in your your analytic copy. And that I think is is really cool and powerful, that pattern for, like, what we're calling coordinated CDC. You know, I think that the the current state of the art for CDC patterns is try and do either the copy on write or merge on read approach. But in that case, you've got, you know, probably a Kafka topic that's partitioned across, you know, say, 15 different partitions. And who knows what point in time all those records are in. So if you randomly batch them up through, like, a Flink or Kafka Connect process, the downstream table actually has, like, a completely random table state that doesn't coordinate or doesn't correspond with basically any time in the upstream table.
Right? So I I've been thinking about this a lot lately. I'm like, how is anyone having success with these CDC tables that are basically like a random table state that is not actually a point in time? So that is 1 that that really fascinates me and I think where,
[00:42:06] Unknown:
where we need to to lead the industry to go, you know, thinking about, like, transactional consistency because it's very rare that in those systems, it's just like, oh, we do everything through Upserve. Yeah. It's funny too because you always hear people saying, oh, well, you just need to run CDC, and then you can keep your analytic database up to date with everything that's happening in your transactions. And, well, maybe not quite.
[00:42:27] Unknown:
Yeah. I mean, if you're trying to to align the commits on your table on your analytic table with, like, 15 different states of Kafka partitions with data from this table, like, that coordination problem alone sounds insane. Like, this customer that came to us with the problem, they were trying to figure out how do we essentially process each partition in a topic up to a known point and then, like, get Flink to checkpoint there so that the table state is consistent when we go downstream. And we were like, we should just pump data into the fact table and then select facts to get us to that point rather than, you know, trying to build all of this into the right layer. And I think that that's a really, you know, powerful realization that you shouldn't always just try to put your downstream table into the exact state that you want. Sometimes it's better to just defer that work until later and, you know, use an ETL process to say, okay. Well, I know that we've processed through transaction 10, 011, and we're gonna materialize those table states in the analytic tables and and, you know, do it that way. As a 2 step process, it's much, much easier than trying to put all of that workload on your Flink writer. Yeah. And that that also
[00:43:50] Unknown:
brings in the interesting challenge of multi table ACID transactions in a data lakehouse environment, but also from the data integration perspective of I just even if you're just doing batch, if I just wanna dump everything out of this database right now, if you're doing that on a table by table basis, then there's no way to be able to ensure transactional integrity of those of of those records. Because if you have a transaction in the upstream that's writing to 5 different tables as a single process, but you're only pulling 1 of those tables at a time, well, then
[00:44:23] Unknown:
good luck. That's actually a a really, good way to use branches. So, you know, I was saying that I'm excited about all these use cases that branching and tagging is gonna have. Well so we've defined an endpoint in the new Iceberg REST catalog, which we haven't even covered, but that's basically, hey. Let's standardize how catalogs talk to each other so that, you don't have to get your catalog jar into, say, AWS Athena, which is never gonna happen. Right? Like, let's standardize communication so we all talk the same protocol. Well, we've added or or we've proposed an endpoint there that can do multi table transactions.
So multi table transactions are coming to iceberg shortly. And the first thing that you can do is that fast forward operation on multiple branches. Right? Because you've already written all of the data out into branches. So, you know, you in in this coordinated CDC pattern, you say, k. We're going up through transaction t. I'm gonna materialize the state here. I'm gonna materialize the state here, etcetera, etcetera. Do it as many tables as you need to, and then you can swap that table state in, in a single transaction. So that sort of thing, I think, is going to be really powerful and and cool.
[00:45:41] Unknown:
And in your experience of helping to build iceberg, founding Tabular, working in this ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned? I think the most interesting thing that I've learned building iceberg,
[00:45:56] Unknown:
building tabular, you know, trying to to, you know, sort of change, the way we do data engineering for the better, is that those aren't always technical problems. Sometimes a technical solution can unlock something. Right? Like, schema evolution is a a great example of, like, a very simple technical solution unlocking a ton of value where it's like, oh, hey. You know, you don't have to worry about making an accidental mistake that costs you a week's worth of time, but not all solutions are technical. And a lot of times, you just have to think about how people are going to interact with this and design differently for that interaction.
I I think that that's 1 of the the insights that iceberg has is that we we put so much attention on how people are going to behave and and what they're going to need. We try not to steal attention. You know, we try to just be a background thing that works that you don't ever have to care about, which is why it's so ironic that pretty much as a, you know, a job, I go on podcasts or write blog posts or give talks about something that I really don't want people to focus on at all. You know, it it should get out of the way, generally.
And I I really think that it's it's a good insight to think about, like, what challenges can be solved by technology and when do you just need to say, like, this is a human problem. Maybe if we have people approach it a different way, it'll be much easier.
[00:47:37] Unknown:
Yeah. It's funny that you make the point of you spend a lot of time talking about and promoting something that you don't want anybody to have to care about. Because as somebody who has worked in operations and data engineering for a long time, if I'm doing my job right, nobody knows I exist.
[00:47:54] Unknown:
Exactly. I want you to think about it when you're making a platform level decision. Right? Like, we we talked about earlier. When you're thinking, you know, what's the format people are gonna be using in 10 years? I think that that's tabular because of the adoption and and everything. I I want you to think less about, like, how it actually works. After that point, don't worry about it. We got you. Absolutely. And so for people who are in the space of making that decision, what are the cases where iceberg and or tabular are the wrong choice? Well, 1 of the easy ones is, transactional.
Right? And this is actually a mistake I made from the the start because we jokingly called iceberg, you know, a a format for slow changing data, which isn't quite true. You can do very large changes, but you don't wanna do a 1, 000 per second. Sort of it's the the analytic space. So that's that's the the number 1 mistake where you get people doing, like, a Trino insert a single row, or modify a single row, and and that doesn't quite work out well because if you're doing that, you tend not to know that you're gonna accumulate a lot of files and need to run optimize or something like that. And as you continue to iterate on iceberg and tabular, what are some of the things you have planned for the near to medium term or any particular projects or problem areas that you're excited to explore? Yeah. Absolutely. So where we've been going, with Tabular is basically into this automatic optimization and maintenance.
And I think that that is a fascinating space, you know, where we're getting reports back like, okay. This table was scanned. This was the data that was read. Then we also have, you know, tests running in the background to see, hey. If we tweak this setting or do something a little differently, how much smaller is the data? And we're trying to blend all of that together into this performance recommender that basically allows you to create a table and then not worry about tuning. Right? We'll make sure that we're updating your sort order and your parquet settings and all of that to make your data small, but also make it so that your actual workloads are performing. I think that that is just ripe for optimization, and it's it's a fascinating space that I I really, have have enjoyed. On the iceberg side, there's there's a ton of good stuff coming. Right. I mentioned tags and branches.
We're doing that multi the multi table, transactions and commits starting with, you know, branch swapping and getting more and more into proper SQL transactions. We're also working in the Vue space. So we're trying to you know, we we've got this sort of data layer and data space that we want engines to connect to. Well, views are a very important part of that. So Iceberg is working on a view specification so that you'll be able to share views across those databases. And to do that, that's all probably going to be powered by substrate, which is a very fascinating project that gives you essentially a a way to store and express SQL plans, and and views. So we're looking at that. And then the the last 1 is we're also working on encryption in the format. So, you know, there there there are quite a few exciting things coming out. Yeah. Definitely a lot of interesting aspects.
[00:51:21] Unknown:
Definitely looking forward to seeing how it continues to grow and evolve. Are there any other aspects of the work you're doing at Tabular or on the iceberg project that we didn't discuss yet that you would like to cover before we close out the show? I don't think so. I I think we're we're doing pretty good. Yeah. The the tabular stuff, I'd love to have you try. That's that's a lot of fun. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:51:58] Unknown:
By far, I think the the biggest gap is automation and the the things that we put on data engineers' shoulders. Right? Forever in the data lake space, We have made data engineers care about minutiae, like, how big the files that they write out are because we couldn't correct those things. And now, like, even with table formats taking a step forward here, right, table formats like like Delta and Iceberg at least can actually take action and fix those things in the background. So then the next thing is, oh, now you have to care about running optimize. Like, it's still on you. You have to know that you had that problem and then take care of the problem somehow. I I think that we really need to take the longer term approach of building systems that see what you're doing and can take action to correct it automatically. You know, that is basically the crux of what we want to build at Tabular. You know, this bottom half of a database actually needs to function like a database that makes things happen on its own because it knows that it's going to be worth it to make that change.
So that that automation piece is by far, I think, the the most important thing moving forward. And it's taking those things back off of the shoulders of data engineers who, you know, they might have time to care about the size of files that they write out or not write a 1, 000 files per partition or or whatever, but they never care about parquet tuning settings. Right? Like, once you get a couple steps below the size of files, people are just like, I do not have time to even mess with that. I want people to be able to focus on the fun parts of their jobs. Right? Not the the minutiae
[00:53:48] Unknown:
of what's going on underneath the covers. Yeah. It it also plays into the interesting developments happening with kind of autonomous and machine learning powered database engines. So definitely a lot of exciting developments happening there and exciting things to come. So, definitely appreciate you taking the time today to me and share the work that you and your team are doing on Iceberg and Tabular. Definitely a great project. Appreciate all of the work that's gone into that. So thank you again for your time, and I hope you enjoy the rest of your day. Yeah. Likewise. Thanks for having me on the show.
[00:54:25] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways machine learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast dotcom. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Host Welcome
Guest Introduction: Ryan Blue
Overview of Iceberg Table Format
Motivations Behind Iceberg
Evolution of Iceberg and Vendor Adoption
Expanding Beyond Spark: Multi-Engine Support
Comparison with Other Table Formats
Choosing the Right Engine for Iceberg
Ongoing Maintenance and Challenges
Time Travel and Advanced Use Cases
Building Tabular: Business Model and Integration
Innovative Uses of Iceberg
Multi-Table Transactions and Future Plans
Lessons Learned and Human-Centric Design
Future Directions for Iceberg and Tabular
Biggest Gaps in Data Management Tools