Summary
A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Join in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today.
- Your host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg
Interview
- Introduction
- How did you get involved in the area of data management?
- To start, can you share your definition of what constitutes a "Data Lakehouse"?
- What are the technical/architectural/UX challenges that have hindered the progression of lakehouses?
- What are the notable advancements in recent months/years that make them a more viable platform choice?
- There are multiple tools and vendors that have adopted the "data lakehouse" terminology. What are the benefits offered by the combination of Trino and Iceberg?
- What are the key points of comparison for that combination in relation to other possible selections?
- What are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems?
- What progress is being made (within or across the ecosystem) to address those sharp edges?
- For someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements?
- What are the differences in terms of pipeline design/access and usage patterns when using a Trino/Iceberg lakehouse as compared to other popular warehouse/lakehouse structures?
- What are the most interesting, innovative, or unexpected ways that you have seen Trino lakehouses used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data lakehouse ecosystem?
- When is a lakehouse the wrong choice?
- What do you have planned for the future of Trino/Starburst?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
- Trino
- Starburst
- Presto
- JBoss
- Java EE
- HDFS
- S3
- GCS == Google Cloud Storage
- Hive
- Hive ACID
- Apache Ranger
- OPA == Open Policy Agent
- Oso
- AWS Lakeformation
- Tabular
- Iceberg
- Delta Lake
- Debezium
- Materialized View
- Clickhouse
- Druid
- Hudi
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png) Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit [dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council) and use code **dataengpod20** to register today! Promo Code: dataengpod20
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png) Data teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. Dagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to [dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast) today to get your first 30 days free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. A new approach to building and running data platforms and data pipelines. It is an open source, cloud native orchestrator for the whole development life cycle with integrated lineage and observability, a declarative programming model, and best in class testability. Your team can get up and running in minutes thanks to DAXTER Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments. Go to data engineering podcast.com/daxter today to get started, and your first 30 days are free. Data lakes are notoriously complex.
For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com /starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macy, and today I'm interviewing Dane Sundstrom about building a data lake house with Trino and Iceberg. So, Dane, can you start by introducing yourself?
[00:01:48] Unknown:
Well, I am Dane Sundstrom. I am 1 of the founders of Trino, and presto before that. I am CTO at Starburst. I've been working in the data lake space for about 10 years now. Before that, I worked at some other startups. And before that, I was 1 of the original people at JBoss and spent a lot of time in Java EE and that sort of space. And do you remember how you first got started working in data? My background mostly was distributed computing. So out of college, I started working at UnitedHealthcare on distributed computing using Intera DCE in the nineties, and then switched to, like, Java EE back when it was called something else. And, I as part of that, I wrote the object relational mapping tools for JBoss.
Then, eventually, we long, long time forward, started working at Facebook, and 1 of the original projects from the head of infrastructure was to come up with a faster, better way of interacting with their large data warehouse at the time. So this is, like, 10 years ago, and it was, I don't know, 3, 400 petabytes or something. It's dramatically bigger now, and they didn't have a team to do it. And myself and David Phillips and Martine have background in Java, extensive background, and databases and stuff like that. So we were available, and we started working on it. But I'm mostly a distributed computing person, so I wrote most of the distributed computing parts of Trino. Whereas, like, Martin's a deep language person. So he did a lot of the language optimizations, and, David is deeply into databases, has been forever, and so built a lot of the database parts and the tooling and things like that.
[00:03:43] Unknown:
As an outgrowth of that effort, along with a number of other contributions to the ecosystem, we have landed in this space where we have a new architectural paradigm for analytical systems that is largely phrased as the data lake house as a midway point between data lakes and data warehouses. And for the purposes of this conversation, I'm wondering if you can give your definition of what constitutes a data lakehouse.
[00:04:07] Unknown:
It's a really good question because I think people play fast and loose with it. So historically, I would say a data lake is you have traditional storage, external storage, so you're talking HDFS is generally what people are talking about it. But nowadays, like, HDFS is so rarely used. It's almost always some cloud object storage, s 3, GCS, Azure stuff. So definitely, all the data stored in that. And then, I think the important part comes with a lake house of talking about standard data representations. So, like, you can be a vendor and store all your data in, you know, s 3 if it's proprietary stuff. And proprietary, I'm just gonna define as you're the only 1 who really implements it. I don't care if you have an open spec or whatever. Like, it doesn't matter, like, if you're the only serious player in it, it's effectively proprietary.
So where I think about it now, it's object storage. It's doing it in the lake. So it isn't like, oh, I take the files and then I import them into my special proprietary format, and then I cross them and then I dump the data back out. That's the lake as a sidecar to you. So it's when you're doing transformations, when you're doing data maintenance, the data goes is operated on directly as the lake being your native form. Everything else is, you know, a bolt on, which not to say is terrible. It's just a different thing.
[00:05:33] Unknown:
Absolutely. And another interesting aspect of the idea of the data lakehouse is that the reason for framing it as such is that it intends to add a lot of the user experience benefits that you get from a fully vertically integrated database system, such as data warehouses, whether that is an actual vertically integrated system, you know, as of the days of yore or a cloud native system where compute and storage are disaggregated but still presented as a single unified experience. And I'm wondering if you can talk to some of the ways that we have actually, as a community, hit that mark, and what are some of the areas where we're actually still falling short of the user experience presentation of this cohesive platform versus the parts where the, gaps still show through, and you can see that it's actually 5 different pieces that are trying to work together.
[00:06:25] Unknown:
Yeah. I think we've done an okay job. I think we got a long ways to go, though. If you had asked me this question 3 years ago, I would have just gone on and on and on about the litany of, like, broken weird tools that exist in the lake house. I think things are starting to get better as people realize that it this isn't, like, so much as, like, the community of users. It's the community of, like, the people implementing and maintaining the systems where, like, I think we've now started to figure out that, like, this paradox of choice is not a good thing. So before, we had, like, Hive, and there were 5 competing data formats, and then that narrowed down to 2. And then everyone realized that what Hive was doing was really bad and not sustainable.
And having like, you have 2 different tables next to each other, and they're maintaining completely different ways and have different type systems and different scheme evolution and and so on. Like, I can go on and on and on about the edges of it. So I think iceberg came along and said, hey. We're just gonna come up with a format for tables. It includes how tables move, how they're evolved, how they're managed, and covers a whole plethora of things, including, like, data types and how partitioning works and stats now and views and and so on as a written down standard. Before, it was just the wild west. Like, literally, like, someone would check something into Hive and, like, invent an entire new system. Spark does this all the time. Like, okay. Let's implement spark bucketingv2, which is different than everything else. And if you wanna know how it works, like, go read the spark code because some person just showed up and everyone's like, yeah. That's cool. So I think we've gotten a really a lot better on data in tables, the type system, that sort of thing is now fairly standardized and well understood. That said, Iceberg did it, and then immediately, Databricks came along and dropped a competing product, which is kind of half finished. And so now I get to implement 2. And now there's more of these coming along and hoping that this time around, we consolidate onto 1 very quickly because it's really kind of a mess, and you're basically what happens is people like us in the Trino community, we have to implement all of these. And we only have so many people, so it's like we implement 1 really well and the rest suffer, or we implement all of them kind of okay. So it's, it's difficult. Like, right now, there are enough people. I think we're maintaining 3 of them. Hive asset died, and that's, like, 1 of end tools. So, like, we can have the same conversation about security. We can have the same conversation about, I don't know. There's there's, like, lots of these areas.
[00:09:22] Unknown:
Absolutely. So I personally am actually using the lakehouse architecture for my platform. For sake of transparency, I am using Trino. I'm using the Starburst managed Galaxy, so get that out of the way. I'm using the iceberg table format, which is largely transparent. I don't have to do a lot on the actual table format piece because TriNet handles that piece of it for the most part. And so as somebody who's using the lakehouse paradigm, there are definitely a lot of niceties. I agree it's gotten a lot easier over the past couple of years than it was prior to that. A lot of the conversation seems to have cohered along a roughly standardized conception of what constitutes the lakehouse.
I do think that 1 of the areas that is still unfinished or at least not as cohesive across the board is that question of security and access control. That seems to be 1 of the areas where the overall data ecosystem is is not yet figured out. Everybody has their own thoughts on how it can and should be done. Everybody wants to own that experience. There aren't a lot of methods for being able to communicate roles and access across the, layer boundaries, and I'm wondering if you can talk to some of the ways that that manifests in terms of that overall experience as a juxtaposition to the warehouse where everything is presented as 1 system.
[00:10:44] Unknown:
Yeah. So as 1 of the people who's written, I don't know, a huge portion of the security systems in Trino and in Galaxy, it's actually a really hard space to be in. So, like, if you kinda look, into the open ecosystem so, like, throughout this whole thing, we're kind of talking about the open ecosystem. So the open ecosystem for security, historically, you had the Hive metastore with its security. Well, the most popular metastore out there is Gloo, and it doesn't have the Hive security model. And the Hive security model was always weird and only applies to Hive. Trino is a federated system, so, like, that doesn't make much sense.
Ranger pretty much died. I haven't seen it around in a while. Like, there are people still kinda looking at it, but, like, I get a sense for how popular things are by, like, when people ask about things. And it's, like, 2, 3 year 2 years ago, it just kinda fell off a cliff. And the the only other thing I've seen out recently is OPA, which the Bloomberg folks have been working on. They really like. OPA is really complicated. It's a you write, like, security rule policies in a security rule server in a custom language. Like, it is not I I literally looked at the language, and I was like, if I did this, I would write a tool to write the language policy files for me. It's very complicated.
So I think that's got a long ways to go. Hopefully, someone builds, like, a UI and tooling and stuff for it. So that's really all you have in the open space. In proprietary, you have AWS's Lake Formation, which, like, I seriously have yet to meet someone who's rolled it out. It just looks weird. We'll see what happens. Again, I I'm hoping it I'm hoping it dies. Like, every 1 of these things that's successful, we have to build and maintain. So, like, I'd like 1, and I'd like it to be open. The, Databricks has their own proprietary thing. At Starburst, we have our own proprietary thing. I think Tabular has our own proprietary thing.
You end up with proprietary things because of the complexity of the security system. So, like, in Galaxy, we built the security system into the core of Galaxy itself. So Galaxy is the starburst hosted version of Trino. So, like, every screen you're looking at in Galaxy is viewer aware, and we're applying your policy on, like, what you're allowed to see, and it's really core to the whole application. It kind of touches, like, every single bit. So how do you put that in, and then you're like, oh, I'm gonna make this out call to a third party system and, like, I need to know it changes, but, like, this is something I need to be able to do on, like, a millisecond level. And so security is a super hard problem. Also, everyone has different viewpoints about how security should work. In Galaxy, we followed a very traditional database security system with roles and access controls, etcetera.
In other systems, like, there's different viewpoints. Like, it's it's very interesting. Like, OPA is, like, this different universe of, like, policy rule systems. So I don't think we have a good answer for this right now in terms of, like, a community. And I think this is 1 of the things that actually is a reason why you would choose a vendor is their security implementation aligns with, like, what you wanna do. Yeah. The the security and policy space is definitely still very much in flux, in particular in the lakehouse ecosystem, but even beyond that. So OPA is a tool that came out of largely the Kubernetes ecosystem
[00:14:16] Unknown:
and is being applied to a number of different areas because it is a generalized policy language. There's another project called Oso, which is an open source policy engine that has its own policy language again so that you can have the policy agent embedded in process in various language run times, and then you can define those policies out of band and apply them to the runtime dynamically. So I I I think that that is an interesting approach and maybe something, you know, whether it's also or OPA or 1 of the other tools in that ecosystem might start to make inroads into the data platform ecosystem as well. And then you also have things like identity systems like Keycloak or Okta or Auth0, etcetera, that also factor into all of that. So it's a it's a big complicated space. I I think part of the problem here is what are we optimizing for? So, like, OPA
[00:15:05] Unknown:
and Ranger, which is just another policy system, was great if you're an admin and you wanna, like, lay down the rules, like, broadly for, like, lots of tables by using table matching. But, like, SQL security was really built around, like, I create a table. I type commands to grant access to other folks in the platform. I may create views or, like, you know, filter rules or something like that. And I'm just typing commands to do that in the SQL language. And that SQL language is the language of the system I'm doing I I'm in. So it's like, that's a system that's optimized for end user experience, not admin experience. And the admin experience, it's great if you're a bank. OPA, like Intrino, came from Bloomberg, and it's like they have a lot of data, and they have data policies they need to apply broadly. But if you're like a small group and you want to have a security system, like, do you even have people that can write this complicated thing? Can you write a can run an OPA system that's gonna return responses in milliseconds because it's part of, like, every query?
No. And, like, really, you wanted the system to be kind of in a simple, understandable way for end users. So it's like there's these a lot of the stuff in data lakes are provided by big companies with big company solutions to big company problems, and it does not align with, like, hey, I wanna, like, grant access to this table to some other person.
[00:16:35] Unknown:
Absolutely. And the in the data lake and lakehouse ecosystem as well, there's the added complexity that by virtue of the storage and the compute being disaggregated, you maybe want to bring a different compute to that same storage. And so then there's the question of, okay. Well, do I need to route all of my requests through the other compute engine that has my policy information? Do I have to have different policy sets and different rule sets across those different compute systems? So
[00:17:02] Unknown:
It's it's actually worse than that too because the outside of, like, Trino, the most popular compute engines are mapreducey like things like Spark and Hive. And the problem is that those engines almost always allow users to upload their own third party code, untrusted third party code into the same process. And that means that you can't rely on the process to be secure to protect against data access and stuff like that. So the spark and hive communities are pushing for things like column level encryption and physical security based on file permissions, which is like anathema to, like, the way SQL works. This would be the equivalent of, like, oh, I'm gonna manage my MySQL permissions by setting file permissions in the in the UNIX file system. It's insane. Right? And, like, this is, like, state of the art, and it's because, like, we the entire the entire industry went down this map reduce path for 15 years, and it's it's not a good idea. Like, you see, like, every single vendor who's working in the data space has moved away from MapReduce. Like, yes, Spark still uses it, but, like, when you get into, like, high performance stuff, like, everyone has moved away from MapReduce.
It's just not a thing you do anymore. And we're still building our security systems to, like, the lowest common denominator.
[00:18:32] Unknown:
And so taking a step back now from ragging about the complexities of security, bringing it back around to Trino and Iceberg, I guess, maybe keeping it in the context of security, what are the benefits that that particular pairing provides and maybe in juxtaposition to other technology stacks or vendors that purport to provide a data lake house experience?
[00:18:56] Unknown:
Today, I think the data warehouse, like, the the folks talking about the data lake experience, and I'm using that in quotes. I think it kinda breaks down into 2 camps. You have folks who have a traditional data warehouse that can pretend like it's in the data lake. That's almost always done by, you run a query, it loads the data into snowflake format. They run their query, and then they throw the data away or they cache it or something like that. But they don't actually execute directly on the lakehouse data. So that's, like, 1 camp. And then the other camp would be, obviously, you have iceberg, camp, and then you have, like, the Delta Lake camp, which is similar.
I have my bias. My bias is absolutely towards iceberg. I was pretty unhappy when Delta Lake actually came out. It's it's unfortunate that, like, I thought we were I we had this brief moment where it looked like the entire ecosystem was gonna move on to iceberg. And we would only have 1 thing to implement, not like 5. And then Databricks dropped their format. And in my experience, the only people using it are Databricks customers, but they have a lot of customers. And so, like, everyone is having to implement it because Databricks made it the default format for their customers. When honestly, like, their customers would be just as happy with iceberg. So now, we all get to build twice, and, yeah, it's got a community, but, like, it's not the same thing as it being an Apache community. But even then, having if there were 2 Apache projects, I'd be annoyed also. And that doesn't and then there's other groups that are trying to build stuff. So so Trino and Iceberg, I think we're combining together, like, in my opinion, the best analytics in query engine we have available along with the current best storage format.
So without like, iceberg without Trino is, like, great. I have storage format, but, like, how do I query it? How do I how do I interact and change and produce these files? Like, you know, it's nice, but, like, it's not you're still suffering the problems of some of the other engines. And Trino, on the other hand, provides this great query engine that's adaptable. Like, Trino has the ability to add in custom data types. We have direct readers for everything. So it can actually we can actually build an engine that's really, really tightly set up for what, Iceberg can do.
And we can do that in a, like, in a way where you get really, really great performance. So what Trino was was suffering from until Iceberg came along was the data formats weren't particularly good. And so, like, they you would have performance problems. You would be missing stats. You know, there's this really most of the data formats and the way Hive was work was actually designed for HDFS, which has a very specific performance profile that s 3 does not have. Like, listing files is great, and HDFS is an insanely slow in s 3, and Iceberg doesn't require listing files. Like, there's a whole bunch of things like that where Iceberg was designed to deal with the performance characteristics of object storage as opposed to, like, HDFS's design that, I mean, hardly anyone uses HDFS anymore. So, like, Iceberg gave us the a really stable format with a well run community that likes specs, that understands, like, the performance of modern things.
And we were able to work really closely with them and build a query engine that's really tuned. The integration we're doing with Iceberg is fundamentally designed for Iceberg. It isn't like a bolt on. It's not like we took Hive and, like, swapped out a little bit. So, like, we wrote a custom plugin just for Iceberg that does exactly what Iceberg wants.
[00:23:14] Unknown:
Are you sick and tired of salesy data conferences? You know, the ones run by large tech companies and cloud vendors? Well, so am I. And that's why I started Data Council, the best vendor neutral, no BS data conference around. I'm Pete Soderling, and I'd like to personally invite you to Austin this March 26th to 28th, where I'll play host to 100 of attendees, 100 plus top speakers, and dozens of hot start ups on the cutting edge of data science, engineering, and AI. The community that attends data council are some of the smartest founders, data scientists, lead engineers, CTOs, heads of data, investors, and community organizers who are all working together to build the future of data and AI.
And as a listener to the data engineering podcast, you can join us. Get a special discount off tickets by using the promo code depod20. That's depod20. I guarantee that you'll be inspired by the folks at the event, and I can't wait to see you there.
[00:24:14] Unknown:
And when somebody is building a data platform or building their warehouse implementation, they decide, okay. This combination of Trito and Iceberg does what I want. I have the benefits of a performant query engine. I have the flexibility and scalability of object storage. I can scale those 2 things independently. How does that influence the other upstream and downstream choices that they might make for the other components of their data platform?
[00:24:42] Unknown:
So, once you decide you're gonna go with Iceberg and Trino, you have the complexities of, like, how do I actually get my data into these platforms? The bootstrap problem is a really big problem in data warehousing in general. It's, like, how do I get my data in? In general, the I since Iceberg has become so popular that a lot of tools are adopting it, so actually getting your data in is less of a problem. But you definitely wanna go and look at the vendors you're gonna use for, landing the data into your s 3 bucket and make sure they support parquet at the very least and iceberg, hopefully. And if they're not supporting it, when are they gonna support it? Because most of them have it on their road map unless they're actually unless they're Databricks. Like, actually, even Databricks is starting to iceberg support. So making sure your vendors actually are supporting landing data in iceberg format. Then in terms of, like, other choices, you obviously have things like, how am I going like, how is security gonna work? How is data maintenance gonna work? So iceberg tables require maintenance on them. And depending on how you're importing data, they may require compaction, and you wanna keep only so much snapshot data because they have the ability to query historic data, but that means you're holding historic data, which could be expensive. So there's a bunch of, like, maintenance things, and you're gonna have to choose a tool that supports the maintenance. So many of the platforms like Starburst, we're integrating all of this stuff into our platform because we wanna create the simplest experience for people. Like, we don't want them to have to go and, like, integrate with a third party tool to, like, run some compaction jobs. Then I think there's additional things around, like, you're gonna use probably some sort of data transformation guiding pipeline like tool. It's almost always DBT.
I don't even know if they have competitors, honestly. Yeah. And then, obviously, you're gonna want some sort of BI tools. Most of them are supporting Trino, or Starburst or both today. So there isn't much of a choice reduction there. But I think the big thing is, like, data ingest, getting it into iceberg, and maintaining those files are currently a big part of the platforms.
[00:26:57] Unknown:
Absolutely. And I started my data lake house journey, I think, maybe going on 2 years ago now. And in that 2 years, it has gotten better. Initially, there wasn't really any out of the box support for being able to write into a lakehouse. You could write data into s 3, but then you would have to perform a different step to actually tell whatever meta story you were using. Hey. These files exist. This is the schema. These are the tables, etcetera. So my team is actually using Airbyte, and so we actually had to write a custom output plugin that sat on top of their s 3 plugin to be able to automate generation of those, AWS Glue tables for the data that was just written out rather than having it be an out of band process of, oh, hey. I wrote all this data s to s 3, and now I've gotta wait for the crawler to run to tell me what those tables are, and it's probably gonna be wrong anyway, etcetera.
[00:27:50] Unknown:
Absolutely. I I thereby, actually, all of them. They either have it. If they do, it's not not always the best. But, like, every single 1 of those vendors, I think, is realized that Iceberg is an important part of the data lake future, and they just need to be able to ingest directly into Iceberg. And Airbyte does have that out of the box now. There are a couple of implementations.
[00:28:13] Unknown:
The level of support is not quite where I would like it to be. And then going back to 1 of your earlier comments as well as far as the data type specifications being a bit all over the place, 1 of the things that is my personal pet peeve, at least in the Airbyte tool chain, I don't know if it exists elsewhere, but anything that has a decimal value is automatically afloat, which if anybody knows anything about data types, that is an awful choice.
[00:28:39] Unknown:
Yes. That is an absolutely awful choice. Funny enough, the first versions of Trino, we didn't have decimal. We only had doubles. And Yep. The actual migration away from them was quite an undertaking.
[00:28:52] Unknown:
We had, like, backwards compatible flags for a long while where you'd be like, oh, if you see a literal, it's actually a, you know, a double, not a decimal like it should have been in the spec. So the the version of the plugin that my team uses, we actually implemented the logic that says, if it is a numeric type that has a decimal place, treat it as a decimal value, not as a float. And so for people who are looking at the data lakehouse ecosystem, going from where we are today and looking into the near to medium term forward, what are some of the areas of progress that you see as far as overall improvement in the capabilities and user experience for the tooling that's available?
[00:29:32] Unknown:
So I think we are finally at the point as of, I don't know, this year that the rest of the vendor space has realized that iceberg is a critical component, and they're starting to they aren't even just starting. They figured this out, like, 6 months ago. Their products are starting to land. And that's that's a big change. Whereas, like, before, as you said, like, you know, the history of, like, the data lake is you end up having to build a bunch of this stuff yourselves while the vendors figure out what's important. So we're, there's there's a there's there's a bunch of interesting parts to this. So there's, like, obviously, things like landing data and data maintenance.
It's gonna be interesting to see how this shakes out in the next, like, year or 2 as what happened before is happening again. Everyone realizes it's important, so everyone's gonna build products around this. So now we're gonna have competing products that all have slightly different features, which is a good thing, but it's also, like, a bad thing because it's the paradox of choice for the end users. You're gonna have a lot of stuff to look at, and you have to consider, like, the data lake is about how things integrate together. So it's, like, if I choose this product from this vendor, how does that work with my other products that I might be interested from other vendors? Can I use Erebite to land my data and then use a separate data maintenance tool that plays well with that landed data? And it's going to be a interesting next set of things around, like, now that we're moving on to iceberg and we have Trino, so it's like, how do we get these different products to play well with it? And everyone's got kind of a different viewpoint on that.
[00:31:18] Unknown:
And as a vendor supporting Trino, building a product powered by Trino, what are some of the areas of investment that you see as being most critical to easing that adoption curve, improving the effectiveness, and user experience for people who are using Starburst specifically and Trino indirectly to just make their lives easier and help them get their jobs done?
[00:31:43] Unknown:
Well, I should have mentioned this earlier. The most challenging thing that people have is actually, like, how they query their data. So we set up Trino, and the first thing you see in Starburst is a way of actually entering queries right in our UI and being able to run some queries. And then you're like, great. I wanna put this in my BI tool. Like, how do I get this to my BI tool? So, like, that is a big area we actually think about is, like, how do we empower users to get this into the tools they wanna use? Then the other part is kind of like generally, like, the admin part. How do I manage my security? We spend a lot of time around that. And I think the big areas that we look for are how do we make it easier and easier for people to, like, set up their data lake. So 1 of the first things we focused on in the Galaxy development was, I call it time to first query.
So you go you sign up. You can be running queries on your data warehouse in a minute, couple minutes. That's great. How do you get your data in? So we spend a bunch of time around data discovery, integrations, etcetera, and we're continuing to do more and more work around how you actually build up your initial lake and get your data into your lake. So I still think that's 1 of the big problems. So is this, how do you get data in and just kind of focusing a lot of the the data lake stuff. It's it's geeky stuff. It's like stuff I love, but it's, like, really detailed, and there's a lot of choice in the space. And, really, what I want as a nontechnical end user or even, honestly, my other friends that are insanely technical, they're like, that's great, but, like, I don't wanna learn how the low level file system stuff works. Just like I wanna run some queries.
So I spent a we spent a lot of time of just, like, let's get it all working. And then if you wanna, like, integrate with some additional stuff because, like, that's important to you, like, we can talk about how we do that. But, really, it's like, get up, get queries going, get excited about the and what we're doing. And then we can talk about, like, sometimes people are very opinionated about, like, they want a certain specific integration the way they wanna do it, but it's it's pretty rare. We hear it because we're in the community. But, like, outside of, like, data heads. People don't even like, people don't know what Ranger is or Parquet or, like they don't know what any of this is. They're like, I just wanna run some queries.
[00:34:19] Unknown:
Yeah. As somebody who's been running this podcast for, I guess, 7 years now, whenever I talk to somebody who isn't deeply embedded in this space, I'm always struck by the fact that the things that I'm talking about, they have no clue and they don't care. I'm like, wait a minute. Alright. Reset. I'm gonna remember that I'm talking to somebody who doesn't do this every day.
[00:34:39] Unknown:
Yeah. I I often find myself saying outside of data space. Like, so you know in Excel when you do x, we kinda do that, but the table's infinite. Like Yeah.
[00:34:49] Unknown:
Right. And going back to that question of landing data and the transformation, as you mentioned, most people these days are using dbt. There are some competitors, but not a lot of them and not on the same scale. But 1 of the benefits that Trino provides is, as you mentioned, it's a federated query engine. So rather than being constrained to, oh, I can only work on the data that's in my iceberg tables, you can say, oh, I actually just want to directly query against my postgres or my MySQL database or some of the other numerous data connectors that are out there. And I'm wondering what you see as the general pattern of people who are adopting Trino, whether they are still using the Airbyte or Fivetran as the only means of landing data into their lakehouse or if they're largely using that federated query capability to be able to do more, kind of real time data updates of from source systems into their lake house via those transformation routes?
[00:35:45] Unknown:
Very, very interesting question. So you're gonna get the database answer, which is it depends. So it's it's interesting. So, like, federation is awesome. You generally typically, you're not keeping your main data. Well, actually, let me back up. So, normally, when we're talking about federation so, like, Trino in its heart is a federated query engine. That is, like, we don't own the data. We're interacting with data and the descriptions of the tables that are all external. That said, the connectors that read data from, like, object store and glue and that sort of thing, those are effectively native formats to Trino. Like, we implement all the raw file reading logic. We talk directly to Glue. We're not talking to, like, another query engine. Whereas, when we talk to MySQL, we send a query in MySQL's language to MySQL.
So, normally, when we're talking about federation, we're talking about the stuff that's not the normal data like queries. Folks that are a lot of companies and users, etcetera, will have what I'll call dimensional data sitting in a production store that's like a MySQL or a Postgres. This could be as simple as, like, demographics for users, etcetera. So, like, they'll have their main feed of data. Say it's, an ads feed. And it's like, okay. User so and so saw this ad. You join in with their demographics, and then you can do analytics of, like, you know, the amount of ad clicks by age range or something like that. And you don't have age range in your in your normal ad feed. So that's really powerful.
And it's easy to do because, like, you just connect them together. You don't have to set anything up. The downside is you're now accessing a production data store that, like, keeps your website running from your query engine. That can be fine if you're using, like, MySQL and you have a bunch of read replicas for your, your database. That can also be expensive because in a transaction processing database is more expensive to run than an analytics database for the amount of data. So sometimes you'll want to instead copy that data into your data warehouse. The other reason that you wanna copy data in is you sometimes want historic data. So you need the demographics for that user when they saw the ad, especially when you're doing stuff where there's money involved and people are paying for certain ad impressions or, you know, you're doing, I don't know, product stuff. You're selling things and, like, you wanna record the state of the system at that point. So a lot of times then, you'll you'll be either dumping in data daily, or you can, with a lot of work, try and set up something like Debezium and get a feed into a data warehouse. It's very complicated today.
So a lot of times, you'll want to mirror the data in because you actually want a non moving snapshot, because you want a non moving snapshot or you wanna reduce the pressure. So a lot of people start with the live and then move to the other 1 when they realize the cost or the pressure on their database. Moving can be really, really complicated, though. Like, the tools there are not good. The state of the art, the best tools are very challenging to use.
[00:39:14] Unknown:
Absolutely. Digging a little bit deeper in there, I'm wondering if there are any other differences that you see in terms of the overall pipeline design, access, and usage patterns that folks are building around their usage of Trino and Iceberg as compared to maybe a warehouse or some of the other lakehouse, compositions that you've seen?
[00:39:35] Unknown:
So there's the data warehousing space, I think, in general, is kind of developing 2 different directions, especially in the, the open data lakes. So there's a large swath of people that are using something like DBT to do step by step transformations. There is a movement towards materialized views, which you just say, I want a materialization of the of this query, and here's the policy for keeping that up to date. They a lot of people think they're equivalent, they are not. So materialized views are about when you're querying that, it's supposed to be the equivalent as if you just ran the underlying query and so the data changes. Whereas, like, pipeline data has the advantage and disadvantage that, like, typically, like, you're processing on, like, I don't know, let's say a daily or an hourly basis. If, like, the query changes or something that changes in the pipeline, it's only future data that's affected, which is good and bad depending on what you're trying to accomplish. So, like, I think that's an important split that's happening in the open community, and I'm curious to see which one's gonna win.
In terms of, like, open data lakes versus, like, proprietary ones, the biggest difference is that people don't keep all their data in their proprietary data lakes. Just too expensive, or it's too complicated to move it all in. Whereas, like, normally, people are storing all their data in s 3, whether it's a data lake or not, because it's cheap and they can have a backup, but you don't keep all your data in Snowflake because it's either too expensive or it's too much of a burden to keep all the feeds to load it into their format. You see the same thing with Redshift and basically everything else out there. It's like even if it were free, it's still just annoying.
[00:41:21] Unknown:
And then another consideration that folks have when they're deciding whether or not they wanna use a lake house approach is sometimes they have queries that need to be able to operate very quickly, and so that's where they'll typically bring in something like a ClickHouse or a Druid when they're dealing with, you know, fast moving data that needs to be updated quickly. And I'm wondering what you see as some of the some of the decision points around going wholesale into 1 of those systems or using those as a supplement to a Trino and Iceberg setup?
[00:41:53] Unknown:
Yeah. So my experience with those systems is that they're limited in their capabilities, so they're almost always used with a custom application, especially in the case of, like, Druid, where it's not standard SQL at all. It's very powerful, but you basically your application is custom written to it, so you're not typically using it for general analytics. And if you're in that space, like, you end up having a lot of choices of different things you can do. So in terms of, like, fast moving data, I I think the open data lake is getting better at this very fast.
I think that's a thing everyone's focusing on. So with iceberg, you now have the Iceberg appending stuff that came in, what, 2 years ago, 3 years ago? Like, you see more and more, people using tools to take data off of event streams like Kafka and landing it into tables at high resolution and then having background compaction jobs to deal with the insane number of files you create. And then downstream of that, there are a bunch of vendors and open source projects working on taking, like, okay. So now we have this new data. How do we integrate that into the computations? I would guess within a couple years, you're gonna see everyone building something around this. You know, it'll be like everything else. A lot of them will be bad, but, I think the overall community is going to be more and more of bringing in data at near real time and being able to have it manipulated in a near real time feed. That said, that is near real time. Getting down to, like, milliseconds, like, anything under, like, 30 seconds typically means you have a custom engine where as you're bringing the feeds in, they're going into main memory, and they're being held in memory. You can't even get them to distribute a disk. It's not fast enough. Those, I think, will continue to be fairly proprietary systems. They're kinda complicated to write. So that's where you'll see the few vendors in that space. My experience is that most people don't need anything short of a minute. It's very rare to see that. The reason you see, like, Hoodie came out of Uber is because they were using their real time system to adjust pricing on the fly. Well, how many group how many organizations have that problem?
None. That, like, outside of, like, delivery services, those are, like, the only people I know who use those systems.
[00:44:30] Unknown:
Absolutely. And as somebody who has been working in this space for a number of years, as somebody who is building and investing in the lakehouse architecture paradigm, you're very deeply entrenched in that ecosystem. What are some of the most interesting or innovative or unexpected ways that you have seen Trino lakehouses applied?
[00:44:51] Unknown:
So the most interesting cases almost always are custom applications. I I I've seen so many, like, standard warehouse stuff that, like, they all kind of blend together and be less interesting. Where it becomes really interesting is when someone builds a custom application, especially in Trino, if they're building a custom data store to match. So you have things like, companies that run big, like, CDNs and stuff like that, building a custom data store that hooks directly into their CDN and can, like, show the live data feeds and, like, security systems where you're hooked into the live security views or ad systems. Like, we built a bunch of these at Facebook for, like, hooking into the live ad system, live AB testing, where you have, like, a custom data store that's specifically tuned to a problem with, like, indexes that are for petabyte scale data.
You can do really, really powerful things with Trino because of the way the the query engine is extensible. You can add new types and functions and all sorts of stuff into it, and end up with extremely responsive systems that do really custom things at big scale. That said, like, you need a team of, like, high skilled engineers to build something like that, which is worthwhile if, like, this is your entire business. I think the more common interesting thing is, like, ingesting data and setting it up and getting a bunch of people running their queries, which is, like, pretty mundane. But it's, like, the power of, like, when you give your people access to data and their ability to make better decisions is just, like, it's night and day.
[00:46:36] Unknown:
And in your experience of building these systems, working with customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of working in this data lakehouse ecosystem?
[00:46:48] Unknown:
I think the most frustrating thing is you run into different requirement viewpoints on things. So it's like, you think you understand, like, what people are interested in, and you start building that. And then, like, someone comes along, and they're, like, no. I actually am very interested in the opposite direction. So we had a bunch of people that were interested in, like, I don't care what the file formats are. I just want this stuff to go really fast. You have this advantage of your ability to move faster, and build really custom things if you can change anything you want at any time. It's actually a huge advantage that the big proprietary vendors have. Well, once you get to scale, you can't really do that. But in the early days, it's very fun. You can move very fast. But at the same time, like, in our space, the reality is, like, we are in this open data space. So it's like, if I extend stuff and no 1 uses it, I'm no longer in that space.
So it's often challenging to figure out, like, how do we thread the needle of, like, actually making things a lot better without stepping outside that bound. So, like, we're doing a lot of work around iceberg and iceberg maintenance, and we spent a lot of time thinking about, like, hey, should we just be, like, in Starburst, should we be just pulling this into our separate space, and then, like, maybe we're not even using iceberg manifest files. Maybe we're using something else in, like, a transactional database, and then I can do indexing in ways that are impossible right now. And we decided that, no, we're the open data lake space, so it's like we gotta figure out how to do it in the in the open format. Sometimes it's like we have augmented data in special fields or sidecar files or that sort of thing to be able to, like, give us the additional information that we need to make our age go faster. Sometimes, like, you get on Slack and you hit up Ryan Blue, and you're like, yeah. How about we just add some some stuff into the spec to be able to handle this? Like, I'm sure everyone has this problem. So that that's the the, like, I wanna move faster, but I can't move faster that I I I wanna so it drives me nuts when it's, like, I know there's a better solution, and it's, like, I can't do it without breaking and making the thing proprietary.
And then, you know, even then, I have to, like, wait for others to catch up.
[00:49:18] Unknown:
Absolutely. For people who are in the process of designing their data systems or they're looking to build a new set of capabilities in their data platform? What are the cases where a lakehouse architecture is the wrong choice?
[00:49:33] Unknown:
So I I I also would say a few years ago, this answer was a lot easier. I think nowadays, the open data lakes are very good. Where I think it's helpful with some of the vertically integrated players is you don't have to understand a whole lot. You're just again, you show up and you just use the tool. And I I think that's where data lakes suffered, like, back to the original Cloudera stuff. And if you ever tried to install it, they had, like, 10, 000 choices of different tools to install. It's like, I just wanna work my data. So it's like their entire id was choice, and that was the worst part about their product. It was, like, too much choice. I think we've done a great job at Starburst around, like, simplifying getting started on your lake and getting going in your lake. You also had this problem in the past where I would say that there's a lot of people who feel like they need to use a lake because they heard it or a data warehouse in general, and they don't actually have a data warehousing problem. Like, they could just use Postgres and don't have a lot of data. Also, we see a lot of people that wanna do federation, and they don't understand, like, federation is, like, we just send queries to the other system. And they're like, well, it'll make my stuff faster. And so I I don't think we've done a great job of describing when you would choose to even move to a data warehouse. And then in terms of, like, proprietary versus non, it's a it's a tough choice. They can get you self going, but they can be very expensive, complex to manage, and you're bolted into that thing. Like, I don't know if you've ever seen someone try to move from a traditional warehouse to an open 1. It's not super easy. I don't wanna say it's hard. Like, we do a lot of business of moving people off the lakes, but it would be would have been a lot easier if they had started on the lake.
[00:51:23] Unknown:
Absolutely. And as you continue to build and iterate on the Trino platform and the Starburst product, what are some of the things you have planned for the near to medium term, or any particular projects or problem areas you're excited to explore?
[00:51:36] Unknown:
So on the open source side, there's a bunch of stuff I'm very interested in around how we can spin people up on Trino in a faster and easier way. So we're doing more around simplifying the setups, simplifying the installation process, making it work in smaller environments, things like that, better integrations with the different ecosystems. Like, I wanna see much more work done with better integrations with, the Python ecosystem in particular. So 1 of the the big areas that I have been focusing on recently has been around how you actually set up Trino. So, historically, Trino was designed and operated as you had a data lake with Hive in it and nowadays Spark in it. And you're adding Trino because both of those query engines are really slow and not particularly good to use. Now we're at the point where a lot of people just run it, and they don't have Hive and Spark. So there were things we would assume would already exist because you have those other tools. And so, like, now we're going back and adding a bunch of, like, things where, normally, you would've just fired up the Hive console and run some commands, and, like, you just don't have that anymore.
So, another big area is you set up Trino, and it's like, oh, you wanna set up a new catalog. And in the old days, you knew what you wanted to connect because you already had a data lake, and so you just create this little catalog file and modify and just restart your server till things work. Well, that's just not how people do it anymore. Now they fire up the, and it's like, okay. I wanna connect to my s 3. And it's like, to go edit a file, why can't I run a SQL command? So recently added a bunch of stuff around create catalog, drop catalog. There's still more to be done like alter catalog. Right now, it's still just under the covers modifying, like, local files, but we have some work on, like, putting it into a a real database.
So it's funny. It's like, you think about this, and it's like in the Trino ecosystem. And it's like, what do you mean you're not storing your catalogs in, like, a normal catalog system? It's like, we never needed to. And it's like at Starburst, like, with Galaxy, we've had this from the beginning. You go to the UI, you just modify your catalogs, and, like, everything's kind of live ish. It's getting even more live with these changes we're putting in in Trino. So, like, you will be able to just add catalogs or remove them a lot easier and maintain your system, put in a bunch more stuff around, like data evolution and things like that. So I'm excited about this. Like, how do we bring more people into this community? Because I think we're we're very much at the point where the difference between what I can do in a traditional data warehouse and what I can do in Trino is a much, much smaller gap. Like, when we started Trino, we're like, we're gonna be able to take out traditional data warehouses with this. Like, we're going to build something that's as good as that. We're 10 years in, and I think we're, for the vast majority of cases, like, we've been able to take them out for years years years. But it's like this new user case, I think, is, like, the 1 remaining spot. And, like, when we started this project, we said, it's gonna take 10 years. I think we're there. Like, just need to, like, just a little bit more, and I think we will have covered pretty much everything all the way down to, like, a new user with, like, a couple of files they wanna process.
[00:55:13] Unknown:
It's funny how, persistent that 10 year time horizon is. Much every time I talk to somebody who has built or is building a database engine, they always say, it takes 10 years before you get it right.
[00:55:27] Unknown:
Yeah. The other thing they don't say is, like, it kinda takes 5 years before, you know, kinda doesn't suck. You know? It was pretty good, but, like, you know, people like, we didn't have the ability to write tables for the 1st year. Like, whatever. We got data. We got Hive. It's writing data for us. We'll just run queries, right, that select the data out. So it's like the amount of stuff from, like, oh, this is actually interesting. It kind of works to, like, I can use it everywhere,
[00:55:55] Unknown:
is, like, people have no idea. Absolutely. It it it's amazing how many products have been built because the person building it didn't realize how hard it was going to be.
[00:56:05] Unknown:
Yeah. Yeah. I I honestly, I think that's almost every project I work on. It's like Yeah. If I knew, I I probably would
[00:56:14] Unknown:
start. And are there any other aspects of the work that you're doing on Trino and this overall space of data lake house ecosystem, the combination of Trino and Iceberg that we didn't discuss yet that you'd like to cover before we close out the show? I think we actually covered all of it. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I really, really think we need a big improvement in the security space. And I don't really care what it is, other than, like, it needs to work well with things like
[00:56:56] Unknown:
Trino and the maintenance. Like, the amount of complexity you have to go through to set those policies, you have to learn a new language. That's way too complicated. And frankly, it's even if you do the wrong language, you're gonna get the policies wrong because you're no expert at it. So it's they're too complex of models. The other spaces, I still think it's too hard to get data into the lakes. It just needs to work and land and be maintained. And, like, you shouldn't have to think about it. It should be it should always work and be low cost, and data just shows up. Like, why do I have to worry about, you know, all the feeds?
[00:57:35] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team have been doing on bringing the data lakehouse ecosystem into a better place and all the work that you're doing to build the Starburst product. Definitely makes the onboarding a lot easier for folks, So definitely like the work that you and your team are doing there. So thanks again for taking the time, and I hope you enjoy the rest of your day. Thank you. This is is
[00:58:02] Unknown:
great.
[00:58:05] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview
Guest Introduction: Dane Sundstrom
Defining the Data Lakehouse
User Experience and Challenges in Data Lakehouses
Trino and Iceberg: Benefits and Comparisons
Building a Data Platform with Trino and Iceberg
Vendor Integration and User Experience
Pipeline Design and Usage Patterns
Innovative Applications of Trino Lakehouses
Lessons Learned in the Data Lakehouse Ecosystem
When a Lakehouse Architecture is the Wrong Choice
Future Plans and Areas of Investment
Closing Remarks and Contact Information