Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

A new approach to building and running data platforms and data pipelines.

It is an open source, cloud native orchestrator for the whole development life cycle with integrated lineage and observability, a declarative programming model, and best in class testability.

Your team can get up and running in minutes thanks to DAXTER Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments.

Go to data engineering podcast.com/daxter

today to get started, and your first 30 days are free. Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture

with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your

data. Want to see Starburst in action? Go to dataengineeringpodcast.com

/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macy, and today I'm interviewing Dane Sundstrom about building a data lake house with Trino and Iceberg. So, Dane, can you start by introducing yourself?

Well, I am Dane Sundstrom.

I am 1 of the founders of Trino,

and presto before that. I am CTO at Starburst.

I've been working in the data lake space for about 10 years now. Before that, I

worked at some other startups. And before that, I was

1 of the original

people at JBoss and spent a lot of time in Java EE and that sort of space.

And do you remember how you first got started working in data? My background mostly was distributed computing. So out of college, I started working at UnitedHealthcare

on distributed computing

using Intera DCE in the nineties,

and then switched to, like, Java EE back when it was called something else. And,

I as part of that, I wrote the object relational mapping tools

for JBoss.

Then,

eventually,

we long, long time forward,

started working at Facebook,

and 1 of the original projects from the head of infrastructure was to come up with a faster, better way of interacting with their large data warehouse at the time. So this is, like, 10 years ago, and it was, I don't know, 3, 400 petabytes or something. It's

dramatically bigger now, and they didn't have a team to do it. And myself and David Phillips and Martine

have background in Java, extensive background, and databases and stuff like that. So we were available, and we started working on it. But I'm mostly

a distributed computing person, so I wrote most of the distributed computing parts of Trino. Whereas, like, Martin's a deep language person. So he did a lot of the language

optimizations,

and,

David is deeply into databases, has been forever, and so built a lot of the database parts and the tooling and things like that.

As an outgrowth of that effort, along with a number of other contributions to the ecosystem,

we have landed in this space where we have a new architectural paradigm for analytical systems that is largely phrased as the data lake house as a midway point between data lakes and data warehouses. And for the purposes of this conversation, I'm wondering if you can give your definition of what constitutes a data lakehouse.

It's a really good question because I think people play fast and loose with it. So

historically, I would say a data lake is

you have

traditional storage, external storage, so you're talking HDFS is generally what people are talking about it. But nowadays, like,

HDFS is so rarely used. It's almost always some cloud object storage, s 3, GCS,

Azure stuff. So definitely, all the data stored in that. And then, I think the important part comes with a lake house of talking about

standard data representations.

So, like, you can be a vendor and store all your data in, you know, s 3 if it's proprietary stuff. And proprietary, I'm just gonna define as you're the only 1 who really implements it. I don't care if you have an open spec or whatever. Like, it doesn't matter, like, if you're the only serious player in it, it's effectively proprietary.

So where I think about it now, it's object storage.

It's doing it in the lake. So it isn't like, oh, I take the files and then I import them into my special proprietary

format, and then I cross them and then I dump the data back out. That's the lake as a sidecar to you. So it's when you're doing transformations, when you're doing data maintenance, the data goes

is operated on directly as the lake being your native form.

Everything else is,

you know, a bolt on, which not to say is terrible. It's just a different thing.

Absolutely. And

another interesting aspect of the idea of the data lakehouse is that the reason for framing it as such is that it intends to add a lot of the user experience benefits that you get from a fully vertically integrated database system, such as data warehouses, whether that is an actual vertically integrated system, you know, as of the days of yore or a cloud native system where

compute and storage are disaggregated but still presented as a single unified experience.

And I'm wondering if you can talk to some of the ways that we have actually, as a community,

hit that mark, and what are some of the areas where we're actually still falling short of

the user experience presentation

of this

cohesive

platform

versus the parts where the, gaps still show through, and you can see that it's actually 5 different pieces that are trying to work together.

Yeah. I think we've done an okay job. I think we got a long ways to go, though. If you had asked me this question

3 years ago,

I would have just gone on and on and on about the litany of, like, broken weird tools that exist in the lake house. I think things are starting to get better as people realize that it this isn't, like, so much as, like, the community of users. It's the community of, like, the people implementing and maintaining the systems

where, like, I think we've now started to figure out that, like, this paradox of choice is not a good thing. So

before,

we had, like, Hive,

and there were 5 competing data formats, and then that narrowed down to 2. And then everyone realized that what Hive was doing was really bad

and not sustainable.

And

having like, you have 2 different tables next to each other, and they're maintaining completely different ways and have different type systems and different scheme evolution and and so on. Like, I can go on and on and on about the edges of it. So I think iceberg

came along

and said, hey. We're just gonna come up with a format

for tables. It includes how tables move, how they're evolved, how they're managed,

and covers

a whole plethora of things, including, like, data types and how partitioning works and stats now and

views and and so on

as a written down standard. Before, it was just the wild west. Like, literally, like, someone would check something into Hive and, like, invent an entire new system. Spark does this all the time. Like, okay. Let's implement spark bucketingv2,

which is different than everything else. And if you wanna know how it works, like, go read the spark code because some person just showed up and everyone's like, yeah. That's cool. So I think we've gotten a really

a lot better on data in tables,

the type system,

that sort of thing is

now

fairly standardized and well understood. That said,

Iceberg did it, and then immediately, Databricks came along and dropped a competing product, which is kind of half finished. And so now I get to implement 2. And

now there's more of these coming along and hoping that this time around,

we consolidate onto 1 very quickly because it's really kind of a mess, and you're basically

what happens is people like us in the Trino community, we have to implement all of these. And we only have so many people, so it's like we implement 1 really well and the rest suffer, or we implement all of them kind of okay. So

it's, it's difficult. Like, right now, there are enough people. I think we're maintaining

3 of them. Hive asset died, and that's, like, 1

of end tools. So, like, we can have the same conversation about security. We can have the same conversation about,

I don't know. There's there's, like, lots of these areas.

Absolutely. So I personally am actually using the lakehouse architecture for my

platform. For sake of transparency, I am using Trino. I'm using the Starburst managed Galaxy, so get that out of the way. I'm using the iceberg table format,

which is largely transparent. I don't have to do a lot on the actual table format piece because TriNet handles that piece of it for the most part. And so as somebody who's using the lakehouse paradigm, there are definitely a lot of niceties. I agree it's gotten a lot easier over the past couple of years than it was prior to that. A lot of the conversation seems to have

cohered along a roughly

standardized conception of what constitutes the lakehouse.

I do think that 1 of the areas that is still

unfinished or at least not as cohesive across the board is that question of security and access control. That seems to be 1 of the areas where the overall data ecosystem is is not yet figured out. Everybody has their own thoughts on how it can and should be done. Everybody wants to own that experience.

There aren't a lot of methods for being able to communicate

roles and access across the,

layer boundaries, and I'm wondering if you can talk to some of the ways that that manifests in terms of that overall experience as a juxtaposition to the warehouse where everything is presented as 1 system.

Yeah. So as 1 of the people who's written, I don't know, a huge portion of the security systems in Trino and

in Galaxy, it's actually a really hard space to be in. So, like, if you kinda look, into the open ecosystem so, like, throughout this whole thing, we're kind of talking about the open ecosystem. So the open ecosystem for security,

historically, you had

the Hive metastore with its security. Well, the most popular metastore out there is Gloo, and it doesn't have the Hive security model. And the Hive security model was always weird and only applies to Hive.

Trino is a federated system, so, like, that doesn't make much sense.

Ranger

pretty much died. I haven't seen it around in a while. Like, there are people still kinda looking at it, but, like, I get a sense for how popular things are by, like, when people ask about things. And it's, like, 2, 3 year 2 years ago, it just kinda fell off a cliff. And the the only other thing I've seen out recently

is OPA, which the Bloomberg folks have been working on. They really like. OPA is really complicated.

It's a you write, like, security rule policies in a security rule server in a custom language. Like, it is

not I I literally looked at the language, and I was like, if I did this, I would write a tool to write the language policy files for me. It's very complicated.

So I think that's got a long ways to go.

Hopefully, someone builds, like, a UI and tooling and stuff for it. So that's really all you have in the open space. In proprietary, you have AWS's Lake Formation, which, like, I seriously have yet to meet someone who's rolled it out. It just looks weird. We'll see what happens. Again, I I'm hoping it I'm hoping it dies. Like, every 1 of these things that's successful,

we have to build and maintain. So, like, I'd like 1, and I'd like it to be open. The,

Databricks

has their own proprietary thing. At Starburst, we have our own proprietary thing. I think

Tabular has our own proprietary thing.

You end up with proprietary things

because

of the complexity

of the security system.

So, like, in Galaxy,

we built the security system into the core of Galaxy itself. So Galaxy is the starburst hosted version of Trino. So, like, every screen you're looking at in Galaxy is viewer aware, and we're applying your policy on, like, what you're allowed to see, and it's really core to the whole application. It kind of touches, like, every single bit. So how do you put that in, and then you're like, oh, I'm gonna make this out call to a third party system and, like, I need to know it changes, but, like,

this is something I need to be able to do on, like, a millisecond level. And so security is a super hard problem. Also, everyone has different viewpoints about how security should work. In Galaxy, we followed a very traditional

database security system with roles and

access controls, etcetera.

In other systems, like, there's different viewpoints. Like, it's it's very interesting. Like, OPA is, like, this different universe of, like, policy rule systems. So I don't think we have a good answer for this right now in terms of, like, a community. And I think this is 1 of the things that actually is a reason why you would choose a vendor is their security implementation aligns with, like, what you wanna do. Yeah. The the security and policy space is definitely still very much in flux, in particular in the lakehouse ecosystem, but even beyond that. So OPA is a tool that came out of largely the Kubernetes ecosystem

and is being applied to a number of different areas because it is a generalized

policy language. There's another project called Oso, which is an open source

policy engine that has its own policy language again so that you can have the policy agent embedded in process in various language run times, and then you can define those policies out of band and apply them to the runtime dynamically.

So I I I think that that is an interesting approach and maybe something, you know, whether it's also or OPA or 1 of the other tools in that ecosystem

might start to make inroads into the data platform ecosystem as well. And then you also have things like identity systems like Keycloak or Okta or Auth0, etcetera, that also factor into all of that. So it's a it's a big complicated space. I I think part of the problem here is what are we optimizing for? So, like, OPA

and Ranger, which is just another policy system, was great if you're an admin and you wanna, like,

lay down the rules,

like, broadly for, like, lots of tables by using table matching. But, like, SQL security

was really built around, like, I create a table. I type commands to grant access to other

folks in the platform.

I may create views or, like, you know, filter rules or something like that. And I'm just typing commands to do that in the SQL language. And that SQL language is

the language of the system I'm doing I I'm in. So it's like, that's a system that's optimized for end user experience, not admin experience.

And the admin experience, it's great if you're a bank. OPA,

like Intrino, came from Bloomberg, and it's like they have a lot of data, and they have data policies they need to apply broadly. But if you're like a small group and you want to

have a security system, like, do you even have people that can write this complicated thing? Can you write a can run an OPA system that's gonna return responses in milliseconds because it's part of, like, every query?

No. And, like, really, you wanted the system to be kind of in a simple, understandable way for end users. So it's like there's these a lot of the stuff in data lakes are provided by big companies with big company solutions to big company problems,

and it does not align with, like, hey, I wanna, like, grant access to this table to some other person.

Absolutely. And the in the data lake and lakehouse

ecosystem as well, there's the added complexity that by virtue of the storage and the compute being disaggregated,

you maybe want to bring a different compute to that same storage. And so then there's the question of, okay. Well, do I need to route all of my requests through the other compute engine that has my policy information? Do I have to have different policy sets and different rule sets across those different compute systems? So

It's it's actually worse than that too because

the outside of, like, Trino,

the most popular compute engines are mapreducey

like things like Spark and Hive. And the problem is that those engines

almost always allow

users to upload

their own third party code, untrusted third party code into the same process.

And that means that

you can't rely on the process to be secure to protect against

data access and stuff like that. So

the spark and hive communities

are pushing for things like column level encryption

and

physical security based on file permissions, which is like

anathema to, like, the way SQL works. This would be the equivalent of, like, oh, I'm gonna manage my MySQL permissions by setting file permissions in the in the UNIX file system. It's insane. Right? And, like, this is, like, state of the art, and it's because, like, we the entire

the entire industry went down this map reduce path for

15 years,

and it's it's not a good idea. Like, you see, like, every single

vendor who's working in the data space has moved away from MapReduce. Like, yes, Spark still uses it, but, like, when you get into, like, high performance stuff, like,

everyone has moved away from MapReduce.

It's just not a thing you do anymore.

And we're still building our security systems to, like, the lowest common denominator.

And so taking a step back now from ragging about the complexities of security,

bringing it back around to Trino and Iceberg,

I guess, maybe keeping it in the context of security, what are the benefits that that particular pairing provides

and maybe in juxtaposition

to other technology stacks or vendors that purport to provide a data lake house experience?

Today, I think the data warehouse,

like, the the folks talking about the data lake experience, and I'm using that in quotes.

I think it kinda breaks down into 2 camps. You have

folks who have a traditional

data warehouse

that can

pretend like it's in the data lake. That's almost always done by,

you run a query, it loads the data into

snowflake format. They run their query, and then they throw the data away or they cache it or something like that. But they don't actually execute directly on the lakehouse data. So that's, like, 1 camp. And then the other camp would be, obviously, you have

iceberg,

camp, and then you have, like, the Delta Lake camp, which

is similar.

I have my bias. My bias is absolutely towards iceberg.

I was pretty unhappy when Delta Lake actually came out.

It's it's unfortunate

that, like, I thought we were I we had this brief moment where it looked like the entire

ecosystem was gonna move on to iceberg.

And we would only have 1 thing to implement, not like 5.

And then Databricks dropped their format. And in my experience, the only people using it are Databricks customers, but they have a lot of customers.

And so, like, everyone is having to implement it because

Databricks made it the default format for their customers. When honestly, like, their customers would be just as happy with iceberg. So now, we all get to build twice,

and, yeah, it's got a community,

but, like, it's not the same thing as it being an Apache community. But even then, having if there were 2 Apache projects, I'd be annoyed also. And that doesn't and then there's other groups that are trying to build stuff. So

so

Trino and Iceberg,

I think

we're combining together, like, in my opinion,

the best analytics in

query engine we have available

along with

the current best storage format.

So

without

like, iceberg without Trino

is, like, great. I have storage format, but, like, how do I query it? How do I

how do I interact and change and produce these files? Like, you know, it's nice, but, like,

it's not

you're still suffering the problems of some of the other engines.

And

Trino, on the other hand, provides this great query engine that's adaptable. Like, Trino has

the ability to add in custom data types.

We have

direct readers for everything. So it can actually we can actually build an engine that's

really, really tightly

set up for what,

Iceberg can do.

And we can do that in a, like, in a way where you get really, really great performance.

So

what Trino was was suffering from

until Iceberg came along was the data formats weren't particularly good. And so, like, they you would have performance problems. You would be missing stats.

You know, there's this really most of the data formats and the way Hive was work was actually designed for HDFS, which has a very specific performance profile

that s 3 does not have. Like, listing files is great, and HDFS is an insanely slow in s 3, and

Iceberg doesn't require listing files. Like, there's a whole bunch of things like that where Iceberg was designed

to deal

with the performance

characteristics

of object storage as opposed to, like, HDFS's

design that, I mean, hardly anyone uses HDFS anymore. So, like, Iceberg gave us the a really stable format with a well run community that likes specs, that understands, like, the performance of modern things.

And we were able to work really closely with them and build a query engine that's

really tuned. The integration we're doing with Iceberg is fundamentally designed for Iceberg. It isn't like a bolt on. It's not like we took Hive and, like, swapped out a little bit. So, like, we wrote a custom plugin just for Iceberg that does exactly what Iceberg wants.

Are you sick and tired of salesy data conferences?

You know, the ones run by large tech companies and cloud vendors?

Well, so am I. And that's why I started Data Council, the best vendor neutral, no BS data conference around.

I'm Pete Soderling, and I'd like to personally invite you to Austin this March 26th to 28th, where I'll play host to 100 of attendees,

100 plus top speakers, and dozens of hot start ups on the cutting edge of data science, engineering, and AI.

The community that attends data council are some of the smartest founders,

data scientists, lead engineers, CTOs, heads of data, investors, and community organizers who are all working together to build the future of data and AI.

And as a listener to the data engineering podcast,

you can join us. Get a special discount off tickets by using the promo code depod20.

That's depod20.

I guarantee that you'll be inspired by the folks at the event, and I can't wait to see you there.

And when somebody

is building a data platform

or building their warehouse implementation,

they decide, okay. This combination of Trito and Iceberg does what I want. I have the benefits of a performant query engine.

I have the flexibility

and scalability of object storage. I can scale those 2 things independently.

How does that influence

the other upstream and downstream choices that they might make for the other components of their data platform?

So,

once you decide you're gonna go with Iceberg and Trino, you have the complexities of, like, how do I actually get my data into these platforms? The bootstrap problem is a really big problem in data warehousing in general. It's, like, how do I get my data in? In general,

the I since Iceberg has become so popular that a lot of tools are adopting it, so actually getting your data in is

less of a problem. But you definitely wanna go and look at the vendors you're gonna use for,

landing the data

into

your s 3 bucket and make sure they support parquet at the very least

and iceberg, hopefully. And if they're not supporting it, when are they gonna support it? Because most of them have it on their road map unless they're actually unless they're Databricks. Like, actually, even Databricks is starting to iceberg

support. So making sure your vendors actually are supporting landing data in iceberg format. Then in terms of, like, other choices, you obviously have things like, how am I going

like, how is security gonna work? How is data maintenance gonna work? So iceberg tables require maintenance on them. And depending on how you're importing data, they may require compaction, and you wanna keep only so much snapshot data because they have the ability to query historic data, but that means you're holding historic data, which could be expensive. So there's a bunch of, like, maintenance things, and you're gonna have to choose a tool

that supports the maintenance. So many of the platforms like Starburst,

we're integrating all of this stuff into our platform because

we wanna create the simplest experience for people. Like, we don't want them to have to go and, like, integrate with a third party tool to, like, run some compaction jobs. Then I think there's additional things around, like, you're gonna use probably some sort of data transformation

guiding

pipeline like tool. It's almost always DBT.

I don't even know if they have competitors, honestly. Yeah. And then, obviously, you're gonna want some sort of BI tools. Most of them are supporting Trino,

or Starburst or both today.

So there isn't much of a choice reduction there. But I think the big thing is, like, data ingest, getting it into iceberg, and maintaining those files are currently a big part of the platforms.

Absolutely. And I started my data lake house journey, I think, maybe going on 2 years ago now. And in that 2 years, it has gotten better. Initially,

there wasn't really any out of the box support for being able to write into a lakehouse.

You could write data into s 3, but then you would have to perform a different step to actually tell whatever meta story you were using. Hey. These files exist. This is the schema. These are the tables, etcetera.

So my team is actually using Airbyte, and so we actually had to write a custom output plugin that sat on top of their s 3 plugin to be able to automate generation of those, AWS Glue tables for the data that was just written out rather than having it be an out of band process of, oh, hey. I wrote all this data s to s 3, and now I've gotta wait for the crawler to run to tell me what those tables are, and it's probably gonna be wrong anyway,

etcetera.

Absolutely. I I thereby,

actually, all of them. They either have it. If they do, it's not not always the best. But, like, every single 1 of those vendors, I think, is realized that Iceberg is an important part of the data lake future, and they just need to be able to ingest directly into Iceberg. And Airbyte does have that out of the box now. There are a couple of implementations.

The level of support is not quite where I would like it to be. And then going back to 1 of your earlier comments as well as far as the data type specifications being a bit all over the place, 1 of the things that is my personal pet peeve, at least in the Airbyte tool chain, I don't know if it exists elsewhere, but anything that has a decimal value is automatically afloat,

which if anybody knows anything about data types,

that is an awful choice.

Yes. That is an absolutely awful choice.

Funny enough, the first versions of Trino, we didn't have decimal. We only had doubles. And Yep. The actual migration away from them was quite an undertaking.

We had, like, backwards compatible flags for a long while where you'd be like, oh, if you see a literal, it's actually a, you know, a double, not a decimal like it should have been in the spec. So the the version of the plugin that my team uses, we actually implemented the logic that says, if it is a numeric type that has a decimal place, treat it as a decimal value, not as a float.

And so for people who are looking at the data lakehouse ecosystem,

going from where we are today and looking into the near to medium term forward, what are some of the areas of progress that you see as far as overall improvement in the capabilities

and user experience for the tooling that's available?

So I think we are finally at the point as of, I don't know, this year

that

the rest of the vendor space

has realized that iceberg is a critical component, and they're starting to they aren't even just starting. They figured this out, like, 6 months ago. Their products are starting to land. And that's

that's a big change.

Whereas, like, before, as you said, like, you know, the history of, like, the data lake is you end up having to build a bunch of this stuff yourselves while the vendors figure out what's important.

So we're,

there's there's a there's there's a bunch of interesting parts to this. So there's, like, obviously, things like landing data and data maintenance.

It's gonna be interesting to see how this shakes out in the next, like, year or 2 as what happened before is happening again. Everyone realizes it's important, so everyone's gonna build products around this. So now we're gonna have competing products that all have slightly different features, which is a good thing, but it's also, like, a bad thing because it's the paradox of choice for the end users. You're gonna have a lot of stuff to look at, and you have to consider, like, the data lake is about how things integrate together.

So it's, like, if I choose this product from this vendor, how does that work with my other products that I might be interested from other vendors? Can I use Erebite

to land my data and then use a separate data maintenance tool that plays well with that landed data? And

it's going to be a interesting

next set of things around, like, now that we're moving on to iceberg

and we have Trino, so it's like, how do we get

these different products to play well with it? And everyone's got kind of a different viewpoint on that.

And as a vendor supporting Trino, building a product powered by Trino,

what are some of the areas of investment

that you see as being most critical to easing that adoption curve, improving the effectiveness,

and user experience for people who are using Starburst specifically and Trino indirectly to just make their lives easier and help them get their jobs done?

Well, I should have mentioned this earlier. The most challenging thing that people have is actually, like, how they query their data. So we set up Trino,

and the first thing you see in Starburst is

a way of actually entering queries right in our UI and being able to run some queries. And then you're like, great. I wanna put this in my BI tool. Like, how do I get this to my BI tool? So, like, that is a big area we actually think about is, like, how do we empower users to get this into the tools they wanna use? Then

the other part is kind of like generally, like, the admin part. How do I manage my security? We spend a lot of time around that. And I think the big areas that we look for are how do we make it easier and easier for people to, like, set up their

data lake. So 1 of the first things we focused

on in the Galaxy development was, I call it time to first query.

So you go you sign up. You can be running queries on your data warehouse in

a minute, couple minutes. That's great. How do you get your data in? So we spend a bunch of time around data discovery,

integrations,

etcetera,

and we're continuing to do more and more work around

how you actually

build up your initial lake and get your data into your lake. So I still think that's 1 of the big problems. So is this,

how do you get data in and just kind of focusing

a lot of the the data lake stuff. It's it's geeky stuff. It's like stuff I love, but it's, like, really detailed, and there's a lot of choice in the space.

And, really, what I want

as a nontechnical end user or even, honestly, my other friends that are insanely technical, they're like, that's great, but, like, I don't wanna learn how

the low level file system stuff works. Just like I wanna run some queries.

So I spent a we spent a lot of time of just, like, let's get it all working. And then if you wanna, like, integrate with some additional stuff because, like, that's important to you, like,

we can talk about how we do that. But, really, it's like, get up, get queries going, get excited about

the and what we're doing.

And then we can talk about, like,

sometimes people are very opinionated about, like, they want a certain specific integration the way they wanna do it, but it's it's pretty rare. We hear it because we're in the community.

But, like, outside of, like, data heads. People don't even like, people don't know what Ranger is or Parquet or, like they don't know what any of this is. They're like, I just wanna run some queries.

Yeah. As somebody who's been running this podcast for,

I guess, 7 years now, whenever I talk to somebody who isn't deeply embedded in this space, I'm always

struck by the fact that the things that I'm talking about, they have no clue and they don't care. I'm like, wait a minute. Alright. Reset. I'm

gonna remember that I'm talking to somebody who doesn't do this every day.

Yeah. I I often find myself saying outside of data space. Like, so you know in Excel when you do x,

we kinda do that, but the table's infinite. Like Yeah.

Right.

And

going back to that question of

landing

data and the transformation, as you mentioned, most people these days are using dbt.

There are some competitors, but not a lot of them and not on the same scale.

But 1 of the benefits that Trino provides is, as you mentioned, it's a federated query engine. So

rather than being constrained to, oh, I can only work on the data that's in my iceberg tables, you can say, oh, I actually just want to directly query against my postgres or my MySQL database or some of the other numerous data connectors that are out there. And I'm wondering what you see as the general pattern of people who are adopting Trino, whether they are still using the Airbyte or Fivetran

as the only means of landing data into their lakehouse or if they're largely using that federated query capability to be able to do more,

kind of real time data updates of from source systems into their lake house via those transformation routes?

Very, very interesting question. So you're gonna get the database answer, which is it depends.

So it's it's interesting. So, like, federation is awesome.

You generally

typically, you're not keeping your main data. Well, actually, let me back up. So, normally, when we're talking about federation so, like, Trino in its heart is a federated query engine. That is, like, we don't own the data. We're interacting with data and

the descriptions of the tables that are all external.

That said, the connectors that read data from, like, object store and glue and that sort of thing, those are effectively

native formats to Trino. Like, we implement all the raw file reading logic. We talk directly to Glue. We're not talking to, like, another query engine. Whereas, when we talk to MySQL,

we send a query in MySQL's language to MySQL.

So, normally, when we're talking about federation, we're talking about the stuff that's not the normal data like queries.

Folks that

are a lot of companies

and users, etcetera, will have what I'll call dimensional data sitting in

a production

store that's like a MySQL or a Postgres. This could be as simple as, like, demographics for users, etcetera. So, like, they'll have their main feed of data. Say it's,

an ads feed. And it's like, okay. User so and so saw this ad. You join in with their demographics, and then you can do analytics of, like, you know,

the amount of ad clicks by age range or something like that. And you don't have age range in your

in your normal ad feed. So that's really powerful.

And

it's easy to do because, like, you just connect them together. You don't have to set anything up. The downside is

you're now accessing

a production

data store that, like, keeps your website running from your query engine. That can be fine if you're using, like, MySQL and you have a bunch of read replicas

for your, your database. That can also be expensive

because

in a transaction

processing database

is more expensive to run than an analytics database for the amount of data.

So sometimes you'll want to

instead

copy that data into your data warehouse. The other reason that you wanna copy data in is you sometimes want historic data. So you need the demographics

for that user

when they saw the ad, especially when you're doing stuff where there's money involved and people are paying for certain ad impressions

or, you know, you're doing, I don't know, product stuff. You're selling things and, like, you wanna record the state of the system at that point. So a lot of times then, you'll you'll be either dumping in data

daily, or you can,

with a lot of work, try and set up something like Debezium and get a feed into a data warehouse. It's very complicated today.

So a lot of times, you'll want to mirror the data in because you actually want a

non moving snapshot,

because you want a non moving snapshot or you wanna reduce the pressure. So a lot of people start with the live and then move to the other 1 when they realize the cost or the pressure on their database.

Moving can be really, really complicated, though. Like, the tools there are not good. The state of the art, the best tools are

very challenging to use.

Absolutely. Digging a little bit deeper in there, I'm wondering if there are any other differences that you see in terms of the overall pipeline design,

access, and usage patterns that folks are building around their usage of Trino and Iceberg

as compared to maybe a warehouse

or some of the other lakehouse,

compositions that you've seen?

So there's the

data warehousing space,

I think, in general, is kind of developing 2 different directions, especially in the, the open data lakes.

So there's a

large

swath of people that are

using something like DBT to do step by step transformations.

There is a movement towards

materialized views, which you just say, I want a materialization of the of this query, and here's the policy for keeping that up to date. They a lot of people think they're equivalent, they are not.

So materialized

views

are

about

when you're querying that, it's supposed to be the equivalent as if you just ran the underlying query and so the data changes. Whereas, like, pipeline data

has the advantage and disadvantage that, like, typically, like, you're processing on, like, I don't know, let's say a daily or an hourly basis. If, like, the query changes or something that changes in the pipeline, it's only future data that's affected, which is good and bad depending on what you're trying to accomplish. So, like, I think that's an important split that's happening in the open community, and I'm curious to see which one's gonna win.

In terms of, like, open data lakes versus, like, proprietary ones, the biggest difference is that people don't keep all their data in their proprietary data lakes. Just too expensive, or it's

too complicated to move it all in. Whereas, like, normally, people are storing all their data in s 3, whether it's a data lake or not, because it's cheap and they can have a backup, but you don't keep all your data in Snowflake because it's either too expensive

or it's too much of a burden to keep all the feeds to load it into their format. You see the same thing with Redshift and basically everything else out there. It's like even if it were free,

it's still just annoying.

And then another

consideration

that folks have when they're deciding whether or not they wanna use a lake house approach is sometimes they have

queries that need to be able to operate very quickly, and so that's where they'll typically bring in something like a ClickHouse or a Druid when they're dealing with, you know, fast moving data that needs to be updated quickly. And I'm wondering what you see as some of the some of the decision points around

going wholesale into 1 of those systems or using those as a supplement to a Trino and Iceberg setup?

Yeah. So my experience with those systems is that they're

limited in their capabilities,

so they're almost always used with a custom application, especially in the case of, like, Druid, where it's not standard SQL at all.

It's very powerful,

but you basically

your application is custom written to it, so you're not typically using it for general analytics.

And if you're in that space, like, you end up having a lot of choices of different things you can do. So in terms of, like, fast moving data, I I think the open data lake is getting better at this

very fast.

I think that's a thing everyone's focusing on. So with iceberg,

you now have

the Iceberg appending stuff that came in, what, 2 years ago, 3 years ago? Like, you see more and more, people using tools to

take data off

of event streams like Kafka

and landing it into tables at high resolution

and then having background compaction jobs to deal with the insane number of files you create. And then downstream of that, there are a bunch of

vendors and open source projects

working on taking, like, okay. So now we have this new data. How do we integrate that into the computations?

I would guess within a couple years, you're gonna see everyone

building something around this. You know, it'll be like everything else. A lot of them will be bad, but, I think the overall

community

is going to be more and more of bringing in data at near real time

and being able to have it manipulated in a near real time feed. That said, that is near real time. Getting down to,

like, milliseconds,

like, anything under, like, 30 seconds

typically means you have a custom

engine where as you're bringing the feeds in, they're going into main memory, and they're being held in memory. You can't even get them to distribute a disk. It's not fast enough. Those, I think, will continue to be fairly proprietary systems. They're kinda complicated to write. So that's where you'll see

the few vendors in that space. My experience is that most people don't need anything short of a minute. It's very rare to see that. The reason you see, like, Hoodie came out of Uber is because they were using their real time system to

adjust pricing on the fly. Well, how many

group how many organizations

have that problem?

None. That, like, outside of, like, delivery

services, those are, like, the only people I know who use those systems.

Absolutely. And

as somebody who has been working in this space for a number of years, as somebody who is building and investing in the lakehouse architecture

paradigm, you're very deeply entrenched in that ecosystem. What are some of the most interesting or innovative or unexpected ways that you have seen Trino lakehouses applied?

So the most interesting cases

almost always are

custom applications.

I I I've seen so many, like, standard warehouse stuff that, like, they all kind of blend together and be less interesting. Where it becomes really interesting is when someone builds a custom application, especially in Trino, if they're building a

custom data store to match. So you have things like,

companies that run

big, like, CDNs and stuff like that, building a custom data store that hooks directly into their CDN and can, like, show the live data feeds

and, like, security systems where you're hooked into the live security views or ad systems. Like, we built a bunch of these at Facebook for, like, hooking into the live ad system,

live AB testing, where you have, like, a custom data store that's specifically tuned to a problem with, like,

indexes

that are for petabyte scale data.

You can do

really, really powerful things with Trino because of the way the the query engine is extensible. You can add new types and functions and all sorts of stuff into it, and end up with

extremely

responsive

systems that do really custom things at big scale. That said, like,

you need a team of, like,

high skilled engineers to build something like that, which is worthwhile if, like, this is your entire business. I think the more common interesting thing is, like,

ingesting data and setting it up and getting a bunch of people running their queries, which is, like, pretty mundane. But it's, like, the power of, like, when you give your people access to data

and their ability to make better decisions is just, like, it's night and day.

And in your experience of building

these systems, working with customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of working in this data lakehouse ecosystem?

I think the most

frustrating thing is you run into

different

requirement

viewpoints on things. So it's like, you think you understand,

like, what people are interested in, and you start building that. And then, like, someone comes along, and they're, like, no. I actually am very interested in the opposite direction. So we had a bunch of people that were interested in, like, I don't care what the file formats are. I just want this stuff to go really fast. You have this advantage

of your ability to move faster, and build really custom things if you can

change anything you want at any time. It's actually a huge advantage that the big proprietary vendors have. Well, once you get to scale, you can't really do that. But in the early days, it's very fun. You can move very fast. But at the same time, like, in

our space,

the reality is, like,

we are in this open data space. So it's like, if I extend stuff and no 1 uses it, I'm no longer in that space.

So it's often challenging to figure out, like, how do we thread the needle of, like,

actually

making things a lot better

without

stepping outside that bound.

So, like, we're doing a lot of work around

iceberg and iceberg maintenance, and we spent a lot of

time thinking about, like, hey, should we just be, like, in Starburst, should we be just pulling this into our

separate space, and then, like, maybe we're not even using iceberg manifest files. Maybe we're using something else in, like, a transactional database,

and then I can do indexing in ways that are impossible right now. And

we decided that, no, we're the open data lake space, so it's like we gotta figure out how to do it in the in the open format. Sometimes it's like we have augmented data in special fields or sidecar files or that sort of thing to be able to, like, give us the additional information that we need to make our age go faster. Sometimes, like, you get on Slack and you hit up Ryan Blue, and you're like, yeah. How about we just add some some stuff into the spec to be able to handle this? Like, I'm sure everyone has this problem. So that that's the the, like, I wanna move faster, but I can't move faster that I I I wanna so it drives me nuts when it's, like, I know there's a better solution, and it's, like, I can't do it without breaking

and making the thing proprietary.

And then, you know, even then, I have to, like, wait

for others to catch up.

Absolutely.

For people who are in the process of designing their data systems or they're looking to build a new set of capabilities in their data platform? What are the cases where a lakehouse architecture is the wrong choice?

So I I I also would say a few years ago, this answer was a lot easier. I think nowadays,

the

open data lakes are

very good. Where I think it's helpful

with some of the vertically integrated

players is you don't have to understand

a whole lot. You're just

again, you show up and you just use the tool. And I I think that's where

data lakes suffered, like, back to the original Cloudera stuff. And if you ever tried to install it, they had, like, 10, 000 choices of different tools to install. It's like, I just wanna work my data. So it's like their entire id was choice, and that was the worst part about their product. It was, like, too much choice. I think we've done a great job at Starburst around, like, simplifying

getting started on your lake and getting going in your lake. You also had this problem in the past where I would say that there's a lot of people who

feel like they need to use a lake because they heard it or a data warehouse in general,

and they don't actually have a data warehousing problem. Like, they could just use Postgres

and don't have a lot of data. Also, we see a lot of people that wanna do federation,

and they don't understand, like, federation is, like, we just send queries to the other system. And they're like, well, it'll make my stuff faster. And so I I don't think we've done a great job

of describing

when you would choose to even move to a data warehouse. And then in terms of, like, proprietary versus non, it's a it's a tough choice. They can get you self going, but they can be very expensive,

complex to manage, and you're bolted into that thing. Like, I don't know if you've ever seen someone try to move from a traditional warehouse to an open 1. It's not super easy. I don't wanna say it's hard. Like, we do a lot of business of moving people off the lakes, but it would be would have been a lot easier if they had started on the lake.

Absolutely.

And as you continue to build and iterate on the Trino

platform and the Starburst product, what are some of the things you have planned for the near to medium term, or any particular projects or problem areas you're excited to explore?

So

on the open source side,

there's a bunch of stuff I'm

very interested in around

how we can spin people up on Trino

in a faster

and easier way. So we're doing

more around

simplifying

the setups, simplifying

the

installation process,

making it work in smaller environments, things like that, better integrations with the different ecosystems. Like, I wanna see much more work done with better integrations with, the Python ecosystem in particular.

So 1 of the the big areas that

I have been focusing on recently

has been around

how you actually set up

Trino. So,

historically, Trino was designed

and operated as

you had a data lake with Hive in it and nowadays Spark in it. And you're adding Trino because

both of those query engines are really slow and not particularly good to use. Now we're at the point where a lot of people just run it, and they don't have Hive and Spark. So there were things we would assume would already exist because you have those other tools. And so, like, now we're going back and adding a bunch of, like, things where, normally, you would've just fired up the Hive console and run some commands, and, like, you just don't have that anymore.

So,

another big area is you set up Trino, and it's like, oh, you wanna set up a new catalog.

And

in the old days, you knew what you wanted to connect because you already had a data lake, and so you just create this little catalog file and modify and just restart your server till things work. Well, that's just not how people do it anymore. Now they fire up the, and it's like, okay. I wanna connect to my s 3. And it's like, to go edit a file, why can't I run a SQL command? So recently added a bunch of stuff around create catalog, drop catalog. There's still more to be done like alter catalog. Right now, it's still just under the covers modifying, like,

local files, but we have some work on, like, putting it into a a real database.

So it's funny. It's like, you think about this, and it's like in the Trino ecosystem. And it's like, what do you mean you're not storing your catalogs

in, like, a normal catalog system? It's like, we never needed to. And it's like at Starburst,

like, with Galaxy, we've had this from the beginning. You go to the UI, you just modify your catalogs, and, like, everything's kind of

live ish. It's getting even more live with these changes we're putting in in Trino. So, like, you will be able to

just add catalogs or remove them a lot easier and maintain your system, put in a bunch more stuff around, like data evolution and things like that.

So I'm excited

about this. Like, how do we bring more people into this community? Because I think we're we're very much at the point where the difference between what I can do in a traditional data warehouse and what I can do in Trino

is a much, much smaller gap. Like, when we started Trino, we're like, we're gonna be able to take out traditional data warehouses with this. Like, we're going to build something that's as good as that. We're 10 years in, and I think we're, for the vast majority of cases, like, we've been able to take them out for years years years. But it's like this new user case, I think, is, like, the 1 remaining spot. And, like, when we started this project, we said, it's gonna take 10 years. I think we're there. Like, just need to, like, just a little bit more, and I think we will have covered pretty much everything all the way down to, like, a new user with, like, a couple of files they wanna process.

It's funny how,

persistent that 10 year time horizon is. Much every time I talk to somebody who has built or is building a database engine, they always say, it takes 10 years before you get it right.

Yeah. The other thing they don't say is, like, it kinda takes 5 years before, you know,

kinda doesn't suck. You know? It was pretty good, but, like, you know, people like, we didn't have the ability to write tables

for the 1st year. Like, whatever. We got data. We got Hive. It's writing data for us. We'll just run queries, right, that select the data out. So it's like the amount of stuff from, like, oh, this is actually interesting. It kind of works to, like, I can use it everywhere,

is, like, people have no idea. Absolutely. It it it's amazing how many products have been built because the person building it didn't realize how hard it was going to be.

Yeah. Yeah. I I honestly, I think that's almost every project I work on. It's like Yeah. If I knew, I I probably would

start. And are there any other aspects of the work that you're doing on Trino and this overall space of data lake house ecosystem,

the combination of Trino and Iceberg that we didn't discuss yet that you'd like to cover before we close out the show? I think we actually covered all of it. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I really, really think we need a big improvement in the security space. And I don't really care what it is, other than, like, it needs to work well with things like

Trino and the maintenance. Like, the amount of complexity you have to go through to set those policies, you have to learn a new language. That's

way too complicated. And frankly, it's even if you do the wrong language, you're gonna get the policies wrong because you're no expert at it. So it's they're too complex of models. The other spaces, I still think it's too hard to get data into the lakes. It just needs to work and land and be maintained. And, like, you shouldn't have to think about it. It should be it should always work and be low cost, and

data just shows up. Like, why do I have to worry about, you know, all the feeds?

Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team have been doing on bringing

the data lakehouse ecosystem into a better place and all the work that you're doing to build the Starburst product. Definitely makes the onboarding a lot easier for folks,

So definitely like the work that you and your team are doing there. So thanks again for taking the time, and I hope you enjoy the rest of your day. Thank you. This is is

great.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about

it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links