Simplify Your Data Architecture With The Presto Distributed SQL Engine

Hello, and welcome to the data engineering podcast, the show about modern data management.

What are the pieces of advice that you wish you had received early in your career of data engineering?

If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help.

Go to data engineering pod cast.com/97

things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's linode

e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today.

Your host is Tobias Macy. And today, I'm interviewing Martin Traverso about Presto SQL, a distributed SQL engine that queries data in place. So, Martin, can you start by introducing yourself?

Yeah. Hi, Tobias. Thanks for having me. So my name is Martin. I am 1 of the original creators of Presto back in 2012.

I'm currently a CTO, 1 of the CTOs at Starburst. I've been working on Presto for the last 8 years. For that, I worked at a couple of different companies, enterprise companies, consumer companies, social networking, and so on. So I have a bunch of experience in different areas of distributed systems and large scale systems.

And do you remember how you first got involved in the area of data management?

I've been ex exposed to databases

from pretty early on. Like, from my first job involved databases in some form or another.

But for many years, it was always

transactional data. It's like the data is behind our website or some application.

And it wasn't until

around

2007 when I was at NIM that I started looking at Hadoop and the Hadoop ecosystem

and and brought it into the company to start experimenting with analytics there.

Then over the the next few years, I got to use, like, things like Hive and Hadoop even more until I I ended up at Facebook in 2012. And that's when we started writing Presto, and I'm working on that system since then.

And so in terms of the Presto project itself, I know that it is, as I mentioned, a distributed SQL engine and that it doesn't do any data storage on its own, instead relying on different backing stores for handling the actual durability of the information. So I'm wondering if you can give a bit more about the overview about what Presto is and some of the origin story and why it was architected in that way.

Yeah. Absolutely. And that that is absolutely correct. Let me start by discovering like, why we created Preston and we'll give you a better picture of why we ended up where we were with it.

So back in 2012,

Dean Sundstrom, David Phillips, Eric Kwan, and myself

formed a team at Facebook. We we joined Facebook around that time and we formed a

team to try to solve 1 of the big problems they had in the infrastructure

at that moment. So people were using Hive almost exclusively. Hive came from Facebook. I don't know if you remember.

And they were using that exclusive almost exclusively for any analytics they wanted to do. Batch

analytics,

ad hoc interactive.

But the problem with Hive is that it's very slow. There was a famous quote from a data scientist at Facebook where they said it is a good day when I can run 6 hard queries. So we looked at that. That was, I mean, that was an awful experience. People were not being able to work effectively.

So we said, okay, we there's something we can do about this. So we set out to build Presta with the goal, initial goal of tackling that that interactive ad hoc user experience

on top of the Huddl warehouse. So we will press it to be efficient at that. Actually, from the beginning, we decided we wanna make it open source. And and that post a couple of challenges because Facebook had their own version of Hive at the time. They had forked the open source version internally. They had extensions. They had changes.

And we need to be able to support

both internal version and also the external version. And, the the open source version. So we said, okay. How do we go about doing that? And that's when we came up with the idea of separating

the

computation layer from the storage APIs. And we put a very simple and clean

interface between the 2, and then I've tracked it I've tracked it out in what we call connectors. And we made it connectors pluggable. So you can drop in a a new connector when you deploy. Preston, you can talk to a different data source. So that was kind of the forcing function. Then after that,

other people back in the company and then after we open source it outside

ended up using that as a way to

extend and and connect with a bunch of other different data sources. So we started with the goal of making

analytics over Hadoop faster, but we ended up with a system that can query

Hadoop and integrate with other data sources like Elasticsearch, MySQL, Oracle,

Chris, and and run queries across all of them. In in that sense, like Presto is the query engine, and then it delegates to all these other systems for the storage needs

through these connectors and plugins I mentioned.

In terms of the release of it as an open source project, what was the motivation behind that? And how do you think that that influenced the original ideas about how to approach the problem?

Dane, Dave and myself are have been always been big believers in open source. When we started the project, we said we want to do something that

will be a multi decade project.

We wanted to be as successful as other projects you you see how they are, like, all the rest of my SQL. So we said we want to build something that stands the test of time. And the only way to do that is to make it open source. Like, if you get inside

Facebook, I mean, this is the kind of thing that it might be useful for a few years and then something else comes along and then they gets abandoned. People move from different t to a different team and and and it stagnates and so on. We wanna make it open source so that we can get other people involved and can grow beyond the single single company and a single use case. I mean, that that's kind of the motivation. And we we wanted to build something that, like, we didn't wanna have to build again

over and over. So

making it open source was the right approach there.

Now, of course, that that plays a bunch of constraints in the way we architect things and the way we

develop things. There there are a number of things that you would normally be able to do if you the system was specialized for the use case, the single use case that you're trying to solve. But by making it open source and accessible to other people, we we basically were saying, okay, we wanna make the system

more flexible. We wanted to cater to the needs of other people. So there's delicate balance. We have to strike there. You require creating these abstractions of connectors.

It requires sometimes making trails about

what do we develop

at a given point in time so that you can satisfy what other people need, not just what Facebook needs.

Sometimes we need to there was some complex sequencing,

the features, and the way we approach them because we may have someone

from external to Facebook come and say, hey, I need this as a feature. And then we might be working on something that might be a like, if you don't do it carefully. There might be a conflict in terms of

how how the 2 changes fit together. So it's it's a little more work. You have to

interact with people in the community a lot more. We have to coordinate things a lot more. And it can get also challenging

when those people are all over the world as opposed to sitting next to you in your office. So it has changed the way where we approach the development and the architecture of the system and also how we work with people as we advance the system. For somebody who is responsible for architecting the data platform for their organization

or considering how they want to handle data access to different sources or a single source.

What are some of the signals that they should be looking to to determine whether or not Presto will be a good fit for their environment? There

are 2 different

approaches

to places where people start from. I mean, you have people that are either starting from scratch, They're, trying to build their infrastructure. So they have, like, greenfield. They they can choose whatever technology they want. If you're considering

doing that and especially if you're gonna be in the cloud,

you're probably gonna be looking at s 3 or Azure

Daylight storage or Google Cloud Storage for storing your data, then you're gonna be talking about

I don't know if they're using Gloomera store in AWS for your catalog and so on. Once you start consuming those things,

then Presto is a natural fit because Presto was designed to be able to work in that ecosystem. So if you're gonna be picking picking anything

from the Hadoop and Hive ecosystem, Presta is just a fit there. And of course, if you want, if you need interactive queries, you're doing interactive queries and you need those to be fast, efficient, etcetera, Presti is a perfect fit. Now, if you're dealing with the other data sources,

that's a case where Presto can be useful.

It really depends on how you're using those those data sources. 1 of the things we've seen is people are trying to migrate out of those systems.

And Presta can be a good bridge for that because you can connect Presta to those data sources and you can connect them to, data lake. And then you can over time migrate from 1 to the other and and do it somewhat transparently. So you can move workloads and you can continue to use the same simple interface to access both both data sources, which is very convenient for users. It lowers the barrier of learning and so on. But if you are where you're thinking is you're gonna keep all your day in 1 place. Like, let's say you have an Oracle instance and you want to use Presta for putting that data instead of going directly to Oracle, that might not be a good fit. Oracle can do much, much better

job at optimizing queries and Presto trying to get the out of Oracle to execute the queries in the Presto engine side. So it's a question of, are you trying to integrate different data sources or are you trying to front a single data source with Presto for some reason or not? So where Presto shines is, like I said, querying data lake data, it's very fast and very efficient. So if you're going that route, then practice a natural fit. If you need to integrate data

across multiple data sources, that can also be a good use case for it. For the cloud data lake scenario,

1 of the things that I've often seen as 1 of the challenges is the access controls and authorization

of the different data sources. And is that something that Presto can manage itself? Or is that pushed down into the underlying storage engines to determine what access is provided to the end user?

There are a couple of different modes Presto can work on. 1st, it implements

the standard SQL access control model. So for every table, there are permissions. The engine checks for every table. When a user tried to query it, you can you can check whether the user has access to read the data,

different columns of the table, where you can read the table,

whether there's a concept of column masks and filters that can be applied. So Presto provides the the framework,

but ultimately, it's up to the connector to the site how to

store and check those actual permissions. For example, if you're if you're dealing with Hive with a high meta store, you might have Ranger

integration,

somewhere. So Presto can delegate the integration to to Ranger to to see whether I give a user as permission to access a given table. If you're talking to to Oracle or to, a SQL database,

Presto can I mean, the delegates into a connector, and it's up to the connector to go and check in MySQL and Oracle whether the user has permissions?

There's another scheme where we we call it the it's like a credential pass through

that allows

the

user to provide credentials for the back end database when they run the query. And and Presto will be

will not look at those credentials. We won't won't do anything with them. We'll just pass them on to the underlying data source. And then the underlying data source is gonna authenticate

against the the data store and use those credentials to to check whether the user has permission. So it's flexible. It depends on your constraints. There are some organizations where they want to constrain the access to the underlying

data sources at the data source level. In some cases, they want to try that out inside Cresta, and that's perfectly doable.

For the types of use cases, you mentioned data lakes or being able to integrate across multiple different data sources. But what are some of the other ways that Presto is often used and the ways that it fits into the overarching

ecosystem of big data tools?

Yeah. So let me describe 1 of the things that we did at Facebook when very soon after we made Presto able to have connectors.

And 1 of our friends at Facebook, he was working on it, on a team.

It was the team building 1 of the back ends for 1 of the user facing systems. It was an in memory

data storage,

and they were thinking of implementing their own SQL engine or or their own query engine because they need to be able to analyze that data to see whether people were gaming the system, whether the algorithms were correct and so on. So

after we told him about Preston, you got plugins.

He said, okay. I'm gonna try that. He wrote a plugin in a couple of weeks. And then in, in 2 more weeks, she got the system running in production. So she was able to

integrate

Presta with that in memory storage

and provide analytics for that system, for the team running that system

within a month. And that has nothing to do with

big data, Hadoop, or hybrid, like that. It was a completely different use case. It was super powerful. Then after that, we ended up building a bunch of different systems on top of or using Presto as the back end. For example, the AB testing system at Facebook

uses Presto as the analytics back end. These are completely different stack separate from Hadoop and and Hive. There's another system where

Presto acts as the analytics back end for a user facing product for ads. So this is users

that are running ad campaigns. They want to go and analyze their campaigns. They go to a website, they click a bunch of buttons, and that runs press the queries under the covers.

So this, is kind of a completely different scenario where it's end user facing. It has

high availability requirements. It has to be up 24 7. It runs through multiple

geographic locations. You have to return

results within

1 or 2 seconds because there's a person waiting on their end of the web browser for the results.

And so I interviewed your colleague from Starburst, Camille,

a couple of years ago at this point and talked a bit about the

underlying functionality

of Presto and how it's built under the covers and some of the ways that you're supporting it at Starburst. But I'm wondering what are some of the most

notable changes or evolutions

in the technical and community and ecosystem growth in that time?

1 of the biggest changes is community wise, like the community has grown dramatically

over the last couple of years.

So after we we left Facebook, we spent,

especially, Dane Deo and I spent 9 9, 10 months

working on Presto

full time on the on growing the community, consolidating the community.

And, I mean, I need paid off. Like, we have a lot more people all over the world involved in the project. So today we have more than 2, 600

people on the Slack channel for Presto. So it's a lot of people interacting every day,

be it for asking questions about how to do certain things when they're running into issues or people saying, I need this feature

or I'm interested in implementing this, how we start to even more, more involved and complex discussions about

architecture and design and so on. And so last year we run

5 conferences.

This is organized, not just by us. It was in coordination with companies all over the world. We did conferences in India and Israel,

in California, in New York, and in Japan. This is a testament to how

global press release across the world. It was actually very surprising to us how it grew and how much adoption it was over the world. So in those 2 years, we had more than 5, 000 code changes. So 5, 000 commits into a code repository too, which is pretty significant. That's about 25% of all the commits in the dark history of the project over the last 8 years. So in terms of technology improvements, 1 of the most notable things is the sheer number of connectors that we added recently. Like over the past year and a half or so, we are connectors for elastic search, for Oracle, for BigQuery, for Apache Pinot, for Druid,

Amazon Kinesis.

And and there's a few more that, M SQL

and so on. So these are mostly

community member contributed.

Some of them were from specific companies, but most of them were from people that were interested in just working on something something and then contributing connected for those. On the engine side, they're having a lot of improvements.

Like there are some performance improvements, some architecture improvements to the way Presto runs queries. 1 notable feature is what we call dynamic filtering,

which allows

the engine to optimize queries that perform joints

by learning that what data comes from 1 side of the join and then using that to prune dynamically the data source on the other side of the join. So this can be a significant

boost in performance,

especially if you have database

partition in high or if you're using 1 of the of the formats like org or parquet that that have some internal stats that can be used to print datasets.

And then we kind of relate it to performance or say, an arbitrator called late materialization

that got added that allows the engine to

evaluate certain things more lazily. So we can avoid

reading data from the underlying data source if it decides it kind of it doesn't need the data at runtime. This is also another dynamic optimization that happens.

It's not something that can be determined upfront. But as it re executes, you can say, well, if you have some data that's gonna be scanned and filter,

then aggregated,

But if filters don't apply, then it won't bother reading the data. I mean, traditionally, Presto would read the data, then filter it, and then throw it away before aggregating, for instance. There are a number of other areas, like, in those 5, 000 code changes. They're having,

like, tons, hundreds of small changes and then a few big changes. And there are some other areas where we have seen a lot of improvement.

For example, there's better support for cloud environments. There's better support for well, actually, Google support for Google Cloud Storage and Azure Daylight Storage is new in the past couple of years. We've added support for AWS Glue, for the Glue catalog.

There have been a bunch of features that make it easier to run, pressed in these various cloud environments.

And then finally, there's

a big push around security.

We added

support for cone level security, cone masks,

row filters. But also there's

a

better support for different modes of authority authentication

for clients

and securing the server, the Presto cluster end to end. For example, all the endpoints in the cluster are now secured behind

credentials. And the UI is secure.

The communication between

the coordinator and the workers is also secure, which is is something that many companies don't care about. But as practice being adopted by different industries like banks, financial

institutions, they are very sensitive to these aspects. And that those are some of the changes that we've been improving over the past couple of years. And then 1 last 1 I like to mention is we added

support for variable precision timestamps in the last few releases.

This is something that we've been wanting to forever.

It's what most database support, like, go use Oracle or Teradata and Atisha, 1 of those. They also support variable presumed time stamps, which allow you to have more granularity

than mille seconds, which is what Presto has traditionally supported. And this is important, for example, for the financial industry, right? Sometimes they, or oftentimes they need to keep data with nano second granularity.

Now that is possible with crypto. So that opens up the door for a lot more use cases and adoption.

And in terms of being able to deploy and scale a Presto cluster, as you mentioned, there's

been a lot of work done recently to support different cloud environments, and I know that there's also been support added for things like cloud native infrastructure using containers and Kubernetes.

I'm wondering for people who are planning an installation of Presto, what are some of the considerations that they should be aware of in terms of the deployment and scaling

and some of the integrations that might simplify their work to maintain a cluster with high uptime and reliability?

First thing is that there is some basic support in in the open source project, but most of the support for, like, high availability and and

integration with the cloud environments more natively, that's done by vendors. Like, for example, if you look at Starburst, there's there's an an offering,

Amazon Marketplace, in the Google Marketplace.

So that's probably gonna be your best best bet if you want. I can out of the box

experience that's easy and straightforward.

If you're trying to do it yourself, it will be more work. I mean, you're gonna have to do all the things that those integrations do by hand, like, in terms of how you do a a hybridity for the coordinator. I mean, it can be done, but you'll have to orchestrate that yourself. All the integration with the monitoring

and CloudWatch in Amazon and so on, you have to do it yourself. So if you want an out of the box experience,

1 of the vendors is probably gonna be your best bet. Then in terms of how you

scale and deploy and scale and so on, there's a couple of things that are interesting to know. 1 of those is that Presto

has been run at scale in many companies for over the past 8 years.

And

for example, at Facebook, we run cluster. So about a 1, 000 nodes each. So

that that gives you a sense of how big you can make a cluster. And that's probably bigger than

almost any company needs. Like, you just need to know, like, can go that high, but you probably won't need it need it that way.

There's a couple of considerations you have to take into account. Like, if you are so Presto can utilize

machines to the full extent. It's parallel and and multithread internally. So if you you have a machine with multiple course, it will use them all. So if you're trying if you're thinking of collocating Presto with other systems, then that's probably not a good idea. I mean, you wanna dedicate entire machines to to Presto itself. This is not an issue with in cloud environments because you can just provision you new machines and a provider takes care of that for you. But if you're doing that as an on premise deployment, then you have to be careful with that. And similarly,

Presto can use the network very efficiently. Like, he can drive the CPU

and and then I work to their limits.

So you need to make sure you have fast networks

interconnects between the the workers,

which again is typically not an issue with cloud environments, like AWS. As long as you're in the same same region, they'll be fine.

In Azure and Google Cloud is the same thing. But if you're on premise, then you have to be a lot more careful about how you distribute your machines across racks. You don't wanna run into top of rack limits and so on. So cloud environments make it easier. You have to take into account those factors in on premise deployments.

And then 1 of the things there's something that we learned at Facebook.

It's generally useful if you have different, different use case. For example, interactive, you have some people that your company that need to interactive analytics, like

BI tools or dashboards or sitting in front of a console and typing a query.

But you also need to have support for batch workloads,

then you may wanna separate those into separate clusters. Like, the configurations for those are gonna be different because you're probably gonna have different SLAs in terms of

response times and queuing and and and how long you're willing to let people wait for a for a quick start and so on. So 1 of the best practices is to separate your

batch and interactive workloads,

and that makes it easier to manage them. It makes it easier to manage user expectations and to make sure that everyone's happy.

As far as the best practices for working with data within Presto,

because of the fact that it can federate across these multiple sources, there is the possibility for being able to use it for some of the ETL capabilities

to read from 1 table in 1 data source and then write it out to another.

But what are some of the

ways that people can potentially shoot themselves in the foot if they're not careful at how they structure their queries or structure the ways that they are working across these different data sources?

Preced is a very complicated system. So

the the short answer is it depends.

Depends a lot on the use case, the requirements. It depends on

what are your

performance requirements,

how many users are you gonna have running queries, are you gonna be big, or they're gonna be small, Are they going to use a lot of memory or not? And so on.

Ideally, if you can run Presto

over a data lake and have all your data in

sitting there, that's going to be the most efficient way to it because Presto, they connect their 4, for the

high data layouts and, green over S3 or or Huddl or HDFS.

It's the most optimal ones. 1 is the 1 that has received the most attention over the past 8 years. And,

of course, it's it's it's the 1 that can run the most parallel, utilize the resources most efficiently and so on. So you're gonna get the best

experience if you do that. But of course, like not everyone has data, every, all the data in a single place in, you know, data lake. And they oftentimes they have data in different systems.

Presto makes it possible to read from those other systems. But 1 thing you have to take into account is that you're gonna be limited in many cases by the capabilities of the connector and how the connector can interact with those systems. For example, if you're using the MySQL connector,

the MySQL connector goes through JDBC.

JDBC is a single connection per, effective per query against MySQL.

That means that

when you run a query in Presto, I need to fetch it from my SQL is gonna be coming from

via 1 connection. It's not gonna be parallel.

So if you're pulling a lot of data, you may be volunate by that. So you have to take that into account. Now there are vendors like Starburst have optimized versions of some connectors. Like for example, there's an Oracle connector that is fully parallel. So you don't suffer from this potential bottlenecks from being made up through a single connection. But there are still some limitations. Like, you can't if you run a complex query, you can't

the engine cannot push certain shapes or query shapes into the underlying data source. So you may end up transferring

a lot of data from the remote system that may be taxing to that remote system.

So those are some of the considerations you have to take into account. 1 thing we've seen is

people use the ability to connect to these data sources to

maybe query them occasionally. Maybe they have fast changing data on these other systems

because they are very suited for that than Huddl, which doesn't really allow modification on data very efficiently.

And so that's in those cases, that's a good fit. Sometimes what they do is they they use the ability for Presta to read from those data sources to mirror them into a data lake, and then they can query them efficiently from a data lake.

And another

situation that people who are looking at Presto might be considering either because they're building out a greenfield project or because they already have an existing data warehouse. What are the trade offs of using Presto in its optimal configuration on top of a data lake versus a vertically integrated data warehouse solution and some of the capabilities that it might provide out of the box?

I think the main 1 is

is 1 of flexibility.

Like, if you are if you're seeing a fully integrated

warehouse, like, for example, let's say you look at Redshift.

Redshift has the ability because it controls everything from the query engine all the way to storage and and even how, I mean, the data formats and so on. You can optimize

that

to this specific, to the workload that you're going to perform. I mean, the disadvantage of that is that in order to interact with Redshift, you have to load your data into Redshift before you can query it.

So that limits your

flexibility. Like it forces you to go through an extra step that you would otherwise require.

If you want yeah, in terms of productivity, if you want to interact with

that data through other tools, like for example, let's say you have Spark in your organization because you're doing a lot of machine learning or or more complex

processing that is not suitable for SQL, then if you have your data locked in Redshift, you can't take it out. Like, you have to extract it going to a different system before you can you can use it. So being logged into

that specific solution that once you have a lot of data, you can get it out

and moving it or moving to another system can become more expensive.

Of course,

you need you need ETL in those scenarios.

ETL can

you can run a query in those systems. It can be fast. But with ETL, you have an initial delay of getting all your data into that system in the first place. So when you have a data lake, like, you're gonna have systems that land your data directly in the data lake. And as soon as it is there, you can query it. You don't have to any further transformations. You have to migrate it. You don't have to copy it and so on. I think those are kind of the trade offs. But again, of course, like, it is gonna

be more feasible for a fully vertically integrated system to do things more efficiently

than than if you

have Presta sitting on top of a data lake.

In designing a data lake, there are also considerations to be made as to what format you're storing it in. So whether it's just JSON data in a g zipped file or if you're using Presto or ORC files and what levels of compression to apply on the various aspects and then

how to lay out the partitioning scheme of what the directory structure might look like in s 3, for example.

I'm curious what you have found to be some of the useful

pointers or

references for being able to determine how you want to land your data into the lake for being able to query it effectively with Presto, as well as being able to use it with these other tools?

Yeah. 1 of the things that you first need to understand is

how are you going to manage your data? Like, why is it life cycle going to be for your data? Like if you have data, for example, I mean, at Facebook, we have this

requirement that they over a certain number of days had to be removed from the warehouse.

So in order to do that efficiently, you need to organize the data in a way that

makes dropping

all data efficiently. So 1 thing that is pretty common is to partition your data by date.

So you're ingesting data every day. You create a new partition.

And then when the data is over the retention window, you just delete a partition. You don't have to go and rewrite anything. You just drop some files from, from the data lake. So that's 1 consideration.

But then another consideration is how are you going to be querying that data? And then sometimes those things can be at odds with each other. There are some things that can help

made that a bit more efficient. For example, Presta supports when you're dealing with Hive and the Hive metastore and so on supports

the partition schemes that that Hive supports. And you can use that for pruning data efficiently. Like, for example, if you're gonna be querying data by date

and always filtering by date ranges, then, of course, if you partition by date, Cresto will be able to skip all the partitions that don't

participate in the query. If you have other criteria, you're gonna be you're gonna be querying over,

and that's not a high cardinality

dataset or, or

field that you're querying over, then you might be able to use partition for that too. For example, you may partition by day and by something, some other, by geography, for instance, if you're querying by geography.

And of course, if you have a high cardinality

dimension or value, we're in a partition on that could be a problem because you're going to put a tax on the meta store. So you don't want to partition, for example, by user that would be super

expensive. On the other hand, there's another feature in

the

Hadoop high layout.

Is called bucketing that allows you to

distribute the data across the files in a partition,

in a deterministic manner according to a set of values.

So that could also be useful

for filtering data upfront. For example, if you're querying, let's say you partition your data by data and then you bucket it by user into a 1, 000 buckets. When you run a query, if you're searching for a specific user,

Presto can take advantage of that knowledge

and narrow down the query to only the files that are candidates for that containing that user. So that can reuse the amount of data transfer, the IO, and so on significantly.

And then, of course, if you're using things like in formats like parquet or org, which are columnar, you can further improve

the IO patterns and the network transfers because it can avoid reading columns that you're not gonna use. Those files also contain

stats about

data inside market or, or, or is organizing to what they call row groups or stripes, which a group of rows.

They basically have, this had footer associated with, with each of those row groups that contain stats about what's the maximum or minimum value for a given column in this row group.

And the end press engine can use that to skip

the sections that are uninteresting dynamically.

So recommendations use parquet or org. Org is the most optimized format right now for Presto. We're working on making parquet better and hopefully as as efficient as org.

If you're using com if you wanna compress your data and you definitely wanna compress your data,

the

2 recommended

algorithms are lz4 or z standard. Those are

the best in their class of algorithm. L z 4 is good for fast compression.

It's very efficient, but it's not as you won't compress the as compact. We as the standard. And then c standard is a efficient

algorithm

for high compression. So it's better than gzip. So if you can if you can use the standard, that's the way to go. And then finally, 1 thing that's important is if you're running queries that use

joins, that have a lot of joins, you wanna have stats. You wanna be collecting stats. So if you're using high the the high meta store and Presto and writing data through Presto, Presto will collect them automatically.

If you

don't, if you're writing data through some other system like Spark or ingesting it through another streaming system, you wanna run the commands in Presto that will analyze the data and and record the stats in the metadata so that the cause based optimizer can then

use them to

optimize the order and the joint type selection

when you run your queries and run them more efficiently.

As you have worked to grow the project and worked with people both in the capacity as a consultant as well as helping to grow the community, what are some of the most interesting or unexpected or innovative ways that you've seen Presto being used?

I mean, at this point, there's almost nothing that surprises me anymore.

But

over the years there were, I mean, there were

a number of surprising use cases. I, I mean, of course, there's the ones I mentioned about Facebook. We touched on on them a little bit ago where

Presto was being used for things that were not what we envisioned in the beginning. Like initially I said,

as I said, we were targeting the analytic

workflows over the Hive Hadoop warehouse

and

over time Presto

was used for a number of other different use cases that had nothing to do with that. For example, the backend analytics

engine for an user facing app product or the analytics back end for the AV testing system.

So those are things that we didn't foresee. We didn't plan plan for, and and they were a bit surprising.

It's actually

an interesting 1 a few years ago. I don't know if they're still, I imagine they're still using it, but there was a system, there was a company, Akamai.

They used to have a system for

doing analytics

over all

the live events over all the entire fleet of machines so they can monitor the machines and see what's going on and, and troubleshoot problems and all that. They have

a machines and see what's going on and and troubleshoot problems and all that. They have a proprietary system

that did some some kind of distributed SQL and a few machines.

And

at a conference in

I think it was 2015.

I was talking to 1 of the engineers there, and they were telling me how they were they were replacing

that entire system with Presto. I mean, I know how far they went with that, but that's what they were doing. It's like, okay, this is on something I, I, I never

thought people would be doing. It's like, this is Presto being used for like, analyzing

effectively what's telemetry of of systems and being able to live

analysis and analytics over those systems. So it's a completely different different application.

And as far as your experiences

of working on and with the project, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Well, I think there are a couple of things there. 1 is when we started project,

I had never built a database engine,

neither had Dane, David or Eric.

So there was a lot of

learning by trial and error. Like, of course, we looked at literature,

a lot of literature over the last 30 years that talk about different aspects of of building

a database

system. The problem is that you look at, those research papers and they are either

I mean, they they they tend to be very narrowly focused. So it's like trying to discern the shape of an elephant by touching the different parts of the elephant. Right? There's a kind of proverb proverb example.

So it's like, it's hard to get a mental

picture of how everything fits together. It's no 1 place where you can go and say, oh, this is how you build a data system. So it was a lot of learning,

you know, like trial and error. And I mean, looking in hindsight after 8 years of experience,

there's so many things I would have done differently and so many things I would have approached differently in terms of architecture design and all that. Now, of course, we ended up in a good place.

Sometimes we, we did things in maybe not the most optimal way or the best way, but I mean, 1 of the important things is being able to iterate and and change the different parts of the system to to keep evolving. I mean, and and we're we're still doing that. There's parts of the system that are still legacy from

back in

2012, 2013.

We want to eventually change and and we're doing kind of doing that that slowly.

The other thing is, and this is

a challenge is related to what I mentioned before of how you deal with building a system that you're trying to optimize for your company, but also make it open source and cater to what other people need.

Just attention there is like, there's a pressure, internal pressure to

do something that will solve a problem for

Facebook in the in the short term, but it might not be the best thing in the long term for the project or doing it in the in the way that would solve that problem in the short term. Maybe taking a detour that

will make it harder to back out in the future. Right? So there's this interesting balance between

how do we solve those problems that Facebook needed

and and still be able to build something that is generally enough that will satisfy what everyone needs in the community.

And and, of course, sometimes there were changes that came from the community that were in conflict with things that we're building. So

having a way to

talk about those things and figure out what is the right order for those things to go in, especially when there are changes in the same areas, you need to make sure that they are sequenced right and so on. We have

distributed groups of people working on on the same project. It can be challenging. So

that's

kind of 1 of 1 of the main things that I think Presto has, it has had as a as a challenge that maybe other

internal projects or companies haven't had.

Going back to the case where somebody is considering Presto for their data infrastructure,

what are the cases where it's the wrong choice that they would be better suited either going with a vertically integrated data warehouse or some other SQL engine or just using a dedicated database system?

Well, I think there's a couple of use cases where

Presta is not a right fit, at least today. Like for example, if you have, if you're doing streaming,

streaming data and you wanna do live queries over streaming data or even standing queries over streaming data,

Presta won't be the right fit for that right now. Like, it's not capable of doing that. Like, eventually,

it's 1 area where we might look into.

But right now, there are other systems, better systems than that. For example, you can look at Kafka, Fling,

what what Confluence is doing around Kafka, maybe more appropriate.

Another area would be if you're doing machine learning. Like, Presto has a couple functions for the machine learning, but they are nowhere near the state of the art. And and there's probably much, much better systems for them that, like, you're using

by torch or TensorFlow

and and or or even Spark, and you're trying to access all the raw data directly, like, going through Prestige, probably know the right,

the best approach for now.

I mean, at least there's no good integration right now with those systems to be able to do that. So those kind are kind of the

the main in terms of the use cases. Like, you probably wouldn't be impressed for that. 1 of the things that we occasionally get asked is, oh, I have a post versus instance. I want to

use Presta to query it. And he's like, okay. If you're just

using 1 instance of Postgres, just go directly to Postgres. Like, there's nothing like, it doesn't buy you anything to go through Presta. Like, aside from you get access to the

SQL language features that Presta exposes. But you're gonna be reading all your data from Postgres and you're not gonna be able to run your queries as efficiently as if you then direct in Postgres.

So that's also not a good use case. If you

you want to front your single database with Presto, it's still going directly to that database. Now, of course, if you're integrating different databases, then that's for Presto,

kind of a better fit.

1 of the other questions that I have

is

the capabilities of Presto in terms of transactionality,

particularly across objects in a cloud data store.

So Presto

has some basic support for transactions.

The challenge there is that

it's really

dependent on the capabilities

of the connector and the underlying data source. Like for example, 1 limitation that it has is that if you're running a query across multiple

connectors, you won't get transactionality

because there's no way to distribute transactions.

Like, if you're writing,

let's say, you're reading data from Postgres and writing it to s 3 and and using the Gloomet store. There's no way to

have a transaction for those systems.

Now within a single connector, that's the, the stories are a bit different. For example, with the recent changes to improvements to the

data lake connector,

we support high transactional tables, which allow you

to modify

and insert into take into those tables with some transactional guarantees. Like, for example, if you are in the middle of a ride, other people are not gonna be able to read that data. So until that transaction is committed, you won't be able to read it. Or if the query fails in the middle, you're not gonna have partially written data

into that table. So

that that can be important for many use cases. There are still some limitations. Like, for example, in that in the hide model, you still can't have transactions across tables.

But again, it's limited to what the connectors

connector on an underlying system does.

Presto can do more than what those systems support, unfortunately.

And as you continue to work on Presto

and help support it and its community, what are the things that you have in store for it and that you're most looking forward to releasing

both technically and in terms of the overall ecosystem?

I've spent

8 years

already in the project. And when we created the project, we, I mean, our, our aspirations for this to be

a long term multi decade project. So we want the community and the project to be healthy over long term.

And

we've seen a lot more people get involved and a lot more company getting involved over the past few years. And we want that to continue. It's like we want Presta to be

like, if you think of Postgres as a successful open source project, we want it, we want it to be like that. We want people to think, Oh, if we're talking about

analytics,

SIPA analytics, Oh, that's Presta.

And that's where we're trying to go with the

community and the project.

So in terms of technical things, like in the short term, there are a couple of features that we're working on, the rest of the community is working on. 1 of them is what we call complex operation push down into connectors. This is about

extending the capabilities of a connector

engineering base

to be able to push down

more than just

simple filters into connectors. So for example, if you have a query that involves aggregations and joins on an order by, we want to be able

to when the connector is capable of doing that, we'll be able to push those operations into the in the connector and, of course, into a remote system. So for example, if you're querying data from Oracle

and you're doing a, say, a join between 2 Oracle tables,

it's pointless to

bring all the data into Presto join the the tables and then filter filter the results and returns a small set. Right? You probably better off being able to formulate that as a joint query inside Oracle and then, let Oracle do all the work with its indexes and opt and optimized capability. So that's 1 of the things we're working on is being able to delegate more more of these complex operations to engine to the storage systems that can support that.

Another feature that we're working on is

it's actually a revamp.

We're doing a revamp of the wave functions work in Presta.

So right now, functions are pluggable into a system also, but

functions have to be defined upfront when the server starts. So you can deploy a plugin,

server starts and these functions that are exposed by applying are available. But as it is is static.

So what we're changing is the model

a way that allows function to be resolved dynamically,

Just like tables and views and other entities in a SQL catalog are resolved, we were extending the function system to be able to do that. This is super powerful. It allows connectors

to well, first of all, I add run time when a query is run, resolve the function and decide at that point where where the function is supported or not and how it's implemented.

This can be useful, for example, if you have

integration with other systems

that provide functions and that where users can go and define functions.

There's a system that LinkedIn uses where they bring in functions from

that are portable across

Hive and Spark and Presto, and they are defined externally. And you want to be able to

evolve that catalog of functions externally and make it make it available to Presto. We are having to restart the Presto server. So that's that's powerful from that point of view. And then of course it opens up the door for

allowing users to define their own functions using the SQL function language or potentially any other language binding, like JavaScript or Python or something like that. So this kind of the the baseline to be able to do that, that's that's going on right now, and it will be in in the in the next few

hopefully, next few months.

Then kind of in the in the longer term,

there are there's, of course, a a desire to make Presto an even better fit for cloud environments. Like, for example, we know to take advantage

of spot instances across different cloud environments that can reduce cost. Yes. There are some challenges in terms of architecture.

We don't have any concrete

plans right now, but this this kind of these are some things that we've been discussing the past couple of years. So what can we do to make

the engine operate better in those environments where machines

may disappear, but they're cheaper.

Right?

And potentially in the future, like even other use cases, like, I don't know, can we do something with machine learning? Can we do something with streaming? Can we better support that? Can we support

for

more granular for tolerance, which will enable certain use cases like

very, very long running queries or even streaming use cases and so on. But it's kind of, pie in the sky, like

longer term,

ideas and vision.

And also for anybody else who is looking to learn more about Presto, I know that you recently published a book with O'Reilly along with some of your coconspirators on the project. So I'll add a link to that in the show notes as well. And for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. I I think there are there are a couple of aspects to this.

Like, if you look at the ecosystem, the management ecosystem today, there's all the pieces there. All the pieces are available. You have

excellent query engines. You have streaming systems. You have

catalogs. You have visualization

and BI

tools. But 1 of the things that's I think is missing is something that brings everything together into a coherent platform.

You can call these things together. You can kind of connect them by it's a lot of work to that.

It's cumbersome. And for someone that doesn't really understand that space,

it can be a lot of effort. So

I think that's,

you know, providing that, that fully integrated experience is something that

would be super powerful for if, if anyone were to do that. And then within that, I think the weakest part right now is how data is the data life cycle is managed. For example, how do you go all the way from ingestion

to

how you

store and optimize the data, how you expire it,

especially when you're considering workloads. How do you organize and potentially reorganize and optimize optimize the data to satisfy those workloads? As far as I know, there's nothing that is capable of doing that right now. At least, I mean, in kind of open open ecosystem, like, of course, provide the systems, maybe we'll do that, but I don't know anything on the open source side. And then, of course, there's 1 big thing is, like, especially as this is something that we had at Facebook,

as you your data warehouse grows, you're gonna end up with and if you have a lot of users, like, basically, we have thousands of users. You end up with, I know, hundreds of thousands of datasets.

It's hard to know what the relationship within

between those data sets is. What's derived from what? What's duplicated? There's a lot of redundancy. So being able to make more sense of that

can improve the

efficiency,

reuse the cost. And then also,

I mean, reuse the cognitive

barriers for people when they're faced with all those datasets and and they have no idea where to start

when they need to look at some. Well, thank you very much for taking the time today to join me and discuss the work that you've done with Presto. It's definitely a very interesting project and 1 that has

been very

transformative for a large portion of the data ecosystem, and it's very important within the overall space there. Something that I'm looking to take advantage of in the near future as well. So I appreciate all the time and effort you've put in on that, and I hope you enjoy the rest of your day. Thank you. Thanks for having me.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links