Summary
Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what Presto is and its origin story?
- What was the motivation for releasing Presto as open source?
- For someone who is responsible for architecting their organization’s data platform, what are some of the signals that Presto will be a good fit for them?
- What are the primary ways that Presto is being used?
- I interviewed your colleague at Starburst, Kamil 2 years ago. How has Presto changed or evolved in that time, both technically and in terms of community and ecosystem growth?
- What are some of the deployment and scaling considerations that operators of Presto should be aware of?
- What are the best practices that have been established for working with data through Presto in terms of centralizing in a data lake vs. federating across disparate storage locations?
- What are the tradeoffs of using Presto on top of a data lake vs a vertically integrated warehouse solution?
- When designing the layout of a data lake that will be interacted with via Presto, what are some of the data modeling considerations that can improve the odds of success?
- What are some of the most interesting, unexpected, or innovative ways that you have seen Presto used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building, growing, and supporting the Presto project?
- When is Presto the wrong choice?
- What is in store for the future of the Presto project and community?
Contact Info
- @mtraverso on Twitter
- martint on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Presto
- Starburst Data
- Hadoop
- Hive
- Glue Metastore
- BigQuery
- Kinesis
- Apache Pinot
- Elasticsearch
- ORC
- Parquet
- AWS Redshift
- Avro
- LZ4
- Zstandard
- KafkaSQL
- Flink
- PyTorch
- Tensorflow
- Spark
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering pod cast.com/97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode, that's linode e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today.
Your host is Tobias Macy. And today, I'm interviewing Martin Traverso about Presto SQL, a distributed SQL engine that queries data in place. So, Martin, can you start by introducing yourself?
[00:01:50] Unknown:
Yeah. Hi, Tobias. Thanks for having me. So my name is Martin. I am 1 of the original creators of Presto back in 2012. I'm currently a CTO, 1 of the CTOs at Starburst. I've been working on Presto for the last 8 years. For that, I worked at a couple of different companies, enterprise companies, consumer companies, social networking, and so on. So I have a bunch of experience in different areas of distributed systems and large scale systems.
[00:02:16] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:19] Unknown:
I've been ex exposed to databases from pretty early on. Like, from my first job involved databases in some form or another. But for many years, it was always transactional data. It's like the data is behind our website or some application. And it wasn't until around 2007 when I was at NIM that I started looking at Hadoop and the Hadoop ecosystem and and brought it into the company to start experimenting with analytics there. Then over the the next few years, I got to use, like, things like Hive and Hadoop even more until I I ended up at Facebook in 2012. And that's when we started writing Presto, and I'm working on that system since then.
[00:03:03] Unknown:
And so in terms of the Presto project itself, I know that it is, as I mentioned, a distributed SQL engine and that it doesn't do any data storage on its own, instead relying on different backing stores for handling the actual durability of the information. So I'm wondering if you can give a bit more about the overview about what Presto is and some of the origin story and why it was architected in that way.
[00:03:27] Unknown:
Yeah. Absolutely. And that that is absolutely correct. Let me start by discovering like, why we created Preston and we'll give you a better picture of why we ended up where we were with it. So back in 2012, Dean Sundstrom, David Phillips, Eric Kwan, and myself formed a team at Facebook. We we joined Facebook around that time and we formed a team to try to solve 1 of the big problems they had in the infrastructure at that moment. So people were using Hive almost exclusively. Hive came from Facebook. I don't know if you remember. And they were using that exclusive almost exclusively for any analytics they wanted to do. Batch analytics, ad hoc interactive.
But the problem with Hive is that it's very slow. There was a famous quote from a data scientist at Facebook where they said it is a good day when I can run 6 hard queries. So we looked at that. That was, I mean, that was an awful experience. People were not being able to work effectively. So we said, okay, we there's something we can do about this. So we set out to build Presta with the goal, initial goal of tackling that that interactive ad hoc user experience on top of the Huddl warehouse. So we will press it to be efficient at that. Actually, from the beginning, we decided we wanna make it open source. And and that post a couple of challenges because Facebook had their own version of Hive at the time. They had forked the open source version internally. They had extensions. They had changes. And we need to be able to support both internal version and also the external version. And, the the open source version. So we said, okay. How do we go about doing that? And that's when we came up with the idea of separating the computation layer from the storage APIs. And we put a very simple and clean interface between the 2, and then I've tracked it I've tracked it out in what we call connectors. And we made it connectors pluggable. So you can drop in a a new connector when you deploy. Preston, you can talk to a different data source. So that was kind of the forcing function. Then after that, other people back in the company and then after we open source it outside ended up using that as a way to extend and and connect with a bunch of other different data sources. So we started with the goal of making analytics over Hadoop faster, but we ended up with a system that can query Hadoop and integrate with other data sources like Elasticsearch, MySQL, Oracle, Chris, and and run queries across all of them. In in that sense, like Presto is the query engine, and then it delegates to all these other systems for the storage needs through these connectors and plugins I mentioned.
[00:05:58] Unknown:
In terms of the release of it as an open source project, what was the motivation behind that? And how do you think that that influenced the original ideas about how to approach the problem?
[00:06:10] Unknown:
Dane, Dave and myself are have been always been big believers in open source. When we started the project, we said we want to do something that will be a multi decade project. We wanted to be as successful as other projects you you see how they are, like, all the rest of my SQL. So we said we want to build something that stands the test of time. And the only way to do that is to make it open source. Like, if you get inside Facebook, I mean, this is the kind of thing that it might be useful for a few years and then something else comes along and then they gets abandoned. People move from different t to a different team and and and it stagnates and so on. We wanna make it open source so that we can get other people involved and can grow beyond the single single company and a single use case. I mean, that that's kind of the motivation. And we we wanted to build something that, like, we didn't wanna have to build again over and over. So making it open source was the right approach there.
Now, of course, that that plays a bunch of constraints in the way we architect things and the way we develop things. There there are a number of things that you would normally be able to do if you the system was specialized for the use case, the single use case that you're trying to solve. But by making it open source and accessible to other people, we we basically were saying, okay, we wanna make the system more flexible. We wanted to cater to the needs of other people. So there's delicate balance. We have to strike there. You require creating these abstractions of connectors. It requires sometimes making trails about what do we develop at a given point in time so that you can satisfy what other people need, not just what Facebook needs.
Sometimes we need to there was some complex sequencing, the features, and the way we approach them because we may have someone from external to Facebook come and say, hey, I need this as a feature. And then we might be working on something that might be a like, if you don't do it carefully. There might be a conflict in terms of how how the 2 changes fit together. So it's it's a little more work. You have to interact with people in the community a lot more. We have to coordinate things a lot more. And it can get also challenging when those people are all over the world as opposed to sitting next to you in your office. So it has changed the way where we approach the development and the architecture of the system and also how we work with people as we advance the system. For somebody who is responsible for architecting the data platform for their organization
[00:08:37] Unknown:
or considering how they want to handle data access to different sources or a single source. What are some of the signals that they should be looking to to determine whether or not Presto will be a good fit for their environment? There
[00:08:50] Unknown:
are 2 different approaches to places where people start from. I mean, you have people that are either starting from scratch, They're, trying to build their infrastructure. So they have, like, greenfield. They they can choose whatever technology they want. If you're considering doing that and especially if you're gonna be in the cloud, you're probably gonna be looking at s 3 or Azure Daylight storage or Google Cloud Storage for storing your data, then you're gonna be talking about I don't know if they're using Gloomera store in AWS for your catalog and so on. Once you start consuming those things, then Presto is a natural fit because Presto was designed to be able to work in that ecosystem. So if you're gonna be picking picking anything from the Hadoop and Hive ecosystem, Presta is just a fit there. And of course, if you want, if you need interactive queries, you're doing interactive queries and you need those to be fast, efficient, etcetera, Presti is a perfect fit. Now, if you're dealing with the other data sources, that's a case where Presto can be useful.
It really depends on how you're using those those data sources. 1 of the things we've seen is people are trying to migrate out of those systems. And Presta can be a good bridge for that because you can connect Presta to those data sources and you can connect them to, data lake. And then you can over time migrate from 1 to the other and and do it somewhat transparently. So you can move workloads and you can continue to use the same simple interface to access both both data sources, which is very convenient for users. It lowers the barrier of learning and so on. But if you are where you're thinking is you're gonna keep all your day in 1 place. Like, let's say you have an Oracle instance and you want to use Presta for putting that data instead of going directly to Oracle, that might not be a good fit. Oracle can do much, much better job at optimizing queries and Presto trying to get the out of Oracle to execute the queries in the Presto engine side. So it's a question of, are you trying to integrate different data sources or are you trying to front a single data source with Presto for some reason or not? So where Presto shines is, like I said, querying data lake data, it's very fast and very efficient. So if you're going that route, then practice a natural fit. If you need to integrate data across multiple data sources, that can also be a good use case for it. For the cloud data lake scenario,
[00:11:10] Unknown:
1 of the things that I've often seen as 1 of the challenges is the access controls and authorization of the different data sources. And is that something that Presto can manage itself? Or is that pushed down into the underlying storage engines to determine what access is provided to the end user?
[00:11:28] Unknown:
There are a couple of different modes Presto can work on. 1st, it implements the standard SQL access control model. So for every table, there are permissions. The engine checks for every table. When a user tried to query it, you can you can check whether the user has access to read the data, different columns of the table, where you can read the table, whether there's a concept of column masks and filters that can be applied. So Presto provides the the framework, but ultimately, it's up to the connector to the site how to store and check those actual permissions. For example, if you're if you're dealing with Hive with a high meta store, you might have Ranger integration, somewhere. So Presto can delegate the integration to to Ranger to to see whether I give a user as permission to access a given table. If you're talking to to Oracle or to, a SQL database, Presto can I mean, the delegates into a connector, and it's up to the connector to go and check in MySQL and Oracle whether the user has permissions?
There's another scheme where we we call it the it's like a credential pass through that allows the user to provide credentials for the back end database when they run the query. And and Presto will be will not look at those credentials. We won't won't do anything with them. We'll just pass them on to the underlying data source. And then the underlying data source is gonna authenticate against the the data store and use those credentials to to check whether the user has permission. So it's flexible. It depends on your constraints. There are some organizations where they want to constrain the access to the underlying data sources at the data source level. In some cases, they want to try that out inside Cresta, and that's perfectly doable.
[00:13:12] Unknown:
For the types of use cases, you mentioned data lakes or being able to integrate across multiple different data sources. But what are some of the other ways that Presto is often used and the ways that it fits into the overarching ecosystem of big data tools?
[00:13:28] Unknown:
Yeah. So let me describe 1 of the things that we did at Facebook when very soon after we made Presto able to have connectors. And 1 of our friends at Facebook, he was working on it, on a team. It was the team building 1 of the back ends for 1 of the user facing systems. It was an in memory data storage, and they were thinking of implementing their own SQL engine or or their own query engine because they need to be able to analyze that data to see whether people were gaming the system, whether the algorithms were correct and so on. So after we told him about Preston, you got plugins.
He said, okay. I'm gonna try that. He wrote a plugin in a couple of weeks. And then in, in 2 more weeks, she got the system running in production. So she was able to integrate Presta with that in memory storage and provide analytics for that system, for the team running that system within a month. And that has nothing to do with big data, Hadoop, or hybrid, like that. It was a completely different use case. It was super powerful. Then after that, we ended up building a bunch of different systems on top of or using Presto as the back end. For example, the AB testing system at Facebook uses Presto as the analytics back end. These are completely different stack separate from Hadoop and and Hive. There's another system where Presto acts as the analytics back end for a user facing product for ads. So this is users that are running ad campaigns. They want to go and analyze their campaigns. They go to a website, they click a bunch of buttons, and that runs press the queries under the covers.
So this, is kind of a completely different scenario where it's end user facing. It has high availability requirements. It has to be up 24 7. It runs through multiple geographic locations. You have to return results within 1 or 2 seconds because there's a person waiting on their end of the web browser for the results.
[00:15:27] Unknown:
And so I interviewed your colleague from Starburst, Camille, a couple of years ago at this point and talked a bit about the underlying functionality of Presto and how it's built under the covers and some of the ways that you're supporting it at Starburst. But I'm wondering what are some of the most notable changes or evolutions in the technical and community and ecosystem growth in that time?
[00:15:52] Unknown:
1 of the biggest changes is community wise, like the community has grown dramatically over the last couple of years. So after we we left Facebook, we spent, especially, Dane Deo and I spent 9 9, 10 months working on Presto full time on the on growing the community, consolidating the community. And, I mean, I need paid off. Like, we have a lot more people all over the world involved in the project. So today we have more than 2, 600 people on the Slack channel for Presto. So it's a lot of people interacting every day, be it for asking questions about how to do certain things when they're running into issues or people saying, I need this feature or I'm interested in implementing this, how we start to even more, more involved and complex discussions about architecture and design and so on. And so last year we run 5 conferences.
This is organized, not just by us. It was in coordination with companies all over the world. We did conferences in India and Israel, in California, in New York, and in Japan. This is a testament to how global press release across the world. It was actually very surprising to us how it grew and how much adoption it was over the world. So in those 2 years, we had more than 5, 000 code changes. So 5, 000 commits into a code repository too, which is pretty significant. That's about 25% of all the commits in the dark history of the project over the last 8 years. So in terms of technology improvements, 1 of the most notable things is the sheer number of connectors that we added recently. Like over the past year and a half or so, we are connectors for elastic search, for Oracle, for BigQuery, for Apache Pinot, for Druid, Amazon Kinesis.
And and there's a few more that, M SQL and so on. So these are mostly community member contributed. Some of them were from specific companies, but most of them were from people that were interested in just working on something something and then contributing connected for those. On the engine side, they're having a lot of improvements. Like there are some performance improvements, some architecture improvements to the way Presto runs queries. 1 notable feature is what we call dynamic filtering, which allows the engine to optimize queries that perform joints by learning that what data comes from 1 side of the join and then using that to prune dynamically the data source on the other side of the join. So this can be a significant boost in performance, especially if you have database partition in high or if you're using 1 of the of the formats like org or parquet that that have some internal stats that can be used to print datasets.
And then we kind of relate it to performance or say, an arbitrator called late materialization that got added that allows the engine to evaluate certain things more lazily. So we can avoid reading data from the underlying data source if it decides it kind of it doesn't need the data at runtime. This is also another dynamic optimization that happens. It's not something that can be determined upfront. But as it re executes, you can say, well, if you have some data that's gonna be scanned and filter, then aggregated, But if filters don't apply, then it won't bother reading the data. I mean, traditionally, Presto would read the data, then filter it, and then throw it away before aggregating, for instance. There are a number of other areas, like, in those 5, 000 code changes. They're having, like, tons, hundreds of small changes and then a few big changes. And there are some other areas where we have seen a lot of improvement.
For example, there's better support for cloud environments. There's better support for well, actually, Google support for Google Cloud Storage and Azure Daylight Storage is new in the past couple of years. We've added support for AWS Glue, for the Glue catalog. There have been a bunch of features that make it easier to run, pressed in these various cloud environments. And then finally, there's a big push around security. We added support for cone level security, cone masks, row filters. But also there's a better support for different modes of authority authentication for clients and securing the server, the Presto cluster end to end. For example, all the endpoints in the cluster are now secured behind credentials. And the UI is secure.
The communication between the coordinator and the workers is also secure, which is is something that many companies don't care about. But as practice being adopted by different industries like banks, financial institutions, they are very sensitive to these aspects. And that those are some of the changes that we've been improving over the past couple of years. And then 1 last 1 I like to mention is we added support for variable precision timestamps in the last few releases. This is something that we've been wanting to forever. It's what most database support, like, go use Oracle or Teradata and Atisha, 1 of those. They also support variable presumed time stamps, which allow you to have more granularity than mille seconds, which is what Presto has traditionally supported. And this is important, for example, for the financial industry, right? Sometimes they, or oftentimes they need to keep data with nano second granularity.
Now that is possible with crypto. So that opens up the door for a lot more use cases and adoption.
[00:21:13] Unknown:
And in terms of being able to deploy and scale a Presto cluster, as you mentioned, there's been a lot of work done recently to support different cloud environments, and I know that there's also been support added for things like cloud native infrastructure using containers and Kubernetes. I'm wondering for people who are planning an installation of Presto, what are some of the considerations that they should be aware of in terms of the deployment and scaling and some of the integrations that might simplify their work to maintain a cluster with high uptime and reliability?
[00:21:47] Unknown:
First thing is that there is some basic support in in the open source project, but most of the support for, like, high availability and and integration with the cloud environments more natively, that's done by vendors. Like, for example, if you look at Starburst, there's there's an an offering, Amazon Marketplace, in the Google Marketplace. So that's probably gonna be your best best bet if you want. I can out of the box experience that's easy and straightforward. If you're trying to do it yourself, it will be more work. I mean, you're gonna have to do all the things that those integrations do by hand, like, in terms of how you do a a hybridity for the coordinator. I mean, it can be done, but you'll have to orchestrate that yourself. All the integration with the monitoring and CloudWatch in Amazon and so on, you have to do it yourself. So if you want an out of the box experience, 1 of the vendors is probably gonna be your best bet. Then in terms of how you scale and deploy and scale and so on, there's a couple of things that are interesting to know. 1 of those is that Presto has been run at scale in many companies for over the past 8 years.
And for example, at Facebook, we run cluster. So about a 1, 000 nodes each. So that that gives you a sense of how big you can make a cluster. And that's probably bigger than almost any company needs. Like, you just need to know, like, can go that high, but you probably won't need it need it that way. There's a couple of considerations you have to take into account. Like, if you are so Presto can utilize machines to the full extent. It's parallel and and multithread internally. So if you you have a machine with multiple course, it will use them all. So if you're trying if you're thinking of collocating Presto with other systems, then that's probably not a good idea. I mean, you wanna dedicate entire machines to to Presto itself. This is not an issue with in cloud environments because you can just provision you new machines and a provider takes care of that for you. But if you're doing that as an on premise deployment, then you have to be careful with that. And similarly, Presto can use the network very efficiently. Like, he can drive the CPU and and then I work to their limits.
So you need to make sure you have fast networks interconnects between the the workers, which again is typically not an issue with cloud environments, like AWS. As long as you're in the same same region, they'll be fine. In Azure and Google Cloud is the same thing. But if you're on premise, then you have to be a lot more careful about how you distribute your machines across racks. You don't wanna run into top of rack limits and so on. So cloud environments make it easier. You have to take into account those factors in on premise deployments. And then 1 of the things there's something that we learned at Facebook.
It's generally useful if you have different, different use case. For example, interactive, you have some people that your company that need to interactive analytics, like BI tools or dashboards or sitting in front of a console and typing a query. But you also need to have support for batch workloads, then you may wanna separate those into separate clusters. Like, the configurations for those are gonna be different because you're probably gonna have different SLAs in terms of response times and queuing and and and how long you're willing to let people wait for a for a quick start and so on. So 1 of the best practices is to separate your batch and interactive workloads, and that makes it easier to manage them. It makes it easier to manage user expectations and to make sure that everyone's happy.
[00:25:19] Unknown:
As far as the best practices for working with data within Presto, because of the fact that it can federate across these multiple sources, there is the possibility for being able to use it for some of the ETL capabilities to read from 1 table in 1 data source and then write it out to another. But what are some of the ways that people can potentially shoot themselves in the foot if they're not careful at how they structure their queries or structure the ways that they are working across these different data sources?
[00:25:50] Unknown:
Preced is a very complicated system. So the the short answer is it depends. Depends a lot on the use case, the requirements. It depends on what are your performance requirements, how many users are you gonna have running queries, are you gonna be big, or they're gonna be small, Are they going to use a lot of memory or not? And so on. Ideally, if you can run Presto over a data lake and have all your data in sitting there, that's going to be the most efficient way to it because Presto, they connect their 4, for the high data layouts and, green over S3 or or Huddl or HDFS.
It's the most optimal ones. 1 is the 1 that has received the most attention over the past 8 years. And, of course, it's it's it's the 1 that can run the most parallel, utilize the resources most efficiently and so on. So you're gonna get the best experience if you do that. But of course, like not everyone has data, every, all the data in a single place in, you know, data lake. And they oftentimes they have data in different systems. Presto makes it possible to read from those other systems. But 1 thing you have to take into account is that you're gonna be limited in many cases by the capabilities of the connector and how the connector can interact with those systems. For example, if you're using the MySQL connector, the MySQL connector goes through JDBC.
JDBC is a single connection per, effective per query against MySQL. That means that when you run a query in Presto, I need to fetch it from my SQL is gonna be coming from via 1 connection. It's not gonna be parallel. So if you're pulling a lot of data, you may be volunate by that. So you have to take that into account. Now there are vendors like Starburst have optimized versions of some connectors. Like for example, there's an Oracle connector that is fully parallel. So you don't suffer from this potential bottlenecks from being made up through a single connection. But there are still some limitations. Like, you can't if you run a complex query, you can't the engine cannot push certain shapes or query shapes into the underlying data source. So you may end up transferring a lot of data from the remote system that may be taxing to that remote system.
So those are some of the considerations you have to take into account. 1 thing we've seen is people use the ability to connect to these data sources to maybe query them occasionally. Maybe they have fast changing data on these other systems because they are very suited for that than Huddl, which doesn't really allow modification on data very efficiently. And so that's in those cases, that's a good fit. Sometimes what they do is they they use the ability for Presta to read from those data sources to mirror them into a data lake, and then they can query them efficiently from a data lake.
[00:28:36] Unknown:
And another situation that people who are looking at Presto might be considering either because they're building out a greenfield project or because they already have an existing data warehouse. What are the trade offs of using Presto in its optimal configuration on top of a data lake versus a vertically integrated data warehouse solution and some of the capabilities that it might provide out of the box?
[00:29:00] Unknown:
I think the main 1 is is 1 of flexibility. Like, if you are if you're seeing a fully integrated warehouse, like, for example, let's say you look at Redshift. Redshift has the ability because it controls everything from the query engine all the way to storage and and even how, I mean, the data formats and so on. You can optimize that to this specific, to the workload that you're going to perform. I mean, the disadvantage of that is that in order to interact with Redshift, you have to load your data into Redshift before you can query it. So that limits your flexibility. Like it forces you to go through an extra step that you would otherwise require.
If you want yeah, in terms of productivity, if you want to interact with that data through other tools, like for example, let's say you have Spark in your organization because you're doing a lot of machine learning or or more complex processing that is not suitable for SQL, then if you have your data locked in Redshift, you can't take it out. Like, you have to extract it going to a different system before you can you can use it. So being logged into that specific solution that once you have a lot of data, you can get it out and moving it or moving to another system can become more expensive.
Of course, you need you need ETL in those scenarios. ETL can you can run a query in those systems. It can be fast. But with ETL, you have an initial delay of getting all your data into that system in the first place. So when you have a data lake, like, you're gonna have systems that land your data directly in the data lake. And as soon as it is there, you can query it. You don't have to any further transformations. You have to migrate it. You don't have to copy it and so on. I think those are kind of the trade offs. But again, of course, like, it is gonna be more feasible for a fully vertically integrated system to do things more efficiently than than if you have Presta sitting on top of a data lake.
[00:30:52] Unknown:
In designing a data lake, there are also considerations to be made as to what format you're storing it in. So whether it's just JSON data in a g zipped file or if you're using Presto or ORC files and what levels of compression to apply on the various aspects and then how to lay out the partitioning scheme of what the directory structure might look like in s 3, for example. I'm curious what you have found to be some of the useful pointers or references for being able to determine how you want to land your data into the lake for being able to query it effectively with Presto, as well as being able to use it with these other tools?
[00:31:33] Unknown:
Yeah. 1 of the things that you first need to understand is how are you going to manage your data? Like, why is it life cycle going to be for your data? Like if you have data, for example, I mean, at Facebook, we have this requirement that they over a certain number of days had to be removed from the warehouse. So in order to do that efficiently, you need to organize the data in a way that makes dropping all data efficiently. So 1 thing that is pretty common is to partition your data by date. So you're ingesting data every day. You create a new partition. And then when the data is over the retention window, you just delete a partition. You don't have to go and rewrite anything. You just drop some files from, from the data lake. So that's 1 consideration.
But then another consideration is how are you going to be querying that data? And then sometimes those things can be at odds with each other. There are some things that can help made that a bit more efficient. For example, Presta supports when you're dealing with Hive and the Hive metastore and so on supports the partition schemes that that Hive supports. And you can use that for pruning data efficiently. Like, for example, if you're gonna be querying data by date and always filtering by date ranges, then, of course, if you partition by date, Cresto will be able to skip all the partitions that don't participate in the query. If you have other criteria, you're gonna be you're gonna be querying over, and that's not a high cardinality dataset or, or field that you're querying over, then you might be able to use partition for that too. For example, you may partition by day and by something, some other, by geography, for instance, if you're querying by geography.
And of course, if you have a high cardinality dimension or value, we're in a partition on that could be a problem because you're going to put a tax on the meta store. So you don't want to partition, for example, by user that would be super expensive. On the other hand, there's another feature in the Hadoop high layout. Is called bucketing that allows you to distribute the data across the files in a partition, in a deterministic manner according to a set of values. So that could also be useful for filtering data upfront. For example, if you're querying, let's say you partition your data by data and then you bucket it by user into a 1, 000 buckets. When you run a query, if you're searching for a specific user, Presto can take advantage of that knowledge and narrow down the query to only the files that are candidates for that containing that user. So that can reuse the amount of data transfer, the IO, and so on significantly.
And then, of course, if you're using things like in formats like parquet or org, which are columnar, you can further improve the IO patterns and the network transfers because it can avoid reading columns that you're not gonna use. Those files also contain stats about data inside market or, or, or is organizing to what they call row groups or stripes, which a group of rows. They basically have, this had footer associated with, with each of those row groups that contain stats about what's the maximum or minimum value for a given column in this row group. And the end press engine can use that to skip the sections that are uninteresting dynamically.
So recommendations use parquet or org. Org is the most optimized format right now for Presto. We're working on making parquet better and hopefully as as efficient as org. If you're using com if you wanna compress your data and you definitely wanna compress your data, the 2 recommended algorithms are lz4 or z standard. Those are the best in their class of algorithm. L z 4 is good for fast compression. It's very efficient, but it's not as you won't compress the as compact. We as the standard. And then c standard is a efficient algorithm for high compression. So it's better than gzip. So if you can if you can use the standard, that's the way to go. And then finally, 1 thing that's important is if you're running queries that use joins, that have a lot of joins, you wanna have stats. You wanna be collecting stats. So if you're using high the the high meta store and Presto and writing data through Presto, Presto will collect them automatically.
If you don't, if you're writing data through some other system like Spark or ingesting it through another streaming system, you wanna run the commands in Presto that will analyze the data and and record the stats in the metadata so that the cause based optimizer can then use them to optimize the order and the joint type selection when you run your queries and run them more efficiently.
[00:36:10] Unknown:
As you have worked to grow the project and worked with people both in the capacity as a consultant as well as helping to grow the community, what are some of the most interesting or unexpected or innovative ways that you've seen Presto being used?
[00:36:25] Unknown:
I mean, at this point, there's almost nothing that surprises me anymore. But over the years there were, I mean, there were a number of surprising use cases. I, I mean, of course, there's the ones I mentioned about Facebook. We touched on on them a little bit ago where Presto was being used for things that were not what we envisioned in the beginning. Like initially I said, as I said, we were targeting the analytic workflows over the Hive Hadoop warehouse and over time Presto was used for a number of other different use cases that had nothing to do with that. For example, the backend analytics engine for an user facing app product or the analytics back end for the AV testing system.
So those are things that we didn't foresee. We didn't plan plan for, and and they were a bit surprising. It's actually an interesting 1 a few years ago. I don't know if they're still, I imagine they're still using it, but there was a system, there was a company, Akamai. They used to have a system for doing analytics over all the live events over all the entire fleet of machines so they can monitor the machines and see what's going on and, and troubleshoot problems and all that. They have a machines and see what's going on and and troubleshoot problems and all that. They have a proprietary system that did some some kind of distributed SQL and a few machines.
And at a conference in I think it was 2015. I was talking to 1 of the engineers there, and they were telling me how they were they were replacing that entire system with Presto. I mean, I know how far they went with that, but that's what they were doing. It's like, okay, this is on something I, I, I never thought people would be doing. It's like, this is Presto being used for like, analyzing effectively what's telemetry of of systems and being able to live analysis and analytics over those systems. So it's a completely different different application.
[00:38:20] Unknown:
And as far as your experiences of working on and with the project, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:38:30] Unknown:
Well, I think there are a couple of things there. 1 is when we started project, I had never built a database engine, neither had Dane, David or Eric. So there was a lot of learning by trial and error. Like, of course, we looked at literature, a lot of literature over the last 30 years that talk about different aspects of of building a database system. The problem is that you look at, those research papers and they are either I mean, they they they tend to be very narrowly focused. So it's like trying to discern the shape of an elephant by touching the different parts of the elephant. Right? There's a kind of proverb proverb example. So it's like, it's hard to get a mental picture of how everything fits together. It's no 1 place where you can go and say, oh, this is how you build a data system. So it was a lot of learning, you know, like trial and error. And I mean, looking in hindsight after 8 years of experience, there's so many things I would have done differently and so many things I would have approached differently in terms of architecture design and all that. Now, of course, we ended up in a good place.
Sometimes we, we did things in maybe not the most optimal way or the best way, but I mean, 1 of the important things is being able to iterate and and change the different parts of the system to to keep evolving. I mean, and and we're we're still doing that. There's parts of the system that are still legacy from back in 2012, 2013. We want to eventually change and and we're doing kind of doing that that slowly. The other thing is, and this is a challenge is related to what I mentioned before of how you deal with building a system that you're trying to optimize for your company, but also make it open source and cater to what other people need. Just attention there is like, there's a pressure, internal pressure to do something that will solve a problem for Facebook in the in the short term, but it might not be the best thing in the long term for the project or doing it in the in the way that would solve that problem in the short term. Maybe taking a detour that will make it harder to back out in the future. Right? So there's this interesting balance between how do we solve those problems that Facebook needed and and still be able to build something that is generally enough that will satisfy what everyone needs in the community.
And and, of course, sometimes there were changes that came from the community that were in conflict with things that we're building. So having a way to talk about those things and figure out what is the right order for those things to go in, especially when there are changes in the same areas, you need to make sure that they are sequenced right and so on. We have distributed groups of people working on on the same project. It can be challenging. So that's kind of 1 of 1 of the main things that I think Presto has, it has had as a as a challenge that maybe other internal projects or companies haven't had.
[00:41:32] Unknown:
Going back to the case where somebody is considering Presto for their data infrastructure, what are the cases where it's the wrong choice that they would be better suited either going with a vertically integrated data warehouse or some other SQL engine or just using a dedicated database system?
[00:41:49] Unknown:
Well, I think there's a couple of use cases where Presta is not a right fit, at least today. Like for example, if you have, if you're doing streaming, streaming data and you wanna do live queries over streaming data or even standing queries over streaming data, Presta won't be the right fit for that right now. Like, it's not capable of doing that. Like, eventually, it's 1 area where we might look into. But right now, there are other systems, better systems than that. For example, you can look at Kafka, Fling, what what Confluence is doing around Kafka, maybe more appropriate.
Another area would be if you're doing machine learning. Like, Presto has a couple functions for the machine learning, but they are nowhere near the state of the art. And and there's probably much, much better systems for them that, like, you're using by torch or TensorFlow and and or or even Spark, and you're trying to access all the raw data directly, like, going through Prestige, probably know the right, the best approach for now. I mean, at least there's no good integration right now with those systems to be able to do that. So those kind are kind of the the main in terms of the use cases. Like, you probably wouldn't be impressed for that. 1 of the things that we occasionally get asked is, oh, I have a post versus instance. I want to use Presta to query it. And he's like, okay. If you're just using 1 instance of Postgres, just go directly to Postgres. Like, there's nothing like, it doesn't buy you anything to go through Presta. Like, aside from you get access to the SQL language features that Presta exposes. But you're gonna be reading all your data from Postgres and you're not gonna be able to run your queries as efficiently as if you then direct in Postgres.
So that's also not a good use case. If you you want to front your single database with Presto, it's still going directly to that database. Now, of course, if you're integrating different databases, then that's for Presto, kind of a better fit.
[00:43:43] Unknown:
1 of the other questions that I have is the capabilities of Presto in terms of transactionality, particularly across objects in a cloud data store.
[00:43:53] Unknown:
So Presto has some basic support for transactions. The challenge there is that it's really dependent on the capabilities of the connector and the underlying data source. Like for example, 1 limitation that it has is that if you're running a query across multiple connectors, you won't get transactionality because there's no way to distribute transactions. Like, if you're writing, let's say, you're reading data from Postgres and writing it to s 3 and and using the Gloomet store. There's no way to have a transaction for those systems. Now within a single connector, that's the, the stories are a bit different. For example, with the recent changes to improvements to the data lake connector, we support high transactional tables, which allow you to modify and insert into take into those tables with some transactional guarantees. Like, for example, if you are in the middle of a ride, other people are not gonna be able to read that data. So until that transaction is committed, you won't be able to read it. Or if the query fails in the middle, you're not gonna have partially written data into that table. So that that can be important for many use cases. There are still some limitations. Like, for example, in that in the hide model, you still can't have transactions across tables.
But again, it's limited to what the connectors connector on an underlying system does. Presto can do more than what those systems support, unfortunately.
[00:45:17] Unknown:
And as you continue to work on Presto and help support it and its community, what are the things that you have in store for it and that you're most looking forward to releasing both technically and in terms of the overall ecosystem?
[00:45:31] Unknown:
I've spent 8 years already in the project. And when we created the project, we, I mean, our, our aspirations for this to be a long term multi decade project. So we want the community and the project to be healthy over long term. And we've seen a lot more people get involved and a lot more company getting involved over the past few years. And we want that to continue. It's like we want Presta to be like, if you think of Postgres as a successful open source project, we want it, we want it to be like that. We want people to think, Oh, if we're talking about analytics, SIPA analytics, Oh, that's Presta.
And that's where we're trying to go with the community and the project. So in terms of technical things, like in the short term, there are a couple of features that we're working on, the rest of the community is working on. 1 of them is what we call complex operation push down into connectors. This is about extending the capabilities of a connector engineering base to be able to push down more than just simple filters into connectors. So for example, if you have a query that involves aggregations and joins on an order by, we want to be able to when the connector is capable of doing that, we'll be able to push those operations into the in the connector and, of course, into a remote system. So for example, if you're querying data from Oracle and you're doing a, say, a join between 2 Oracle tables, it's pointless to bring all the data into Presto join the the tables and then filter filter the results and returns a small set. Right? You probably better off being able to formulate that as a joint query inside Oracle and then, let Oracle do all the work with its indexes and opt and optimized capability. So that's 1 of the things we're working on is being able to delegate more more of these complex operations to engine to the storage systems that can support that.
Another feature that we're working on is it's actually a revamp. We're doing a revamp of the wave functions work in Presta. So right now, functions are pluggable into a system also, but functions have to be defined upfront when the server starts. So you can deploy a plugin, server starts and these functions that are exposed by applying are available. But as it is is static. So what we're changing is the model a way that allows function to be resolved dynamically, Just like tables and views and other entities in a SQL catalog are resolved, we were extending the function system to be able to do that. This is super powerful. It allows connectors to well, first of all, I add run time when a query is run, resolve the function and decide at that point where where the function is supported or not and how it's implemented.
This can be useful, for example, if you have integration with other systems that provide functions and that where users can go and define functions. There's a system that LinkedIn uses where they bring in functions from that are portable across Hive and Spark and Presto, and they are defined externally. And you want to be able to evolve that catalog of functions externally and make it make it available to Presto. We are having to restart the Presto server. So that's that's powerful from that point of view. And then of course it opens up the door for allowing users to define their own functions using the SQL function language or potentially any other language binding, like JavaScript or Python or something like that. So this kind of the the baseline to be able to do that, that's that's going on right now, and it will be in in the in the next few hopefully, next few months.
Then kind of in the in the longer term, there are there's, of course, a a desire to make Presto an even better fit for cloud environments. Like, for example, we know to take advantage of spot instances across different cloud environments that can reduce cost. Yes. There are some challenges in terms of architecture. We don't have any concrete plans right now, but this this kind of these are some things that we've been discussing the past couple of years. So what can we do to make the engine operate better in those environments where machines may disappear, but they're cheaper.
Right? And potentially in the future, like even other use cases, like, I don't know, can we do something with machine learning? Can we do something with streaming? Can we better support that? Can we support for more granular for tolerance, which will enable certain use cases like very, very long running queries or even streaming use cases and so on. But it's kind of, pie in the sky, like longer term, ideas and vision.
[00:50:06] Unknown:
And also for anybody else who is looking to learn more about Presto, I know that you recently published a book with O'Reilly along with some of your coconspirators on the project. So I'll add a link to that in the show notes as well. And for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:33] Unknown:
Yeah. I I think there are there are a couple of aspects to this. Like, if you look at the ecosystem, the management ecosystem today, there's all the pieces there. All the pieces are available. You have excellent query engines. You have streaming systems. You have catalogs. You have visualization and BI tools. But 1 of the things that's I think is missing is something that brings everything together into a coherent platform. You can call these things together. You can kind of connect them by it's a lot of work to that. It's cumbersome. And for someone that doesn't really understand that space, it can be a lot of effort. So I think that's, you know, providing that, that fully integrated experience is something that would be super powerful for if, if anyone were to do that. And then within that, I think the weakest part right now is how data is the data life cycle is managed. For example, how do you go all the way from ingestion to how you store and optimize the data, how you expire it, especially when you're considering workloads. How do you organize and potentially reorganize and optimize optimize the data to satisfy those workloads? As far as I know, there's nothing that is capable of doing that right now. At least, I mean, in kind of open open ecosystem, like, of course, provide the systems, maybe we'll do that, but I don't know anything on the open source side. And then, of course, there's 1 big thing is, like, especially as this is something that we had at Facebook, as you your data warehouse grows, you're gonna end up with and if you have a lot of users, like, basically, we have thousands of users. You end up with, I know, hundreds of thousands of datasets.
It's hard to know what the relationship within between those data sets is. What's derived from what? What's duplicated? There's a lot of redundancy. So being able to make more sense of that can improve the efficiency, reuse the cost. And then also, I mean, reuse the cognitive barriers for people when they're faced with all those datasets and and they have no idea where to start
[00:52:40] Unknown:
when they need to look at some. Well, thank you very much for taking the time today to join me and discuss the work that you've done with Presto. It's definitely a very interesting project and 1 that has been very transformative for a large portion of the data ecosystem, and it's very important within the overall space there. Something that I'm looking to take advantage of in the near future as well. So I appreciate all the time and effort you've put in on that, and I hope you enjoy the rest of your day. Thank you. Thanks for having me. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Episode Overview
Interview with Martin Traverso: Presto SQL
The Origins and Architecture of Presto
Open Source Philosophy and Community Growth
Use Cases and Fit for Presto
Recent Developments and Community Expansion
Deployment and Scaling Considerations
Best Practices for Data Management with Presto
Innovative Use Cases and Lessons Learned
When Presto is Not the Right Choice
Future Directions and Technical Roadmap
Closing Remarks and Contact Information