Summary
Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Presto is?
- What are some of the common use cases and deployment patterns for Presto?
- How does Presto compare to Drill or Impala?
- What is it about Presto that led you to building a business around it?
- What are some of the most challenging aspects of running and scaling Presto?
- For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries?
- How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with?
- What are some cases in which Presto is not the right solution?
- What types of support have you found to be the most commonly requested?
- What are some of the types of tooling or improvements that you have made to Presto in your distribution?
- What are some of the notable changes that your team has contributed upstream to Presto?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Starburst Data
- Presto
- Hadapt
- Hadoop
- Hive
- Teradata
- PrestoCare
- Cost Based Optimizer
- ANSI SQL
- Spill To Disk
- Tempto
- Benchto
- Geospatial Functions
- Cassandra
- Accumulo
- Kafka
- Redis
- PostGreSQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utm_source=rss&utm_medium=rss
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. And go to dataengineeringpodcast.com To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media. Your host is Tobias Macy, and today I'm interviewing Camille Baidapovlokowski about Presto and his experiences with supporting it at Starburst Data. So, Camille, could you start by introducing yourself?
[00:01:01] Unknown:
Sure. Hi, Dallas. Pleasure to be here. So I'm a cofounder and CEO of Starburst Data, and Starburst Data is all about Presto, which I will obviously introduce later. It's generally in the SQL on Hadoop and big data space, and I've been involved in in this area actually for over 10 years, previously as a chief architect at the entire center for Hadoop and, also as a cofounder and chief architect of the company called Hadoop, which was SQL, on Hadoop
[00:01:32] Unknown:
company. And do you remember how you first got started in the area of data management?
[00:01:36] Unknown:
Yeah. Sure. So obviously I was interested in databases and SQL pretty early on in my career as a software engineer. But then I just leveraged databases as a way to store information needed by my applications. But, what really got me started seriously thinking about database systems, was when I joined Yale University and a grad school, and I actually started my PhD program. That was, 10 years ago. And I was very impressed with sort of the great advancements in that field back then. So things like, you know, leveraging clusters of machines, leveraging, columnar techniques, map reviews, Hadoop, all of that really, was picking up steam at that point. And that really was when I was really hooked into this area very seriously, not just using the system, but also building them, designing them. So I thought that was a great challenge and opportunity and really wanted to focus that in my research.
So, actually, some of the results of my first research paper that I wrote with my team at Yale, in the 1st year of my PhD program. We sort of developed a system called Hadoop DB, which was a novel architecture that combined, so the benefits of Hadoop and and and and databases in 1 system. And we actually open source of the system back then, and got really good, sort of feedback, lots of citations, conference talks. And then I followed sort of, further into this area and into the topic of split execution engine optimizations, and achieve great performance results. And that really that work really that I've done during the research.
I sort of used as a foundation to start a new company called Adapt. I was started together with some folks from here. Justin Borgman, who is back then year, MBA student, and and 1 of my, advisors, Dana Bady.
[00:03:50] Unknown:
And so you briefly touched on the high level of what Presto is, but can you give a bit of a more detailed explanation about, how Presto is used?
[00:04:00] Unknown:
Yeah. Sure. So, Presto is open source distributed SQL engine. It sort of delivers fast analytical queries over various data sources, and they can actually range from gigabytes to to petabytes. And the product itself was started at Facebook in 2012, and they made this it open source a year later. And so their original goal was to offer much better performance than Hive, which was sort of at that point pretty much standard for running SQL on big data. So they wanted something that's truly interactive and highly concurrent. And Hive wasn't really offering that.
And the other thing that they really wanted, something that they saw at Facebook internally, was that it's not only Hadoop. Right? There are multiple different data sources. So they really wanted SQL on anything engine and separation of compute and storage, and something that makes Prestea really uniquely positioned to deploy it basically anywhere, whether that's on premises or virtual environments or in the cloud. And I think that's really, pretty powerful, sort of
[00:05:11] Unknown:
architecture. And some of the other tools that are existing in that similar space of SQL anything are things like the drill project and Impala. So I don't know if you can do a quick compare and contrast between the benefits that Presto provides versus the use cases that those other tools provide. Mhmm. Yeah. Sure. So,
[00:05:31] Unknown:
actually, I think all of those tools started around the same time, 2012, 2013. And, you know, those all of them were kind of response to the current needs. What distinguishes Presto from Apache Dril and Impala is while they are similar on the sort of high level architecture seems kind of similar. Right? You have a bunch of machines in a cluster, a bunch of worker nodes together compute the query, and they often sort of offers flexibility to to reach out to other data sources, especially DRILL, I think, really has a bunch of connectors to other platforms.
So what distinguishes Presto from them, I guess, is that it was actually built towards real needs, high concurrency scale. At Facebook, someone who's not a vendor like Mapar, who built drill in the 1st place, or Cloudera, who built Impala. Right? It was built by someone who is really, really strict with production requirements, real use cases. And I think that focus made the Presto slightly different. It gained lots of more adoption from other big companies that are just dealing with scale and concurrency. And it wasn't really tied to any specific product on the market, wasn't pushed by any vendor.
To some extent enjoyed sort of the benefits of developing towards and being ready more ready for real production, high scale use cases, while not being pushed as a product to be sold to companies. And I think that helped a lot. So what you see as an effect of that, what you see is that unlike Presto, you can't really see lots of sort of spontaneous users sharing their great experiences with drill or in parallel at the conferences or on their blogs, right? But you see a lot of that for Presto. And to some extent, I think that speaks that Presto really is a great thing, that deliver what promised.
And we actually used extensively, the use cases that those other products are not really seeing.
[00:07:46] Unknown:
And so as you mentioned, you built the, Starburst data company as a means of supporting Presto and providing a more enterprise grade distribution of it. So what is it about the Presto product that was interesting enough and compelling enough to build a business around it?
[00:08:06] Unknown:
Mhmm. Yeah. That's a that's a great question. So, so during the last couple of years of, my involvement in the project, and that was that's back at, during that time my times at Teradata. Presto has experienced, unprecedented growth in popularity, really, and and user adoption at enterprises of sort of all shapes and sizes, you know, from so fast growing Internet companies, to sort of Fortune 500. And, besides Facebook, some early adopters included, Airbnb, Dropbox, Groupon, Netflix, among others. And then sort of the acceleration of the road map, that we also contributed to, and successful proof of concepts really led to production deployments at companies like Bloomberg, Comcast, FINRA, LinkedIn, Lyft, as well as Slack, Twitter, Uber, and Yahoo Japan.
So we served those customers, some of those customers. My team at Teradata was really both contributing and supporting those users. And given the number of adopters across many other interest resources, including, telco, health care, retail, financial, we sort of were successful capturing some some of that market really while while being at Teradata. However, we felt there is, even bigger opportunity, to serve sort of the broader market, other than just that sort of top top 500 companies that Teradata focuses on. So, basically, late last year, part of my team really felt that we are ready to do it on on our own, and that, that's basically how Starburst was born.
[00:09:58] Unknown:
And what are some of the most challenging aspects of running and scaling Presto or some of the most common problems that your clients are faced with that you're helping them with?
[00:10:09] Unknown:
Presto, as I explained earlier, is is really run on typically run on tens or hundreds of machines, with with lots of concurrency, you know, potentially pretty complex queries, distributed data services. So all all of this, you know, to us, engineers really speaks about, just technical challenge. Right? So Presto and other big Presto users, especially in Silicon Valley, they have teams of engineers that basically take care of stability, performance, and scalability of of, of their operations. And as your workload changes over time, you you obviously need to monitor your system, manage resources, plan ahead, plan for unexpected situations. Right?
So there there is obviously complexity in all of this. And 1 way, Starburst helps enterprise customers is by offering our subscription. Right? So in addition to our free distribution, you can subscribe to our services that include 24 by 7 support, troubleshooting, tuning, and that really helps a bunch of, companies. In addition, recently, we also introduced the managed services offering called Presto Care. So we can, with that, we can actually fully take care of your personal environment and you don't have to worry about it. Right? And given our experience over 3 years in this project, we really are best positioned, to to get sort of customers most out of Presto.
[00:11:49] Unknown:
And when people are using the SQL interface, given the fact that there are potentially so many different data sources with different ways of structuring the data, different APIs. Are there particular edge cases or pitfalls that people should keep in mind as they're working on building those queries?
[00:12:08] Unknown:
That obviously depends from data source data source. Right? So until, very recently, there was no, cost based optimizer in price though. And, essentially, any complex queries that are joining a bunch of tables were a challenge. Right? 1 would need to be careful, and avoid joining, 2 large tables at the beginning, because that would essentially impact all subsequent operations in that query and make the overall experience not so great. In in the recent release, of our distribution, we actually included, some of our early work on cost based optimizer, that basically got takes advantage of statistics about the data that you may have collected.
So for example, if you if you know the size of your tables and and some information about your columns, just, number of distinct values, min max values for for each column in a table or partition. Given this information, the Cosmos optimizer is pretty good in figuring out, the joint order and the type of the joint distribution automatically. And there's actually bunch of information about this topic, on our website. If you want to read more about this, and see the benchmarks and some technical explanation, I encourage you to to take a look at our blog, posts there. But, really, that obviously helps for some, of those, cases.
I I think we have still further work to to make sure that you can expose logistics for all the connectors, that are available in Presto, not just the most popular HDFS. It's free, connector. So in the meantime, if you if you are trying to join some some less popular, data sources, You need to be careful how you structure your queries, essentially. And some things to help you with that are are basically classic explain, explain analyze commands that you can run. And Presto will respond with the query plan it's planning to to run for a given query. And, also, in the case of the analyze explain analyze, it will actually run the query and collect a bunch of metrics, and and you'll be able to to, realize where the most of the time goes, really. Right? So, for example, pretty often, you realize that you for example, you forgot to include a predicate on the partitioning key, or, and you're suddenly running over a petabyte of data. Right? And and there's no way that query can be fast, right, unless you have, lots of lots of machines, computing that query and nothing else is using the system.
Or maybe you forgot 1 of the join keys, and for example, you have, unnecessary explosion of intermediate data during your quip. So, basically, that that, what I'm trying to say is based on the data source and and your but your situation, I think there are tools to help you. There's also pretty good Presto UI that, helps to get an idea what's going on in the Presto faster and how your inquiry is performing and and, you know, what's what's time CPU time used at every stage and so on and so on, how many bytes are moving between the nodes. So all all this tooling, will help you in in cases when optimizer is not yet fully covering, your your case. But for the most popular data sources such as HDFS and S3, we can actually do a lot to help you, with that.
[00:15:40] Unknown:
And so the cost based optimizer is definitely 1 very significant additional feature for Presto.
[00:15:48] Unknown:
I'm curious. What are some of the other tooling or enhancements that you have been working on at Starburst for Presto, and which of those get contributed back upstream to the mainline project? Let me actually speak about the contribution made in the last 3 years. So so they basically expanded the time when when our team was back at Teradata and now at Starburst because there's a continuum really here. So in the beginning, we we actually worked on a lot on the ease of installation and managing the cluster, and we built a tool called Presto admin, and we contributed that back to the Presto project. And that basically allows you to sort of rather be easy install, configure, restore as needed. And before that, there wasn't really anything to help you with that. You would have to essentially develop your own scripts or or or press them into your sort of overall cluster deployment mechanism. And, you know, sort of after overcoming that barrier, we we focus on some of the enterprise requirements such as integrations with security mechanisms. So for example, personnel fully supports, and as well as LDAP. And that's obviously a must in any large enterprise organization.
And increasingly also in those Internet companies, believe or not. And then we've sort of switched gears and started focusing more towards the core of Presto. So we implemented a bunch of enhancements to the ANSI SQL language compatibility. So such as correlated subqueries, decimal data type, and many more little SQL features that basically now allow us to to claim really, really good coverage of on c SQL. And then, we also, contributed to building new, connectors, as just SQL Server, for example, and we also enhanced existing ones, Cassandra, based on requirements heard from from from our user base and community. And then a bunch of work on performance. So so we mentioned, cost base optimizer.
But we also worked on sort of improving, various elements of the execution engine so that, your joints or aggregations, will basically run as fast as possible. And and that often means, you know, just moving fewer bytes, right, across the network or within memory of of rest of workers. And then, so 1 really important feature that we, implemented spill to disk. And spill to disk, essentially allows queries to no longer be limited to execute in memory, which was, for a long time a serious, limitation in Presto. And 1 would need 1 would need to basically have enough machines in a cluster to to fit all the queries, in memory while they are executing. Yeah. And in addition, obviously, behind the scenes, we we, build various tools to help us develop Presto, and we have a testing framework called TEMTO, as well as, benchmarking tool called BenchStore.
And both of those are, also contributed back as a sort of satellite project, to Presto to to help anyone who's, serious about not only using Presto, but also, contributing back. And I would I would also stress that actually all of our contributions that we've made while at Teradata and and now at Starburst, we're making those open source, and could contribute back to the core price though. So so the community, as well as, Facebook, Netflix, and others can also leverage that. And this is awesome, because in exchange, we also get contribution contributions back from those users, to Presta. So for example, quite recently, Uber contributed to geospatial functions. And now Facebook is making them even, faster.
And we get that too. And plus our functionality is now used all. At all those companies, and and that basically contributes further to to verifying. It performs well. It scales well, and it's stable.
[00:19:51] Unknown:
So it's really good collaboration with the community here. And so it sounds like the differentiating factor for your distribution is just that it has all of your contributions before they get merged back up into mainline. Is that accurate? Yeah. I would I would say that's that is true.
[00:20:07] Unknown:
We we obviously have some advanced work that we are maybe finishing this month, and it's already in our district because we obviously we are confident in this in those features. So we tested them, and we can include that in our distribution. And and then we are merging that back upstream. Sometimes it's a quick thing. Sometimes it takes, a a few months because, it's, you know, it's a large controversial features. Optimizer is 1 of those. It's a it's a massive change in the whole, in the entire Presto. So so it needs more eyes, from the community to look at that and and and really get approved, and and, go through all the code review and testing and making sure, you know, there's no regressions that we do not think about and so on and so on. So that's 1 value, the of the distribution. The other is really, because we release that distribution, every 3 months or so. We have sort of time allocated to to make it really, well tested and stable because we are offering our support services for that distribution. Right? So we really want to make sure it's it's actually stable, for our customers. And and that's unlike, the releases in the community, which actually happen every 2 weeks or so sometimes.
So you may be on the cutting edge, but sometimes you you obviously experience some of the instability and and bugs that that just appear because of the fast paced, development
[00:21:38] Unknown:
in the, in the open source. And digging a bit deeper into the architecture, there are certain aspects of the way that Presto is designed that are core to the execution engine, and then there are pluggable components that allow you to add support for additional data sources or additional processing. So I don't know if you can just talk through that briefly.
[00:22:00] Unknown:
You know, I think as we mentioned, upfront, like, Presta was built as a purely compute engine, just running SQL. It's not tied to any specific storage engine, which which, the benefit of that is that you can query, many different things. But it has also it's it's also a trade off. Right? Because you are not tightly, combined with the storage. You may not, for example, support, things that, a classic database would do, for example. Right? Database would, manage transactions, would manage referential integrity, and and lots of different things that, Presto, which is really so read read only or append only system is not really, trying to solve at all, and won't won't, be good for for any transactional workload. For example, it's it's purely analytical engine that looks at your historical data. But because we, Presta was architected to, distinguish the compute from storage and, because Facebook in the early days already saw that multiple different data sources, even Facebook.
So they actually did this so that you can actually implement connectors for for more than 1 data source. And Presto today already offers a bunch of those. So, so for example, you, you know, in addition to to sort of the file systems and object stores, you can create NoSQL engines, for example, Cassandra or Accumulo, as well as databases such as POSKS, MySQL, SQL Server, but also things like Kafka or Redis. Right? So that sort of speaks to the expressiveness of of that API. As long as you can make sort of the the mapping between the data that you have and, relational model and and expose this data as a set of tables with rows and columns, Presta will be able to serve as a SQL engine, on top of that. And, actually, I'm aware of additional connectors being built by the community right now for for things like Elasticsearch, Apache Kudu, or Apache Phoenix.
And because the SPI, allows you to do it yourself, there are companies building SQL, layer for their own proprietary data sources that may, for example, expose data via a REST API or or Frift API or or things like that. So it's pretty, pretty powerful, architecture. And then what helps is obviously on the front end, it speaks as a sequel, so you can plug in any, BI tool or any sequel, query tool, and whether that's sort of crafting your own SQL or using some custom application that leverages JDBC driver or or a Python,
[00:24:44] Unknown:
driver and and and powers your your own dashboard, for example. Right? So it's it's really powerful that way. And is that the typical way that the Presto engine gets used is as a translation layer between multiple sources for being able to feed into another tool, whether it's a presentation layer or a report tool? Or is it also commonly used as just doing exploratory analytics using the SQL interface to do interactive queries? Yeah. I think we see both,
[00:25:17] Unknown:
actually. Right? So so, you know, just judging by volume and the number of people using it, I would say the, Presto as a SQL engine over s 3, like, in Amazon Cloud or or or SQL for, you know, data storage in Hadoop. Right? Those are the most common cases, but we we actually see the growing, need and popularity of for running queries across multiple different data sources in 1 Crystal installation. And and some examples of of that are, when when you have your archival archival historical data in your Hadoop, for example, HDFS, but also you may have most recent sort of online data in in the Cassandra system or or actually in, you know, Kafka topics.
And and you want to run a query that will sort of combine those 2 data sources and and present you with with, you know, what's the latest versus what was there in the past. Right? So you see those use cases, with the, growing as well. So so I would say both are, interesting. What distinguishes Presto, from some of other engines that can also do that, for example, Spark is that you can actually drive a lot of, queries concurrently, and and they will be really, really, fast, like, low latency. And that that makes sort of Praesor really, popular for those both ad hoc analytics as well as reporting,
[00:26:49] Unknown:
and driving dashboards and and and things like that. And in terms of things like the security model and the schema introspection, are those things that are the responsibility of the source data stores where you maybe not have to create a postgres user that has specific permissions to give in tables or is it something that could be enforced at the presto layer for ensuring that, certain people have access to certain aspects of the schemas?
[00:27:19] Unknown:
Mhmm. Okay. Good good question. So it's it's really both. Right? There's intersection of those. I would say that the personal itself will will not be able to to protect, your data at rest. Right? That's the responsibility of your of your data source. Say, in in your case, Postgres. Right? Database, you you need to have the right permissions, for the certain users, for for given tables, and that needs to be guaranteed. And then Presto and Presto accessing a table that happens to be stored in your Postgres instance, then, we'll basically leverage a configured user, and read from that table and bring relevant rows and columns from POSGuys Enterprise, though, for further analysis and maybe joining with other data sources or just aggregating, and returning to their user. And the connectors actually vary, between themselves in how much of the security integration between Presta and the data source, is available. So in the case of of Postgres connector, there's basically a service user that you configure, for that connector. And and, basically, you'll be using Presto using that user, to fetch the data of POSCO. So, essentially, you need to make sure, that that that specific user that will be actually authenticated via gDBC can leverage, access all the tables that you you want exposed to Presto users. In case of other connectors, for example, the the high HDFS connector, Presto can do much more, and, it impersonates, a user that's accessing Presto. Right? So you may have multiple different users accessing Presto. You can Presto will impersonate, them when talking to your data source. And for example, you know, we'll be impersonating HDFS, read, users when reading the data off of HDFS. Right? And that, basically ensures that if that specific user that you may, authorize, and authenticate via Kerberos in, and when talking to Presto, is passed through for all those layers and the same user will be used to to, fetch data off HDFS. And if that user has no right to to see, given information, it will, you know, use an error or you'll not be even able to see the table at all. Right? So so I think the that integration, chain, you know, it's it's, it's available, in the Presto connector API. You can leverage it, and various connectors implemented to to to various extent. And I think there's probably more that can be done to to make it even more, smooth and and transparent.
[00:30:06] Unknown:
And in particular for sources such as a collection of files in s 3 or various other flat files, Are there ways for the for Presto to be able to introspect those schemas? I'm assuming that it relies, for instance, on the embedded schemas in Avro or Parquet files. But is it possible to define a schema for things like JSON flat files as well? Right. So in in
[00:30:31] Unknown:
Presto actually assumes that you well, for well, you have to define a schema in, say, in in the case of Object Store, that would be the Hive metastore. So you have to define a schema saying, okay. Well, the the data in in this packet, right, in SP packet or HDFS folder is following this schema. Right? And it's a simple operation because you are not actually reading the data. You're just declaring the schema for the data. Simple DDL command, and you can access the data that's already there. Right? Just by creating, like, an external table, essentially. Right? And in that mapping, you will say, okay. I interpret, the first column as as a integer and and the second column as a string and data type and whatever. Right?
And that when reading the Impresso will use that schema, will expect this kind of, data, and we'll read your file or JSON file. And, basically, we'll we'll sort of, map that information right off of of this car of of your storage to the schema that it expects. If it happens that you have a file that completes, there's no way we can interpret that file to according to the schema that you provided, then then we'll we'll we'll say, well, we there's probably a corrupted data there. Right? But but so person itself will not try to discover the schema that you may have in your pi k. You you need to sort of, provide it in a create table statement. But Presto will do the validation, of of that data versus the schema that you offered.
[00:32:10] Unknown:
And in terms of scale, it's obviously able to handle very large sets of data and, you know, very largely distributed node counts. But does it also work equally well in a small scale deployment where you're only using it, for instance, on a single server on a dataset that's maybe only a few gigabytes in size?
[00:32:30] Unknown:
Yeah. I would say yes. That's obviously not a common deployment, but, well, I do have a a docker, with Presta on on even on my laptop for demo purposes. Or if I want to check something very quickly, I can do it on my laptop. You can deploy it on 1 server, and it will work, just fine. The the good thing about Presta is that because of its architecture, if you if your data grows or the number of users concurrent queries, or complexity of the queries grow over time, You basically can just add nodes. Right? Add more nodes to the cluster. Right? And you can start with 1 and then you have 10 and 20 and 100.
And, you know, I'm I'm hearing that, Facebook is running closer that they're approaching 1, 000 notes in a single person installation even. So you know this thing will scale to to beyond what you probably need, or whatever need. Right? And I think that it brings sort of, peace of mind to to many customers knowing that there are, users, running this on 100 of machines, and at really high scale and high concurrency.
[00:33:39] Unknown:
And are there limitations in terms of the ways in which you can scale as far as the the network architecture? So for instance, does it support geographically distributed clusters where you have subsets of the machines within multiple different data centers and being able to aggregate and merge the data, across those nodes for a single query?
[00:33:59] Unknown:
Mhmm. So that that is not a a common, use case for Presto. In all the deployments, I I've seen, the machines are always within a single data center. And, you for really high performance cases, you actually want a pretty fast network between the Presto workers. Right? And also between the Presto cluster and your data source. Right? Because if you if you happen to be limited by, say, 1 gig network somewhere along the line, you know, then that may mean you cannot read data fast enough, or you cannot exchange data between Presta workers fast enough and your queries will suffer. Right? So Presta is really, because it doesn't give the data as as compute engine. During the compute, you you expect, interactive speeds.
You really want to deploy Presto on an environment when you have, at least 10 g, network, and and, you know, ideally, the nodes will be close to each other. So they can exchange the data, really fast because most of the operations in in a SQL engine is is fully paralyzable, and it's actually very simple to paralyze them and press the leverages that allowed. And and, you know, if if I cannot exchange data between the nodes fast enough, that that interactive, experience will suffer. So it's not really, common case for Presta to be deployed across different data centers, or or, like, as a single installation. Right? You may have multiple separate installations, and maybe there's a a pressed installation that kind of is trying to, leverage those local installations somehow. I I can imagine doing such architecture like this. I don't think it's a common case right now. And what do you view as being the long term future of Presto? And are there any particular
[00:35:52] Unknown:
enhancements or new features
[00:35:54] Unknown:
that you or others in the community have planned? Performance is really a never ending topic. Right? So, there's, many, many things we would like to enhance, in our optimizer, to be even smarter for for more and more complex queries. There's constant, fight to to increase stability and scalability, and concurrency. So a ton of work goes into that. There's a number of security integrations that we would like to do as well for especially for for enterprise customers. I I think there's a number of things we can do even better. You know, you know, if you just imagine, like, each data source typically comes with their own security model and and and sort of making sure Presto can encompasses can encompass all of those and and present this of coherent view, is is definitely, still an open question, how to do it, really well. You know, there's definitely things that we can invest in in terms of, integrating more advanced optical functions, into Presto. I I mentioned the geospatial, is is becoming a really popular use case, but there's, more things like this that we can invest in.
So, yeah, more connectors to more data sources. Right? So data sources at Presta cannot connect to. I think we can, we can improve that. And then for us at Starburst, I think investing even further into making Presto really, really easy to use, and also easy to, inspect, like, for a DBA, for example, just, like, get a holistic view of of how my cluster is doing, how my users are are are doing, you know, who's running the the the heaviest queries, and and how to to help optimize the system towards, the the use cases that, I I see. I I think this is something we would would like to continue to to work on as well. And
[00:38:02] Unknown:
1 other thing that I was discussing with someone recently is I'm curious about the support for nested data within, given column, for instance, like being able to query into arbitrary, for instance, JSON structure or nested schema for being able to pull out some of that data?
[00:38:23] Unknown:
Well, Presta is actually, handling that that case pretty well. It obviously works, with parquet and org files and JSON. So you can actually have a complex, column types. And I think at this point, we we have pretty good support for for arrays, for maps, and for, for rows, which is essentially struct, which could be arbitrary nested. So so all of those are are very well supported in in Presto. I I do recall there are some optimizations that, coming to you still working on to, make, sort of when you actually have, if you just want 1 little bit of information of your whole, nested data structure, you know, the the the naive way to do it will be just to read this entire, nested data structure, which might be pretty large and pretty complex. And then after reading that off your storage, just in memory extract that little thing, that, user needs. I think, this obviously can be improved by by pushing down to the scan level, that message predicate, that someone included in a query and pushing that to storage.
And I think there's there's some community work, to address that, for for park a, I believe, at this point. So so the things we can optimize even further, but,
[00:39:51] Unknown:
nested fields are definitely fully supported right now in Presto. And so for anybody who wants to follow the work that you're up to and, the work you're doing at Starburst, I'll have you add your preferred contact information to the show notes. And as 1 final question, from your perspective, what do you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:40:13] Unknown:
Okay. I think it's it's an awesome question. Data sources and and data management systems are obviously evolving, and and just watching that space, very closely in in in in the last decade, you know, just just brought so much to the table. And things that we thought will be, absolutely in the future, 10 10 years ago. Also, some of that, is is actually no longer the future. It's a past. Right? And so so an example, right, is the fault tolerance that was really important, 10 years ago. I think, well, it it found, its use cases, but, is is not, taking over the world, actually. There are still systems, such as Presto, actually, that, make a trade off, and they trade full tolerance for performance and preferred performance other than full tolerance.
And it wasn't a a common belief, even 10 years ago. And and in the future, I think what what we'll see is is, further, sort of the civilization of those data stores and data management systems. They will become even more sophisticated and and tailored to specific use cases rather than 1 that will fit all use cases. Even though everyone would prefer just 1 SQL engine because they have so unique capabilities and are good in other things, I don't think we'll see 1 to rule them all. So more fun for us.
[00:41:46] Unknown:
Well, thank you very much for taking the time today to talk to me about the work you're doing with Presto and Starburst Data. And thank you. Thanks so much. It was a pleasure to to join here. Sure. And I hope you enjoy the rest of your day. Yeah. You too.
[00:42:05] Unknown:
Thanks a lot.
Introduction and Guest Overview
Camille's Background and Journey in Data Management
Understanding Presto: An Overview
Building Starburst Data Around Presto
Challenges in Running and Scaling Presto
Optimizing SQL Queries in Presto
Contributions and Enhancements to Presto
Presto's Architecture and Flexibility
Use Cases and Applications of Presto
Security and Schema Management in Presto
Scaling Presto from Small to Large Deployments
Future Enhancements and Community Contributions
Support for Nested Data Structures
Biggest Gaps in Data Management Technology
Closing Remarks