Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Have you ever woken up to a crisis because a number on a dashboard broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets,

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

Your host is Tobias Macy. And today, I'm interviewing Dipti Borkar, cofounder of Ahana, about Presto and the work she's doing at Ahana, which is a SaaS managed service for Presto. So, Dipti, can you start by introducing yourself? Hi, Tobias. Thanks so much for having me here today. Co founder and chief product officer at Ahana.

I'm also the chair of the Presto Foundation

community team, and great to be here talking with you today about all things data.

So for people who've been listening to the show for a while, they've heard you before when you were in your role at Alexio. But for people who haven't listened to that yet, can you tell us a bit about how you ended up in the area of data management?

Yeah. Absolutely. I forgot if I used this joke the last time, but I was born into it is what I say. If you notice, my initials are db.

And so I've been trying to live up to my initials. Data is something that always fascinated me actually in engineering.

The database class was my favorite class, and that's how I ended up in the database lab at UC San Diego

working on a variety of, semi structured

data principles.

From there, the natural transition was to end up at

a couple of these relational database companies,

research labs. So IBM was 1 of them, and I started at DB 2 in the course to origin indexing kernel doing building distributed databases for around 5 years. So I worked on the the warehousing side, the semi structured side. I have a patent on the indexing side for semi structured data. And then over time, I transitioned myself from relational to nonrelational.

Hadoop and NoSQL was starting to become big. There were new use cases.

The world of data was changing. And through some transitions, I ended up at Couchbase.

Couchbase was a NoSQL database, was, early product

person there, but idea of the teams and the product there. And we built SQL for JSON. Right? And kind of interesting area there. Recently went IPO, actually, Couchbase.

And then I transitioned again into back into the analytics side a little bit to a GPU database company called Connecticut for a little bit. Interesting space was a very, very niche use case. But also what was happening is the deconstruction

of the database was starting to happen with disaggregation of storage and compute.

And I got into a Luxio,

and that's where I got introduced to a lot of these distributors, SQL engines, Spark, Presto, Hive, and others. That's kind of my journey into databases.

This is my

and Presto is probably my 6 distributed system database engine that I've worked on over the last 15 years. In terms of what you're building at Ahana, can you give a bit of an overview about what it is that you're working on and some of the story behind how you ended up focusing on Presto and what it is about this particular problem space that makes you want to focus your time and energy on it? Yeah. Absolutely. So, you know, as I was mentioning at at Connecticut, you know, you have a tightly integrated databases. You have data warehouses and so on. I started to see users

trying to solve different problem. There was a lot more data, different types of data. The cloud the transition to cloud was starting to happen.

A serious transition. Right? Mostly cloud was being used for web applications and other things, but for data, a lot of it was on prem. And I started to see this transition to cloud data lakes, s 3 in in particular.

It was very hard to

run SQL on s 3 or do, you know, have computation

on s 3 for smaller data platform teams. And then I lost you. I got into the the Presto community. I was working with Facebook.

Over time, what what ended up happening is I joined the foundation. We were very early members of the foundation

working with Facebook, Uber. The foundation was created in September 2019.

We joined a few months later, and so we've been part of that entire process.

And that's how we got into, you know, the foundation and Presto.

Now specifically, you're asked about the problem. Right? So as I was mentioning,

with all this data in s 3, what do you do with it? Right? There's different types of processing that you could perform. There's obviously the SQL workloads. There is the general purpose computational workloads, transformation

ETL, pipelining.

Spark did a great job job at that. And Databricks was the company that there was a good experience that existed around Spark. There was machine learning, which is also with TensorFlow and PyTorch and others, and there were companies around that area. But there really wasn't a good alternative

for SQL on s 3. Very simple way to run SQL on s 3. And the alternatives were kind of Hadoop, Hive. Right? Presto was a good alternative, but there really wasn't a good experience around it.

Athena, which was it's AWS, but it's serverless.

It's like a Lambda function for a SQL. Right? And that only goes so far. It's great for test and dev. But if you really wanna perform kind of classic data warehouse

reporting dashboarding workloads, you need a a managed service.

And so that's where I felt that why does it take 6 months to 9 months

to install

these distributed systems,

and, you know, Presto for that matter,

and get value

from it. Right? It is way too much work for a platform team of maybe 3, 4, 5 people

to do this, and the value is seen a year later or never

because you give up, which is kind of what happened with Hadoop. I always felt that Hadoop, I've intentionally stayed out of that space because Cloudera, Hortonworks, they could have created a great experience from this to turn big data into something very, very valuable. Right? And so I started thinking about this as why is this so hard?

Why can I not do SQL on s 3 in 30 minutes? And that's what I built, with Ahana. So today, we onboard our customers. They bring their own data lake. They bring their own catalog. They might have blue. They might have Hive metastore.

And in 30 minutes, they're creating their data

in their environment without ingesting it anywhere else, and that's the experience. There's a long way to go, but that was the problem that I was solving is having Presto, which is a phenomenal engine for the data lake in the hands of every data data platform engineer so that they can query their data really fast, set it up, spend less time on ops, spend less time on tuning, and really give value to their data analysts and their data scientists. So that's a little bit of background. Definitely

useful

insights as well because I've been working on building out the data platform for my work at MIT,

And that's 1 of the things that I keep running up against is, okay, do I go with the data lake approach? If so, how do I manage the queries on s 3? Currently, I've using Athena because it's easy to set up, and I don't have time to invest in actually doing the operations to get a more full fledged solution up and running.

Or, you know, do I go down the road of a Snowflake or a Firebolt? Or, you know, if I wanna be able to fully own the entire platform

and I wanna warehouse like experience, maybe I go with ClickHouse and use some of its capabilities for being able to ingest data from s 3, but then you run into the issues of cost because of the data living on EBS versus an s 3. So

it's definitely

interesting that this continues to be a problem despite the number of tools and systems that are available for doing advanced analytics.

That's right. That's right. That's every day that I live.

You just summarized it.

And so in terms of Presto in particular, it's definitely become sort of the de facto engine for being able to run analytics, particularly on s 3, but also across multiple different data back ends.

You know, some of the notable competition are things like Dremio

or, you know, as we already mentioned, there are some of the cloud data warehouses that will let you take your data from s 3 into their system, but it doesn't actually reside in s 3 anymore. So you have this sort of issue of multiple copies of data for different use cases. I'm wondering if you can just give a bit of an overview about some of the recent activity that has been happening in the Presto community over the past couple of years. Yeah. Absolutely. Yeah. There are a lot of options. And so I classify them into,

you know, data warehouse. Right?

Which is Snowflake,

Redshift, you could put BigQuery in that category.

And there's on prem versions of it, which which we won't talk about. You know, the world is moving to the cloud. We are 100% focused on the cloud, no on prem. Right? And so that's the experience that we're working on. And then you have the data lake. Right? And the data lake is emerging as a very serious alternative

now, which did not exist before. There are a few different options. So Presto

is Ahana's kind of looking at this as a Presto company. That's the foundation of it, and we are building on top of it. Right? There's Dremio, which came out of drill. Right? And they're a little bit more closed source. Right, and enterprise focus

also very much on a little bit of on prem. Most of their users are on prem. There is the AWS options. Right? Like, Athena, as you mentioned, that serverless is a completely different cost model. Right? You're charged in a per query basis

where it could be get expensive quite fast depending on what your workload is. Right? There's these are the alternatives.

And then there's dimensions that users think about. Right? So from a user perspective, they think about what I'm hearing as I I talk to many different platform teams is cost.

Cost becomes, with cloud data warehouses,

it's starting to become really expensive.

And so that's 1 element, and we'll talk about a few of these. Second 1 is openness.

Users increasingly do not want to get locked in into yet another Teradata. Right? And with Snowflake, it's yet another Teradata

in some ways. It's new. It's great. It's a good experience,

but it's expensive and it's its proprietary formats. We're seeing more and more users want to stick with open formats so that they can choose when to switch a query engine if you leave it to the user

and not the vendor. Right? Flexibility

of types of data processing.

So when you have open formats,

then you can actually run many different things on top of those formats. So you could run a SQL workload.

You could run a general purpose computational workload. You could run TensorFlow added, like, support

from what I hear. Right? And so those are some of the key aspects, I would say, you know, structured, unstructured data support for different types of data

and increasingly large amounts of data. Right? So those are the 5, probably key dimensions that users think about. And with the open data lake approach, I call it open data lakes because the open is really important.

You get open formats or parquet, very, very popular, high performing column formats. Press 2 is highly optimized for it, and we are increasingly adding

optimizations. We're adding audio optimizations for

on top of what Facebook built. Then there's open source, which is a big part of I've been in open source for 10 plus years, open source purest,

and believe that we have to give back. Vendors need to give back, be a part of the community, and that helps users.

Having a community driven project

is very important because you get the best of what Facebook is building, Uber is building, Ahana is building, and this benefits the users at the end of the day. And as a vendor, we can't just go in and change a license on someone and say, hey. Sorry.

From next quarter, you know, we are going to change our license. We're not Apache 2.0 anymore. Well, guess what? You can't do that with Linux Foundation or Apache projects. But if it's a company driven project, you can. And this matters to users because they do not want to be at, you know, vendors

back in call and not get locked in. Right? And then the other 2 are amount of data scale of data.

S 3 solved that problem. S 3 commoditized

storage for us. It's in 15 years in the making. Right? Trillions of objects. And it's cheap. It's cheap and it's ubiquitous. So why not use something that's so foundational

and build on top of it, right, versus ingesting it in yet another system and yet another location.

And then finally, different types of data. When it's open, you can have JSON. You can have structured data. You could have obviously, you can have unstructured. Presto doesn't fall in that realm. Right? But you can do other things on that same platform, the same storage. You can have different types of processing. So these are the the dimensions that users think about.

In terms of these comparisons,

they look at what's possible with data warehouses. They look at what's possible with the data lake today,

and they decide, okay. Do I want to skip the warehouse

or do I want to augment it? Right? Do I want to have both? And both are valid options. I see both these as I talk with users and customers because you will always have 10, 15%

of workloads that might run on the data warehouse. Right? And you're okay with the cost associated with it because you wanna really, really low latency.

And it's physics. If it's a tightly coupled system, it is going to perform better than a deconstructed database. Right? And if good enough performance

at very low cost for large amounts of analysis, historical analysis, interactive, add up is what you're looking for, and the open data lake becomes a really good option.

Wasn't the case 2, 3 years ago, but now there is a valid option for users.

Right? And so that's kind of their thought process.

And I'm happy to talk more about Presto, but let me pause there and see if that made sense of kind of walking you through how a user thinks about this. No. That makes perfect sense. And it's definitely

a big challenge for people who are building out a new system, and they're saying, okay. I have data. I need to be able to do something with it. Where am I going to put it so that I can actually

make use of it versus just running a whole bunch of Python scripts or Java workloads and, you know, have to deal with orchestrating all of that across multiple places. I just wanna be able to say, here's the data, hand it over to my analysts, do what you do best. You know? I'm gonna go and build out some infrastructure.

So it's definitely

great to see

the emergence of a lot of good options for the data lake as a way of being able to process data at low cost and scalably

and the fact that there are these different data warehouse options that are coming around. You know, it is a little unfortunate to see that there isn't a viable open source option for a cloud data warehouse architecture, but there are a number of open source data warehouse

forms that will allow you to augment the data lake so that you can have that trade off of I need high performance here and high scalability over here. Right. Right. I mean, in some ways, you can think of the data lake as the open source data warehouse.

Right? In fact, Facebook calls it their data warehouse. Presto

is their data warehouse. Right? And it's open source. It's the open source data warehouse.

Performance is is, very, very important and, actually, you know, the second part of the question that you asked is what's latest and greatest in Presto given this context. Right? And performance is very, very important on that list. There are a few different areas that we are working on by ourselves as Uhana. We're also working jointly with with the foundation and members. The ones I've dropped recently, Raptorex, for example, is caching

on the data lake at every single level. So it includes IO caching, fragment results set caching,

metadata caching, header footer caching,

and and Facebook is seeing, you know, 10 x performance improvement from that, right, for Presto. This is only available in Presto. Right? This is the new Presto. The new Presto has Raptorex. It has ARIA optimizations,

which are essentially

on joints or table scan optimizations. How do you essentially optimize table scans? Right? Repartitioning. How do you optimize that on the unnest?

Depending on the structure. Right? There's so many variants of the different types of data, the different types of queries that can run. And so these are some of the ones that have already that already exist. We're extending that. And so Facebook uses a org format. That's the 1 that they optimize for. So we're taking those principles, and we're optimizing that for parquet because that's a pretty, well, you know, popular format. And we're building ARIA for parquet for Presto.

That's work in progress that we will open source, benefit the community.

And so there are many other projects like this in the works that are focused around performance. Even in a within Ahana, that's 1 of the dimensions.

Performance is a very key dimension that I focused on where, you know, Presto, Spark, all these systems that came out of Hadoop literally have thousands of configuration parameters. Right? Half of them are in the docs. Some of them you have to actually go to the Java code and see, okay. What does it actually do? Right? And so why do you need a PhD in configuration

management or Presto as part to get it up and running

and, you know, months to tune the system to get something out of it. So with Ahana,

we have almost more than 200 configuration parameters that come pre tune with every cluster

or with the data source that you attach so that you're not sitting, figuring this thing out.

And it comes with caching built in. So you have 1 click cache where it's enabled. It'll figure out what your instance type is. It'll say, okay. 3 times the amount of SSD volume so that let's automatically attach these volumes. And so you don't have to sit and manage all of these just different systems.

And you can, like you said, focus on creating value,

the data analyst, and maybe it's about governance. Right? Cleansing. And you can focus on the other aspects of data. So performance is is a dimension that we'll continue to work on both from a open source Presto perspective

and from a managed service perspective.

And I would say the other important ones are security,

you know, and with open source. If you're using open source, that's great. You tend to build it on your own. Right? For people, if you need a vendor or some support and you need enterprise class support, you want it to be figured out for you. Right? We wrote, 1 of my cofounders wrote a plug in for Presto for Apache Ranger. We've open sourced it. There wasn't an open source option up to now. There's other so Trina, for example, or Saba has a closed source version of it. But we do wanna open source. Right? We do wanna give back. We don't want to hold back on areas that actually help with community adoption.

Ranger is an open source project. Presto is an open source project, and so why should that integration be closed source? There are other ways that I can create value and monetize,

you know, from a business perspective, but adoption and community are very important for us. So just some thought, you know, thoughts on what's the what's happening in the Presto space. What are the areas that we're thinking about? And to end this this section,

we actually worked on a second half plan for Presto as a part of the foundation. There's a technical steering committee. Tim Meehan for Facebook is the chair of that committee. My counterpart on the technical side will actually be sharing out the second half road map for Presto where Facebook, Uber, Twitter,

they're all collaborating.

Right? It's openly published what everyone's working on. So I'm very excited about that. Looking forward to that. Another thing worth touching on while we're discussing Presto and its community

is the recent developments that have happened with Presto and the fork a bit in the form of Trino. And I'm curious if you can just talk a little bit to some of the ways that the divergence of those code bases has manifested in terms of the particular areas of focus and capabilities

that are developing in some of the ways that you foresee

that to continue to evolve over the next few years? Yeah. Absolutely. Happy to. This question is important because it helps clarify things for the community, both communities.

Perhaps my opinion is biased, but I will try to give you a unbiased opinion.

So Presto was created at Facebook. It was open sourced a while back. Facebook donated it to the Linux Foundation 2019. We talked a little bit about that. Right? So they formed the foundation. Uber, Alibaba, Twitter were the other members.

Around that time in 2019, there was also a hard fork of Presto.

It was slightly confusingly named Presto SQL. Recently,

it was renamed to Trino. Right? So the hard focus now called Trino.

I think that helps with removing some of this confusion because now there's 2 projects. There's Presto, which is part of the Linux Foundation

and part of Presto Foundation, which is a hosting foundation. And then there's Trino, which is a separate project. It's not actually Apache or Linux Foundation, and so it's the Trino Software Foundation.

And that's created by the cofounders of Starburst. Right? So that's a separate project.

Both are good projects. Right? But they are diverging.

The way I see it, Presto is increasingly focused on the data lake with some focus on federation. Right? And that means connectors, adding plug ins, and so on. But Presto was built for the core you know, the reason it was built for is a replacement for Hive. Right? And so it was built for the data lake, and that's kind of the core focus. All the performance elements we talked about, that's the primary path. Right? The primary code path is to the Hive connector,

and we're constantly improving on the Hive connector,

which then can access HDFS on prem or s 3 g c s and others in the cloud. And, also, it connects with other catalogs, which are, Hive metastore or AWS Glue. Right? Presto still continues to have a lot of connectors.

The most popular ones are tend to be MySQL,

Postgres, and others. A little bit of the data warehouse side and Elastic and Kafka. Right? There is a proliferation

of databases,

data sources,

body cloud persistence. Right? It's in the making for many, many years. And they are very different problems to solve. So you can solve a problem where, hey. You know, put most of your data in the data lake and query there. Right? And that's the primary path. You might have 5 or 10%

of workloads where, you know, I'm not able to move my data into

a data lake and it's there, and so I need to query it. And if I can query it to the same engine, great. Right? That's a federation use case. In some cases, you wanna do correlation across the 2, but not all data sources are equal. So if you do a 5 way join and a MySQL, it's not going to stand up for long. Right? And so from Presto perspective,

the primary core path is the data lake, I would say, with some focus on federation and the connectors.

Continue to build that ecosystem,

but the optimized path is the data lake path. From a treatment perspective,

I see that it is largely focused on the federation element,

and there is a proliferation of connectors. Right? A slightly different problem to solve. That's how I see it. There's different use cases and different tools for different problems. So in that way, given that now there's different names, there's different use cases, users can decide what is the best tool for the best job. Right? Engineering decision should be made. A little bit about those differences.

Now since the hard fork, things have been added on both sides. Right? So a lot of connectors on the Trino side. On the Presto side, Raptor X, Aria, these are only in Presto. Right? So they're not in Trino as an example. Multiple coordinators. So Facebook actually, you know, 1 of the, I would say, not great design decisions was to have a single coordinator

for Presto.

And now we are there's an alpha available with multiple coordinators for Presto. And so that's really great to see so that you have not only

to some extent, you know, for the coordinator, but it allows it for scale. It allows for a lot more scaling so that your scheduling and all of these things on the query level are not bottlenecked on the coordinator.

And so excited to see that progress and get into introduction level quality. Right? And so there's all of these new capabilities that are getting added. The biggest 1, which is forward

looking, is a native worker. And so Presto is written in Java.

Lot of other engines are written in Java.

But for the worker, which is the workhorse of the entire

architecture,

it would be great if it were written in c native c. And that's a project that's ongoing as well, the native worker, which is going to be in Presto and that the entire community is very excited about. So it is a little bit more forward looking, but this is kind of game changing stuff, right, where it's like orders of magnitude

difference when you can natively manage memory

within the engine itself. So some of the things that are going on on Presto land. Yeah. It's definitely, as you said, important and useful to be able to identify those sort of distinctions and focus. And it's definitely

very different problems to solve for, so it's great to see that they're both being tackled and that they have their respective communities and supporting organizations. So I'm excited to see where the future takes the sort of overall cloud and federated query capabilities.

And digging a bit more into the open cloud aspect,

some other areas that have been seeing a lot of development and activity are in some of the

surrounding components and supporting infrastructure for the data lake and the cloud, thinking in terms of things like the Iceberg project,

Hudi, Delta Lake,

Nesi project for working with Iceberg and adding some versioning and snapshotting capabilities, LakeFS.

And then on the analytics side, there's been a lot of development

most notably with the superset project and other engines that are sitting on top of the query layer. I'm wondering if you can just talk a bit about some of the developments that have been happening there and how that has influenced

your particular focus and product direction for Ahana?

Yeah. Absolutely. And this is an emerging area. I've been following it for a long time. I've written a lot of the code on relational databases,

as I call it. It's the deconstructed database. The great deconstruction

has has begun, but now there's actually legit options for each of these layers. Right? So if you look at the top, at Prestocon, I'm a program chair for Prestocon and we have these panels.

We had Max from Superset.

We had Vinod from Hoody. Right? We had Facebook and I representing Presto,

and and we talked about the stack. Right? And that's why I said it's kind of the open source data warehouse because you have open source Tableau or if you will, superset.

Great product. We actually are big

fans. I have embedded it into a Hana, and so every Hana user actually has a free superset that's bundled and that's running in the compute lane. We call it superset.

We haven't called it a HANA. It's important to give credit where it's due. It's really a great kind of SQL editor. Also does much more things, right, dashboarding, so on. So we're big fans of that. Then you have the query engine, which is Presto as let's take that as an example, which does the core SQL,

everything from parser compiler optimization

to execution.

Then you have the layer on top of the data lake, which is now a transaction manager. It can be Hoody. It can be Delta. Right? It can be Iceberg is, I think, it's still evolving. It's still more of a table format than it is a full fledged transaction manager. But my guess is that they will that will emerge as well. And then you have storage, which is s 3. Right? This is the stack

that is emerging,

and

we have multiple customers running the stack. Right? So we have Superset, Presto, Hoodie, s 3. And Presto and Hoody, for example,

have a very strong integration.

The Uber stack as well. Uber and that stack. Right? And from a Presto perspective,

it's important to integrate with multiple different transaction managers. So it's a very strong integration with Hoody. There is an approach that integrates with Delta as well and we have other customers who are using with Delta. But there's more work to be done there. And similar, the other projects will start coming in as well.

The challenge is that and this is an area where

things need

to get a little bit more well defined, little bit more tighter over time from a user perspective. Otherwise, it gets complicated.

You also have an element of governance. You have authorization.

And now you have from a transaction manager perspective, you have versions. You have multiple versions

of the schema. You have multiple versions of a table at a point in time. Right? And you have security and governance. How does all of these things come together?

And that's something that we will need to look at jointly across all these projects and see how they integrate.

But today, this stack, it mimics the data warehouse. Right? The presto hoodie blue s 3 stack, I call it the face stack, p h u

s 3.

It is an emerging stack. I see that the next 2 to 5 years, these stacks will be the defining stacks for analytics.

Right? What will happen is once your data lands in s 3 or even before it lands, right, you users will pick for the upstream for the upstream ingestion

path. They will pick a transaction manager. It will either be Hoodie or Delta Lake. Those formats will be supported in s 3, so those files will live in s 3. The query engines on top will then define a perhaps a virtual database and say, oh, there's files. You know, let me tell me what format it is. Tell me where it sits in the folder.

Tell me the partitioning scheme, and we'll figure out the rest. Right? And it will create a virtual table, virtual database,

and then query it and allow for time travel, allow for this last version of data, allow for many different things. Today, it's a bit limited. Over time, it will increase and, give users even more of a true warehouse capability.

Have you ever had to develop ad hoc solutions for security, privacy, and compliance requirements?

Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data?

Satori has built the 1st DataSecOps

platform that streamlines data access and security.

Satori DataSecOps

automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift, and SQL Server,

and even delegates data access management to business users helping you move your organization from default data access to need to know access.

Go to data engineering podcast.com/satori,

that's s a t o r I, today and get a $5, 000 credit for your next Satori subscription.

In terms of what you're building at Ahana now, I'm wondering if you can discuss some of the technical and architectural elements of actually building and running a managed system for Presto because as you mentioned, it is a distributed system. Distributed systems are notoriously hard and particularly difficult to maintain the overall uptime and reliability and then adding in multitenancy to that. I'm wondering if you can just talk through some of the ways that you've approached that challenge.

Absolutely. There's many different options. Right? And I've learned from my mistakes in the past

and so built it right this time around.

There are a few key concepts

that I had to

make decisions on early on. So for example, do you ingest the data to your cloud

or do you leave it where it is? Right? Very, very important. Increasingly from a security perspective, users don't want to move their data anywhere. They want it in their own VPC.

They want it in their own environment.

And so because of this, we chose a path of control plane and a compute plane approach

where the Ahana SaaS

console or the SaaS itself is in Ahana's VPC

or Ahana's environment.

But the compute plane,

anything that touches data is in the user's environment. Right? And that way, we bring compute to where the data is versus moving the data to where the compute is. In case of a warehouse, right, you would have to take the data and move it to where the compute is, which is ingesting it into a snowflake or a redshift. Right? And that also leads to, you know, lock in costs, all of these things. And so this is a choice that we made early on. It's also now an AWS best practice. So they call it the NVPC approach. And so that you have clear separation

of what touches the data. So the Presto clusters, the Hive metastore you know, on a per cluster basis, if you'd like, you can have, you know, 1 checkbox

and you get a Hive metastore that's managed for you. Right? And so all of these things that touch data, including Superset,

they live on the compute plane, which is in the the user's account.

Now the compute plane is created. This is the 30 minutes, 0 to Presto in 30 minutes. The first time you onboard a user, the user starts, we automate the entire process. So we create the VPC.

It's cloud native running on containers, which is another key design decision we made so that it can be multi cloud over time. So it's running on EKS. It creates a Kubernetes cluster,

and then it manages

all the networking for you. So the endpoints are created.

Everything from network to OS is taken care of for the developer. Nothing needs to be done. And then you bring in your own data. So you can say, I have Glue, and you just give it the I'm role for your Glue, perhaps the I'm role for s 3 if it's different. And you open your superset console and you query your data in blue. And that's as simple as it is. 2 key design considerations that I made, take the compute to the data. And so we have the compute plane that's in VPC

containerized.

Now we don't pack the instance the containers. Right? Because these are data workloads. This isn't an app server where you can have 4 app servers running on a container. And so we have a 1 to 1 mapping. We have system containers, obviously, that manage check for uptime, check, track metrics.

Everything we track. PrestoMetrics is all integrated into CloudWatch so that within CloudWatch, you have access to everything.

Some information is pulled back. Metadata is pulled back into the control plane. How many clusters you have? Auto scaling is enabled. So when you have HPA that runs and auto scales the cluster,

what happened? That's tracked back into the system. And all of this is pay as you go. Right? So there's a billing element where you only pay for what you use. You know, we don't charge for managing your compute plane. That's kind of free or obviously paying for the resources.

We only price for the Presto clusters, or the Hive meta store. It's as low as, you know, 15¢ per hour for extra large. That's a pretty, you know, easy way of getting started as opposed to, you know, 1 year mega deal for, you know, 100 of 1, 000 of dollars. It becomes very cost prohibitive. So we really want this to be running in every data platform team. We've enabled it in that way to be marketplace first, pay as you go, very friendly

to engineers, just removing the complexity away. That's the way how I've designed it. Beyond just

the sort of management of the Presto cluster and by virtue of being in VPC, you kind of remove some of the challenges that exist in multitenancy when they're all on the same system and the same hardware.

But what are some of the additional

tooling and supporting services and systems that you have built up to be able to simplify the work of actually managing those deployments, managing upgrades for customers, managing the up time and alerting,

and helping customers manage their Presto infrastructure so that they can focus on building the analytics and not have to be worried about is this instance up? Are my scaling at the appropriate points?

So there's many, many different things there. You know, just simple things like everything is API driven. So start, stop, restart, resize, clusters. Right? So you can actually just programmatically

include Ahana in your Python scripts. You know, you can start a cluster. Once that's up and running, run your workload, stop a cluster. All of this is kind of, you know, programmatically done. On top of that, we have things like cost management. So for example, if your cluster is idling for a period of time, right, and it's configurable,

it automatically shuts down and goes down to 1 node in a static use case or it can auto size between

a range that you provide that the user provides.

And so all of these things are built into the management, into the SaaS console.

On top of that, there's security. Right? So we're adding more security and authorization

to make it very easy to integrate into other systems. And so for example, Apache Ranger can you can configure that, add governance to your Presto clusters very, very easily.

Data sources can be attached very easily. It's a complex process to attach a data source. Something as simple as that. You have to create a catalog file and, you know, if you're running Presto,

or open source Presto, figure out all the configurations that are needed for that. And then restart the entire cluster. Make sure that it gets picked up. Right? So all of this is simplified where we take care of the restarts. We take care of all this the the management

of these data sources.

From a monitoring perspective, alerting, there's some alerting built in. There's the ability to send logs. Right? Automatically, you just click a button, logs come over if you choose to. We see, you know, it's a SaaS. Right? So we can track alerts and errors way before ahead of time and actually proactively

as opposed to reactively

help users, which is the difference, right, from an installed product versus a SaaS managed service. So we're just getting started. It's just been 15 months since we founded the company, and we just announced a pretty jumbo series a raise a couple weeks ago. Excited about that. And that's where a lot more of observability

will come in. Performance enhancements will come in from both from a Presto perspective, but also from a managed service perspective and now looking forward to build out all these cool features.

And so in terms of the overall space of

analytics and open data lakes, what are some of the other

interesting elements of the stack that you've been seeing come up or interesting

additional systems that people have been building to reside next to or in conjunction with your Ahana services to be able to

provide either more availability of data to multiple sources or be able to do more interesting analytics or build machine learning pipelines that feed into or feed off of the work that they're doing within Ahana.

So there are some adjacent areas that we're seeing we get questions about.

There's the data catalog space that's emerging as well, which sits kind of next to Presto.

It's very confusing because you have the operational catalog like an HMS or or a clue, which is actually just system to system. Right? It's more system to system than human to to system. But there's a human element of the catalog and schema and really kind of the business side. Right? Who owns the data? Who owns the schemas? How would they manage? Lineage of of it. And that's the data catalog space. There's Amadsen. There's Data Hub. There's a a couple of other projects

that sit next to it. For example, Amadsen integrates with Presto.

Right? In some cases, it might be able to pull all the catalogs to Presto because Presto itself is federated and can reach out to many many different data sources. And in some cases, it'll go and directly connect with the other other systems. So I would say that that is an entire area that's emerging.

Still early. There are proprietary closed source opt options there. Open source is eating the world phase.

There are a lot of open source options in the data catalog space, which is kind of right next to the query engine and the stack. We'll see more integration there and so on. The way I, you know, I think of it, if there's an analyst

who's trying to look for a table, look at which columns to query,

when he or she

finds the table that they're looking for and they wanna run this query, it should be fully integrated. There should be just a direct interface with a Presto or something like that where you go and run it. And it's ad hoc and interactive, truly ad hoc analytics. That doesn't exist today. You know, you find out query in the schema here, then you go you go to your Tableau

or some other thing and you actually run-in or data science, you write a SQL notebook with Jupyter. Right? And then you run it. And this can be simplified more, but we're talking about, you know, 5, 10 years from here on out. That's an area of where there's more joint collaboration.

The other area is governance

and authorization

and how that all fits in together.

Why does this become even more important with the data lake? The reason is that given that it's an open data leak, this data is being used not just by 1 system. It's not just 1 database. It's not security for that just that 1 database,

but it's security for all the data

in the lake that's being used by many different systems. And so defining that, there's Ranger obviously in that space, and

AWS is adding its own governance layer colleague formation.

And that's kind of an interesting space where there is a lot more integration that needs to be done with the data lake itself

where what's happening is typically in a database,

you would have in database authorization,

and the authorization would happen at the top. Right? With the query engine right at the top of the database. But now that security

layer is being moved down

right on top of the storage layer.

And it's another change in paradigm.

So you're not doing in database

authorization

or back anymore. You're actually the query engine is it now needs to deal with storage and say, hey. Does this user you know, does Tobias have access to this podcast table

or not? And that's going to go earlier down versus in database. And that's changing the way things work as well. So I would say those are, you know, 2 areas where I'm seeing a little bit more questions coming up. The paradigm is changing. Right? And we're working on integrations, you know, all of these as well. Yeah. The aspect of the authorization needing to live in the data and not just in the system that manages the data is an interesting aspect. And I had a interesting conversation yesterday with some of the folks from Cinchy talking about

their concepts of data ware where the

data is the kind of sovereign entity and the application is just something that interacts with it versus the application owning the data and then you having to, you know, either extract it and integrate it elsewhere or ask the application for access to the data. So the fact that this is happening more in the data lake space as well is definitely interesting and sort of the attribute based access control as the next evolution beyond RBAC and,

some of the security elements there. And then another interesting point to touch on is

the idea of the catalogs.

You mentioned the Hive metastore and AWS Glue as the operational catalogs. And for a long time, it was just the Hive metastore.

So I'm wondering if you're seeing any

activity in alternative implementations of that or alternative approaches to building this operational metadata catalog and being able to do a more direct and seamless integration with some of these sort of organizational data catalogs where maybe organizational catalog is able to push table information down into the operational catalog or vice versa, just having them be more of a cohesive unit versus 2 separate entities?

I really hope it it's a hard problem to solve. And the reason is

the catalog is where

everything about the database lives. It's the connector. Right? Otherwise, s 3 is completely useless. It's objects. It's a bunch of objects that nobody knows anything about. Without the operational catalog, you don't know that it's a table. It's part of this. Right? What format is it? You don't know anything.

It's very key piece, very, very key element

of the entire stack. In fact, my guess is that's why AWS call it glue. It's the glue that connects

all of these different pieces together.

Now

the purposes for these 2 system is very different. The reason you need an operational system is the query engine that needs it. It's not a human that needs the operational system. Right? Operational catalog. And so the way you build that and the principles that you build it for, high throughput. Right? Low latency, high concurrency,

very different from

human interaction data catalog

where the experience matters. Right? The interactivity is important.

Concurrency, at most, you're gonna have maybe 5, 000 analysts, 10, 000 analysts. It's not with some of these other workloads. You can hit the HMS really I mean, hundreds of thousands of times. And so the way you build these are different.

My take, they won't go together because it would be very difficult to build this that works for both these use cases. Just like you can't build you have an OLTP system, which is a MySQL or a DB 2, may not be able to use it for analytics. You use a warehouse.

Right? And so I think that's what's happening here. Unfortunately,

where the naming is confusing everybody, they're both called catalogs.

But it's a human catalog and an operational catalog is the way I simplify it for engineers who try to say, oh, they're the same thing. It's like, no. They're not. Right? These are the differences.

There is an emerging element though on the operational catalog side because of the transactional layers. The transactional

layers are

maintaining versions

of when updates happen. Right? So whenever there's an update that happens and there's 2 there's 2 kinds of updates. Right? There's schema updates and then there's actual data updates. So you insert a row into a table.

For schema updates, you can actually track those as well now. And so you can see where a column is added. You can see where a column is deleted

and these can be tracked. So you actually have versions of the schema itself.

And the Hive metastore and the or Glue for that for that matter, which is HMS compatible,

have not yet

supported are not yet supporting versions of schemas.

And that's the evolution of on the HMS side. I see that we'll need to, you know, add that in

Is going back in time and saying, okay. What changed? Right? Or even doing a query performing a query on a previous schema. Because you do have not only do you have the schema, you also have the data is along with it. So that's what makes it interesting. It creates a new dimension to analytics

where you can do historic analysis. It's great for threat detection or security kind of, elements. It would be great for health care, life sciences.

And so that's the evolution I see on the operational catalog side of things.

In terms of the work that you're doing at Ahana and your vision for the overall product and the direction that you're taking it, I'm wondering what are some of the other industry trends or

technical developments

or community activities that you're keeping an eye on that are helping to inform the ways that you're thinking about this problem space?

So it's all of the things that we've talked about, which is a lot. We've been talking for a while, Tobias.

Hopefully, people are still listening and I haven't dropped off, but this is a very, very, very interesting space. Right? So there's the data lake angle. There's there's transactions. There's security. There's the catalog element. There is also the cloud element. Right? And that's evolving.

We will see new ways of that emerge on the cloud side. Some of it will be a lot of it so far has been adoption of the cloud versus

now it will be cost savings. There's a big element of cost savings and so we'll need to, you know, bundle that into the SaaS observability.

There is an element of my cofounder and CTO, Dave Simmon,

incredible database

leader. He talks about the machine learning and the database kind of coming together. Right? Up until now, we've had an optimizer that was either hinge driven. You have to give it hints. Right? Or stats driven, which is a cost based optimizer, which came out of d b 2. But there is an element of well, you know, there could be real workload driven. We are already capturing workloads. We're capturing

trends, what date tables are popular, what credits are popular. Right? Was the selectivity of these predicates? And there's an element of where we can feed this back into perhaps an adaptive optimizer

where we can actually see what indexes are needed perhaps. Right? Or what are the other kind of data structures that might be useful for this workload and create them on the fly. This intersection of cloud and data

and open source, I would say, allows us to be able to do these things in the future. That is something that, you know, we look at over time

bringing these together so that it is not a manually driven, user driven creation of materialized views or store procedures or indexes. It's all automated. Right? And with the cloud, you have the elasticity of the cloud. You can bring up instances as needed,

and you have the cheap storage that allows us to do some of these things. So that's really kind of more forward looking. There's a few good projects right now like Auto Tune is starting is ML

for databases.

It's a project around CMU.

We're early in that space, but that is a divergent space that I hope to see Ahana evolve into. As you have been engaging with the Presto community and building out the Ahana product, what are some of the most interesting or innovative or unexpected ways that you've seen either or both of them used? Yeah. I think the good news is that the use cases for RESTO are pretty well understood, right, in some ways. So it's interactive queries. So you think that reporting and dashboarding, I think

the

data science and SQL Notebooks. So those are pretty kind of standard use cases.

1 kind of very interesting use case, and this is more perhaps on the business side and why the, in some cases, colleague lot persistence

or, you know, that is an interesting problem to solve. With COVID, there is a government agency

that had to speed up a lot of its tracking. Right?

3 months in, they were focused on different things and they ended up using 4 different databases. Right? And now

12 months in through the pandemic, they're like, oh, we there's 4 databases. Yes. We eventually wanna move to the data leak, but how do we query these? Right? And so that was kind of a very interesting use case with, you know, through this pandemic where, you know, Ahana is born in the pandemic. You know, we kind of found it in April last year. But it was an interesting use case where for them, it's a journey now for moving to the lake. And as they're moving to the lake, they can still get to query

these different data sources with Presto and Ahana. You know, that's kind of a interesting use case given the times that we're in. That's 1. And then I think that the in terms of what is

not expected

was

Spark is a great engine for transformation ETL because it kind of stage by stage is actually this is more fault tolerant, right, than Presto, which is really more in memory. It's pipeline architecture. It's more streaming architecture.

And so it's really great for interactive versus long running queries. Right? We're starting to see more folks use it for transformation.

SQL based transformation.

We have a customer for an exchange. They were running their pipelines on Presto on the cloud as well as they've tried different things.

I'm staying 72 hours on with their previous option with Ahana. It runs at 20 hours. They can run more. They wanna run more workloads. And I wasn't expecting frankly to see transformational detail workloads this soon in this process, But it kind of speaks to how important SQL is and how widely adopted it is because

not everyone can, you know, want wants to write Python pipelines or, you know, these complex data engineering jobs. They just wanna use SQL.

And Presto is an incredible SQL engine and so that, you know, they start trying it out and it does the job.

So that was kind of an unexpected

use case for this early in in a Hana where it's like, okay. So not only do we have interactive and customer facing applications,

Securonix is a a security company large security company in the SIM space. They're doing threat detection

with our threat hunting with Hana and Presto. But then you have these transformation workloads as well that are coming up. And Facebook

actually has built Presto on Spark. And so we'll see how that fits in, but it's very interesting to kind of see these 2 workloads kind of coming together because, you know, know, it's like you wanna have a steering wheel. Right? You want a round wheel or whatever the engine might be an electric engine or might be a, you know, a gas engine,

but it helps if you have 1 steering wheel. And so not only is Presto driving Presto, it's also starting to drive Spark. That's interesting to see. And as you have been building out the technical and organizational

aspects of Ahana and engaging with customers and the community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

It's been quite the journey.

This 15 months of, I would say, perhaps the most interesting 15 months of year of my career

so far in data over 15 years. We started in a pandemic

and

recruiting, scaling in a pandemic,

I would say, is nontrivial.

We have great investors, you know, Google Ventures,

Lux Capital, Leslie

Ventures, and 3rd Point,

all great, you know, VCs and hedge funds. But convincing someone to leave a good job and starting at a start up, believing in this vision of open data lakes and presto,

nontrivial.

I would say, you know, this is my

6th start up actually.

And, you know, you've gotta have a very clear strategy. Strategy trumps execution.

And so if you have a strong strategy and you believe in it and you execute on it, you have a team that could be brought together to execute on it, sky is the limit.

So far, like every company,

there's challenges for us now. It's how fast can we grow and scale and how fast can we execute

and deliver an amazing product to our customers. Right? So we're in the next phase where we have great validation in our existing product. We have, tens of customers already using it. And so now the next phase is building even more value into the product, both from a Presto perspective, from an open source community perspective,

as well as an Ahana perspective

and continuing to grow grow that when open source users on Presto,

expand the community,

drive the community, and expand customer adoption in that process as well. So I would say scaling and growing

some of the key challenges, but it's also a good problem to have.

We have a phenomenal team. Steven and I have a cofounder. We started and we've known each other for many years. For me, this is like 15 years in the making across many different projects, many different products.

And, yeah, looking forward to where it's headed. For people who are interested

in being able to manage their data at scale and they're interested

in maybe the open data lake aspects, what are the cases where Ahana and or Presto might be the wrong choice? Yeah. So good question. I think in general, you know, that we talked a little bit about Polyglot persistence. There's different types of databases, different types of systems for different

areas. If you wanna use it for OLTP, not a good option. Right? It's really not doesn't support transactionality.

Best to use databases, more traditional databases.

If you're very new to this big data space, it is way easier today than it was 2, 3 years ago. But as a platform team, if you're only familiar with relational databases,

then maybe a cloud data warehouse

might be where you start off it and then you augment it with a data lake. But it is you have to future proof your platform,

and that's where the data lake comes in because over time, there is no doubt that that is kind of the emerging trend for analytics. Right? That's where it's headed.

And so that's the other use case where I would say, you know, if you're a 1 person or, you know, really, really, really small team,

maybe start off with a cloud data warehouse and then add to it, augment to it so that you have there's a slight education

or little amount of knowledge of how these things fit together that's needed. Now, Uhana obviously simplifies that greatly. You know, you can get started

very, very easily, but there is an element of auditioning and understanding those the storage tier itself, which is slightly different. And then the 3rd use case I would say is the transformation new workloads. Right? So if you are doing kind of heavy batch long running workloads

with Presto and Spark, that is an option now. But, you know, pure Presto and Ahana might not be the best option. A Spark would be a that's what Spark was built for. That would be a much better Databricks for transformation workloads, ETL pipelining workloads. That's what they're great for. You know, that's what I'd say. And so,

Presto, interactive,

ad hoc, data science, SQL notebook, workloads, federation to some extent, obviously.

That's what it should be used for. There's use cases will emerge over time. Every system is going to try to do more. Right? Consume more and get more workloads

running on it. But that's the sweet spot for Presto and Ahana. So that's what I would recommend.

And as you continue to build out the product and the business, what are some of the things that you have planned for the near to medium term or any projects that you're particularly excited

for? Yeah. I think we talked a little bit about it. There's performance

aspects that we're working on, auto integrations and provide some of these great features like Raptorex,

for example, within Ahana itself. We merge with Presto

fairly often. Right? Every 2, 3 months, we'll we'll rebase. So the latest and greatest is available within Ahana as well. And then we develop something, we'll we'll push it into Ahana and then open source it. So, obviously, the users get it first, but it's also open source and others can benefit from it.

Specifically, I would say our second half Presto road map, we're looking at tighter integration with Delta. That's something that is interesting.

Raptorex, I already mentioned. ARIA for parquet

is an optimization that we see good early results, and so we we wanna push that out. These are some of them. The Ahana road map is a little bit more closed source,

but our the Ahana Presto road map is all open source and is actually publicly available in terms of what we're building. And so that's the beauty of, you know, community driven project. You'll see what all the members are working on and, what's coming. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap on the tooling or technology that's available for data management today. We touched on some of these gaps. I think that

while it's great to see this data lake emerge with the transaction management and all of these areas,

there's still a long way to go. Right?

And

the from a for all the tools, Presto needs to evolve where it needs to support some of the more core database capabilities. Right? Dave just pushed out a PR for supporting primary key constraints,

which don't exist on

on Presto 6. So there's some big gaps there that need to be addressed on the Presto side from a language perspective so that more and more workloads truly can run. Data warehouse workloads

can run on the lake. The other area,

the transaction management, it's very, very early

as a transactionality

is at a very low granularity,

needs to be across many more elements,

multi statement level and so on, or that needs to emerge. And so in the data space, I would say each layer

in the stack, there's a lot of work to do. There's a lot of gaps to truly get to the state of the art that databases are at because it's like 50 years in the making. And a lot of those concepts still apply, and they're very relevant, but it takes time. And it takes part to build the right architecture.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Ohana and the ways that you're engaging with the Presto community. It's definitely a very interesting platform and an interesting business, and I look forward to seeing where you take it. So thank you for all the time and energy you're putting into it, and I hope you enjoy the rest of your day. Absolutely. Thanks so much, Tobias. Always a pleasure. Good luck as well. You're

growing podcast.

Very exciting to see. So thanks again for having me on. Thank you.

For

listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links