Make Database Performance Optimization A Playful Experience With OtterTune

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your DBT models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more.

Go to data engineering podcast.com/census

today to get a free 14 day trial. Your host is Tobias Macy. And today, I'm interviewing Andy Pavlo about Ottertune,

a system to continuously monitor and improve database performance via machine learning. So, Andy, can you start by introducing yourself? Hi. So my name is Andy. I am the cofounder and CEO of the Auto Tune startup. I'm also an associate professor in the computer science department at Carnegie Mellon University. And do you remember how you first got involved in the area of data management?

So my PhD

is Brown University in databases.

I didn't plan on doing a PhD in databases at the very beginning. I was originally interested in distributed systems,

but with databases,

it sort of turned out to be this thing where

I seem to understand it better than other people. What sort of was challenging for other people and, like, working with databases and using databases

sort of came naturally to me. So this is sort of how I ended up in the space. And then when I started at Brown,

my advisers were sit Stan Zadhanik and Mike Stonebraker

and, you know, they were building a brand new database system from scratch called hDoor, and I was recruited to work on that project. And so I've been do building databases systems since then. Yeah. Must have been a privilege to be able to learn about databases under the tutelage of Mike Stonebraker because he's kind of legendary in the space.

Yeah. I did a pre doc at the University of Wisconsin,

you know, before I sort of sort of sort of my PhD when I was out there. And I just remember

this is the first time I started hearing the name, like, Stonebreaker. And he just came on various conversations about all the great things Mike did and helped us, like, he was this amazing researcher.

And then when I finally got to meet him, a bit of surprise because he's very quiet. He's very tall. I think he's 6 foot 6 is the is the official measurement.

He was a large and a light figure, and it was I'm very grateful for the time I got this back with him.

And so you've been working in databases for a while now, and,

you know, Ottertune is something that is aiming to

help with tuning and maintenance of databases. I'm wondering if you can just start by giving a bit of a description about what the AutoTune project is and some of the story behind how it got started and the goals that you have for it.

So AutoTune is an automated database tuning service

that targets database configuration knobs.

So by knob, I mean a control parameter or configuration parameter

that the database system exposes to you as administrator, as a user to allow you to tweak the behavior

of the system.

And the reason why these knobs exist is because

when the database developer was building the database system, at some point, they had to make a decision about allocate memory for a hash table or something about the feature they were building. And then rather than using a pound define that was meant to be for all potential users,

they instead kick it out as a configuration knob because they assume that someone else is gonna come along who knows exactly how the application

wants to use its database system and will be in a better position to set it correctly. These knobs cover things like buffer pool sizes, caching policies, log file sizes,

various aspects of the database system's runtime behavior.

And so if you can set these knobs correctly

for for how the application is gonna be using the database, like, what kind of queries is gonna run and what the the environment looks like, what the hardware looks like, then you enable the data system to run more efficiently

to achieve either better performance or reduce the cost of of the deployment.

So

I got down this path of

the project originally started my my first year in 2013 when I started at Carnegie Mellon University,

and I was looking at

how could we use machine learning to

automatically improve the performance of a database system. So there there had been a ton of work going back to the 19 seventies,

automatic database design for picking indexes and partitioning keys. I had done some of that work when I was in in graduate school, but then I felt that there really wasn't a lot of stuff looking at how to automatically configure knobs. There was some work done in the 2000.

Commercial systems had these tools that were rule based that could set these things, but no 1 should have looked at can you sort of holistically look the database system through these knobs

and tune them correctly and then be able to reuse the training data you've collected from 1 tuning session for 1 database and then apply to another database. So all the existing tools that existed that we're doing some amount of machine learning, we're always learning everything from scratch.

So we thought, like, okay. What can we do to expedite the process and learn from 1 database and apply it to another database?

And I in particular, I also was interested in how far could we get without having to get

access to real world queries and datasets.

So the way AutoTrune works is that we only use the metrics that the data system generates and expose it to you. We don't need to see any user data or query traces, which was a big problem when I was in grad school to try to get access to these things. So we looked at a way to not have to, you know, ask for this kind of data from from people outside of the university.

And I know that I think it was a few years ago now, there was a research project

that was building

custom indexes based on machine learning, based on the usage of the data that gave better performance above things like

a generic b tree or, you know, a a or a hash index. But that obviously requires

access to the data that's in the table. And so I can see where, you know, that's definitely something that's challenging and would have to live within the database and not be able to

have any external access to it. And so I'm wondering

what you've seen as far as the

potential performance capabilities

of changing these configuration parameters based on the actual database workload as compared to something like

generating customized indexes based on the data that's in the tables?

Yes. So the learned end indexes, that's a separate topic. We can come back to that. Excuse me. It's tough to compare. Right? There are 2 different approaches to several black box approach.

With auto tune, it's a black box approach. So so, again, we don't control the internals of the database system. We don't require the user to install any kind of special extensions.

It only looks at the database using the APIs that the system already exposed and and the information that it already exposes.

And so the benefit you can get from automated tuning with something like auto tune

will obviously depend on many factors as in all cases and databases,

depends on what the queries are, how big the database is, depends on whether you've already tuned it or not. We see people still using the default configuration

that Amazon gives you on RDS, which Amazon has tuned it a little bit, but it's still not as good as what, you know, you could achieve if you tailor it explicitly for the application.

So it's tough to say, like, oh, auto tune is gonna give everyone a 2 x benefit. You know, if you've already tuned some things a little bit, then, you know, the benefit we can provide is is potentially lower. Now we all say that we've done real world deployments of AutoTune where we we have compared it against configurations that have been tuned by human experts.

And auto tune can still, you know so the algorithms can still find

some some room for improvement. So we've been pretty happy with that that outcome of, like, even after a human expert has gone at it, we can do a little bit better.

And another project that you've been involved with in your time at CMU is the noise page database, which is an entire engine that's built around the idea

of, you know, being maintenance free and automatically tuning itself. And I'm wondering if you can provide some

details as to any lessons that you've learned from your work at NoisePage that have been applicable at Ottertune and vice versa, and any sort of relationship that might exist between the 2 projects.

So my research is divided into 2 tracks.

There's the auto tune project, which was taking a black box approach to automatically managing database systems,

and then the noise page is a white box approach where if you're designing the data system from scratch with the assumption that it it will be entirely controlled by machine learning algorithms.

So with NoisePage,

we make certain decisions as we build the system

that are not going to potentially achieve the best performance,

but it makes it easier to control and observe its behavior

while it's running, which is what you need to have if you wanna be completely autonomous.

So with NoisePage, we've been dubbing this as a self driving database system, and I realized Oracle uses that kind of language in their marketing. I disagree with their characterization of what they're doing. So in in NoisePage with what I'm defining as a as a self driving system is 1 that can forecast the workload in the future,

proactively make changes to prepare for it, and then observe those changes and learn from them.

Things like Otter tune and other sort of autonomous tools that are out there, they only solve half the problem. They only tell you what the change to make things better, but they don't tell you how and when to apply those changes.

So with noise page, we wanna complete the circle

and have the system automatically

figure out, okay, these are the changes I need to make, picking indexes, tuning knobs, changing hardware configurations,

and then

apply the changes, observe what happens when you make the changes, and then repeat the process and over again and learn from them. So

the lessons that we've learned from our research have mostly gone from the auto tune side to the noise page side. So I've seen the struggles that we've had

in trying to get auto tune to tune Oracle and Postgres and MySQL

and other things. And then we then go and say, okay. Well, now when we build our system, let's not make those same mistakes. The clearest example would be

having a restart for some knobs to take effect. So in every single data system that's out there, there are some knobs where you can make the change, but the data system actually doesn't apply them until you restart the system and come back online.

The open source guys have a bunch of these knobs that are like this. The commercial guys have them as well. There's fewer of them, but nobody is immune to this issue.

So in the case of noise page, we tried to build the system so that any change like that doesn't require us to have to restart the system

because if you have to restart the system and now you need to know whether you're allowed to even restart

because that's an external cost that the algorithms can't easily reason about. Like, I don't know whether it at, like, you know, 1 o'clock in the afternoon, is it okay for me to take your database online for an hour?

Probably not, you know, I don't know what the implications of that are. So in Ottertune, you have to manually tell us when you're allowed to restart. In the case of noise page, we said let's avoid that problem entirely

and let's build the systems that we don't have to restart at all. Going back to the knobs that are available to operators of the databases, I'm wondering if you can just talk through some of the kinds of challenges that database administrators and operators and end users of the system run into when they're trying to figure out how to

optimize the system, tune it for their workload,

you know, or just get some, you know, performance improvements if they see that they're starting to hit some limitations with the way that they're using the platform

and just some of the sort of sources of that complexity and confusion that arise?

So the way people can tune database systems now, tune the knobs now, there's essentially 4 ways to do this. You can try to do it yourself, and it's usually falls into, like, the DevOps role. And this is where you have somebody who maybe knows a little bit about how a data system works, but maybe not doesn't know entirely

all the the internals.

And what they usually end up doing is just googling, you know, how to tune MySQL, how to tune Postgres.

And there's a bunch of these blog articles written by some of the the data's consulting competencies and the vendors

that will give you some rough guidelines on maybe 5

knobs you could try to tune and they tell you, you know, calculate the formula to how to tune them. Maybe how much RAM do you have, how much CPU you have. In our experiments, we actually found some of these recommendations of the blog articles for postcode on my c core actually incorrect. You know, it varies widely in the application.

If you have a little bit more money and have time, you can hire a DBA to come in and try to tune these things and these are people that have had experience in tuning, but typically they're only gonna maybe tune 10 to maybe 12 knobs if that. And we found that

for things like Postgres and MySQL, there could be, you know, a couple dozen knobs that actually could affect performance. And there are gonna be

implicit dependencies between these knobs that are not readily understood or or discernible

by humans, you know, where you maybe tune 1 knob and that changes how you wanna tune another knob, which is the behavior of that other knob. And so it's very difficult for people to reason about these things.

The next approach would be these tuning scripts that are available

where you punch in some basic information like how many cores you have, how much RAM you have, maybe what the disc and what the workload looks like. And then there's these hard coded formulas in these programs where they just spit out

some recommendations for some knobs. So all the major database vendors have their own things. And then for Postgres and MySQL, there's open source tuning scripts that are available.

And, again, these are, like, the lowest common denominator. They're not gonna be explicitly for your application. You know, it's gonna be sort of general purpose. And we have found that in our own experiments, we can beat those things pretty easily.

So the challenge is a combination of not knowing

what to tune,

not knowing how to tune them, not understanding that if you tune 1 knob that could affect other behaviors of other knobs, and then probably more importantly above all of this is just a matter of time. Like, it takes time to try these knobs out and see whether they're actually, you know, making a difference or not, and then you have to manually keep track over time, like, okay. What knobs did I change? Did that help me or not? And, you know, with DDAs and developers and DevOps, like, peers don't have time to do these things. They have all these other stuff they have to do at their jobs and database tuning is almost always relegated to something that, you know, I'll get to it when I can. And then last a performance issue comes up but then by then it might be too late and, you know, you still have to do this manual process of trying to figure things out. So auto tune takes a lot of this complexity and takes a lot of this hardship away because we can automate as much as possible, you know, this process of figuring out how to tune your database system. And it can reason about the complexity of these these all these different knobs in a way that humans just aren't able gonna able to do.

Looking at some of the configuration options for the different databases, there are, you know, potentially

dozens, if not hundreds of configuration parameters. And I'm wondering what you have seen as some of the contributing factors

to adding in all of these different options that lead to this potential complexity and all these different sort of implicit dependencies in terms of

the impact that setting, you know, a handful of different values together will have and not being able to sort of clearly

describe what the relationships are and just sort of the evolutionary and sort of incidental complexity that comes out of that. Database systems are complex pieces of software, And so it's not surprising that they have a lot of knobs

that you can tune because a lot of features in them. And

when you look at, like, very popular database systems, Oracle, Postgres, MySQL, SQL Server,

all these systems have over time have added new features and it's gotten even more complex. So it's not surprising that the number of knobs that they expose has been increasing.

So my PC student part of her research, she did a survey where she looked at the last 20 year release history of Postgres and MySQL and then for every release she just counted the number of knobs that they listed in their documentation

over a 20 year period. You know, MySQL increased by 7x and Postgres increased by 5x to the number of knobs that they have. So, you know, not all of these knobs are gonna tune effect performance. They're gonna be things like file paths and port numbers or things that you're not gonna want machine learning algorithms to set for you, but there's still enough of them that can affect performance.

And so figuring out which ones actually wanna tune is is challenging. Yeah. I see also too. So depending on the workload, some knobs actually don't do anything. If you're not inserting any data, it's just read only queries, and the knobs related to inserts probably, you know, potentially won't make a difference. So just knowing what you should be tuning is another challenge as well.

So I think these systems are very complex. They're very popular. People are adding new things up to them all the time. So that's why the number of knobs they've exposed has grown so much. Now with things commercial systems like Oracle, it's even more difficult because

there are some hidden knobs that you're not supposed to tune that can affect performance.

And in in Oracle, they're like they start with an underscore and then super secret ones that you're never supposed to touch or double underscore.

You can tweak the single underscore ones. You're not supposed to in some cases, but you could.

And so that just makes it even harder to figure out what's going on because now there's all this other class of of configuration ops that I could be touching, and I don't know whether I should be or not. So the problem is is certainly grown, and it's not something that's gonna be easily solved in the future. Like, the the database vendors are not gonna remove these knobs because they're there for a reason. With the number of parameters that are out there and the fact that not all of them are going to have an impact on performance or scalability. I'm curious what your approach has been in terms of identifying

the parameters that you want to look at and

building the machine learning model to be able to understand the impacts that it has and just the overall approach that you've had to actually building out the autotune system for

being able to learn and intelligently

determine which configuration parameters to set and what values to use for them? The first thing we do when we bring on a database system is and we wanna we wanna start including knobs. So let's say there's a new release of Postgres and we wanna

incorporate it in into auto tune.

We go through the documentation

and we have to manually look at the knobs and identify

whether there a potential to tune them. You know, if it's a host name or file path, we obviously don't touch those things. And we at we clear those in in our denied list of what we don't let Otter tune

touch. But then a bit more tricky problem is for the knob that we do think we potentially wanna tune,

we have to go through and determine whether

there is if we let the algorithms tune them a certain way, could that cause an external

cost? You know, is there an external factor that we'd have to take into consideration. So

for example,

most data says you can turn off whether you wanna flush the log to disk when a transaction commits.

And so,

obviously, if you don't write to disk, your data runs a lot faster. But the so the machine learning algorithms will find that turning off disk flushes makes things go faster, but

that there's an external cost to that because

if now you crash and you lose the last 10 milliseconds of data,

the algorithms don't know how that's gonna affect your company or how it's gonna affect your organization.

You can't measure that. That's something we don't know whether that's the right thing to do for a customer, for a database.

We also prevent our 2 from tuning those kind of things as well. So we have to mainly go through the the documentation and us since we know database internals, we we can decipher and understand what's going on. In some cases, we've looked at the source code for postcode to my SQL to understand, like, what is this not actually doing to determine whether it's safe for us to tune them or not. So that's sort of the first step to go through and figure out what should be allowed to tune, which which shouldn't be tuned. You you know, you go from a couple hundred to a couple dozen, less than a 100 knobs that you could potentially tune for Postgres MySQL and Oracle

based on this. The other interesting challenge to that, and I'm sure it just ends up turning into code spelunking, is for the cases where a parameter is maybe a string or an enum where it only accepts a certain limited number of values that you need to make sure that you get correct or else it's just gonna fail you know, either fail to start or just not have any effect. And, you know, I'm wondering how you handle the variability in data types and in particular things like enums where it's, you know, this stringly typed value that has to match exactly what's in the code. Those aren't an issue. The enums aren't an issue because there's just categorical variables in our machine learning models. So you just convert them to integers

and the algorithm can deal with that.

And in terms of machine learning side, the algorithms don't know what they're actually tuning.

Right? It doesn't know that this knob is gonna affect something to do diff and something to

affect, you know, the right ahead log. It just knows that if I changes them a certain way for a particular database, it makes my objective function better. Like, is it improving the throughput and improving the latency? So long as you we convert the options that are available to tune the knobs,

either the range of values or a discrete set of values, long as you convert them into integers that we can then build models on. It's not that difficult to add new things or or incorporate, you know, as you said, strings or enums.

Can you describe a bit about how the Ottertune system itself is actually implemented and some of the ways that you think about building the reward functions for optimizing

for, you know, whether it's cost or throughput or latency and things like that? So the way it works is that you provide auto tune with some basic login credentials to access the database system. But if you're running on Amazon and using something like RDS,

you provide us an IIM role to

access the parameters, the knobs from Amazon's platform or service,

and as well as a severely limited

login credentials to access the datas directly.

So when we access the database, the only thing we need to retrieve are the runtime metrics, like the internal performance counters of the database,

pages read, pages written, things like that, and then the current configuration. And then we have a we have a driver running in Kubernetes that can connect to the data as impute this information.

We then store the metrics and the configuration for the database

in our internal repository

with all the other training data we collected from all of their databases.

And then we have a set of Kubernetes

jobs that run

to train the models based on this train this data.

We support different machine learning algorithms. We've done Gaussian process models, deep nets, and then reinforcement learning.

We haven't really seen a major difference just yet in the efficacy of those algorithms, although they can have different convergence rates. That's not the main issue we're we're hitting right now. So then you have to specify what you want the reward function to be or what what the objective function should be. Do I wanna, again, maximize throughput, minimize latency?

You can also provide a alternative metric that maybe isn't coming from the directly from the database

system. So it could be, like, a monitoring service, an APM service where you're keeping track of something that's outside the database system.

Again, that's just another metric we can record and try to optimize for. So then based on that, we then run our recommendation algorithm to generate new values for the knobs that we wanna tune. We send them back over to the driver who then applies it to change the database.

If that new configuration requires a restart and we're allowed to restart at the time, we'll go ahead and restart, and then we observe the effects of this. And instead of repeating the process,

applying new configurations,

observing whether it's help helping helps us. And over time, the models will converge, and we think we have the best configuration for your workload right now, then we can switch into this monitoring mode where we're not actively tuning, but we're making sure that that the the configuration you currently have is the best thing for you right now.

As you

started building the project, I'm wondering what are some of the ideas or assumptions that you had about either how you would approach it or the

sort of capabilities that it would that it would have that have either been challenged or invalidated

as you progressed through building the initial prototypes and into real world systems and then started getting it into the hands of end users?

It's not so much the service

itself.

It's

the how we would set up the environment

for the database to tune it.

So

for the plans we've done, there have been sort of 2 variations.

You either make a clone of the database and capture the workload and then you run a backup machine or a spare machine and run all your tuning experiments there. And then once you think you have the best configuration, you then apply that change to to the production database.

We've also done another deployment

where we've run directly on the production database, but it's a read replica for back end services.

And in that environment, we didn't have to capture the workflow. We didn't have to make a snapshot of the database. We just run directly on the production data as it is running queries. But in that environment, since you don't have a baseline to know whether you're making things better because the workload is changing over time, we had to connect to another database as the control

is running roughly the same workload. And then we observe whether as we make changes, are we doing better than the control database?

So those are sort of the 2 setups that

our 2 can can work in. The challenge has been that

people

just don't have the ability or don't have the time to do a snapshot and do a work catch the workload trace, like, getting something set up where you can run on backups and tune on spare machines,

that's been harder than I thought it was gonna be. And, again, that has nothing to do with machine learning, has nothing to do with all their algorithms and the optimization you could provide. It's just the infrastructure that people would need to, you know, tune a database and then without touching the production database. Like, that part has been harder than I thought it was gonna be. 1 of the points we've done has been with Oracle. Oracle has really great tools to make you know, do workload capture and use snapshots.

The open source ones for progress Postgres and MySQL

aren't nearly as sophisticated. So that part's been challenging.

In terms of the actual design and goals of the system, I'm wondering if there have been any changes to the kind of priorities that you've had in building Ottertune

or the architectural aspects of actually building and managing it. So the 1 thing that surprised me the most is that

we started off with supporting Postgres and MySQL RDS.

But for our inbound requests,

it's been Amazon Aurora has been the thing that everyone's asking for. Some ways it doesn't surprise me because, you know, Amazon's crushing it with Aurora. But that's the 1 thing that I was not expecting. Like, everyone has been asking us, can you make can you optimize Aurora for both Postgres and MySQL, but more Postgres than MySQL actually. That's the 1 thing that surprised surprised me the most. The other thing that surprised me too is that

a large number of people are running with the the default configuration from Amazon. And Amazon has already tuned this a little bit but it's it's still not optimal.

People, they either don't know they should be tuning yet or they don't know how to tune it. And in terms of actually adding support for a new database engine or a new deployment target, I'm curious what's actually involved in being able to

identify all of the configuration parameters and add them into your model and be able to

build in the information into the model itself to be able to make those intelligent decisions. And

sort of tangential to that, I'm also curious if you actually have to have separate models for each of the different database engines or if there is overlap enough between them that you're able to take some sort of core element of it and just have additional parameters that go into it based on which engine you're working with? The last question first. You're not able to reuse the training data from different database systems. So so I can't take any training data I collect from my SQL and reuse it on Postgres.

These systems are just so different. The knobs are different. The behavior gets changed by the knobs are different. There's no easy way to map any of the data from 1 database implied to another. It's even also challenging to reuse the training data across different versions of the same database system. So, you know, taking postcode 9.6 data and using a postcode 1011, 12, 13.

We have done some experiments on this. The challenges of some knobs get dropped, some knobs get added, some metrics get added.

The effect of knobs can also change.

And so we've done some experiments on this, but we found that

it was just better to train the model from scratch

using the data from the same the the major version of the data system. You can take, like, you know, press this 10.1 data and use it on press this 10.2. That seems to work, but going from 10 to 11 over my SQL 5.6 to 8, I think that's just too difficult.

And then it so the other question is, like, what does it take to bring on a new database system? As I said, the the first step is just looking in the documentation,

figuring out what are all the knobs that are available,

understanding what they actually do

to a high level, and determining whether this is something we actually want an automated system to tune. Because, again, we we don't wanna cause issues for customers or users of auto tune where we're making the database go super super fast because we're turning off disk rights or we're turning off the right of headlock. All, you know, all those things are are bad because

when there's a crash, when bad things happen, which will happen in database,

we don't want the safeguards that the database system provides to be turned off.

So once we figure out what knobs we actually wanna tune and understand what what possible range of values for those, the next step is then to figure out what is the interface

that the data system exposes to you to retrieve

the metrics and treat configuration and how you make changes to the knob values.

Then another big part of this is how to

make the change to the database, like, apply them now values but then observe or check whether there's a problem with those configurations

and the data system will refuse to start or refuse to apply them. So and obviously, it would be if I set it, you know, a a size of a memory pool to negative 1 and I restart the system, in most databases,

they'll restart and throw an error and say that I can't have a negative 1 value for, you know, the amount of memory I wanna use. And so we end up having to scrape logs and then determine, you know, whether things are not restarting correctly because of something we're doing or some other issue. So there's a bunch of that kind of bespoke

safety checks we have applied for each each database system.

So the combination of those 3 things that take up most of the time. Again, notice I didn't say anything about the machine learning stuff because the algorithms don't know, don't care

what database systems are tuning.

The same algorithms can be reused, you know, across the different system types. It's just the data you're feeding into them can't be shared. So from a machine learning perspective, it's not a lot of work for us to onboard a new system.

It's just the infrastructure to accommodate it and make sure that we're doing things correctly and safely. That's what takes most of the time.

So a combination of knowing something about databases and machine learning, you make this all work, which is hard to find these days. And it sounds like outside of the machine learning aspect where you have the algorithms, they do what they're supposed to do. You figured out what the appropriate reward functions are. But in terms of actually onboarding a new database each time, it's as if it's the first time that you're doing it because every database is just so different in terms of the parameters that they're providing.

Absolutely. Yes. The knots are different. The metrics are different. There's other aspects of them that are unexpected too. Like,

sometimes there's global metrics. Like, there's a amount of data I read for all my tables in my database for all my indexes.

Sometimes there are sort of database object level metrics. Like, here's all the data I've read and written for this single table. Like, every database is is different and it just takes time to sort of unify what they're generating to produce and feed into our algorithms.

I'm wondering, as you've gone through these exercises with these different databases, if that's given you any thoughts

on what the sort of

core

logical

aspects of database engines

coalesce to and if there are any learnings from that that could be

used to improve the overall

sort of structure or approach of database engine design. I mean, obviously, you're not going to retrofit some of these existing systems. But as you work on new research projects or, you know, work with new database engines, if that has given you any thoughts on how to make these more sort of uniform in the core elements while still being customizable enough and special cased enough to be useful for the domain that they're targeting.

What I was saying before, it's what are some of the things we learned from auto tune that we've then used to guide some of our research directions

in the noise page project? I mentioned the no restart 1. That's a big issue.

It's not so much like the core engine. Like, is it a column store? Is it a row store? Is it vectorized or is it doing query compilation?

There are some nuances to that. At the high level, it's just

it's observability

and control is the main thing if you want the data system to be controlled by an autonomous agent. Right? So, for example, like, in MySQL,

the knobs that you can set that don't require restart,

not all of them happen get changed instantaneously.

MySQL will apply the change incrementally,

but it doesn't alert you. There's no programmatic

way to know

from the outside other than maybe scraping the bug log to know when that change has been fully applied.

And the reason why that matters if you're thinking again thinking big picture of, like, can I have a system that's completely autonomous?

I need to know whether I'm seeing a degradation in performance or change in in my objective function

while the change is being applied because the workload has changed, the hardware has changed, or like some other aspect of the environment has changed or it's because I'm making that change. Think of, like, if I'm gonna build an index,

I have to take away some CPU cores and use some disk while I'm building that index and therefore I'm gonna see a reduction performance. So I need to know

from my autonomous planning agents

whether that production performance because of some some action I applied or because the workload has changed, the hardware has changed.

So I think that's the thing that we found that matters

is actually gonna matter the most. It doesn't matter whether, you know, it's post grad style MVCC or Oracle or MySQL's MVCC

from the perspective of, like, you know, having your machine learning algorithms to control these things, that stuff doesn't matter.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more.

Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at data engineering podcastdot

com/rudder

today.

In terms of actually

setting the parameters, you mentioned that for RDS environments, at least you get IAM access to the parameters for the specific engine. So you can obviously set those programmatically.

But I'm curious what your approach has been for being able to handle things like

the database engine is running on, you know, a physical box or an EC 2 instance, and you actually have to make a change to the file on disk to set this parameter or,

you know, what the breakdown is of values that can be tuned from within the database engine by create you know, issuing a SQL statement to modify the, you know, the runtime configuration

versus having to make these changes on disk and just how you approach the continuum of how to actually make the change.

Every environment is different. Every database system is different. So you're absolutely right. So sometimes you need SSH access to log into the machine where the data is running and make changes to the configuration file on disk. You certainly have to do that in post or so MySQL 5.7.

But then in MySQL version 8, you now have the ability to make changes and persist them through entirely through SQL commands. So you don't need access to the file system.

In other environments, sometimes they're using, like, Puppet scripts to install configurations. So we have to go

somehow modify that big file so Puppet applies the change.

Every data system, every environment is different. What's the nice thing about RDS, and I suspect Azure and GCP are the same way, you know, they provide a standard interface

that we can use to to make these changes. If you're running on prem or you're running self managed inside of EC 2, everybody's doing something different. There hasn't been an easy way to just unify all these things. The database engines that we've been talking about so far and that you've targeted for your for your current system that you're offering with OtterTune are all transactional

systems primarily.

And I'm wondering what your thoughts are on the utility of something like Otortune for analytical databases or data warehouse engines.

On the Reacher side, we have done experiments

of tuning

analytical databases. I think we tuned vector wise or acting in vector in the house called. We looked at tuning Greenplum, but there was a bug in Greenplum's issue that was not something Auto Tune could control.

So, again, the algorithms don't know, don't care what that is a transactional workload, operational workload versus

analytical.

If there are knobs that can be changed that will affect performance, Otter 2 can learn that. We can apply these changes. So we have done tuning for analytics in the past. For MySQL and Postgres,

most people are running these on

operational workloads. They do have some analytical stuff that they do wanna run on MySQL and Postgres.

You do have to

be careful about, like, what the objective function is if you have a mixed mixed workload of, like, if your p 99 latency

of the analytical queries is, like, through the roof, and how do you know you're actually making the the shorter queries run faster? There's some actual work we have to do on our side to to work to make that make the handle to the sort of 2 objective functions at the same time. But, again, the algorithms at high level are still the same. There's nothing preventing auto tune using tuning analytical workloads. We just nobody's really asked us to do this too much yet. Now

some systems

or some databases that are like a database as a service like Snowflake,

they don't expose any knobs to you as the end users. There isn't actually anything that auto tune can tune through, like, a single interface. We'd have to be on the inside and be able to tune things for them, which we think we could do. We haven't done that yet. As you have been working with these database engines and figuring out how to optimize their performance and latency and twiddle all the different knobs. I'm wondering what are some of the most interesting or esoteric configuration options that you've come across or, you know, something that just made you really either scratch your head or facepalm that, you know, why is this even available as a configuration parameter?

The really bizarre esoteric things like the hidden NOMs in Oracle,

we don't wanna touch.

Otteroom currently doesn't doesn't tune those things just because we're trying to be very conservative

about what we're executing because, again, it's the database. The the you know, people are very paranoid about losing data, rightfully so. So we don't want auto tune to to cause cause problems.

The thing that

so so there we haven't come across any knots that I would say that are too esoteric or bizarre. They're like, you know, why would you ever wanna tune this? Because I can't think of any offhand. I mean, 1 of the surprising things has been,

as I mentioned, like, things that a bunch of the blog articles, a bunch of the recommendations that are out there that people have written about how to tune your database,

things that they recommend actually turn out to be

incorrect.

Using 1 example would be in MySQL. There's a feature called the adaptive hash index.

It basically recognizes that if you have a b plus tree

that you could build an in memory hash table for you to make keys look up go faster. By default, it's the 1 in MySQL,

and most of the blogs out there say that they'll leave it on. We found workloads where it's clearly this is you know you know, it's a 30% reduction performance if you use that feature. That's a good example of, like, what the default is and what people recommend you do is actually not always what you wanna do. Do. And it varies per workload. And how do you know whether your workload is 1 of the ones that shouldn't be using it? You'd have to find out manually or use something like auto tune to figure it out for you. That's probably been the most surprising thing that I can't think of anything

I mean, the database developers at the major systems,

they're also in the database business too. So, like, they wanna be very conservative as well. So they're not gonna, like, do something that would be super bizarre. Like, turning up disk rights is probably the

most aggressive optimization they can probably expose you. And in some cases, that's actually the right thing to do. They'll

make things go faster, but it's probably what most people probably don't want. And in terms of actually being able to

work with Ottertune and determine what are the priorities for optimization that I'm working on, I'm wondering what the actual

user experience is of saying either I want to optimize for latency or I wanna optimize for speed. Or to your point of increasing the number of disk rates, I wanna optimize for the durability and safety of my data, you know, at the expense of things like latency or speed. So, again, for durability and safety of data, like, that's the default. Auto tune

can override it and say, you know, turn turn off f you know, f sync for log rights, but we don't expose that to the user now in the interface. Because, again, it's like

we don't want people shooting themselves in the foot and causing problems.

So

the way it basically works is you set up the credentials, allow auto tune to connect to the database,

You specify whether you want to auto tune to monitor the database or actually tune the database. And if it's just monitoring, then we're just collecting the metrics and using that to build our models and understand your your your workload a bit more. If it's tuning, then you specify whether you wanna restart or not. And again,

we don't want to restart when you're running production and it cause problems to get there to ask you whether you want to restart that. And then based on that selection, we then expose

a list of recommended knobs

that

you'd wanna tune that you could potentially tune and then long with some default ranges of what values that you could potentially set them to. And then that's it. That's that's all you really need to specify

other than, again, doing optimized for throughput or latency.

Cost is a weird 1.

It requires you to move to a different, you know, different instance type, at least on AWS.

If that's downtime, that's something we don't wanna do. So it's typically you want you want to maximize throughput, maximize cost

I'm sorry, improve throughput, improve cost.

And then

that's enough to figure out, like, you know, I'm getting better performance.

We're not there yet, but we want we think our models can then be used to

suggest to the user that

this is what you're getting now on this instance size and this provision IOPS on Amazon. But if you downgrade or upgrade and let auto tune manage things, then this is the performance you'll get and you'll pay this more this this not less. So we've done our own experiments with RDS where we can get the same performance on 1 instance type on a machine that costs half as much. Right? You know, if you're using the Amazon as the default config. So we can show that we can get and get you half the call for it. But then that's not something that if you have to pay half the amount and get standard performance. But that's not something we wanna do automatically because again, that transition, that movement is downtime and we don't let auto tune do that. For the case where you say, I don't want you to automatically apply the changes. I just want you to show me these are the parameters to set. I'm curious if you have the ability to kind of priority rank them of you know, these are the ones that don't require you to restart. This is the expected sort of range of performance improvement that we think you're going to get or if you have any capabilities or plans to offer anything like that in the user experience?

We don't have that yet.

That's something that we could add. There is an internal algorithm that ranks the impact of knobs.

So 1 of the things we do is

we run a lasso in progression

to determine

to estimate what knobs will have the most impact on performance.

And that's how we determine, like, here's the, you know, the top 10 recommended knobs we should start with tuning with. You can span that over time. So we could do something like that where would you that you're suggesting that

take the the expected gain

from the machine learning algorithms and expose that to the human. Like, hey. If you let us tune this for these knobs, well, you'll get this percentage improvement.

If you tune these other knobs, we'll get this amount of percentage improvement. Of course, there has to be a confidence interval to this because if the model hasn't converged yet,

the accuracy of the recommendation you're making can vary a lot. So but humans also have a hard time reasoning about these things. That's why we haven't maybe done that just yet, but that's something we could consider.

Another thing that you touched on earlier is the fact that there are some systems that offer some automated tuning capabilities that might be built into the database or that run inside the engine, but

that they're inherently limited by the fact that they have a very

sort of small view of the world where they only have experience of that 1 database, and they're not able to learn and train across a variety of different workloads. And I'm wondering what you have seen as some of

the compound benefits of being able to run OtterTune as a centralized service and be able to work with and learn from a large variety of database instances with variability across their workloads.

So being able to use training data across different databases,

it allows you to

generate the better configuration or more optimal configuration more more quickly.

It basically means there's less learning time. Again, this assumes that the application and workload you're trying to tune on on particular database is something that we've seen in the past. If you come out of left field with something we've never seen before, we have no prior knowledge of anything that looks like this.

It's just gonna take the same amount of time. You know, you have to learn from scratch in that in that regard. But most workloads are, you know if you had to see something that's dramatically different, it reduces the convergence time of the model, how long it takes before it produces a figuration recommendation that it thinks is confident that it's gonna be do well on. So

that's sort of the main benefit.

And where this matters is

in really large fleets, like, if you have 100 or thousands of databases,

again, no human could go and tune each 1 individually,

but maybe you also don't wanna go through the full process of, like, setting up a clone, catching the workload trace, you know, spare hardware, and doing all that sidecar training that we talked about before.

You could potentially run

directly on the production database

with minimal downtime and allow you to to tune it more quickly. That's where we think 1 of the big benefits we're gonna see with auto tune

is in that environment where you have just this large fleet.

Most of the workloads slightly most of the same, but maybe just a little bit different, but you can still tune them exactly

in a way that you couldn't do if a human was doing or if you were just using a rule based

tool. So I think

reducing the amount of time it takes and potentially number of research you have to tune a database

is the big benefit you have from reusing training data. As you've been working on doing the research for AutoTune and turning it into a product and working with customers

and adding compatibility with different database engines,

I'm wondering what are some of the most interesting or unexpected or challenging lessons that you've learned in that whole process?

This is not exciting, but lawyers lawyers take a long time.

Because, you know, when I was at the university,

any kind of legal arrangement we would have were handled by the university lawyers, and everybody wants to hang out with people from Carnegie Mellon. So the terms are very favorable. What we can do the research?

Now you're a startup

and

it just makes everything just more challenging. You know, things that I didn't have to worry about as as a professor, as a researcher

at the university, I had now to worry about, like, liability insurance, all those things. That's been the thing that's been most surprising is how much longer things take when you try to do stuff on your own, again, from a legal perspective.

But the problems are the same. Right? The problems don't go away.

Everybody is struggling to manage these databases.

And so we we think we we're solving an important problem. Doesn't solve all the problems with Auto Tune. We think this is an important problem that has not really been addressed

in this sort of this meaningful way that we're we're trying to achieve.

Auto Tune is definitely a very impressive system, and it offers a lot of potential upside for people who are using it. But what are the cases where it's the wrong choice and you would be better served either by tuning things manually

or by, you know, running with the defaults or, you know, maybe using a different database engine entirely?

You always never wanna use the defaults. I'll just throw that out right now. I mean, certainly don't use the default configuration

you get from the OS package maintainers, like, you know, if you call app get install MySQL,

they assume you're running on a machine that has a 160 megabytes of RAM. Like, it's not realistic. So

the default is always gonna be bad. The Amazon default is better, but it's never gonna be optimal. The tuning tools like PG tune or the MySQL tuner, various ones from the commercial vendors, they're better but still gonna be

not as optimal as what you can achieve with

either, you know, a very expensive human expert coming and doing this or something like auto tune that can automate it. So the default is always bad. You know, there's other issues you have to deal with managing a data system. The physical design is another big issue, like picking indexes, picking sharding keys. That problem doesn't go away with auto tune. You know, auto tune come off the best configuration that you could possibly have for your database. But if you have no indexes and all your queries are doing complete sequential scan on a 1, 000, 000, 000 rows, there's nothing we can do to make that better. So

physical design still matters. There are some

open source tools and some commercial tools that can have you pick indexes.

That's something that not auto tune doesn't address right now. Whether or not you should switch to a different database system or to a different storage engine, that's also outside of the purview of what Auto Tune is trying to solve, at least for now. Because, again, that actually requires access to the database, it requires access to the queries, understand how you're using the database,

and that's not something that

we wanna touch right now because, again, of all the privacy concerns, like, you know, people are paranoid about data leaks. And so right now, Otterfume doesn't want access to these things. We only need the metrics. We only need the configuration,

so we avoid these problems. But that means we can't solve the problems like picking indexes or

telling you what data system to switch.

Switching database systems too is also not something that you can easily automate as well because the dialects SQL dialects are always gonna be different. There's other aspects,

like the ecosystem, how it's hooked into whatever the build environment, the CI environment you have. So that's not something that you'd wanna automate very, you know, at all right now either. There's a bunch of humans need to be involved in that decision. So I think non tuning can still get you a pretty significant benefit improvement,

you know, without having to make drastic changes like changing to a different data's vendor.

And as you continue to work on OtterTune and work with customers

and add new engines to it, I'm curious what are some of the things that you have planned for the near to medium term or interesting areas of research that you're excited to dig into? Well, so the Auto Tune research project is dead. My PG student, Dana Van Aken, she defended

this year, and she's now the CTO and cofounder with me at Auto Tune. For a research perspective, like black box optimization

at the Carnegie Mellon Davis Group, we're not pursuing that any further. So now on the startup side, I think I mentioned that we're adding support for Aurora.

It's basically gonna be very customer driven, so poor people we're gonna be asking for, we'll start adding. There's some other things we have in the works that we're not ready to talk about yet, because as a scientist, I wanna make sure that, like, we have results. We can prove that things that you were. So we have some additional things in the work we can

use based on the same models and the same kind of high level ideas of what auto tune could achieve, but it's sort of packaging up or presenting them in a different way and doing different things. So we have a bunch of things in the works in that regard that we hope to announce later this year. Now in the university on the research side, we have a paper in Sigma this year

that solves the second problem of a self driving database system to do behavior modeling. So the term self driving for a data is not just a marketing term. Take heavy inspiration from self driving cars, which partly were invented here at Carnegie Mellon.

And we're using their their basic high level designs of how we're gonna build our self driving database system. So the 3 problems you need to tackle are the forecasting, predicting what the queries are gonna look like in the future, the behavior modeling to predict how the system will change or the performance will change as you tweak different aspects of it. And then

the 3rd piece is the planning phase of, like, deciding what actions to apply, what changes to make going forward based on the forecast. So self driving cars at a high level work the same way. You do forecasting down the road to predict where cars will be or where objects will be around you in the future.

Then you have models that can predict how the car will behave

as you turn the steering wheel left and right or, you know, accelerate or decelerate.

And then there's the planning portion where you say, oh, what actions do I need to apply to get me to my destination? What we're trying to build is working the same way. So we have the forecasting piece. We now have the behavior modeling piece in Sigma this year, and then we're working on the last 1 to do the planning components.

So there's a bunch of work in this space where we're trying to get better observation of of the runtime behavior of the data system to feed that to our models. We're looking how to reuse

replicas or sort of solve that problem of the cloning the database and capturing workload trace. Can we do that entirely online

using high availability replicas without having to provision additional machines?

So there's a bunch of these kind of problems we're looking at. We think we almost have the circle complete where we can do planning, apply actions, and observe whether they help us, but there's still a lot of things to do for own a problem.

Are there any other aspects of the work that you're doing on Ottertune or the overall space of database performance automation that we didn't discuss yet that you'd like to cover before we close out the show?

I think that observability is gonna be a big issue.

I think that

observability is gonna be a big issue, and I think, as I mentioned and it's gonna have 2 aspects. It's gonna be,

1, observing metrics and data that you can then feed into machine learning models to control things. But then at the end of the day, you're also gonna need to have,

you know, human decipherable or human understandable explanations of why the algorithms are making certain changes.

Sort of, like, people that are really interested in machine learning and people are working in the space, they're like, oh, yeah. Sure. We should totally automate everything. We can do this. But it's the people that actually deploy in the databases that that actually make sure things are running smoothly.

They are gonna be rightfully

hesitant to turn controls over entirely to machine learning algorithms. So I think that we're gonna need

capabilities to to assuage their fears and tell them why

the database or why the the machine learning algorithms are making changes to the database a certain way. So that's something, you know, we potentially wanna add an onr tune to, you know, say why we think these are the right NOC configurations because your workload looks like a certain way. And then certainly on the noise based side, it's sort of the same problem, but now it's more complicated because now it includes forecasting and and behavior models and whole bunch of other components

in in decision making process

where the auto tune sort of has a more straightforward

a single model approach. So I think that observability and, you know, explanations for humans of why things are occurring is gonna be a tough problem. And that's a whole another research area in machine learning AI right now as well. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or try out Otter 2, and I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap on the tooling or technology that's available for data management today.

So

just a pretty big question to end with. My current thesis that I'm sort of playing around in my head is

I think we're in, like, the golden age of databases,

meaning that there's so many choices, there's so many open source projects, there's so many commercial products,

there's so many things out there, and it it's really a fascinating time. And, you know, unlike in the 19 nineties where there was only a small number of databases and maybe some things be built in academia, but everyone started trying to be very general purpose and solve all everyone's problems. Like, we're in this really nice era where, like, there's these niche databases that are solving,

you know, specific problems or specific workload challenges

that, you

know, we designed this in a based on, you know, just that environment, just that workload. I think that's really fascinating.

And of course, now the problem is there's so many choices. How does somebody actually navigate this environment or navigate, you know, what are the possibilities out there? That's a hard 1 to solve as well. But I think something like that would sort of guide people to say, okay, like, your workload looks like this, your environment looks like this, your queries look like this, your data looks like this. Here's what you should be doing. I think that would make a lot of people's lives easier. I don't know what that is, and I don't know how, like, you know, is it just a decision tree? Is it a flowchart? Is it a website? Or is it something automated that can decipher something?

It's tough to say. Yeah. It's definitely an interesting

problem statement. As you said, it's not easy to envision

how that would manifest because it could be a decision tree, but then what if your assumptions about your workload are wrong, you know, or maybe it's something that lives in your database that says, you know, you're using,

relational engine. You should be using a key value store or or a document store for these access patterns, and, you know, now you have to go and re architect your entire data model. I mean, database systems don't die. Right? I mean, companies go undershort and then, like, sometimes the products will fold, but

there's a surprising number of people that are running on, you know, legacy systems that have been long the companies have been shuttered. The company has been sold off

to, you know, to to a holding company, and there's milking the maintenance fees.

Like, once you make that choice of, like, I'm gonna use database system x,

it's very, very hard to get get off that. And so that's a really hard decision for a lot of people and helping them say, hey, this is what you should be doing before they make that big plunge,

I think, would be well received.

Absolutely.

Yeah. And definitely a lot of avenues where that would be useful, you know, in including for people who are deciding which data warehouse am I going to use. Yeah.

I was like, Postgres is Postgres is so good. Right? It's gotten, like, in the last 10 years, the development is accelerated. It's a phenomenal data system. And for to be honest,

90% of the people out there, Postgres probably is the right choice from those things.

We'll just make a website. You know, which database should I use that it just says Postgres?

Yes. Yes.

I think, actually, that might exist.

Probably. Yeah.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Otter2, and it's a fascinating system and 1 that I definitely plan to at least test out for my own use cases. So I appreciate all of the time and energy you've put into the research and commercialization of that. Definitely look forward to seeing what else you put out. So thank you for all of your efforts in this space, and I hope you enjoy the rest of your day. Alright. Thank you so much. Take care.

For listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links