Shining Some Light In The Black Box Of PostgreSQL Performance

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/

rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy, and today I'm interviewing Lukas Fiddle about optimizing your database performance and tips for tuning Postgres. So, Lukas, can you start by introducing yourself?

Of course. Yeah. And thank you for having me. So I'm Lukas Fiddle. I am the founder and CEO of PGAnalyze.

We help companies optimize their Postgres performance. Over the years, you know, I've spent a lot of time working with Postgres, so I kind of started out probably, like, 15 years ago now, probably more.

Kinda lose track of times, 17 years I think. Using Postgres just as an application engineer myself. So my background is, you know, writing code, and I still like writing code these days. And back then, you know, I really struggled with how do I just make Postgres work better, essentially. How do I scale Postgres as an engine engineer? And so over the years, you know, I've kind of had different takes on how to approach that problem. For example, I was, part of a company called Citus Data that essentially did distributed Postgres,

now part of Microsoft. And, you know, I I kind of was using Postgres as a end user, essentially, in many cases as well, in companies like Product Hunt, for example.

Now over the last couple of years, I've been working on pganalyze,

which essentially helps people optimize their career workload better. I'd be happy to, you know, jump into some of these things, talk more about Postgres in this episode today.

And you mentioned a little bit how you first started fighting with Postgres. I'm wondering if you can dig a bit more into how and why you decided to focus more into this area of data management and the database

as a core concern of your career.

Yeah. It's a good question. I think, you know, initially, I wasn't I definitely wasn't an expert. I'm still not an expert these days in many ways, you know, like, you always learn. But,

I do think, you know, it was I I was struggling with understanding,

like, even just getting a men a good mental model of how the database works. Right? Like, relational databases are often quite similar, like Postgres has a lot of similarities with MySQL, but then, of course, also divergences in, like, specific implementation details. But even just understanding basics like how database takes your query and, you know, plans it and actually, you know, finds a good query plan and executes it and how the data kind of flows, you know, from indexes kind of a result and gets joined together. I think initially, I I just didn't have a good mental model with with that. And so I think what fascinated me back then when we kind of, you know, started the first version of PGMLIZ and what still fascinates

me today is essentially providing ultimately better visualizations

of performance information. Right? So for me, it was always performance because, like, that's what that's what makes it actionable. Right? Like, it's interesting, of course, to visualize, you know, internal data structures and things like that, but that doesn't really have that urgency to it. And so performance work is interesting because you kind of you have an actual issue you're trying to solve. And so I think what excites me really is the combination of, you know, something complex like performance optimization for a database combined

with thinking about design and user experience and how, you know, how you can visualize. So I mean, this is a side note, but, like, 1 of the things that we did, early in PGML is we always wrote our own visualizations. So we don't use 3rd party I mean, we use c 3, but we don't, you know, essentially do, like, a charting library or anything. We essentially have, you know, a very, like, purpose built visualization,

logic and and graphing logic, because we felt that was important.

Digging now into some of this aspect of database performance, performance tuning, I'm wondering what are some of the

primary ways that performance problems can impact the business and ways that they might manifest in the runtime of an application that's using the database.

I think ultimately, it comes down to that. If you think of this from application perspective, it's a really slow function call most of the time. Right? That function, you know, might be an ORM that you're calling. It might be, you know, a query that you wrote yourself and you're just executing, you know, the query directly. But it's kind of like you're doing, you know, connection dot execute or dot select or something. Right? And then it's just really slow. And so I think the way it manifests for business, right, is that that ultimately is part of a bigger, you know, kind of customer experience. Right? So somebody, like, accesses website

and the bottleneck is a SQL query that runs slow. And so the reason that, you know, everybody should care about it essentially, right, is because databases

sit so central in in that whole overall experience. Right? Because

basically anything that works with data has a database somewhere. Right? And Postgres oftentimes is a system of record, so you, you know, even maybe if you have a cache, you know, in front of it, but then there will be cases where the cache is missed. And then you do have to think about those, you know, kind of tail latencies

where, you know, things are either not in your own cache, like your Redis, for example, or maybe, you know, they're not even in the Postgres buffer cache, like the in memory cache that Postgres has. Instead of Postgres has to go to disk. And so it's really about these tail latency type situations

where I think it becomes so important to, like, essentially think about how do I

how do I go from a slow query to understanding why it's slow. For people who are relying on the database, they're starting to see these challenges of slowness performance.

What are some of

the issues that they might have, or what are some of the,

challenges that they might experience trying to address performance problems where

maybe they just try to say, oh, well, how do I make my application work better with the database? What are some of the core problems of performance engineering in a database context that application teams or even data teams are typically experiencing?

I think to start with, people have, you know, this, I think, respect from the of the database. Right? So they kind of they don't even wanna go in into the database because they're, like, you know, it's this thing that I don't understand,

and it's just doing things, but I don't I don't it's not familiar essentially. Right? So I think a lot of times, you know, when we do performance work for database, I think we need to start by actually taking a leap of faith that we can do it, you know, that we can actually kind of go into it. And I'd like to think of ultimately database as a software as well. Right? Like, it's it's software that, you know, in the case of, like, Postgres is fully open source. Like, you can just if you really want to, you can figure out, you know, this query gets turned to this plan, gets turned to this executor function so you can follow along essentially to understand what exactly is happening. And so I think it's really, like, to start with getting

getting across that hurdle, but then, you know, obviously, I think most of the time, you know, performance work will involve, like, query plan optimization. Right? So, like, that's really where I think and this is skill you can learn. Right? Like, it's something that I think you just have to get into it and kind of take, like, the most core, you know, kind of command you use in Postgres is explained, and so explaining Postgres gives you it's actually, I I think, pretty good compared to other databases. Like, every now and then, like I saw a post yesterday where somebody was showing execution plans for, like, SQLite and I think, DuckTV and stuff stuff like that. And they actually don't have as good and, like, as detailed as a format, that the Postgres explain kind of shows because it's pretty good at telling you, you know, here is this plan node. And if you do explain, analyze, ideally, there's the buffers option that tells you,

how long each party execution took. And so that really helps you kind of zoom in on your problem. Now I think what I would do then is I would try to understand how the data flows. Right? So imagine you're, like, joining 2 tables, and from 1 table, you have, you know, a 1000 records, from the other 1 you have a 1000000 records. I would probably start by thinking and and actually, you know, let me add something to that. You have a 1, 000, 000 records, but you're only using 1 of them, for example. Right? So the the the most opportune, low hanging fruits are essentially the cases where you're fetching a lot of data from, you know, ultimately the file on disk, but you're not using it. Right? You're discarding it because the database, in worst case, is a sequential scan, but oftentimes, you know, would do an index scan. The index scan maybe covers some of the conditions that, you know, you have in your where clause, but it doesn't cover all of them, which maybe you don't need to always. Right? But essentially, the problem is then, you know, database does a filter and kind of filters through data and does a lot of work to ultimately just give you 1 row. And so I think that's where the execution plan is so important because it tells you that story.

You mentioned that 1 of the challenges with performance might be related to the size of the tables, number of records, the

you know, given the number of records in a table, the number of records that you actually want.

I'm curious

as you have worked as an application engineer, as you've worked on trying to optimize database performance,

what are some of the common patterns that you see that typically lead to these

performance errors that are maybe avoidable in terms of the application architecture and design, and how much of it is just a factor of this is how databases work, and you just have to figure your way around it. Yeah.

I I I like that latter notion. You know? Like, it's just how it is. You just gotta accept it.

Sometimes that's true. I think in the simplest case, you know, it's understanding indexes is probably the most important thing,

mostly because a lot of the easy problems are indexing problems. Right? I mean, not not easy easy as in, you know, it's trivial, but it's something where if you're missing an index, then that's gonna be very obvious in your execution plan. Now I think there's 1 this is very like, this is a very specific aspect of all dimensions since it was something that I didn't understand for a long time. I only understood this, like, 2 years ago. What matters so imagine you have 2 tables and you're joining them, and you you want both of the situations to use an index, but 1 of the indexes is on your join condition. And so that works. You can definitely have indexes on join conditions in Postgres and, you know, use that index, and that's what we call the parameterized index scan. Because what Postgres essentially does is it does a nested loop, and it kind of inputs, you know, the variables into each index scan for kind of the outer values.

But that's that's actually really hard to reason about because

you need to look at your plan to understand, is it actually doing that? And 1 thing I've learned over the years is that sometimes you don't you want that, sometimes that's a very desirable behavior that you want that, you know, parameterized index scan with that nested loop. Sometimes you really don't want that. Sometimes what you should do instead is to add,

additional conditions to your queries. So it sounds a bit counterintuitive, but imagine you have a schema where, let's say, you have a customer ID on all your tables, and you have, like, a multi tenant, you know, shared tenancy type system. And so in that situation,

you may query

in a way where you're saying,

let me get all the records from table a for this customer, and then for each of those records, let me find the associated records in the other table. But you're not adding the customer ID to that second table. You're just adding it to the first 1. And sometimes that's a good choice, but sometimes it actually helps a lot to do it on both tables essentially. Right? So you essentially have where, you know, table a dot customer is something and then where and table b customer is something. It's It's exact same, customer meeting that has but what that helps the database do is use a different join method because it can then use a merge join or hash join where it kinda gets the data for both of these and then joins it together versus having to get 1 of them first and then join into the other 1. And so it's it's things like that. Like, understanding

that that kind of stuff is hard to understand at first, but if you do it a lot, you see it all the time.

So, like, now when I look at the query playing, you know, it's 1 of the first things I'm trying to understand is, you know, is this the right join method? Should this be using a different join method? Because most of the time in relational databases, you'll see joins, kind of happening.

Another problematic pattern that I've seen happen a lot is

things like n plus 1 queries, particularly when you're dealing with a web framework that has a built in ORM, and you're just naively saying, oh, I need this attribute, so I'll just go ahead and fetch it. And not thinking through the fact that that actually means you're adding another query to an inner loop somewhere. And then all of a sudden, you deploy it and wonder, oh my goodness. Why did I suddenly go from

500 requests per second to 5?

And I'm curious how much you've seen things like ORMs

and application frameworks

being used as a crutch, as a means of not having to understand

how the database operates and some of the best practices around query patterns

and some of the ways that maybe the industry is trending towards,

you know, the pendulum swinging back towards actually

living closer to the database and having the database be a first class consideration

in the design and development of your application.

Right. Yeah. And I think I mean, it's fascinating also, like, if you look at this from a community perspective, like, when you talk to folks in the Postgres community, oftentimes they're very anti ORM, I would say. Like, a lot of people working on Postgres don't really don't really want to understand why people use ORMs. And I emphasize with that. But I do think there is another, you know, side of the coin, which is, I think as an application engineer, it is very convenient. Right? Like, it like, I don't wanna write SQL all the time. Like, even, you know, I am like, my own code, would use a combination of ORMs,

and kind of, you know, raw SQL. And I think it kinda it comes back to, you know, kind of taking that leap of faith and actually just doing some SQL. Like, it's not that hard, essentially. Right? I mean, it is not not trivial, but it is something where you wanna if once you start looking at, you know, what SQL your your your arm is generating and then feeling confident that you could just copy that and maybe modify it, right, then I think it's a good starting point for many folks. I do think many ORMs generate suboptimal SQL,

but it has improved over the years. So I I would say that it it depends a bit on which ORM we're talking about, but there are definitely cases still where you will get bad query plans because the ORM essentially does something nonsensical.

I remember

there was actually a patch recently that got into postcode 17, which if I recall correctly has to do with self joins,

where essentially a table is joined against itself again, and ORMs, like these are the kinds of patterns where sometimes they just make sense, right, But ORMs

tend to do some of these things unnecessarily essentially, and so there can be situations where the postcode planner, for example, gets confused because you're you're telling it something very odd, essentially. Right? You're saying join with this table, but, you know, don't really do it. Like like, there's kind of, you know, this this kind of illogical kind of query. And so I think that's the situation where ORMs can be a problem is when they when they essentially cause a bad query plan because they're they're causing something, you know, to happen that, you know, you didn't actually mean to happen. Now you touched a little bit on how that's changing, and I think it's really fascinating. Like,

for example, 1 of the,

tools in the Go space is called SQL c. And so what SQL c does is, essentially compiles,

like, it it does compile time or run time I'm sorry. Build time checking would be the right term here. Build time checking of your, SQL queries. So the idea is you write SQL, you don't write around calls. And what SQL c does is it does, a type check for you essentially. And so it generates,

if I understand correctly, if I'm reading more correctly, it generates the type definitions as well for you. And so you author SQL, but the SQL essentially becomes more than just the text, right, in your application code. And I think that's an interesting trend. There's, like, similar tooling for Rust as well where you are writing SQL, but then a SQL kind of integrates more deeply into your application

by virtue of, you know, having the right types for querying and such. And so I think that's that's actually really exciting because it makes people more familiar with SQL. Right? Like, it makes it easier for them to like, if they if the query is slow, they then can just put explain in front of it, right, as mentioned, or explain analyze buffers, and then, you know, understand why the query is slow. So they kind of,

just like, it removes this kind of, invisible barrier to the ORM sometimes add to performance

work.

Data lakes are notoriously

complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data.

Want to see Starburst in action? Go to dataengineeringpodcast.com/

starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

And now digging into

the manifestation of performance problems in the debug path, you mentioned the

explained plan for,

figuring out how is the database actually going to execute this query.

What are some of the useful

debug methods and signals

for understanding

how best to focus your efforts on addressing an underlying performance problem where, do I need to change my code? Do I just need to deploy a bigger database instance? Do I need to restructure my tables, add indexes? I'm curious if you can talk through what that overall process looks like,

maybe generalized for any given database since most SQL databases have some of the same

core problems or or core methods for this approach, but also maybe specifics of where the Postgres

database and ecosystem

are adding better capabilities or adding tooling that simplify this process.

Yeah. And I would say, you know, I would probably

think there's really 2 things to look at. Right? It's, like, the individual

latency of whatever you're trying to do, and then there's the overall kind of system utilization or system performance. Right? So, like,

and they often they interact with each other. Right? So, like, if you have a lot of slow queries, that will kind of bog down the whole system, and so unrelated queries might get slow. And if the system is slow for some reason, it might even not be a query related issue, might be some maintenance task database running, that, you know, also can, you know, like, slow down your whole database essentially. I do like to, you know, kind of I think a lot of this actually comes back to more, like, understanding, you know, kind of the dev ops side or Linux kind of systematization side of this. Because it's it's really a question of, you know, which resources do you have and how is the database using them. Right? And so oftentimes, in case case databases, that's IO. Right? So 1 of the most important things, I think, to understand about database, performance at the high level, right, would be how is your IO stack set up and kind of, you know, how fast are your disks? What's the disk latency? Right? So from your database server to the actual disk,

these days, you know, in the cloud, oftentimes, that's actually network,

kind of communication that has to happen. And so I think that oftentimes, I think, drives the the big picture. Right? Because if you have insufficient,

kind of IO bandwidth or insufficient IOPS, configured on your disk devices,

then that will, you know, lead to problem x. Now so I I would, you know, oftentimes actually say, you know, just look at whatever metrics your cloud provider has, or if you run your own virtual machine using a built in Linux tooling just to understand, you know, what's actually happening on a on a system OS side, essentially.

I do think that,

you know, going from that to maybe a little bit of what Postgres itself, for example, can offer. So, like, Postgres has many different subsystems. Right? Like, Postgres has, you know, auto vacuum, for example. Postgres has, you know, the wall where it kinda writes out the right to headlock for crash recoveries and and such.

And so sometimes it's important to understand what these other systems do, and especially, I think it's it's less critical for read performance. Because for read performance, you can actually just look and explain plan. It will give you a good sense for what's going on. But especially for write performance, it becomes challenging to follow what's going on because Postgres, you know, like other databases

that, you know, have the same issue, they won't, like, won't just write your data right away to the disk. Right? That wouldn't be fast. That wouldn't performant. And so there is a lot of kind of ways how data is kind of kept, you know, in a temporary structure, and then, you know, kind of it gets written through the ball, it gets written to the data directory. And so when you have issues with write performance especially, then it becomes really important to understand, you know, am I bottlenecked on certain parts of the system? Right? Like, am I bottlenecked on how fast my wall and Postgres can be written out. 1 of the things that, you know, I am really excited about that, so Postgres 16 came out a month ago roughly, and what postcode 16 has added is a new statistics view called pgstatio.

And pgstatio is really interesting because it gives you,

essentially the database's view on its IO activities. So it tells you, I'm the auto vacuum daemon, and I issued

500 IOs and I issued, you know, like, them in this way because, you know, the this stuff was not in memory. This was in memory, stuff like that. Right? And so it becomes really helpful for understanding,

is my issue, you know, in the wall writing activity, in the auto vacuum, in the queries themselves. Right? And it's just, you know, like 20 rows of data. It's actually not a

lot of data, but it's, like, if like, when you see, you know, which of the rows essentially gets flagged, and then you looked at over time, that's, I think, will make Postgres performance work, you know, as once people upgrade to that new release, a lot easier because you kind of know where to where to start.

Another interesting aspect of

performance

and improving performance

at the database engine level is

in Postgres case. I know there are

dozens, if not hundreds, of configuration

parameters that you can change for different aspects.

Some of them require

modifying multiple different parameters in concert to be able to have the desired effect

for anybody who is not deeply familiar with Postgres and all of its internals and everything about it and lots of its history.

It can be quite intimidating to even start looking at what those configuration parameters are and how they're supposed to work. I'm wondering if you can talk to some of the ways that people can start to

gain confidence in and understanding of those parameters and know

where to start looking for tuning for performance to say, oh, actually, I have a right heavy workload, so I wanna change the way that Postgres does this specific sub operation

because I don't need to be able to read as fast necessarily, or I'm willing to accept some latency

in the rights to make sure that I that it's durable. I'm curious if you can just talk through some of that process of gaining familiarity with the various tuning parameters.

Yeah. And I'll I'll start by saying,

I think a lot of times people overestimate

how bad the parameters are. Like, people think that they have really inefficient parameters, but it turns out most of the time, especially when you're on a cloud provider, they kind of do do a little bit of the tuning for you. Right? So, like, on our Amazon RDS, for example, your, like, shared buffer setting is gonna be contingent on the RAM. And, you know, of course, you know, sometimes you wanna tune that shared buffer setting, but most of the time, you don't necessarily need to start with that. Right? And so I think a lot of times people think that they need to tune a lot of parameters, but I I would question that essentially. I would, question if that's really where the biggest problem is.

I do think that

there are some, like, there's some things the database can't know and you have to, you know, kind of tell the database what's okay with you essentially,

to happen to your data. I was talking to somebody recently about a very important setting in that regard, which is synchronous commit in Postgres. So synchronous commit essentially says

when you're,

like, I'll I'll mention this because I think it's it's 1 that I would actually think about changing myself if you have a write heavy workload in particular. Synchronous commit, what it controls is whether when you make a write, Postgres waits until the write actually has landed on disk to return back to you,

the results. Right? To say, yes. You successfully this was successfully written to this. This. Now if you turn that off, which, you know, I actually like to do in my systems,

the effect of that is that in the worst case, what will happen is that you might lose a transaction in the case of a crash. Right? So the worst case essentially is that you might lose something, but your database won't be corrupt. Your database will still be consistent. It will just be missing 1 of the data points. So if you're operating, you know, in financial services field, for example, you definitely shouldn't be turning that off because you would wanna make sure, you know, when when your system that's writing the data into the database actually, you know, needs to have confirmation that it got there successfully.

But in most cases, in most other systems,

it's actually not as big of a problem,

to turn it off, and it does have a huge performance implication. Like, a lot of internal systems can work more efficiently if they don't have to get that guarantee back to the client. And so I think that's 1 of the things I would look at. Maybe, you know, like, in terms of I think you asked a more broad question, which is, you know, how can I even think where to start? And so I would probably

start by saying, look at the parameters, like, synchronous commit is special because it kind of is on the right side of things. Right? But again, coming back to a lot of times, read queries are a little bit easier to to to reason about because it's it's more isolated essentially in terms of how it works in the system.

And so I would say starting by understanding

there there's some parameters that influence costs, like in the postcode planner. The postcode planner is a cost based system. Right? So it makes, like, an certain calculations and says this is the best plan according to this cost. And 1 of the things, for example, that's, you know, really oftentimes not tuned these days, and, unfortunately,

Postgres still has a bad default there, as random page cost. So Postgres does, you know, random IO, for example, when it does an index lookup, and Postgres has a way of estimating how much random IO it's gonna do for these index lookups. And turns out by default, random page cost is set to 4 cost units, and sequential scan cost is set to 1 cost unit. And so that's stark difference between the 1 and the 4 essentially defaults means that postcodes will oftentimes prefer not to do an index scan, which on modern hardware, like, you know, an MVE, SSD SSD type hardware, it's actually not that expensive. And so tuning down random page cost to be, you know, 1.1 or something, not 1, but just slightly above 1 is usually good, is is a is a good choice. Right? So that's the kind of stuff I would look at is,

like, anything that's planner related,

around costing, I think is good. Work MEM is also good, like, influences how much memory Postgres can use for for execution,

things like that. In the process of tuning these parameters, obviously, you don't wanna go straight to your production instance and start changing things willy nilly. But as you are iterating through tuning these configuration settings,

validating their effectiveness,

what is a useful

workflow that you found for being able to make sure that the change that you made is actually having the desired effect so that you don't end up in the space of, I think I'm doing the right thing, but I'm just getting frustrated, so I'm gonna change 5 different things and hope that it all works. And then when it does function, I actually don't know which 1 did the thing that I wanted.

Yeah. That sounds familiar.

I think so some of the parameters you can actually change in a per connection basis. So for example, random page cost that I just mentioned, if I recall correctly,

you can set that on a per connection basis. And so you could essentially say, if you have a bad query plan, right, you could say set random page cost equals 1 dot 1, for example, in your connection, and it just changes for that connection. And then you can see, does that actually make the, you know, intended change? And then, of course, you know, at some point, you will wanna roll it out to your whole database. But at that point at least you've done some testing with maybe your most important queries. So I think if you can do that kind of testing, I think that's, you know, definitely recommended. I think the other thing I would look for is the the most frustrating parameters to change are the ones that require restart. There are some parameters in Postgres

where you have to restart the database for them to take effect. 1 example would be shared buffers. Sharedbuffers is like a fixed size allocation in Postgres. It says, how big is Postgres' internal cache of sorts, like, it's kind of working area, for a lot of things in Postgres. And so changing that requires a restart. And so what I would probably try to do there, if I really tried to get to the right setting, is I would really try to get it with benchmark setup, and I would try to, you know, work for copy of your production database. Ideally record some of the queries you run. I mean, this is where it gets tricky. Right? Because there's no tooling for that, unfortunately. But, like, record some of the queries you're running on your production,

then against the clone, issue the same workload ideally. Right?

And then you know what's best setting. Now most of the time, that's a lot of effort. It's just not worth it. And so that's where people then, you know, go with a gut instinct then they make a change, and then, you know, maybe it's a good, maybe it's a bad change. It's essentially that question of, you know, like like, is it worth effort? Right? Like, do you see a problem where you think that, you know, that contention and share buffers is an issue? And so then it would actually do the benchmarking for those big settings that essentially can be hard to change.

You mentioned at the beginning that you have been building pganalyze

as a way to make it easier to

understand the performance characteristics

of your queries, where you might need to make some of these changes.

Wondering if you can talk through some of the journey that led you to the development of this project and ways that it can help in this process of identifying,

this is what's going wrong with my database performance. This is how I'm actually going to address it.

Sure. And I think I would start with saying, you know, it's like I think the best software is written by people who have the problems themselves, and so that's, you know, how we wrote p g analyzes. You know, it's just solving our own issues essentially,

with optimizing database performance. And so it like, it started really around query performance and saying which queries have executed on database, like, just showing you context on, hey, this is everything that has happened. We then expand that into showing you the execution plans as well. So in in Postgres, there's an extension called auto explain, which essentially automatically will log slow execution plans. And so you can say, you know, if I have these outlier queries,

then, you know, I'll put it to the Postgres log. And so we have, you know, a pretty, you know, well working system these days that essentially picks up that data from the logs, links it together with the queries. And so I can just open up a query page and see the execution plans, which is huge because that's always a lot of work and oftentimes you don't get the right plan because you're kind of looking at after the fact versus seeing the actual time. But really what we've been getting into more recently is other ways to kind of help you optimize the workload by

essentially giving you, like like, what we call advisors that sit on top of, you know, the whole data that we're getting. Right? So 1 of the challenges that I found is when I look at my queries, and, you know, I have probably a couple of 1, 000 unique queries in our system, it becomes hard. Right? Like, you can look at 1 query and understand what to optimize, but you can't look at a 100 queries and still keep that in your head. Like, it just gets too complicated. And so the first adviser we launched about 2 years ago is the index advisor for Postgres. And so what index advisor does is it essentially looks at the workload, looks at, you know, the where clauses, the join clauses, and then says, you know, which of these is not, you know, covered by an index or which of these has an inefficient index scan. And the idea there is that it's not necessarily replacing a human thinking about this. Right? So I still think that there's a human element in this, but it is essentially helping

you work with a larger set of data much more easily because it kind of gives you an abstraction that we call scans, which is kind of each table, you know, has multiple scans based on what the queries are doing. And then for each of those scans, we essentially, like, run through different permutations of possible indexes and then give you recommendations on what we think is missing. And so that, you know, essentially, the the goal here is to

make that work more easily for either the, you know, data engineer or the application engineer working with the database. They can kind of get a good first assessment of where they might need to spend time.

This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about data fold by visiting data engineering podcast.com/datafold

today.

So 1 of the things that you just pointed out is this setting that will allow you to log queries that are deviations from the norm. And

if you don't already know about this or you're not familiar with Postgres, you don't know to make that change to log those queries. And so when you do run into the performance errors and you're trying to find some common thread to be able to address it, you're not gonna be able to have that information at hand. So as you are deploying a new Postgres instance, what are some of the things that you always make sure to change because you want to be able to have that information when you need it rather than saying, everything's not working, but now in order to get the things that I need to know if I first have to make a bunch of changes and then wait?

Yeah. No. It's a it's a good question. So I think the so the 2 things I would always enable so coming back to, you know, some settings are harder to change, so you should change the settings early that, you know, are hard to change so you don't have to restart database later. And so, there's 2 extensions that are bundled with Postgres that are kind of part of what Postgres calls contrib that it would always enable.

The first 1 is pgstatstatements.

So pgstatstatements

collects the query statistics, and that's,

something you can always turn on in production systems. Like, I've not seen well, I mean, there's some edge cases, but, like, let's just say, for the most part, I've not seen any performance issues at all with that. And so it's it's just something that really gives you a lot of details around, you know, just which queries have executed. So pgstatstatements, I would turn that on. I would add that to share preload libraries, which

statements, I would turn that on. I would add that to share preload libraries, which is the setting that requires to restart. So I would make sure to make that change. For most production systems, I would also enable auto explain, which we just referenced,

which, you know, locks the slow query execution plans. Now auto explain has the same way speech statements, has settings that control whether it's actually running. And so the just that shared preload setting is the 1 you need to change at start time, but there's other settings that control, you know, what threshold or it should even run at all. And so that's why I would just make sure you enable it even if you turn it off again. Right? Just so that then you can turn it on if needed without the restart. I think apart from that, the good news is from a diagnostics perspective,

most things that I can think of right now are changeable without a restart.

So there are some things that I would enable that are more around logging and just logging more things. So for example, it's helpful, like, so Postgres has auto vacuum blocks.

And auto vacuum, you know, for those of you not familiar, is kind of that process in Postgres where it cleans up dead rows. Right? So, like, as you're updating and deleting things, they leave behind these dead rows and tables, and then, you know, that's just how Postgres' MVC model works, and then Postgres cleans those up using autovacuum.

And so understanding when autovacuum is not working is very helpful, and so there isn't lock there's a setting called lock auto vacuum integration, which you can just set to 0, which means any auto vacuum process, the the results essentially get logged. And I find that really helpful because it kind of gives you the, you know, kind of you know what's going on essentially. I mean, you can just look at the logs to to see what happened. Apart from that, the 1 other thing that comes to mind right now is around temporary files.

So there's lock temp file settings,

which essentially tells you, you know, when a query has to spill

to disk. So essentially, like, let's say, query tries to do a sort or hash operation

and the in memory space that it is it's allowed to use is not big enough, And so it has to do it right to disk to essentially temporarily,

you know, save that some wells. And that can be quite expensive in terms of IO, and so it's important to, you know, know when that happens as lock time files tell you that.

A lot of these changes for people who aren't familiar with databases and their internal workings and the ways that they came to be how they are, it can seem very esoteric and

arbitrary.

And most of the database engines that are in common use, particularly for transactional workloads, have been around for decades. They've gone through many iterations.

They've gone through significant evolution in their design and their

usage.

I'm wondering what are some of the lessons

that we have learned communally as an industry from running these systems over the years and some of the ways that that has influenced

the ways that we think about some of the performance characteristics, some of the tuning parameters, some of the ways that the engine itself should be responsible for its own well-being

that we're carrying forward either into successive versions of the same engines or in new database systems that are being designed and deployed?

I think it's a good question. I would say that Postgres is probably 1 of the more conservative ones out there in terms of adopting new technology. Right? So I think and and then there's there's a couple of reasons for that. Right? But I think so if you look at, you know, what's new and exciting in databases, you're not gonna find it in Postgres for the most part.

Like, may maybe you'll find it. So I think the 1 thing that Postgres has which allows some of that, you know, new work to happen in Postgres is extensions. Right? So extensions in Postgres lets you modify a lot of things that the core system then doesn't have to do. But I think the from the so a good example of this just as a quick side note would be pgvector. So, you know, these days, there's a lot of specialized vector databases, which, you know, they're based on a lot of recent research right around embeddings and such.

And so Postgres,

does not have any built in logic for that, but there's an extension called pgvector, which essentially makes Postgres a vector database. And, you know, when that was released and, you know, ML and AI is all the craze,

like, around last year, you know, everybody started adopting it. Like, all the big cloud providers offer p g vector as a built in extension now because people are wanting to use it. Right? But Postgres itself still doesn't have anything special that, you know, it does in the core engine to support that. I think what's interesting to note there, right, just in terms of how Postgres works is that it's conservative because it's ultimately

like, sometimes in, you know, like, if you look at our database, let's just say Oracle and MySQL. Right? There are sometimes things where Oracle kind of makes a strong push for certain things. Like, they have, like, MySQL HeatWave, which is, like, their, you know, fancy auto tuning kind of logic that they have. Like, if you use Oracle Cloud, then you can use their heaters auto tuning logic, and I'm sure it's great. I've never used it, but, you know, I'm sure it's great. And it's that kind of stuff that in Postgres world doesn't really happen with Postgres itself. Right? Because there is no single entity behind Postgres

that drives those, you know, kind of noncore experiences. I mean, in a sense, it is a core experience, right, but it's not the core engine. Right? It's not the the executor, the planet that's changing. It's in this case, you know, just some logic around it. And so really what we see in the Postgres world, I would say, is that instead of, you know, the core team adopting these changes, it's, you know, providers

like Amazon, like Google, like Azure,

building their own variant. Right? Like, we can see, like, AWS has Aurora,

Google has AlloyDB.

And so I think what's fascinating there to me is, like, just how how much

scale Postgres has gotten. Right? Because it's just such a good base to build on, essentially. Right? Like, it's very hackable.

It's, you know, fully open source.

You can influence the community direction if you, you know, kind of spend time on it. Right? If you pay people to work on a project, there's, you know, a good

chance that you can at least contribute some of the things back. And so I think that has really given it a way to, you know, kind of, have the longevity here, but I do think that, you know, if if I look at, you know, database communities, research conferences, and you're like, you know, here's the next best algorithm that, you know, really changes the how executors should work, how our planners should work, will take some time for Postgres to adopt that, I would say.

There are also interesting developments

that are maybe external to the database

engine. 1 that comes to mind is the Otter tune project from Andy Pablo and, some of his research at Carnegie Mellon where it will

do some of the analysis of the runtime, some of the logs that are available in the database

engine. I believe it works on things like Postgres, MySQL, and a few other engines.

And then it will either just recommend these are some of the configuration parameters that you should tune or, if you're feeling adventurous, even do some of the tuning for you. And I'm wondering

what you have seen as some of the general adoption of tools like that or any other tools in the ecosystem that are trying

to alleviate some of the burden of the

end user of the database having to know all of the details of how the database works to be able to run it most efficiently and effectively for their workloads.

Yeah. I think it's quite interesting. Right? Like, so I've I'll caveat this with saying I'm not an expert in what they're doing. I think it's interesting, but I haven't used it myself, so I'll I'll preface with that. I think what I've seen and heard from people is I think they're they're interested in that. People want that that level of automatic tuning,

but they're

also worried that these systems kind of make changes that they don't understand. Right? So like I think that's really and like 1 of the things I did see just as an outsider from what auto tune did is, like, over the time that they, you know, made it as a commercial project, is they added more, like, guardrails around, you know, changes not being applied automatically, but you kind of, you know, get a way to kind of approve them. Right? And so I think to me, it comes back to

we want systems that take that reduce the overhead, the work, like, the cognitive overhead before working with database technology is hard. Right? So I I think we all want systems that make that better, but I don't think people just want things that, you know, change things automatically. I think people want some level of control over that in the most part,

because databases are so important. Right? I think you just have 1 of them in many cases. Right? So if your database, you know, suddenly does something unexpected, that's actually often a really big problem. And so I think that's something that I don't think has been overcome yet,

but I do see a lot of premise in that. And I think, you know, our our own take on this, right, is, like, we're, a little bit competing, but not directly. Like, we don't do much parameter tuning today and p channelize,

but index,

like, recommendations as mentioned, that's for us really where we see, you know, a big opportunity to just do better than the default, which is, you know, somebody taking, like, back of a napkin and trying to, you know, kind of sketch out which query uses which index, you know, how should I optimize this.

And I think that that, you know,

those type of systems will really, you know, essentially save a lot of time, make people work better, focus on, you know, the actual problems they wanna solve, and not worry too much about these, you know, minute details.

And then a lot of the primary use cases for Postgres

is as

that are architecturally

separate. And I'm curious

what you see as some of the challenges or confusion that people run into when they are maybe trying to use their transactional database in an analytical context or vice versa?

Yeah. And I would, you know, I would say if if you have, like, if you have a workload that's not a good fit for Postgres, it's probably not a good idea to use Postgres. It's, like, I I do essentially think that there is there's cases where 1 can go too far in trying to use 1 technology for everything, and I think, you know, Postgres is pretty good at it, but, you know, there are definitely cases where it's not good, in that sense. I think the most clear example that I would reference is, you know, columnar based versus row based. Right? So, like, that's I mean, so for context, Postgres does have a way to do columnar storage, like Citus extension can do column storage. Hydro, which is based on the column extension for Citus, can also do column.

So and there's, I think, other things in creamplum and such. And so I I think Postgres, for example, could do column storage for you, but it will most likely never be as efficient as ClickHouse is at that, for example. Right? And so the question I would ask myself is, do I have a system

that is really best suited for columnar,

and it's just columnar type data, then I would probably use a specialized database for that. Right? Because then you don't have all that overhead that Postgres has ultimately will still have around, you know, all the other systems it's running. But if I have a a a workload where I need a combination of that. Right? So let's say I have an app, and in my app, there's a portion of this that's more like analytics type functionality,

then I think it's really interesting to essentially use Postgres as 1 database versus having to manage 2 different, like, running Postgres and ClickHouse, for example, means you have to now think about probably 2 different providers that offer those for you, 2 different ways of optimizing things, 2 different ways of, you know, understanding for your performance. And so that's where I think it's probably a good idea to, you know, try to use 1 database if you can, but then do recognize when you have specialized workloads and do change which database you're using in those stations.

And in your work as an application developer, as an an engineer, as somebody who is working in the postgres ecosystem

to help bring to light some of these performance problems in the engine, what are some of the most interesting or innovative or unexpected ways that you've seen people trying to address this problem of database performance?

I I mean, I think the the 1 thing I will I will say is that there is I think more recently, there is in the Postgres space, there is some efforts around finally, that's what MySQL has had over the years already. So if you remember MySQL had, like, my ISEM and then InnoDB, right, so different storage engines. And so, ultimately, I think most people sent it around InnoDB and then maybe Rocks,

like, my Rocks, I think, is also 1 thing you could use. Not knowing a MySQL person myself, but so Postgres actually has,

you know, as of, I think, Postgres 12 has a way to do

so Postgres for a long time, it was able to do custom index access methods, so custom index types, essentially. But, more recently, Postgres can also do custom storage. Right? As they're coming back to common store, that's 1 way of, you know, kind of doing custom storage.

But another way that's, I think, much more fundamental,

like, there's a project called Aureole DB, which tries to essentially

do a lot of things better at the same time, which unfortunately also makes it hard to actually get to the point where it's production ready.

But what Aurora Aurora DB does, amongst other things, is it so Postgres, if you remember, if you're familiar with it, has transaction IDs. Right? And so 1 of the biggest problems, so to say, for people operating Postgres at scale

can be that they have a high rate of transaction ID consumption, and so they need to run a lot of auto vacuum processes because otherwise, Postgres will run out of transaction IDs. If you look back, you know, over the years, you'll find, you know, especially in the early, you know, 2010 and and following years, you'll see all these production outages where people had, you know, like, Postgres stopped and, you know, they had to, like, do multiple hours of maintenance.

And the good news is most of that is gone these days because of better defaults in Postgres,

but there are still situations where that can be a problem. And so what Aureole DB, for example, tries to do currently, it's a fork of Postgres.

It tries to essentially replace the the core storage method essentially, like how things are stored in the tables,

to, you know, solve a lot of these problems at once. And so I think it's really promising, but also, you know, the diff essentially between the Aurelity before the actual core Postgres is a perfect diff. And so that's really where, you know, the struggle often is is then, you know, how do you make that into smaller pieces that, you know, can actually be merged back into the main Postgres tree versus, you know, kind of because they're big ideas, because they're big changes, they essentially

have to live separately for a long time,

which is a challenge.

And in your own work of operating in this space, digging deep into post and understanding its performance characteristics? What are some of the most interesting or unexpected or challenging lessons that you've learned?

Let's see what is

I think, you know, I'll I'll start by saying sometimes it is the simple things. Right? So, like, coming back to indexes, like, it was a problem yesterday. I deployed some new code. You should maybe question, you know, whether I should be writing code sometimes because you have I have a team working for me now, so, you know, sometimes, you know, I write that code, and that's maybe not a good idea. But so I deployed some code, and there was a missing index. Surprise. Surprise.

And so, you know, sometimes, you know, things you run into is as simple as just missing an index, right, and just not having thought about it because you just you wrote a SQL query, but you didn't think about covering some part of it with an index. And so I think oftentimes it does come back to that for me personally. I do think that 1 of the things that I often come back to, and I I referenced this earlier, right, is this idea of kind of how is the data flowing and what are you essentially asking a database to do? Like, oftentimes,

when I see, like, a bad performance, like a slow query, I would look at the explain plan and I would see, you know, why are we fetching a 1000000 records here. Right? And it would come down to sometimes that, you know, in our system,

like, with p g n l itself, essentially, for the internal database, which is also Postgres, some customers just have databases with lots of records, essentially, that we're looking at, like, in in our case queries, for example. Right? So they have a lot of unique queries that they're tracking. And so then what happens is that these outliers oftentimes, you know, would would be these situations where they have a lot of queries. And so then the question is is really more, do we need to store that data? Right? Is it is it right? Like, is somebody actually gonna be able to look at a 1, 000, 000 queries? Right? Like, they're not gonna be able to work with that. Right? And so oftentimes, these, you know, problems that start as a slow query then for me become a what's happening with the data model or why do we have this data or why do we have a 1, 000, 000 records. And so then I think what really helps, and I think this comes back to, you know, thinking about who is working on this. Right? If I was just the Postgres

expert in the team, just the data engineer, I maybe wouldn't know what the application

needs to do. Right? Like, these surprising situations where you suddenly have a 1000000 records where you expect that

a 1, 000, for example. Right? Like, I wouldn't know if that's expected or not. And so I think if you are in a larger organization, right, if you don't have that knowledge within a single head like mine in this case,

what you should do, right, is really try to work, like, across these kind of,

departments in a sense. Right? It's like, if there is a problem where it is a slow query, I think it's very important that the application engineer who knows what's expected

and the data engineer or data platform engineer, they sit together and they are like, is this expected that there are a 1000000 records here, or do we think that that needs to be fixed on the application side?

And in those contexts too, it's also useful to bring in tools like great expectations or other frameworks that allow

you to collaborate

in a shared space without necessarily having to always be in tight communication so that everybody can share that responsibility of understanding

what are the contextual expectations

of this

database, of this table, of these pieces of data so that you can have some of those assertions

as part of your runtime without always having to manually say, oh, hey. This isn't working. Now I need to go dig into why. It could just be something that says, oh, hey. Something's weird. You should look at it. Exactly. Yeah. And 1 thing we added recently with speech analytics is, we integrated with OpenTelemetry for essentially exporting some of the data that we have into APM systems. Right? So oftentimes, as an application engineer, you wouldn't work with a specialized tool like speech analyze. You would work in your New Relic or, you know, your Datadog, your Honeycomb. And so what we've, added recently to speech analytics is a way for speech analytics to say, when it has a slow execution plan, it can actually send that into your tracing system, like into your, let's say, Honeycomb, for example. I could look at my trace in Honeycomb, and I can actually see the actual slow span from the database. And there's a backlink into bganalyzed to then, you know, dig deeper and see index recommendations and whatnot. But it becomes a way for application engineers to see that performance issue from their world, right, like from the application side versus them having to kind of look start somewhere else. That's definitely

very useful aspect of it too is just bring the information into the tools and the context where the person is Right.

Exactly

Right. Exactly. Yeah.

And as you continue to build and iterate on pganalyze

and engage with the Postgres and the database

community? What are some of the things that you personally have as wishes or goals for the future of databases

and the ways that they are,

engineered and deployed and maintained?

So I I think, like, there there's many many aspects to this. 1 aspect that I I personally truly care about

is, like, I think we need to have less databases that are run as startups or by single companies. I think it's I mean, it's in in a way it's exciting. Right? Like, that's how you can fund people to work on the problem. So in a sense, that's good. But I do think that there's a big risk where,

like, people,

like,

there's new startups coming out, they make a new database technology,

then, you know, they become big, they try to make money with it, they try to, you know, restrict licenses. It's it's the same spiel all over again. Right? Like, it just keeps happening, like like, this whole license issue in particular. And so that's what I really appreciate about Postgres, and I wish we could adopt this for more database technology is that there is a community that's not a single company community. Right? Like, it is a like, in the Postgres case, right, there's people from AWS, people from Microsoft, people from GCP, people from, you know, smaller companies, consulting shops that all work in that database together. And that gives me, you know, confidence that database is gonna be around in 10 years, and there's not gonna be any, you know, weird license terms attached to it. And so I think what I would really love to see is see that approach adopted for more of the other database technologies. Right? ClickHouse is a good example. So ClickHouse, you know, is open source, but they have started to commercialize their offering more. And so now some of the offerings that ClickHouse has are only in the cloud version, and we're starting to see some discussions in the open source side where they're like, well, are you gonna be adding this? And if I make a PR that adds this feature to the open source version, would you accept the PR? Right? And so the problem is the governance,

like, by the project being the corporation,

actually limits the what the database can become.

Absolutely. Are there any other aspects of the work that you're doing on on pg analyze or this overall question of database performance analysis and resolution that we didn't discuss yet that you'd like to cover before we close out the show? No. I I don't think so. I think, you know, I I will maybe echo again.

It's just software. You know, you can, like, think of think of your database as part of your application. Right? So, like, I I mean, yes, there's a network boundary and, yes, you know, it it does, like, this weird SQL and then, you know, it does this planning stuff. But ultimately, you know, it's causing

function calls to happen and then data to be loaded, like, files to be read. And so I think demystifying that and, like, not thinking about it as a separate entity, but really thinking about it as part of your application logic is how I would approach database technology, in particular Postgres, in the context of, you know, a, typical web application, for example. Yeah. It's definitely

useful to call that out and,

get rid of some of the fear, uncertainty, and doubt that exists around this black box of that's where the data goes, and hopefully, it doesn't disappear.

Exactly. Yes.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think that I think it's a it's a user experience gap still. Right? So coming back to what I said earlier, I think we need more people

with a design or a user research background working on database technology.

And it's really hard, right, because that's not easy to understand. And so the overlap of, like, designers and database people is very small. But, you know, I think that's really what I would like to see more of and, you know, that will ultimately result in better tooling. Right? I don't know which exact tooling, but I think just the fact that we need more people thinking about, you know, user experience in the context of database tooling is, I think, most important. Alright. Well, thank you very much for taking the time today to join me, share your experiences

working in this problem space of database performance, the work that you're doing on PG analyze to make it more scrutable.

Definitely appreciate the time and energy that and your team are putting into that, and I hope you enjoy the rest of your day. Yeah. Thank you for hosting.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the

machine learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it.

Email hosts at data engineering podcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links