Taking A Tour Of PostgreSQL with Jonathan Katz

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and 40 gigabit network all controlled by a brand new API, you'll get everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

to get a $20 credit and launch a new server in under a minute.

And are you struggling to keep up with customer requests and letting errors slip into production? Wanna try some of the innovative ideas in this podcast but don't have time?

DataKitchen's

DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and datasets while improving quality.

Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love.

Join the DataOps movement today and sign up for the newsletter at datakitchen.iode.

After that, learn more about why you should be doing DataOps by listening to the head chef in the Data kitchen at dataengineeringpodcast.com/datakitchen.

And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.

Your host is Tobias Macy. And today, I'm interviewing Jonathan Katz about a high level view of PostgreSQL

and the unique capabilities that it offers. So, Jonathan, could you start by introducing yourself? Sure. Thank you, Tobias, and thank you again for having me. So, so my name is Jonathan Katz. Currently, my role is I'm the director of communications at Crunchy Data, which is an enterprise Postgres support company. But I'm here today talking about my role in the Postgres community as a contributor.

I guess, you know, just a little bit more about my background. You know, before I got into this marketing position, I was actually on the engineering side, and I did and I basically been programming for about 20 years. Mostly, I was active in the New York City, tech start up scene in various leadership positions. Previously, the CTO at a company called VenueBook. And a couple years before that, the VP of engineering at a company called Paperless Post. And, of course, at both companies, I was using Postgres. And before we get too far along,

can you help me settle the pronunciation

debate of how how is it that you're actually supposed to pronounce the name of the database?

Yeah. So that's a that's a that's that is a good question because, it's both it's not only the pronunciation, but also the spelling. So the commonly accepted ways to pronounce post gres by the community are Postgres, as I've been saying, or PostgreSQL.

I've heard things like PostgreSQL,

PostGray,

PostgreSQL,

and, you know, some some, you know, weird intonation on that, but it's Postgres and PostgreSQL.

And do you remember how you first got involved in the area of data management? Yeah. So I think, you know, it actually, you know, it actually started when I was a teenager, when I first began developing websites.

I found I was re you know, I was doing the you know, what I learned was a very common mistake where I'd be repeating my code and basically doing the same thing over and over when trying to get sort of lines of data to render. And, you know, even even being a naive, you know, 13 year old, I, you know, I kinda thought there was a better way of doing that. So I started researching databases, and, you know, this was back, you know, around 2, 000, 2, 001, and, you know, you know,

1 of the common stacks back then was the LAMP stack and, you know, Linux, Apache,

MySQL, and PHP.

And this, you know, basically what I got started with was first a MySQL database,

and I learned from, you know, I learned from, you know, some friends at this other database called Postgres, which is how I began using that.

But what really fascinated me about, you know, about data about data management and data engineering was how much you could do with data that, you know, once you had it in the structured database, all the different kinds of analyses that you can run.

And I also became very interested in machine learning, and this was more when machine learning was in the the research phases as opposed to how widely commercially deployed it is today. Of course, what you need to do for machine learning, you need to, you know, not only store a lot of information but be able to analyze it as well. And I began to see, you know, how it was to to use databases to do so.

And I think, you know, the other thing is I I just love programming. I love programming languages and learning SQL and just all the different things you can do with SQL, particularly within a Postgres database, just made me, you know, you know, very excited to work with it, create reports with it, create generalized reports that anyone can run, and really see how much,

you know, how much you could do with, that kind of information.

So I think, you know, it's you know, the way I usually summarize it is that I've just always been fascinated with data and everything you can do with it, and it's just amazing to see, you know, you know, since I would you know, since I started, you know, 20 years ago,

to see how much just the field of data management has evolved to to what it is today. And do you remember how you first got involved in the Postgres project itself and became

1 of the,

core contributors

and actively involved as 1 of the maintainers of the project?

Sure. So, so first I would say, you know, my role and I'm there there's actually something called the core team and I I'm not a part of that, but,

what the way I became involved in Postgres was, you know, first starting out as a user of Postgres, and I like to think of the Postgres community as everyone who's either using Postgres, contributing

to Postgres, developing Postgres, whatever it may be.

And when I first started using Postgres, I just really liked all the functionality in the database, and I think now that's something I definitely would like to talk about a little bit later,

and I want to figure out a way to be more active. I mean, I think 1 of the wonderful things about open source is that, you know, the the community of giving back, that, you know, if you're using something, you should try to find a way to contribute.

And when I looked at the people who were active Postgres contributors, they were just brilliant engineers. They're adding very complex features that were very stable and very performant,

and I felt that I could be more useful in being involved in helping to to organize events and organize ways to help promote and educate people about Postgres.

I initially started by, helping to organize and grow the local New York City Postgres user group, which today is the largest local user group in the world.

Back when I started,

it was, you know, it was maybe about 60 people.

From there, I became active in conference organization. I'd organized entrepreneurship events when I was in college, so I had some familiarity with it, you know, once I, you know, once I was out in the the professional world,

and I helped to grow some very large Postgres events. You know, and of course, you know, the nice thing about Postgres is that it really is a team and everyone helps and supports each other. So, you know, having just having that great support network, you know, working with all these wonderful people has been great in terms of helping to grow that. And, you know, and since, you know, and since doing the conference work, I'm now helping more of the general project advocacy. So if you read any of the press releases coming from the Postgres project, I've probably been involved in writing those. I also helped manage the website. I was involved in a project that just redesigned the Postgres website, which had not been updated since, I think, 2, 004.

So having, having that more modern look and feel certainly,

I think should certainly help the project. And also, you know, I do I do enjoy speaking. You know, I I mentioned to you before we started, this is my first podcast, but, no, frequently, I'm doing, public speaking, you know, at various local user groups, conferences, events, etcetera.

And for anybody who hasn't used or directly interacted with a Postgres database and the project itself, can you briefly describe

what PostgreSQL

is

and

what differentiates it a bit from some of the other relational databases that are out there? Yeah. Sure. Well, I can actually I think actually a good way to start with that is to give a little of history of the project and and how it came to be. And, you know, I mean, first of all, to say Postgres is what's called an object relational database management system,

or also known as an ORDBMS.

This derives from a relational database management system, which is what most people are familiar with that term, and, you know, that's, you know and basically, that's what Postgres is as well. But

it all goes back to,

a paper written by, Edward Codd called A Relational Model of Data for Large Shared Databanks, which was the definitive guide for how to build a relational database management system. Basically what it did was that it defined what, you know, what data was and various attributes around how you could classify and store data, as well as something called relational algebra, which is the language to which, well, the mathematical language in terms of how to manipulate data,

and you can basically trace every single relational

database back to that paper.

You know, so for example, Oracle is implemented based upon, you know, the the information given in that paper, and and Postgres is no different.

Postgres came out of a project from, you know, UC Berkeley, from, from a gentleman named doctor Michael Stonebraker, who's been prolific in the database world.

It's simply Postgres means post ingress.

That's where the name originates from. Ingress was a database doctor Stonebaker had worked on prior.

It was Ingress, you know, and I don't know as much about it as I would like, but it was a lot of it had to do with, image processing.

And, you know, Postgres wanted to extend upon,

the relational database management system by implementing these things called objects or types, which this is where a lot of Postgres' power in terms of manipulating data types comes from. So, this so basically Postgres was first implemented in 1986,

and ultimately what happened was several iterations of it were made.

At this point it was closed sourced. It's just, you know, it's part of a university project, and eventually it was commercialized and picked up by a company called Illustra,

which, just, you know, a little bit history of that was that Illustra eventually was merged into Informix, which was then later acquired by IBM.

And several versions of Postgres were created,

and,

you know, but again, it was still, you know, it was still either, you know, either in a few commercial companies or as part of this academic project.

The big breakthrough for Postgres in terms of becoming open source happened around 1995 where, 2 students at, UC Berkeley, Andrew Yu and Jolly Chen, added the SQL language to Postgres.

This was actually a very big deal because if you looked at data me database management systems up until that point, all of them implemented their own unique languages.

Postgres was called QUEL, q u e l. And, you know, they're different languages for the different systems. And

what made this, you know, by by implementing SQL, which is what the data management community was trying to standardize around, it suddenly made it easier to access Postgres,

and it made it easier to have a career in data management where instead of needing to know 10 different languages, you only needed to know 1.

This was released to the web as postcodes of 95,

you know, basically to to commemorate the year it was released and of course, you know, looking back, you know, that's when Windows 95 was out and a lot of things were giving that 95 moniker.

And also the other big deal was it was released to the web in 1995. And, you know, is recalling the web back in 1995 is a very different landscape than it was today. And, you know, you typically did not release software to the web, like free software for for people to download. You know, this was relatively novel. You know, Linux had on had only been released, I think, a couple years back.

So what happened was,

sorry. The other thing I should mention is that, you know, in addition to standardizing Postgres around the SQL language we're interfacing with it,

it was rewritten in ANSI C, so a standardized version of c, which would make it much easier for other people to contribute and develop it.

So some others became very interested in this project, and they want to open source it. And on July 8, 1996, though actually, I looked it up. It was July 9th UTC,

1996.

Several individuals,

committed Postgres to the CVS repository.

It was renamed PostgreSQL

at that point, and I know I've been I've been I'm still gonna refer to it as Postgres because it's a little bit easier to pronounce, but from, 1990 6 onward, it was known as PostgreSQL.

The, the license I was just open source it is its own license, the PostgreSQL license, but it's based on the VSD license if you're if you're keeping score at home in terms of that. And, like, the other thing that that really fascinates me is that when Postgres was open sourced, it already had 10 years of development behind it. And, you know, 22 years later, you know, it still had you know, it's still being developed. So there's already 30 plus years of development on PostgreSQL, which, you know, is a really test really much of a testament to the community behind it developing it. And also, you know, the the power of good software architecture, that if you do write something, you know, very well, it's easy to continue building on top of it, you know, for for many years to come.

So in in essence, you know, just to circle back, you know, what is you know, if I describe what Postgres is, Postgres is a database at at its heart. It's a database that has a lot of functionality. 1 comment out of there right off the bat is, as you said, it's a

30 plus year,

database project or software project in general, which for anybody who has worked at any length in the software industry, that's quite a commendable

accomplishment because I know I've written software 2 years ago that I don't even wanna look at anymore, let alone something that's 30, you know, 30 plus years old.

And I think that your point of the object aspect of the

Postgres relational system of being able to have custom types and support more than just the,

standard primitives is 1 of the things that has really granted it staying power because,

it it does have so much adaptability

for various use cases

without the,

core team having to be the ones to maintain that functionality

out of the box, you know, is is 1 of the things that really provides a lot of the power,

and a lot of the

innovative uses of Postgres that I've seen in various contexts.

And so,

yeah, digging a bit more into some of the primary characteristics

of the Postgres project,

that

what what are some of those that would

particularly lead someone to choose it over,

another relational or even non relational database system for a given project that they're starting to architect? You know what? When I listen to this question, you know, I think a lot about my experience as an application

developer where, you know, while, you know, maybe the application I was crafting might be written in Python or Ruby or, you know, or what have you. I mean, it it all came back to what can the database offer as, you know, all these applications were database driven.

And and it and this is this is what my computer scientist had on. You know? It for me, it always starts with the data types. Like, what, you know, how can I store my data? You know, what can I store my data as? And and you talked about the primitives that Postgres has.

And even even if you just look at the basic primitive support, I I have a I have an introduction to Postgres training,

for data types where I spend just an hour on the primitives, and, you know, these are the the numeric types, the string types, you know, and and the things like that

because there are, you know, there's a lot of, you know, different indexing capabilities around it, which, you know, I do wanna touch on in a little bit, but just a lot of functions that really let you harness the power of, you know, just, you know, basic numbers.

For me, it really starts getting interesting when you get to some more structured types, and 1 of the things that's really helped me throughout my career is Postgres datetime support. A lot of the core functionality for Postgres's datetime support was completed about 15 years ago. It truly, you know, is a testament to, you know, the engineers behind it.

But, you know, for, you know, for 1 thing, it's very easy to do date date and time calculations. It's very easy to manipulate intervals, you know, do relative date calculations based from based on that, and, also, the the time zone support. You know, it's very easy to confer between the different time zones, which is the bane of,

every,

schedule manager's existence. You're trying to figure out, you know, when when to organize things.

Additionally, the indexing around the daytime support is incredible too. Now I've had to do a lot of work with scheduling throughout my career,

and, you know, I've I've always been able to fall back on Postgres to ensure that I can, you know, both accurately

and quickly,

store and manage the data around dates and times.

Another structured data type I love is the array type.

Postgres can store arrays and you kinda think about that. Well, you know, isn't that against the relational database model? And, well, yeah, it is, but also, you know, we need to be practical too.

The way that Postgres handles data type compression, there's often there's often cases where it makes more sense to store thing store information in

a series of rows.

Additionally, you know, the array type, you know, based on, Postgres' object typing system, the array type works for

essentially any type that's in Postgres. So you can have array of integers, array of dates and times, array of

more complex objects like, XML documents. You can have an array of arrays. You can have array of array of arrays. So, you know, it it's very flexible in that.

1 of my other favorite structure types is the range type, and that's definitely helped me,

you know, in scheduling type applications.

So Postgres allows you to store ranges, you know, ranges of times, ranges of numbers, you know, which have, you know, have applications, you know, not not just in scheduling, but in science, you know, storing things like error bounds, probability,

and, you know, other things where range data is very important.

What makes it useful is not the storage around it, but the indexing, which I keep alluding to as, you know, something I'm gonna discuss, which I will. I just I just wanna be able to tie all the different data types together.

And, you know, things like UUIDs.

And I know there's a big debate in, the data management world, like, do I wanna use, you know, UUIDs?

But the fact is Postgres supports them. So if it makes sense for your use case, you can do that.

Document types.

You know, Postgres has gained a lot of buzz for supporting JSON and, it's binary,

formatted JSON type called, JSONB.

But, you know, before that, Postgres actually was able to store XML documents and you could do a lot of work around those and as well as generating them, as well as it had its own key value based data type called h store, which, you know, predated which actually predated JSON when you look at the the history of key value storage.

Postgres has a lot of built in geometry types as well. You can store points, lines, circles, polygons.

And extended upon that is a is a is a geospatial,

extender called PostGIS

or PostGIS,

another tool that you can debate how to pronounce the name, but what post yeah. What what PostGIS does is that it stores,

it can store geographic types and geometric types that are used in mapping. And it's actually a tool that's widely deployed and widely used in commercial spaces.

And, you know, as you mentioned earlier, device, there's also, custom data typing. You can write your own data type. You can write it right within Postgres. You can write a c extension to do it, you know, whatever you're whatever you feel most comfortable with, as well as, building out, composite types.

So I think, you know, what I'm what I'm getting at is that there's a lot of data type support out of the box, and it's very easy to store structured information in a way that's easy for an application developer to access it.

Of course, you know, as I was going through that, it's 1 thing to store information, it's another thing to retrieve it, and in order for these data types to be useful,

you need to be able to search over them, you know, you know, particularly quickly.

And, you know, Postgres offers many ways of doing that with indexing.

You know, most most relational databases

influence something called the B tree index,

which doesn't actually you know, everyone thinks I always ask, you know, what what does B tree stand for? And people say, oh, balance tree. And I'm like, no, it actually stands for B tree. You know, it's not abbreviated for anything.

So but what a B tree is, it is a balanced tree, and, you know, it makes it very easy to do equality and, you know, you know, inequality operations to be able to search through data, you know, within an index.

Postgres, it goes beyond that with its standard B tree index where it's able to handle, you know, multi column indexes,

expression indexes. So if I want to index over an immutable function, I can do that, which, you know, you can see is very useful in looking up things like all lowercase email addresses and and the like.

Postgres also supports something called a partial index where you can index only over a subset of the column by adding a where clause to the index.

You know, for instance, it's very useful if I have a if I have a

if I'm partitioning let's say I'm partitioning my information by active or inactive, and I wanna make sure I can quickly look up all the active information in my table,

I would I would probably use a flag at the end of an index saying, you know, look up you know, look up, no index this ID

where active is true. Now the beachy index is great for the primitive data types, but when you have

things like, let's say, geometric shapes, it might not work. Why is that? Well, if I wanna say find me all the squares that are equal to this square, what does that actually mean? I mean, I could probably look up squares that are equal in area, in which case I would put an expression index on,

on on my column containing squares, but, you know that's not necessarily useful in a geometric query. Geometric queries, particularly in mapping, it's more about am I overlapping with a space? Am

I within a space? And and things like that. So an index so Postgres added the Postgres community added an index to handle this called the generalized search tree,

And this was set up to handle different arbitrary indexing schemes.

1 1 of my favorite features of the of the gist index

is something called a k and n gist query or a k nearest neighbor lookup. These are useful for finding finding things that are close to you like you know, find, you know, the the 5 closest coffee shops to where I'm standing.

Those indexes, you know, 1 1 interesting thing about those is they take a little bit more time to set up because it does have to store a little bit of information, but it makes these kinds of geospatial and geolocation queries very fast.

The GIST index, you can also see being used for arrays and ranges,

and,

you know particularly it's used very heavily by post GIS

it's a you know it's a you know it's definitely it's a very powerful flexible index

GIST, you know, it's still a tree based index so you would like it to be relatively balanced but there's an indexing type called spgist, the space partition gist index which is better for unbalanced data.

Spgist was introduced I believe in Postgres 9.3 for dealing with unbalanced range types.

And it's something it's something that I've used that works very well if you have data where it's not gonna fit your normal balance of, of, you know, what you see in a b tree or a regular,

gist type index.

Another 1 very that's very helpful is the gene index, which is the general,

inverted,

index.

So I'm gonna say that word again. It's the yeah. So it's the general inverted index.

Jing was introduced to help primarily with full text search within documents,

which is which is actually if you, the full text search documents have their own data type as well, I should mention.

It's also used for introspecting,

introspecting,

JSON documents. This is where we get the very fast JSONB queries,

as well as it has some support for arrays as well, being able to

introspect elements within,

integer arrays. Another another another useful index is the brin index, the the block range index, which was developed for looking over information

in very large tables so if you have let's say you have a table that has a 1, 000, 000, 000 rows and you're storing integer data and you wanna be able to narrow down fields quickly,

the brain index is decided to, you know, quickly get you to the at the set of rows that you want. It's actually very space efficient as well. Yeah. So so maybe to tie it all together, what makes Postgres work so well just fundamentally from an application developer standpoint is this, you know, data type support and this indexing support combined with all the functionality that you can get around manipulating and introspecting,

data types.

But, you know, this is just this is just the tip of the iceberg.

Now a lot of the other features that are very important in Postgres, you know, are around what you would say is keeping your data reliable and safe.

Postgres implements something

it implements a concept called write ahead logging, which basically records every single change that's made to your database. And this is important because it gives you a playbook for what is occurring in your database and, you know, in the case of disaster, you can restore your database back to a particular point in time, also known as point in time recovery.

You know, since Postgres 9 o,

when, native streaming replication was added, made different functionality.

There's been a lot of different functionality added around,

improving, you know, different ways you can copy your data from 1 Postgres server to another. It could be done asynchronously, which is, you know, a very fast operation saying, alright. I've ran this data on my primary, you know, send it off to all the standbys, and I'm done with it. It could also be synchronous, where I could say, alright. I've written this data to the to my primary.

Now write it to the standby, and don't say the transaction's committed until it's been marked as committed to the standby.

Since Postgres version 10, there is logical, which allows applications to connect Postgres and read in all the different transactions and decide, you know, how those applications,

wish to handle them.

Logical replications also could be very important in doing online upgrades and versions of Postgres after,

version 10.

Yeah. So the

point that you made about it being very useful from an And

1 of the approaches that a lot of value

And 1 of

the approaches

that a lot of application developers make is to try and treat the database as just dumb storage and not leverage all of the features of it, which helps with portability of the application, but you end up hamstringing yourself

in terms of a lot of the

performance gains

and additional

utility that you can get

from your storage layer. So

it's definitely worth investing the time

figuring out what is available,

and

taking advantage of it, and potentially tying yourself to a given storage engine, particularly if it's something like Postgres that has been around for so long, and has has such a proven track record of stability and performance

and,

functionality.

And so given all of those

reasons why Postgres is so

and the many besides that we haven't even begun to touch on yet, what are some of the cases where Postgres would be the wrong choice where somebody should look at different storage options for their data. Yeah. No. I think it becomes very application specific at that point.

And, you know, and also making sure that you're, you know, you're basically using the right tool for the right job, which is a a common engineering mantra. There's some cases that I've heard of,

you know, 1 well, I should say 1 area I've seen Postgres used less is, you know, at present is embedded devices. You know, I've heard I've I've seen some talks. I've heard some cases where people have run Postgres on embedded devices,

but quite often, it might be, you know, it might be good enough just to use SQLite, which is, you know, it's really optimized for, you know, some of those, you know, know those small devices

and, you know, the the use cases that are around those. There's also been some public, you know, quote, unquote performance issues or other, you know, cases where, you know, so and so company has stopped using Postgres and use another solution.

Of course, you know, you've seen, you know, just as many in in the other direction as well. But the reason I do wanna touch upon that is frequently when, you know or often when you see a report like this do come up,

the community is really wants to understand, like, why this happened. Was it was it a configuration issue? Was it, you know, just lack of understanding of using Postgres? Or was it something that was wrong with the database that needs to be corrected? And, really, in all 3 buckets, community tries to make improvements. You know, if we realize it's a it's an awareness issue, like, what can we do to to better talk about what the, capabilities are of Postgres. If it's an architectural or development issue, the community typically will work on, you know, fixing that issue, and you often see that these issues that do get talked about that are actually engineering issues end up being fixed, you know,

often in the next version. There are some, you know, there's some more, you know, I would say,

you know, things that know Postgres could definitely do better. I mean, write scaling is still write scaling is still something that, you know, it's been improving in subsequent versions of Postgres, and you can see some improvements for that coming in Postgres 11 with the the improved partitioning work. And there are also some tools and extensions to to help with right scaling, but, you know, natively, it's still you know, I mean, you you don't have a parallel write mechanism yet in Postgres.

The other thing is people will often look at who know who can support Postgres or how many people are working on it. And, you know, this is a problem that has been gradually going away because we have more people contributing to the Postgres community in addition to more people just using Postgres. So you have this general, you know, general,

community expertise just amongst, you know, all the different users. So, you know, it become, you know, it becomes less and less of a reason to to not use Postgres.

Then, you know, there are certain things where, you know, Postgres is probably not the right solution. If you're looking at, you know, high speed messaging or, you know, being able to, you know, communicate, you know, very rapidly between applications, messaging or, you know, being able to, you know, communicate, you know, very rapidly between applications,

you know, you might wanna use a tool like Kafka, which is or, you know, or other, you know, asynchronous,

message brokers and tools like that, which is, you know, definitely something that I that I have done.

So, you know, again, you know, like I said, it really becomes application specific.

I think, you know, when you wanna make sure that you you know your data is stored

and there's 1, you know, there's essentially like a centralized data repository

or even a distributed data repository where, you know, my data is stored here. I can always get it out. That's typically where, you know, Postgres, you know, you end up using Postgres and that ends up being a very large swath of applications.

And 1 of the points that you made there of why people

have historically

chosen to move away from postgres and use other platforms

is the potential for confusion

or misunderstanding,

whether it's in terms of how to deploy and manage the database or configure

it or how to use it at the application level.

So

what are some of the most common points of confusion

for people who are new to Postgres, particularly

if they do have prior experience working with other relational database systems? Yep. And, you know, and and that that is a very good question.

So 1 of the things that has been improving in the Postgres community but can still get better is,

I'd say ease of use and having user interfaces that are friendly, that are easier to set up Postgres databases,

interact with the Postgres databases, and do a lot of tools that or basically do the functionality of a lot of tools that you see in, let's say, the proprietary software world. There's a very good project called pgAdmin,

which has, you know, been around for a while and that's continuing to improve. That helps with a lot of the a lot of the database management functionality, and it it is a very popular project in that regard. But, you know, that you know, it does focus, you know, primarily on Postgres itself. And, you know, as we mentioned earlier, you know, Postgres is very

extensible. Tools like, you know, tool things like backing up, Postgres does have backup capabilities

but there are more powerful tools that are extensions to Postgres that are better for large

scale type backup solutions.

So once you come into the Postgres community and you see that there's all these extensions, it could be very overwhelming and that sometimes the right tool is not bundled with the core Postgres software.

I find that's, you know, I find, you know, especially in talking to, let's call them, you know, x Oracle DBAs, that's where they'll trip up the most because, you know, they're able to navigate around a command line or, you know, be able to write, you know, very complex SQL.

But not having everything right out of the box can be a little bit overwhelming. And I think that's, you know, that's an area where I do see us continuing to improve. And you typically see that, you know, if there's a tool that the community coalesces around that it might be able to make it back into the core software. You know, the the other thing is the Postgres

community tends to be very command line driven.

And by that is, like, they you know, we don't use, user interfaces as much. You know, we you know, especially the people who've been around a very long time. It's, you know, think you know, writing, you know, writing code in Vim or Emacs or, you know, even just, you know I know some community members still read their email in text based editors, you know, from the command line. So, you know, there is

a a lot of comfort with just using the command line and Unix based tools,

well you know coming from some other database platforms there might not be as much comfort from doing that. That's typically where we see a lot of confusion is that and maybe the last point because you know we only just started touching on functionality of Postgres.

There's so much within Postgres. I mean, I can tell you, you know, I've been using Postgres for 15 years and while doing some research for for our our discussion today, I learned some new things that I did not realize Postgres did. So, you know, even even for veterans of Postgres, you know, you're constantly learning more and more about the the functionality of the software, and that makes sense too. I mean, the the code base is over 1, 100, 000 lines of c code. There there's a lot going there's a lot going on there.

And, you know, it's hard to sometimes, it's hard to keep track of everything.

Fortunately, you know, Postgres has phenomenal documentation, so you can look it up, but you just you still also need to be able to know what you need to look up as well. Yeah. That's definitely 1 of the things that I've bumped up against occasionally is not even knowing what to search for. But then when you do figure it out, as you said, the documentation is very extensive

and descriptive.

And,

also,

the

overwhelming number of potential configuration options

can turn somebody off who's coming from maybe a more straightforward database engine where they don't have to do as much tuning, or it's, you know, optimized for the default use case that everybody uses it for.

Whereas postgres

has a good set of default options, but there are so many different ways that it can be leveraged. The the And there are a multitude of parameters

that can interact in

not necessarily very obvious ways. So once you go beyond the default of just reading and writing to the database and trying to tune it for a given workload,

it could become

very,

time consuming

to

research all of the ways that you can tune the database. So

somebody might decide to just go back to their default choice of a different engine rather than actually figuring out all of the ways that they can

optimize postgres for their application and then be able to leverage it to a greater extent than they would their prior engine. Yep. And and you hit the nail right on the head there. I mean, let's take let's just take an application where I might write once and read many times, but, you know, it's a relatively small database or a database where I write once and read many times, but it's an old lab data warehouse style database. I would tune those both of those databases very differently. And, I mean, of course, I would architect those databases very differently as well. And then take a write, you know, write many times, read seldom, you know, read seldomly. I would architect and tune that database very differently than, you know, those other 2. So 3 very common use cases right there and, you know, 3, you know, very different configuration sets.

And just very quickly for somebody who is interested

in digging deeper on how to tune postgres for these various workloads, what are some useful resources that you would recommend they check out? Yeah. So,

so on the Postgres documentation, there is some guidance in terms of of how these parameters work. Also on the Postgres website, there are some support resources that will link to either channels where you can talk to people who have these kinds of workloads and, you know, can provide advice.

Also, some other online resources and presentations that offer up these as well. Attending local user groups is an another great thing too because I would actually say, at a NYC pub last month, we had a speaker, Payal Singh, who talked about

exactly this, like, how do you tune Postgres, you know, in a how do you optimize it within a in a Linux environment and talked about a lot of these different settings. The thing that makes data management so challenging is that it really can vary based upon your workload and the environment in which you deploy in. Like, is your database within a cloud environment? Is it on bare metal? Are you in a private cloud where you're controlling all the resources around you? Are you in a public cloud where you have no idea what's running next to you? Because that will affect in terms of, you know, how you might wanna tune, you know, tune some of these parameters or handle some of this architecture. And then, you know, beyond that, it's, you know, how am I storing the data myself? Are there is there a particular table that I am reading and writing to much more than, you know, than another 1? Is there a table that it's easier for me to federate it across multiple post databases because of, you know, whatever architecture I'm planning or, you know, how I'm actually writing this data to the database?

So it becomes so subjective, and this is kind of the art of being a being a Postgres DBA.

The good news, you know, is that there are resources to do it, and you find the Postgres community is very helpful. You know, 2 popular ways to get some advice on this, you know, near instantaneously

are the Postgres IRC channel, hashtag PostgreSQL on free note,

and the Postgres

Slack channel, which is also a well, I forget if it's a hashtag or not on Slack, and I'm actually on that channel. But there there is a community Postgres SQL Slack channel where people are happy to help. But, yeah, there you know, probably the the best way, though, I find to get particularly performance advice or get caught up to speed on it, are at the Postgres Conferences, the Postgres Community Conferences because you're guaranteed to have someone who will be discussing that topic, and you're also guaranteed to have people there who

are are very familiar with, you know, performance tuning as well. And

in some of the most recent releases, there have been some very substantial

improvements and new features added.

And given the fact that Postgres is used in so many,

very performance

intensive

and very

critical

environments

where you need to ensure a higher level of stability

and reliability

in the data layer.

What are some of the strategies that the community uses to manage that balance as they,

update and improve the database for new types of workloads and environments?

Well, first, I'm glad it appears that we're making,

substantial improvements. You know, that's always good to hear, know, and that's that's very good feedback for the community. I think the first thing is the community ethos in terms of maintaining that stability and security of the product. I mean, the community understands that, you know, data is

1 of, if not the fundamental

part of any application out there. I mean, it probably sits just above the operating system in terms of the part of the system that you need to be the most stable. You know, if your data gets corrupted and is unrecoverable, I mean, that could be the end of your business. So the community does adopt that mindset and ensures that, you know, and does its best to ensure that everything that's written is

and released is gonna be as stable as possible.

You will see, you know, you you see in the community as well, if you follow the pgsql

dash hackers list, you know, the debates that will go on, sometimes even 1 line patches. But ensuring that

whether it's 1 line or 10, 000 lines,

that it's not going to affect the overall stability of the system. Features will not make it into Postgres because people don't think that people think that it's going to affect the you know, it's gonna affect the overall performance

and stability of Postgres.

You'll see in it you'll see some features. Sometimes it takes years for them to get in because the community really wants to ensure that that code is going to match Postgres' reputation for being stable. I remember following the, saga of the upsert capabilities

in Postgres and waiting while the debate raged on as to how best to achieve that. Yep. And and, you know, and, again, it it ultimately did make it in. It just took some time to make sure that we had the solution that the community felt comfortable with putting in. Another example was, the binary JSON,

format JSONB,

where the release of that was held up, or the major release post of Postgres 94 was held up to ensure that we had the on disk representation

correct. Because anything to change the on disk representation of a data type, you know, means significant pain in terms of upgrading. So the community wanted to make sure that we got it correctly. We were, you know, optimized for how it handled compression, and everything was good to go for that release. And I think, you know, again, having that ethos having that ethos is very important. I'll I'll say that there's 1 feature that I really wish made it into Postgres 11 that did get held up because the community want to go through the code more and make sure that it was, you know, clean and stable and would would not introduce issues into it. And, you know, just, you know, seeing just seeing that dedication, making sure that, you know, we have it correct does you know, it also, you know, translates back to the user side knowing that, you know, the Postgres community has its eyes out and for not, you know, not breaking the functionality or not creating a bad experience for, you know, for the community. Would it be great to have some of those features, like, pushed out? Yes. Of course. But I think people feel much better knowing that they're pushed out correctly and accurately and that their data is, you know, safe and secure

on on disk. The other thing is, you know, the you know, and this this has, you know, changed a little bit over time, you know, the the release process is a little bit more formalized now in the community. Now the community will say we're gonna release things when they're absolutely ready,

but, you know, it's the fact that we try to make a major release, you know, once a year, and that's usually targeted in the Q3, roughly end of September, early October. Of course, I will get the disclaimer that don't quote me on that.

The the other thing that's changed in recent years is having regular minor releases, that you can go to the website, It's the the road map page on the Postgres website, and you see that we try to do minor releases once a quarter. We actually have 1 coming up on August 9th.

And, of course, if this airs after August 9th, then,

I also say we just had 1 on August 9th.

And the minor releases basically are bug fixes and if if needed, security security fixes for for the software. And and by having that regular schedule, it allows know, DBAs everywhere to plan their updates, you know, plan if they need if they need anything beyond, you know, basic basically restart the postcode server. They can plan for appropriate,

downtime. So can you know, the the last thing is just the dedication of the community too, that it's not uncommon to talk to someone in the community who's been active for close to 20 years. You know, people who really love Postgres, people who really love contributing to it, and you see this dedication of, you know, of people continuing to improve it. And sometimes, you know, you know, people leave the community, and sometimes people might take a brief hiatus and then come back, you know, you know, whatever it is. But the community has just always been very supportive of it and is you know, has always tried to put things in place to help ensure, you know, people can, you know, pick up you know, can either, you know, pick up, you know, what whatever work they were doing or be able to start contributing,

you know, anew, you know, you know, whatever, you know, whatever, you know, makes sense for them.

And 1 of the things

that I think helps contribute to the overall staying power of Postgres as well are the combination of it being open source for so long and the

very extensible plug in capabilities

that,

sort of obviate the need to create any long running fork of the project to add new functionality

where,

some databases

have had forks early in their history that never got merged back in because the added feature set was too incompatible with either the vision of the original project or the code base,

where with Postgres, you can have some very

drastic changes to the overall

interface and capabilities of the database while still remaining code compatible

and only requiring

a plug in extension rather than a complete rewrite of the underlying code. Yeah. And I think, you know, in in you talk about this extensibility as well. You know, 1 thing,

1 thing that you touched upon earlier was

how much should an application developer

how much of the business logic should an application developer put

into

an application away from the database versus in the database itself in terms of portability

and the nice thing about Postgres is you kind of have that choice.

Postgres has this thing called the procedural language loader, which is originally written to allow you to write Tcl within Postgres. It later was extended to support pgSQL,

which is a language similar to PLSQL

out of the box as well as Python. But people have written other language extensions to to Postgres such as R or JavaScript

or Java. So what's interesting is suddenly, you know, you think about it, like, I can write Python within my Postgres database. I can put

Python within

a function,

and then call that function that's executing, you know, maybe, you know, even even NumPy or scipy, and that adds a lot of that really makes a very interesting calculation in terms of or an interesting

it gives you an interesting idea for how you want to architect your application. Now, you have to pay a penalty whenever you bring data out of a database over a network connection into an application. You can save a lot on bandwidth and, you know, arguably performance by doing some of this, you know, some of this work, you know, within the database itself. Databases tend to be very IO heavy. You know, they're reading and writing to disk.

You know, they tend to not use as much CPU. So if you want to offload some of your calculations into your database, chances are you do have a lot of CPU available to do that. Yeah. I've definitely

taken advantage of that,

for doing some

ad hoc parsing of XML

without having to revert to XPath. I just wrote a Python function and loaded it into the database, and

that that made my report writing a lot easier. Awesome.

And 1 of the,

biggest pushes

in the data layer right now is the desire

to enable largely distributed clusters of data stores. So there are projects such as cockroachdb

that speak the line protocol line protocol of Postgres will be a complete reimplementation of the engine itself. And then there are projects such as Citus, which are an extension to postgres that add the ability to scale out horizontally

for right heavy workloads.

And 1 of the questions that comes up a lot is

what

plans there might be in the post grads community to try and incorporate

native

ability to be able to scale the right workload

more horizontally

versus,

vertically

scaling a masternode and then distributing the read transactions?

Yep. So I think what you've seen, and I would say this is probably started with the Postgres 9 o, is this push to for Postgres to be more of a horizontally

distributed database.

On Postgres 9.3,

something called the Postgres foreign data wrapper was added, which foreign data wrappers themselves were added in Postgres 9.2.

Essentially, what a foreign data wrapper is is that it allows you to communicate with another

whatever it may be, another database, another file system,

and you'll pull that information into Postgres and have it look like a Postgres type query.

There's support for, you know, Postgres, MySQL, Oracle.

You can post things to Twitter from Postgres. You know, and and, again, like, everything in Postgres is extensible, so you can write your own foreign foreign data wrappers. But in Postgres 9.3, the Postgres foreign data wrapper essentially introduced the ability to do distributed transactions in Postgres.

If you want, you know, if you open up a transaction on a local server and you're trying to do something on a remote server, it will also open up a transaction on that remote server. Now I had to do a little bit of research to figure out, you know, where Postgres was in terms of, you know, full support for distributed transactions, and

1 of the big things that's missing is

error handling for certain failure cases where

based upon the failure and how, you know, how things fail,

you you know, a database administrator might need to clean up transactions on another system.

That said, you know, if if it's a fundamental failure like where, you know, something goes down on the local, you know, it can, you know, it should be able to, you know, safely, you know, roll back the transaction

on the, on the remote server. So that said, there is a actually, I did hear there is a project in the works for, you know, that kind of distributed transaction handling to be able to handle those better better error cases, and hopefully, it's able to make it into into version 12. Additionally, you've seen all this work in Postgres about making making it easier to to partition.

So Postgres has been able to support partitioning for over 20 years, but a big advancement came last year in the during the postgres 10 release by supporting declarative partitioning, which is essentially partitioning

from, you know, is a partitioning syntax that people coming from other databases are very familiar with.

Postgres 11 is significantly expanding on that by making it easier for the partitions to be managed where they're doing more auto management of how the data can move between partitions,

be removed from partitions, be able to push down various indexes, you know, between the partitions. So, you know, partitioning is gonna be much easier to use, you know, you know, once Postgres 11 is released, which is, again, anticipated in the Q3 of this year. So, basically, you see a lot of work by the community to make it easier to do all these actions

Postgres. And I think, you know, that has been part of the community's mindset

shift is that, you know, while we do try to make thing we try to make Postgres as extensible as possible and we welcome people to write extensions to it.

Once we see that, you know,

certain functionality,

should exist within Postgres itself such as replication, which initially was only handled by extensions,

the community will push to get that in because I think you see this mindset of let's make this as easy for users as possible

while maintaining that robustness and that flexibility of choice.

And are there any other aspects

of

the road map of postgres that you think are worth calling out?

Yeah. I think that there's a lot I'm very excited for. So

so there's so I'll try and touch on a few projects I know people are working on for version 12. And, again, you know, I don't know if they can make it into version 12. The version 12 development cycle just started. But I'll I'll touch on some that, you know, I've heard about that I find, very interesting. So Postgres in version 11 is going to introduce just in time compiling

on various types of expressions. So just in time compiling will actually help optimize, you know, various parts of the query such as,

when you're doing something called tuple deforming or essentially formatting the tuples from the way they are on disk to what you need to be able to process them within the query itself. Also, in certain parts of certain places where you're evaluating

expressions such as in the where clause of a Postgres query

or certain things in the select list or the first part of a select statement. I, you know, I definitely see more improvements to, you know, we're introducing the JIT system in Postgres 11. I'm sure there will be improvements made to it in 12.

Another project that's been proposed for 12 is rewriting the executor, which is the part of Postgres that says, alright, you know, either, you know, write all this information to disk or pull all this information

rewritten in a while and the and the proposal will help optimize it and, you know, make it even faster. It was certainly very interesting and I'm curious to see what happens with that project. Partitioning will be continued to be improved. You know, this is I definitely see partitioning being a part of the several releases. And additionally, parallelism,

or I should say at least reparallelism.

I know we're gradually working towards right parallelism and I understand there are some challenges associated with that. But in terms of reparallelism, Postgres 10 made a it was introduced in Postgres 9.6 and Postgres 10 made some huge advancements in it, and we have some some more coming in Postgres 11. But you can you'll continue to see this work, you know, presenting itself in a Postgres 12. 1 other project that's really interesting

is a pluggable storage layer.

So right now Postgres only has 1 way of storing information on disk. You know, it's been it's the way that,

it's a storage engine that's been used now since, the

well, I I don't even know how old it is. It's, you know, it's a little bit before my time being involved in the project but it's been around, I wanna say, since the beginning. There's this proposal of a new storage engine called ZHeap which is supposed to help with,

basically,

you know, the idea is that it should be an improvement upon the current storage layer, but, you know, maybe not necessarily for all applications. We don't know yet. It's very new.

And in order to make this really successful, the, you know,

the the Postgres community is evaluating, introducing a pluggable storage API so people can write their own storage engines and attach it to Postgres.

You see this already, with MySQL.

You know, that's that's how MySQL was architected from day 1.

So there there's definitely no some other there's some very large projects in the pipeline. 1 that I'm also interested in is the SQL, the implementation

of the SQL JSON standard. So Postgres does have a lot of functionality for interacting

and,

for interacting with, JSON documents and running queries over them.

The SQLJSON

standard provides, you know, the way that the SQL committee is recommending for all of these to be implemented, and they'll search over, introspect,

and have, you know, standard functions around it.

So that project was,

you know, that project was proposed, late in the Postgres 11 development cycle, so people are iterating on it for version 12 and we'll see if it's you know, if it does make it in. And are there

any other aspects

of postgrads that we didn't touch on, which I know there are many, which you think we should cover before we close out the show today?

Well, let's see,

let's see what I can do there. So security.

Yeah. I'll touch on security real briefly.

Security has always been an important part of the Postgres project as well, making sure that you can securely connect to the database, securely store your data, and you'll ensure, you know, people you know, only the people who have access to do reads and writes or, you know, various administrative level functions can do that. I could talk about the access control system, which you you can commonly think of a grant and revoke.

That is probably an 8 hour talk in itself.

Authentication is something I will touch on, though. Postgres has a lot of different authentication

methods from trust, which you should only ever use in your development environment because if you set trust as soon as you set trust you know people can access your system entirely to you know more advanced or enterprise type authentication controls such as the GSS API or the s s or the SSPI, which is the Windows based 1, LDAP,

certificate based

enforcing. You know? And there there's even a verify full mode where you have to

ensure all the certificates being used are valid by the certificate authority. Postgres version 10,

we introduced, the Scramshaw 256

authentication,

as

way above the MD 5 authentication,

system that we had in place. Postgres 95 also introduced something called row level security.

Postgres had column level security for a while, but being able to control,

control, secure access based upon, you know you know, who or what or whatever your policy says about a row,

was added then and, you know, continue to be improved upon as well. I guess the last thing I'll touch I touch upon briefly I'll talk about full text search. So Postgres does have full text search out of the box.

In Postgres version 11, we're gonna be able to

support stored procedures out of the box. So Postgres was able to support stored functions, but stored procedures,

give give users the power to, create and roll back transactions within a function itself.

That's a feature that's been want people have wanted for a very long time, and we're excited that's gonna finally be rolling out. So for anybody who wants to follow you and get in touch or,

see you know, follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question,

from your perspective, what do you see as being the biggest gap in the tooling or technology that's available for data management today? So, you know, of course, you know, the tooling technology is gonna continue to improve. I would almost flip it that I still think it's awareness in education,

You know? I was fortunate. You know? I decided to study computer science when, you know, it was right after the dotcom crash, so it was a little bit of a taboo field to go into then.

No. Now it's great. It's a it's a very popular field. I'm happy to see it, but you still see that, you know, a lot of these code academies, which are helping people get jump started in the world of software engineering, they tend to focus on particular topics in web engineering,

quite often front end web engineering or data science.

But you don't really see many that do data management as a whole.

And part of it is that it's probably hard to do, you know, to study data management in a 12 week boot camp you know, let alone in a 4 year major because there's just so much involved in it.

And it's a little bit of a harder skill to learn, but I think a very necessary 1. And I would encourage people to study more around it because, you know, as you might have heard in this podcast,

there's various aspects that you can that that you can learn and get involved in there.

So I think, you know, you know, just to to bring it back around, I do think it's an awareness problem and an education problem, and I'm happy, you know, 1 of my goals as part of the Postgres community is to help make working with databases

more accessible.

Alright. Well, thank you very much for taking the time to join me today and talk about the work that you and the community are doing on the Postgres project. It's definitely a very useful and valuable

project and database that I have used many times. So thank you all for that, and I hope you enjoy the rest of your day. Sure. Thank you, Devise. I'll definitely pass on your kind words to the community. Thank you so much again for having me on.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links