Building Your Data Warehouse On Top Of PostgreSQL

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because the number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset?

Or tried to understand what a column name means?

Our friends at Atlan started out as a data team themselves and faced all of this collaboration chaos firsthand,

and they started building Atlan as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code,

Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to data engineering podcast.com/atlan.

That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Thomas Richter and Joshua Drake,

heretofore known as JD, about using Postgres as your data warehouse. And, Thomas, can you start by introducing yourself?

Hi. I'm Thomas, a CEO and cofounder of Swarm 64.

We are an extension for Postgres

that accelerates

complex queries and data heavy queries. So, basically, the classical query types you would see in a data warehousing

environment. And, JD, how about yourself?

I am the founder of United States PostgreSQL,

as well as the founder of Command Prompt Incorporated,

which is a full stack PostgreSQL consultancy and support provider.

And we've been around for over 20 years. And going back to you, Thomas, do you remember how you first got involved in the area of data management?

Yeah. My first exposure to that was really in what you would probably today call as sales analytics or in house analytics, and back back then it was called sales steering. That was just at the change of the 2000.

And, there was an internship.

And, basically,

as the intern, I would do the daily

data preparation and so on. So back then, it was IBM

with Cognos

on a big data warehousing system. Everything was, back then, still OLAP cubes, a very prescriptive way of how data is queried, and that was, in a way,

a lot of fun.

And as you can see, left a lasting impression.

So that was my first endeavor into data science, and then, basically, I've been always close to data processing throughout my entire career.

Worked in IT systems for airlines for a while,

then got into

wonderful world of start ups,

worked on projects around information access and mobile technology,

and then eventually decided to create my own start up. So that's really my process of

going into data

processing and always been very close to MySQL, MariaDB, and Postgres

throughout my career so far. And, JD, how about yourself?

I have a slightly different background.

I actually

fell into databases.

I worked for a company called Powell's Books, which is

the largest freestanding bookstore in the world, actually. It's in Portland,

Oregon.

And there was a technical bookstore. This is back in the days when Linux was just coming about.

C and c plus plus was the primary platform for everything.

Java hadn't even been production.

And I wrote a special order database system

in dBase of all things.

And that kinda landed me in the direction of going into web development and those types of things.

And I ended up picking up Postgres

before PostgresQL. It was called Postgres 95.

And that was the actual first exposure to Postgres that I had had.

And because of that,

command prompt was created

and wrote a book for O'Reilly

called Practical PostgreSQL all the way back in 2001.

And I've continued to use PostgreSQL and support the community

since then.

For today, we're talking about Postgres in the context of data warehousing. But before we dig too deep into that, can we just establish a working definition of what constitutes a data warehouse?

And because I know that there are sort of a few different ways to look at it and a few different approaches. So for the purpose of this context, if you can each just kind of give your thoughts on

the the framing for this conversation.

Well, Thomas, what do you think about that? Because

the traditional data warehouses that you and I, you know, would consider,

They're not like that anymore. We're getting a lot of hybrid workloads, things like that.

Yeah. So I think data warehouse is actually a changing definition, so to say. In my view, it is basically a system that holds your data for analysis

and that could actually take various

sizes. And I think in the old days, it used to be very specific. It used to be

very offline, as in you would put data in, and it would then precompute things and make them available for analysis. I think today, the definition is is shifting much more towards

you can store

quantity of data,

often transactionally

consistent,

and then retrieve that. And there's no more

prescription around, okay, how recent does your data need to be? Is there any offline activity? So there's a more of a real time aspect to it. So,

really, a large, heavily parallel database

that allows you to

ask questions

and get the answers back in a reasonable time

or to provide batch jobs that can then be answered, for example, in the nightly run and provide that information by the next day.

That's how I would define a data warehouse. JD, maybe you can help with the Yeah. I would agree.

I mean,

traditionally, people might have something like a Netezza, which is actually

a fork of PostgreSQL.

It's actually a fork of PostgreSQL 7.2.

But nowadays,

what we've started to see, you know, when in our support and consulting is that

our clients don't really want to make the distinction.

They may use something like tablespaces

to push

their

data for warehousing off to

different storage than their analytic or not their analytic, their transactional loads.

But we are definitely finding that a lot of our new business, a lot of our migratory business

is how can we

get off of these, you know, monolithic

large expensive platforms

and just run Postgres?

And how do we make Postgres do what we want when it comes to data warehousing.

And it's usually, as you said, right, it's about analyzing

whatever data that you put it in there. It's not necessarily

transactional in nature. It's more reporting in nature, machine learning in nature, that type of thing. In terms of the out of the box experience with Postgres,

what are some of the shortcomings or limitations when trying to use it for these

scalable analytical purposes

versus the transactional

row oriented

workloads that it's built to handle.

And, Thomas, I'd like to get your feedback on this. But from my perspective,

there is no

inherent limitation,

except for

either a, a lack of knowledge or skill set of how PostgreSQL

does things,

and or

usually you want value added tools. Postgres on its own is a perfectly viable data warehousing platform. It's also a perfectly viable and what it was originally designed for transactional platform.

But when you're talking about a data warehouse, you're talking about usually

a volume of data that is at a scale

that

requires

more architecture than just let's install it on a Linux box and see what happens.

Thomas, what's your experience with that? Yeah. I would agree. I think Postgres is definitely an underrated tool in the data warehousing space, and I think we'll see a lot more growth of using Postgres as a data warehousing tool. And probably later on, I'll cover a little more of some some really unique features that Postgres has

to bring to the table.

What you generally

do for purely performance reasons is, of course, the

column store versus row store arguments. So transactional data where data is usually changing within the same row as in, for example, if you update a customer address, you would typically update street name,

the ZIP code, and the house number, for example, at the same time.

And that would all be in the same row versus analytical

workload where maybe all you care about is finding out which customers are in which ZIP codes, and then you would not care about the street name or the

house number. So a classical example that most

data pure play data warehousing products would, of course, have a columnar storage format, and Postgres has a row store. However,

as JD was saying, is with the right level of skill, you can work around these things. And, in general, it really then comes down to performance. So the level of parallelism Postgres can deploy, for example, on onto a single query is more restrictive than, for example, other solutions,

also other solutions from the Postgres ecosystem.

The way how data is accessed is is slower given it's a row store versus column store, at least for that specific purpose. But in general, there's no inherent

technical limitation.

And when it comes to

kind of extending Postgres

into high performance data warehousing,

there's many ways how you can do this with extensions, with community projects, and I think we'll cover some of them later on in this podcast.

As you mentioned, the definition of data warehousing has evolved a bit from what it was a couple of decades ago to where we are now with things like cloud data warehouses

or systems such as Presto and Trino where it it's just a query engine that runs across, you know, various different storage locations.

And I'm wondering if you can just briefly kind of categorize

the different types of data warehouse systems and where you see Postgres fitting within that given the fact that there are these different extensions for being able to stretch it in different directions?

Well, I think you kinda answered your own

question there. You know, it fits where you want it to fit. I mean, that's the nice thing about PostgreSQL and that it's open source. Right?

Although, it's certainly not

necessarily

designed to fit into,

let's say, a federated data warehouse, like, you know, a lot of Snowflake or or something like that. You're gonna end up in a much better situation by deploying Postgres

for your data warehouse, because you can make it do whatever it is that you need it to do.

Out of the box, it can be a data warehouse

as it sits.

But if you need more performance, you can add something like swarm 64,

which makes your B tree indexes so and and Thomas can speak more to this, much more performance

and saves quite a bit of memory resources,

which means access resources and speed is made more available.

And you can federate it if you want to. Right?

So I don't know that there is an actual

1 spot

that Postgres doesn't fit.

You know, it goes back to the original statement of, you know, if you have the skills and the knowledge,

unlike say

MySQL,

Right? MySQL is good at what it does, but it's very specific to what it does unless you, you know, buy a bunch of different clustering software and things like that. Whereas with Postgres, it's

fantastic at what it does. And then when you hit a ceiling with it, you can extend it using either your own knowledge or the knowledge of other vendors and projects that have already extended it for you. You just have to install it and configure it. Yeah. Maybe 1 way to answer this question is also to look at what does it do exceptionally well. Because I think it's no secret

that there is a very, very large amount of

data analytics solutions out there,

quite simply because analyzing your data is such an important part of what you should do as an enterprise.

So, clearly,

it's a fast moving market. It's an attractive market. There's many, many solutions out there. And so what we've been asking ourselves recently is really, like, where does Postgres

play strongest?

And,

of course, it has all the features that JD has mentioned. And I think the open source aspect and the versatility

is extremely

strong. But I think another area

that is maybe,

I would say, underrated, so I would just like to talk about this a little bit,

is to have the ability to enhance your

analytics

with all the transactional features of Postgres.

We call this internally,

TEA,

as in transaction enhanced analytics.

It is in this hybrid space where you're basically no longer distinguishing clearly between

analytics

and

transactions, but your workload is a purely analytical 1. You're using transactions as a means to augment it. So let me give you a few examples.

First 1, clearly, transactional consistency, as in very easy to debug

because, you know,

if something is there, it's there. And if something is not there, it's not there. So there's no guessing, no eventual consistency.

Then, of course, the high availability that Postgres

has. There's many, many different ways of how you can integrate Postgres into these 0 downtime

or 6 sigma or 5 nines type environments, and that has been done all over the world. So you've got something that is very, very powerful. So if you apply those 2

concepts, you will find that very, very few all up databases in the world

actually satisfy that. But then you can add a third thing, and I think that really

makes it then even stronger.

You have the ability to prevent and heal human errors.

So this

is different things we've seen at clients. For example, you can enforce data consistency or unique constraints,

which a lot of your

classical data warehousing tools will not do. They will not allow you to, for example, say, I want this

data, for example, I want a certain customer ID to be consistent with the customer IDs in my other table.

Those kind of things are usually

limited to the kind of databases like Postgres that can truly do that and enforce these constraints.

Or, on the other hand,

things

like the ability to roll back even major changes. So unlike a lot of other databases in Postgres, you can even

wrap creating or dropping an entire table into a transaction. So let's say you're operating this database. You're making a mistake.

There's the ability to roll back the transaction and basically

undo your mistakes.

So these are just 3 examples. And then the 4th 1 I would add is

there's actually a lot

of nice things in how your transactional features from Postgres

can expand how you can do,

basically, analytical data handling. So just 1 example we've seen in the field we found super interesting. And I have to abstract that a little bit because the exact case was under NDA, but I'll give you an example that's very similar.

So instead of

taking this classical append only design where you would, for example, place every shipment update of a shipment status of an item.

You would put that all into a table, and you would only append.

Instead, what you can do is you can, for example, have a JSON field

where you're concatenating

all the shipment updates

as

an update statement.

So as opposed to just adding to a table that grows and grows and grows, you can actually just

update the individual field. And the benefit is then that, for example, asking for a shipment that has transitioned from loaded to delivery van

directly to lost

suddenly becomes a much easier query to ask because you do not have to reconstruct your table

from

all your append only structure and find out

in what order updates actually happened. It's all there. It's all within a single field that will provide you for every single shipment

the complete history of what has happened in terms of shipment status updates.

So it's those kind of things that really

let Postgres shine even against a massive backdrop of products that are out there.

And that's why we think more people should know about it and 1 of the reasons we're having this podcast today. I would actually like to add a 5th point to that. Those first 4 points

are absolutely

on target.

And I think that it really brings about

the true versatility

of of Postgres as a data warehouse. But the 1 thing that we're forgetting

is that Postgres can be the data warehouse for all the other platforms.

You can directly replicate it in Jest, say, from Oracle.

So if you have an Oracle environment

that, obviously, for whatever reason, you can't get rid of it, you can't migrate off of it, that's fine. But you don't have to deploy any more Oracle.

You can just deploy Postgres and replicate from

Oracle to Postgres to allow Postgres to be the data warehouse. And you can do it Microsoft SQL.

You can do it with any major platform that's out there, you can do that with. So you end up in an environment where Postgres becomes,

you know, the database of choice over time

because it satisfies

the needs

of not having to deploy legacy platforms.

And that brings us to mind things like the Presto slash Trino engine or things like drill for being able to be this distributed query engine across various underlying storage layers. And I know that Postgres has the concept of the foreign data wrapper for being able to do something similar. I'm just curious if you can dig in a bit to some of the trade offs or relative strengths of Postgres

being used as this focal point for all of the data and being able to federate those queries across different storage engines

versus something like Presto or Drill that is designed to be just a query engine with no native storage layer? Well, Presto actually does work with Postgres, so we should throw that out there.

For me,

the main limitation that you're going to have with using, say, in foreign data wrappers or something like that

is if you want there's 2. 1 is that if you wanna go transactional,

you are going to be limited by the quality of the FDW itself. So for example,

if you're federating Postgres, so you're using FDW

across multiple Postgres instances,

the quality

of performance is going to be much higher than, say,

federating Postgres to Oracle.

Because there's just certain things we can't see in Oracle, and therefore, when you're federating the data out, you're not gonna have optimum

query plans and things like that.

Thomas, what do you think? Yeah. I think you're also highlighting here 1 of the key points when it comes to, the relative strength of these solutions. So if you're looking at something that is essentially a query engine

across a range of platforms, you're losing a lot of the points we made earlier around the absolute transactional consistency,

the ability to have very predictable

high availability setups,

this ability, for example, to roll back changes because you're just an engine querying whatever data is there. It's basically taking things into a lot more of offline logic. And, of course, for certain cases, that's what the user may want. So then a tool like Presto is the perfect tool. But as you're moving more into kind of this reactive real time, as we said in the beginning of the call,

the ability

to kind of, you know, have that fluidity between

is it now very recent data or is it offline data and you don't have to distinguish anymore. I think

then you will very quickly hit the limits of those query engines that are just running over

a host of different data storages.

However,

there are very valid reasons in which you may actually want Postgres to just access data stored outside. So we've just had a customer case that we won based on that. And

in the end, we ended up using the

foreign data wrapper

for the Parquet storage engine. So that's a foreign data wrapper developed

by the community. We have basically contributed a little bit to it,

and that allows us to just read parquet files that are stored anyway,

like, outside of the database because that's how the customer's architecture works,

and you can just bring that into Postgres. So you do have these federating capabilities or these foreign access capabilities in Postgres as well, but I think it does play strongest

if you want that predictability,

transactional consistency,

and high availability concept you can set up without actually having to develop a PhD in computer architecture. So

I think that will be my take on it. In those areas, post CRIS is really playing very strong.

Actually, that's a good point, Thomas.

1 of the things that has happened as the new generation of developers come ab abound

and items are moving, for example, to the cloud,

a service such as s 3

is very popular.

And you may not want a certain class of developers or

certain code that's being executed to have access to the database itself.

And so what would happen

is that you can use the FDW technology

to actually

link to an s 3 bucket. And people can upload their s 3 data, you know, JSON objects or whatever files that they're putting in there.

And because of how versatile Postgres is with the FDWs,

you can use that as an ingestion

point that's always live.

So you just have some database job that says, okay, grab the latest data from s 3 out of this table,

which appears to be a table, but it's actually s 3,

and manipulate it any way you need to to ingest it into the actual data warehouse storage that that Postgres is

providing. And that gets into things like the

relative cost per byte and cost per performance

of Postgres.

We are aware if you're storing the data in s 3, you might have a cheaper per byte, but it's going to be slower in terms of the latency to actually query across that data.

In the other direction, these cloud data warehouses that are aiming to decouple the constraints of storage and compute where they can scale independently of each other. And I'm wondering

how you see Postgres contending in the sort of overall landscape of cloud data warehouses

and object storage as the kind of canonical storage layer for the

growing big data ecosystem?

It's kind of an interesting question because if you look at

what the major cloud providers are doing,

the big 3,

they have all, especially

the top 2 and being Amazon and Microsoft,

have put an enormous amount of

not only engineering, but infrastructure

into deploying Postgres.

You know, you can't buy something from Amazon without using Postgres.

And when you start getting into, you know, the the data warehousing scenarios,

I think what's, you know, really gonna come about is that we're just gonna see a continual expansion of that.

When you look at the market, I think that there's always going to be, say, you know, well, like, your snowflakes, for example. I think they're all going to always exist

because everyone likes to reinvent the wheel. That's kind of the nature of humanity.

What the market is doing and what the developers are pushing and where the growth is,

is that at some point somewhere, you're going to be running Postgres.

You know, whether it's RDS or to or it's Hyperscale, which is, you know, Citus on top of Postgres running. And as you're right, no matter what, somewhere, you're going to end up running Postgres because it's becoming,

well, it's becoming the Linux of databases,

Right? Everyone, whether they're running Apple or 1 of their running Windows somewhere is running Linux. And a lot of times they don't even realize it.

And Postgres is doing the same thing, and it's very much becoming pervasive throughout the cloud because it provides a uniform standardized,

highly functional transactional,

you know, environment just as Thomas was stating

for your data warehouse

or your transactional load or both.

And I'm not sure how we're gonna get away from that at this point. I mean, I would imagine that within 5 years,

it's not a question of whether or not you're writing Postgres. It's a question of, you know, which version.

Right? Whether it's, you know, pgcloud

or whether it's Nitrous, or whether it's, you know, even say like a timescale for time series data, it's all postscripts at that point.

Digging more into how to actually take your out of the box Postgres,

unadulterated,

no extra capabilities added in, and turn it into a data warehouse things that you would do to actually move it from, you know, I just app get installed postgres to I'm now

move it from, you know, I just AppGate installed Postgres to I'm now using it for the data warehouse for my entire enterprise?

So first,

kudos for saying AppKit.

I'm a big Ubuntu fan, Debian fan. That was really kinda great to hear.

You know, out of the box,

you know, you're talking about configuration. Right? Making sure that you have the right storage,

making sure that you have enough cores to run parallel queries, and make sure you have enough memory to manage the data as it's being, you know, manipulated and delivered.

There's no native change. Right? It's not like you're flipping major switches or installing major extensions

just to get to the point where you can start reporting. In fact, I would say the barrier to entry

for a DBA

to provide a data warehouse using Postgres

is exceedingly

low.

It's when you start reaching ceilings that you have to get more creative.

Thomas can speak to this better, but something like Swarm 64 would come in. Yes. It's all a question of scale and our experience.

So, you know, depends on the size of the enterprise or the size of your data warehouse as a result of it. And, yeah, I would say it's a question of the the quantity of data. It's a question of also what kind of response times do your analysts expect.

Generally, as JD is saying, the first thing I would do is I would adjust the configuration, and I would make

sure that a certain amount of hardware is is in place. So for those listeners that are actually actively considering a data warehouse,

if you're starting with plain Postgres as your

engine, I would recommend

16 cores at least. So 16 virtual cores or visible cores

for your system.

So as in 16 hardware threads that Postgres can run on, generally, we find that this works quite well if you have, let's say, 1 or 2 concurrent analysts accessing your data.

That's the kind of quantity that can very nicely be used. Of course, Postgres really likes a storage that responds quickly.

So SSDs that are commoditized enough in this day and age, it's, of course, a recommendation.

You should have an adequate amount of RAM to go with it. Let's say, as a rule of thumb, maybe look at

maybe a 5th or a 10th of your entire database size

should be at least available in RAM. And I think those kind of starting points will get you quite far.

However,

as everybody knows, people want fast responses. They don't want answers in minutes. They want answers in seconds and so on and so forth.

And then you have a range of choices.

I think we're gonna talk about some of them in a moment. Yeah. Let me I would like to add to that, Thomas, just to kinda geek out a little bit. The reason the number of cores is important

is that because of the architecture of PostgreSQL, which is process based.

In order to do more than 1 thing

on a particular process, so if you're running a query

that is calling out, you know, multiple sorts

kind of thing,

you need more cores.

Because it can only do 1 thing at 1 time per course. So your

statement of 16 course, I would agree with you. That's a very

good

round

number 1 that doesn't cost all that much.

And 2,

will allow PostgreSQL

to parallelize

these longer report queries across the number of cores

to retrieve the data faster.

Further with the RAM, because of the way PostgreSQL handles RAM, there's 2 key parts, right? There's your working memory,

And then there is your buffer pool, which is cache.

And

before data warehousing, the buffer pool isn't as important,

but the working memory is.

So when you're doing many sorts, you wanna be able to make sure that you have enough memory available to say, for example,

process a 100 megabytes worth of data

before

you start spilling that data to disk to start processing it. Now luckily,

again, to your point with SSDs, that's not as expensive as it used to be, but it's still not as fast as RAM.

Patrick is a diligent data engineer,

probably the best in his team.

Yesterday, when trying to optimize the performance of a query running over 20, 000, 000, 000 rows, he was so eager to succeed that he read the entire database documentation.

He changed the syntax. He changed the schema. He gave it his everything and reduced the response time from 20 minutes down to 5.

Today is not a good day.

Sarah from business intelligence says 5 minutes is way too long.

John, the CFO, is constantly slacking every living being trying to figure out what caused the business intelligence expenses to grow so high yesterday.

Want to become the liberator of data?

Firebolt's cloud data warehouse can run complex queries over terabytes and petabytes of data in sub seconds with minimum resources.

No more waiting, no more huge expenses, and your hard work finally pays off.

Firebolt is the fastest cloud data warehouse.

Visitdataengineeringpodcast.com/firebolt

today to get started, And the first 25 visitors will receive a free Firebolt t shirt.

Digging more into the actual

extensions and plugins or forks of Postgres that you can use to make it more purpose built for data warehousing, you know, maybe leaving out things like Redshift, which is originated as postgres, but is not necessarily considered in the same camp anymore.

But, you know, I'm aware of things like the c store, FDW or column store for a data wrapper. As I was preparing for this podcast, I was reading about a project called ZedStore that's trying to add native column storage capabilities

to postgres as part of the core offering,

obviously, Swarm 64. Yeah. I know Greenplumb has been around for a while. So

what are sort of the relative trade offs of those different open source and commercial offerings, and what your considerations are as you decide which ones to bring in for a given use case.

1 thing on that, like you brought up, Zendesktor, which is actually a great example,

1 of the great features of Postgres. And I know I'm beating this drum, but because it's open source, but it also has a storage API.

I mean, that's how Aurora works, Right? Is that it's postscripts, and then underneath it, it's using the storage API to actually do a different type of storage

than what you would natively do.

ZedStore is also going to provide a different type of storage specifically for a certain type of workload and dataset.

I think I would actually like to pass the the baton to to Thomas on this 1 because of everything you mentioned, you know, Greenplum is fantastic. It is based on Postgres. It's open source.

You know, ZedStore, which is being driven by the same folks of Greenplum

to try and get into Postgres native,

is also very cool. The Citus people are wonderful.

It's not really for data warehousing. They're recognizing the hybrid workloads, so they are adding some features

for that, but it's not really designed for that. But the most interesting to me, which is why we're on this podcast, is form 64 because of what they're doing

with their software to make Postgres be able to perform at a level that, as far as my experience

in the field, for what Swarm 64 is good at, there is no open source solution,

period. So,

Thomas, can you explain a little bit more about that?

Yeah. Basically, what we decided to do is we decided to really

fit

exactly into the framework that Postgres provides,

and that includes a few things. So Postgres in itself, with its extensibility that we've already mentioned,

allows you to also augment the database without actually

forking the project. So you don't have to recompile

Postgres into a new project and and, you know, turn it into something else. You can actually just provide additional features

into the pluggable type of architecture that Postgres has. So, for example, under the bonnet, we are offering Postgres better ways to

do large joins, which is something you'll have very frequently in data warehousing type workloads.

We allow it to execute better. We allow it to execute with more parallelism.

We allow it to modify how data is moved and data is replicated during queries in order to basically have to spend less time

moving data around cores and making the processing more efficient. So all these things we do behind the scenes. But then we also have implemented a

column store based index.

So it's called the column store index,

and it is basically

keeping a heavily compressed

copy of your data

in an index.

You can define if you want your entire table to be

kept additionally

or if you

want only certain

columns of your data to be included.

Typically, we do see that the raw storage size on disk

is

anywhere between

1 5th to

for example, people do not choose all columns or the columns are heavily compressible.

We've also seen examples of a 20th of your actual data.

And what happens is that this columnstore index is actually updated in the background. So it doesn't take away anything from your

postgres processing as it's happening.

It updates itself in the background. But the moment it sees that you need to analyze large quantities of data,

it will offer that access method. It will offer the ability to use the column store,

index.

And as a result then, it will read

incredibly

reduced quantity of data. So you can look at a factor 10 less data to a factor 100 less data that has to be read

and then also process it much quicker to provide it to Postgres. So in other words, as you're moving into these kind of multi terabytes, tens of terabytes, 100 terabytes type problem scenarios,

this makes a difference between hours to minutes or within minutes to seconds in your query time.

And we generally find that other examples, like in the community, there's a column store form data wrapper that, Tobias, you mentioned a moment ago,

that generally doesn't reach that level of maturity when it comes to access. So if you compare them side by side and there's a free developer license available on swamp 64.com.

So you can try that yourself. If you compare the column store foreign data wrapper

versus swamp 60 four's column store index, you will see that's a night and day difference in terms of security execution speed.

Yeah. So long story short,

if you want to have a Postgres that is still 100%

a Postgres, behaves like a Postgres, needs minimal additional configuration, if at all,

and then just

works also as a data warehousing powerhouse,

then adding the Swarm64 extension

is easiest way to get there.

And

it is basically completely noninvasive.

So we're not locking you in. We're not taking you anywhere

outside of this Postgres usage paradigm of the 100% Postgres compatibility.

That was a very important design goal we had

when we basically upgraded our product version last year

from version 4 dot x. We had a few versions 4. And then going to version 5, we basically gained that 100%

Postgres compatibility.

Yeah. So that's how you can turn Postgres into a high performance data warehouse

very quickly. And then you'll be right up there with some of the best products, like, for example,

Microsoft SQL Server with its

column store index

or Oracle

in memory column store features. So

you'll basically elevate Postgres to that kind of level.

1 of the things that

people don't really think about either is that

because of the way that Swarm 64 is architected,

1, your ceiling becomes much taller.

But also, you save money

as people continue to

the cloud where you're basically paying for every metric of resource, whether it be an IOP or memory or core or whatever.

The high compressibility

creates an environment where you need less RAM, for example,

to process the same data.

And further with SWARM64,

and Thomas has said that, you know, 100% compatibility, but I think another way to put this

is that

it's

backwards compatible. So if we have upscaled

Postgres with now we're running Postgres Swarm 64,

And then we decide,

we'll never decide this because Swarm 64 is awesome, but let you know, assume that we have to uninstall install

Swarm 64.

Your data warehouse will still work.

Right? And those are important features

when you're talking about environments that can change dynamically based on the whim of a CTO decision. Right?

So there's more than just a technical argument for swarm 64. There's also a business argument, I guess, is what I'm saying. Yeah. And and to be honest, this was entirely by design from our side. We actually consciously chose that

because,

like, lock in is something that is like a hidden cost.

And I think everyone who is now maybe considering things like Snowflake and,

JD, as you said, reinventing the wheel. I'm not sure if I could quote you on that, but

yeah. So what people aren't seeing is they're basically creating a massive dependency, and they're creating a big debt for the future, for their future selves or for their

successors to solve. And we decided

that

there's so many

excellent database operations, paradigms, and Postgres is so successful.

It doesn't need new usage patterns. It doesn't need new

SQL language. It doesn't need all these things. So we are actively keeping all our

extra configuration parameters,

for example, index definitions and so on, and absolutely in line with the standard way of operating it. So if people decide

that Swarm 64 is not what they need anymore,

then it's part of our feature set that they can actually do that and not be locked in. There's no migration.

And that, I think, is a very important feature that

people should pay attention to because, I mean, as we all know,

pricing models change.

I think quite a few organizations

getting

into big Oracle licenses

weren't expecting

them to move commercially to a direction that they have since moved. So,

you know, I think that's a very important

design criterion and decision criteria.

Well, that and, I mean, let's remember that in a decade, everyone's gonna move back off the cloud because it will no longer be the neat thing.

Yeah. That could very well be. Right. All of a sudden, people are like, I can't believe I'm spending,

you know, $50, 000 a month when I could actually only spend $10, 000

a month if I was willing to host it within my data center.

And that's not to knock the cloud. I mean, I mean, it's been around for much longer than the term cloud. Right? In back nineties

doing what's called virtual hosting. That was the cloud too. It's just everything continues to evolve.

The way gaming laptops evolve, who knows? Maybe you'll be

you'll be running your data warehouse on a gaming laptop in 10 years because it is all you need. No kidding.

Digging a bit more into that cloud aspect of things

for somebody who maybe starts off with a Postgres database, and they're building out their data warehouse, and then they start

loading in more and more data from more different sources. And then maybe they start to hit some performance limits because they have so much data on disk.

What are the best practices or what are the recommendations that you have for being able to maybe life cycle some of that data out into, quote, unquote, cold or warm storage on something like s 3 while still being able to query it and just some of the strategies around

data modeling and life cycle modeling for Postgres as your primary interface to all of your data.

That's actually an interesting point. I was surprised you brought up s 3 for that. That certainly wouldn't be a best practice through s 3. I think that s 3 obviously has its place, but that's not it.

The most

common way and what I would consider the standard way

is that you would roll data off onto

a non SSD volume. Right? Good old fashioned spinning metal, and use table spaces

to do so. So you would have an environment where

Postgres is just running like normal. But if you go into this schema with these tables

that are partitioned,

that schema and those tables and those partitions are all sitting on

rotational media versus SSDs.

And there's a number of advantages to this.

1 is that rotational media is much cheaper and also much larger.

So you can have an environment, whether you're hosting it yourself within a data center,

or if you're in the cloud where you're mounting,

you know, network drives, essentially.

You're just able to just add that volume

to whatever size that you see or need,

and then move the data over there, and you still have your standard interface.

And what's great about this is that if the rotational media isn't fast enough, you still have the opportunity to do something like install Swarm 64,

which will, as already mentioned, greatly compress that data, or at least the indexes within that data, and allow

much higher performance

query capability,

even though it's rotational drives.

That's kind of the standard way. Obviously, if you're talking about ancient data that you just don't wanna get rid of,

you had mentioned cold storage,

you know, it's not as simple.

We can certainly roll it off using something like logical replication

into a secondary node that's sitting on whatever you want it to sit on, and then you can just turn that node off.

That is absolutely an opportunity. And actually

doing that is functionally,

gives you batch updates as well. Because if you were to turn that secondary node off once, you know, if you're adding data, say once a month.

If you turn that node back on in a month

and it reconnects,

all that data will come down from your primary.

It'll replicate to that secondary node, and then you can shut that secondary node off again.

So there are a couple of ways to do it, but the standard way of the warm storage or the rotational storage would be the most obvious with tablespaces.

Just to add to this, if you're happy to do a little more work, what you can

do is you can use a foreign data represent. Again, I'm going to use the Parquet 1 as an example because, I mean, Parquet is a standardized format, which, of course, is always important to use in standardized

things.

It's been maintained by a Just company

from Berlin,

and they're now part of Applovin, US company.

And

the nice thing that Parquet itself is a very

heavily compressed format.

And

what you can do, basically, is you can for example, with a script or something similar, you could actively offload data from your Postgres

data warehouse into Parquet files

and keep those files, as JD was saying, possibly not an s 3, but keep them on a comparatively cheap storage medium,

like, for example, spinning disk based elastic block store containers. Yeah.

And you basically put it onto an EBS volume that's based on spinning disks, so that's quite cheap. And then it's heavily compressed, And then you can access it with a foreign data wrapper.

So that's another opportunity to basically age your data away while still having the full ability to

query it for

analytical workloads. So I would say that's maybe a middle ground solution. I think the tablespace

solution mentioned is probably the easiest 1 for quick fix.

Also, Postgres has a history of working well with rotational media with spinning disks. It is quite nicely optimized to work quite well with that. Then if you wanna be a little more involved and get greater savings, greater space savings, there's the way to basically copy data away

to parquet

files and then have those

pulled back in via basically, a foreign data

wrapper, still making it read only queryable.

And then if you wanna go to extremes, you could basically make replications onto nodes that you switch off and and have a very kind of

actively controlled and orchestrated mechanism where you basically have something being switched off most of the time and only being switched on once a month. So, yeah, there's a variety of ways in which you can do it.

I think right now, it's

not a 100% automatic as you could tell from our answers, but

it is quite easy to set up. So if you're interested in, I mean, a company like

Command Prompt could probably set that up for you very quickly.

We definitely could. Another area that has been seeing a lot of activity and variations in

terms of its popularity is the data lake where there was the Hadoop era where everybody was all on the sort of commodity storage and MapReduce, and then now

with object storage and these different query engines and cloud data lakes, and now there's the hybrid lakehouse. And, you know, we've talked a bit about sort of

being able to

move some of your hot data into warm storage on spinning disks. But for the case where you already have data in your data lake, you know, maybe it's already in parquet files,

and there's maybe a Hive metastore that has all the table definitions. What are the capabilities for Postgres to be able to integrate with that and either, you know, migrate it all into Postgres or just have them live separately but be able to interop easily?

I would definitely opt for the second, but I think JD can comment in the moment. But I would definitely think more about interoperability

here and Postgres being a consumer

of your architecture there as opposed to Postgres being the central manager of it all.

That would be my take on it. So I think it's a fantastic strength of Postgres to be able to interoperate with these things.

I think there's about,

I don't know, 30, 40, 50 different foreign data wrappers, so there's a lot of formats that are are supported.

SJD has also correctly mentioned

they are of varying quality.

So I think there's a a wide range where you can really rely on and use them for production. There's others where you may wanna

enlist the help of command prompt or swarm 64 to get them to a level where you can use them for production.

But the strength is there for Postgres to interoperate with it. I'm not sure I would use Postgres as a data lake orchestration or data lake management engine. JD, what do you think? No. I I would absolutely agree with that. I mean,

that point, Postgres becomes your canonical store. Right?

And you allow all your data within these lakes

to do what you do with it. And you can either ingest into Pubscripts or just interrupt, you know, making it work is not a problem. And it's not just FDWs, right? There are plenty of what I would call brokers out there. A good example is Debezium,

which is open source, which allows you to interface with a lot of different

data sources, and you can shove it right into Postgres if you need to.

The data lake technology out there is,

you know, I remember the days when, you know, you'd be doing EDI.

And

it's interesting to me to see all this new stuff and yet more marketing terms come up for what's already been around.

But definitely

from a technology standpoint,

Thomas is absolutely correct. I mean, Postgres

is really your consumer at that point.

As you continue to work with Postgres and its community and work with your customers to help them be able to put Postgres into effect for their various data needs,

What are some of the areas of work that are happening in the Postgres community, both in terms of core and the surrounding ecosystem,

that you're most excited about and that will help to make it more easily suited to these data warehouse and analytical workloads?

From my perspective,

the main things going on in core,

we've spent the last few releases. We're on release 13 now.

Adding features. But for the most part, it's been clean up and optimization.

For example, you know, in version 9.6,

we we had partitioning,

table partitioning, which is fundamental to a successful data warehouse.

It was okay. It did what it was supposed to do.

But now, our data partitioning is well, I'll give you an example. It used to be that

having, say,

250 partitions,

planning a query around that number of partitions could take a considerable and demonstrable amount of time.

Whereas now in the newer versions, the more modern versions of PostgreSQL,

planning around even thousands of partitions

is minuscule. You don't even notice it. So it's those kinds of optimizations to deal with larger datasets.

And, you know, a lot of people are like, oh, it's just a tweak. No, it really isn't because you're talking about a code base that goes all the way back to the eighties.

There's quite a bit of,

oh, if we change this, we have to change that, and then we have to change this. But at the end of it, it becomes highly optimized.

And that type of work is happening now. And there's a lot of really good stuff.

I'm not gonna go into specific patches, but

I think that

PostgreSQL used to be a very good departmental database.

And then it became a very good Internet slash web database.

And

now, because

it used to be to have a data warehouse meant a terabyte of data.

Well, having a terabyte of data is nothing now, because we're keeping everything.

And so now we're modernizing

the code base again

to deal with the current demands

for data volume and querying that data.

I would say in terms of excitement,

I do think that the zedStore initiative by Greenplum is an interesting 1 if it makes it into a later version of Postgres because it will offer

a compressed table storage

that is more optimized towards

analytic type workloads, and that will work seamlessly with Swarm

then as well because, as I said, we are just

adding to Postgres. We're not taking away.

And we think that could be a very nice complement

to

what we have done with the columnstore index. So then it offers a wider

variety of what users can use, and they would, in some cases, keep data in the Postgres standard. Postgres format, which has a ton of advantages, of course, but they would sometimes choose to use ZStore. So I find that very exciting. So, you know, we're doing what we can to support that project. It will probably, unfortunately, take a bit of time until it's there, maybe 2 or 3 years. I wouldn't hold my breath yet, but that's 1 thing that really excites me. Actually, it's interesting that you just brought up a timeline.

Although I know that currently the plan is to get it into core,

within 2 to 3 years, I would expect that the value

of having it in core

would be

rather reduced. And instead, like swarm 64, it would be essentially an extension.

And there's a lot of benefit

to keeping it out of core in that fashion, because it will be able to move at its own development life cycle,

while maintaining compatibility

versus PostgreSQL's

development life cycle.

That would be huge for a project like that as, you know, new methods and new ideas come about and being able to test and,

you know, have, you know, long term releases versus short term releases for testing in in early adopters and all that type of thing. But I would agree with you that is that ZSOR is definitely

1 of the more interesting

technological improvements that are coming about.

And in your experience

of using Postgres and working with other people who are trying to deploy Postgres for these analytical use cases, what are some of the most interesting or innovative or unexpected ways that you've seen it used? We just recently had a project

in the health care industry.

The

client demand

was that all,

changes

to a particular table

be exported

to JSON

and was simultaneously

imported into an s 3 bucket. And this was literally to be consumed

by some external

analytical platform

to run reporting,

financial reporting.

And of course, command prompt happily, you know, we built a system using Postgres, using

logical replication slots, and change data capture to export to JSON and insert it into s 3. And in fact,

we were able to build it out in about 10 days

after an extensive project by another company failed trying to use the data migration service.

To me, the reason that's interesting

is because of how wrong

it was that we had to do it in the first place. But,

it shows the the functionality or the extensibility of Postgres. Right? The adaptability. We had a demand for canonical data to be transformed into a different data format

live

as it changed and imported into s 3 for export into another reporting system. And we were able to build that out

in basically a week and a half. It's been in production for,

you know, 4 months without a blip.

I attribute that not to command prompt, but I attribute that to the stability and extensibility of PostgreSQL itself.

Fantastic example of it. I would actually go back to what I described earlier. This is a while ago since we've come across that. But, basically, the idea to use updates

instead of an append only workflow

to make your queries be

tremendously

easier

on the reading side as in when you're trying to understand the data.

It's tremendously easier because instead of having to basically construct,

for example, a certain flow of status updates, like the example I gave with the shipping status where you basically then have to backtrack

and kind of rebuild

your table of what every individual product has happened in terms of shipping updates, like what kind of life cycle has it gone through.

Instead, just having an update

and basically just keeping 1 of those modern array style data types or structured data types like JSON

and

doing it this way, which is a very, very unusual paradigm by data warehousing standards. But it's such a brilliant

concept because it just makes your query so much easier

also from an understanding point of view instead of some database

analyst

having to

conceptualize in his head 3 different joins and multiple swords in order to combine the exact shipment tracking history of every 1 of your shipments back together,

that's just a single JSON field

that contains all updates.

And I think those kind of things are they are great learnings. Like, coming across that and then realizing, wow, here, someone did something really genius. That's always the fun part of the job.

As you have built out your

own businesses and worked with your customers

and, you know, dug into the guts of Postgres itself, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Oh. So

you put the term business in there, and I think that's the most interesting and challenging aspect is that as the market has changed,

a lot of clients

don't understand

those changes. And I'll give you an example.

RDS Postgres, which is a fantastic platform if you want an almost 0 barrier entry into

developing or deploying PostgreSQL.

Okay?

However,

just because it's managed Postgres

does not eliminate the need for a DBA.

It eliminates the need for a SRE,

but you still need someone who understands PostgreSQL,

understands its access patterns, understands its configuration,

understands how to optimize it, how to extend it, how to scale it, how to back it up, how to make it highly available, all those types of things.

And a lot of

people in the market, what they do is they will hire a developer,

and say, build me this magical widget.

And the developers great at building magical widgets.

But the developer has no idea

the chemical composition

that is required

to make that magical widget.

Right? He's all about, I've got this API, I've got this framework, I've got this code that I any, you know, objects and methods and actions that I'm all putting together.

And I'm working with a user interface guy to make sure that it's usable,

but they don't understand Postgres underneath. To them, it's just a data store.

And that's a really

unfortunate thing that has come about

because it's not just a data store. This actually applies to any database. Right? It doesn't matter if it's MySQL or Snowflake or Greenplum or whatever.

You do still have to understand the technology that is

hosting and managing your data.

And I think that's the most in fact, I would say it's the most complicated

part of our business experience with Postgres

is that

we will take a request from a potential client or even existing client, and they'll say, you know, our Rockstar developer says it's Postgres' fault.

And you go in and you look at what the Rockstar developer is doing, and what it is is clearly

the Rockstar developer knows nothing about databases.

They only know how to put or get or post or, you know,

insert through a method within their ORM.

And that's challenging because then we have to prove the point. Right?

You would almost think that's mundane, but it's actually,

I would say that

the majority of the business that Command Prompt does

is not necessarily

support in terms of things are broken.

It's support in terms of education

to

the client on how the platform actually works.

And I would add to that the

emergence

of Postgres solutions

and them all having kind of small

nuances in the flavor and the parts they automate and the parts they take away from the user in the good sense, as in the user doesn't have to spend any time on it anymore, and the parts they take away from the user in the bad sense, as in, for example,

RDS excluding

a certain amount of extensions and basically having no longer the standard extension interfaces that regular Postgres has.

I think that is something that really

creates a larger level of complexity, and it can go as far as branding even. So if you look at Amazon,

the offering Aurora,

which you can have in 2 flavors, it's called Aurora, but there's an Aurora MySQL and Aurora Postgres.

So it's basically muddying the water a little bit, which is a shame because

it makes things a little more complicated from an operation point of view. So I would definitely stick on that that

often

operating it, especially as you're getting further and further removed from the database,

is

starting to become the bigger challenge for the users.

No. I think that that's actually a very good point, Thomas, because if you look at all the managed providers, right,

the extension problem is obviously a very important 1 because it basically

RDS

is great as long as you live within its box the moment you wanna do something that's outside of its box. And keep in mind that that's not the Postgres box, it's the RDS box. Because you can make Postgres do anything you want because it's open source, RDS is not.

And

then you take something like an Aurora PostgresQL,

which is actually different.

It has different planner

semantics, it's got different storage semantics,

you know, pricing structure that's different. And then you take that,

and then go over to Microsoft and their version,

which also has its own storage engine. And then all of a sudden, you're ending up in this environment where you say running Postgres, but now we have to qualify our Postgres.

Are you running

timescale?

Are you running RDS Postgres? Are you running Aurora? Are you running, you know, Azure? Are you running GCP?

Because they all have their own

extensions to Postgres to make it work the way they think it should work.

And so now what's happened, like, for example, with ClamPrompt,

every good company has an, you know, client intake process.

Well, 1 of the things that we do is we say, okay, can you tell us about your Postgres installation? And we ask them half a dozen, you know, actually it's more like a dozen questions now, but it used to be, what version of Linux are you running and what version of Postgres? That was the intake process for knowing how to take care of the client. Well, now

it's what operator are you on Linux, or are you on Windows,

or other? Because occasionally you run into free BSD or something like that.

What version are you running?

And then, you know, is this a managed platform? If so, which managed platform?

Those types of things. And so Postgres is no longer

just this piece of software that we have downloaded or installed with AppKit.

It is any number of flavors.

And it's also making it challenging for someone like, you know, a Swarm 64,

who is integrated so deeply with the known open APIs of PostgreSQL.

But even though they

work just fine with that, they may not work with something like an RDS because of the limitations that RDS puts on its users.

Yeah. As somebody who uses RDS for my day job, I can definitely agree with the constraints and occasionally

bemoaning them, but still being able to benefit from the fact that I don't have to worry about making sure that the actual underlying box is up all the time and managing the point in time snapshots for the backups.

Right. And there are definitely things that RDS is great. There's no argument there.

It's 1 of those things, like, it's great until it isn't. And once it isn't,

your life gets really complicated really fast. Exactly.

And so

in terms of people who are looking at Postgres and considering it for their data warehouse or analytical workloads, what are the cases where it's the wrong choice and they might be better served with either, you know, a different database architecture

or a different managed platform or just using a distributed query engine or, you know, throwing everything at Spark and saying tech with the database.

In my mind, there's 2 parts to that question.

Number 1 is

not having

the expertise

by far.

It's funny, the market thinks that anybody can be a developer

or anybody can be a DBA,

but you wouldn't expect

your DBA to be able to

build a car.

Right? So why are you expecting your web developer to be a DBA?

So you need to have

actual expertise, especially with something like a data warehouse,

where

you have unique access patterns,

volumes and volumes of data beyond what you would normally encounter.

I mean, your transactional database

could run, you know, say a $1, 000, 000, 000 company, and the transactional database might only be a 100 gig or less.

But the data warehouse for that company is a petabyte.

It's a totally different understanding of how to deal with the technology.

But once you have the expertise, it kinda answers your other question. Right?

Is that, you know, what is the problem you're trying to solve? Well, I've got a data warehouse and I need tools to manage this data. And like you said, maybe it's Spark, maybe it's, you know, any number of other platforms.

But that question answers itself once you have the expertise.

So I don't think that there's any real

hard line there,

possibly if you're using a platform that is not relational.

You know, I could see a situation where if for whatever reason you were using

something that, you know, object storage or something like that, you know, trying to push that back into Postgres

could be a challenge and maybe the wrong answer to the question.

But it really does make its own self apparent once you have the expertise to understand what your architecture is doing.

Are there any other aspects of Postgres in the data warehouse and analytics ecosystem or the work that you're each doing with it and just the underlying technologies or capabilities that we didn't discuss yet that you'd like to cover before we close out the show? So we haven't discussed, no.

Obviously, I'd like to affirm the part about it being open source. I mean, this is why

it has become so successful.

Right? It's why Amazon runs on Postgres, because it was open source.

And

I think that for those who are trying to deploy a new system, and whether it's transactional or data warehouse,

Postgres should be on the list of consideration

because

it puts you in an environment where

you are

forever supported.

You can either be supported internally through your own staffing

or through

companies like Command Prompt.

And you are also supporting the ecosystem

that is driving forward

the new models of how business is being done, like a Swarm 64.

That that's kinda where I'm at on it. To add to that,

it's

easy to look at Postgres from a value point of view,

free, open source, but we should also not forget that it is probably

next to very few other databases.

It's definitely up there, I would say, among the top 3, possibly, most fully featured databases in the world, full stop. And that allows you to do all these things that we've been talking about

today. And that is the value in itself because who knows at the beginning of a project where the project may take you. And to swap out the database

2 years into an operational project,

it's a nightmare. You don't wanna do this. So having the ability to start with something that is just so incredibly

fully featured and extensible,

I think, is always a very powerful starting point. And it saves you from trying to predefine

all your requirements upfront, which we all know from experience doesn't work. You'll always forget something. It saves you from that burden, and you will find a way to solve it later

as we've seen many times.

Well, let's not forget that requirements not only are forgotten,

they change.

You may have requirement a, b, c, but, you know,

the moment you start deployment or development of whatever you're doing,

ABC becomes A through Z, and there's a whole lot of A. 1.2.3.4s

that just appear.

And to your point, Thomas, the great thing about PostgreSQL in that environment, is that you have started with

podcast

by

podcast by itself is the capability

of Postgres to be able to push user defined functions into it using a large variety of different programming languages and approaches to it and, you know, the viability of then being able to maybe push down some of your machine learning into the database rather than having to, you know, bring the data to it. And as I said, I'm sure that that'll probably spark up a whole another conversation that's worthy of its own hour plus

discussion. So

Well, definitely. In fact, I I was surprised you brought it up.

User defined function store procedures

are hugely powerful.

And specifically, in regards to this topic,

you know, you have the ability to write them in, for example, R or Python,

both of which

are, you know, very well suited and widely

deployed for, you know, data analytics, data science, that type of, you know, statistics, that type of thing.

Well, we'll leave that 1 on the table for now, and maybe we'll come back to it another time. But for anybody who wants to get in touch with either or both of you, I'll have you each add your preferred contact information to the show notes and so they can follow along with the work that you're doing.

And as a final question, I would like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. Last time, I answered with what I consider a certain need in software tools. But to be honest, I think in this day and age,

I have a very different answer for you today, and I'm really looking at the tool as in the tool you can touch. I do think we're running into a

dangerous shortage of chips in general and, specifically, processing chips. I mean, GPUs is very obvious, but also CPUs

is starting to, show a degree of squeeze. And I do actually think that there have been absolutely

amazing

iterations, especially in the last few years when it comes to multicore.

And what I'm seeing at the moment is that the global demand

is currently not met. So what I'm actually thinking the number 1

gap at the moment is really to get all these amazing

chips that have been recently developed into quantity and to kind of keep up this amazing speed of innovation

that the likes of Intel and AMD and so on, NVIDIA

have all shown recently. Because,

for me, data warehousing is strongly connected to this massively parallel processing.

And I think the ability to just have these amazing core counts on a single CPU and then having

easily 4 CPUs in the system is just completely changing the landscape.

So I want these chips to deliver. I want them to be easily available.

And, unfortunately, at the moment, I see this being slightly under threat. However,

as we have this

free enterprising world, you know, as demand

there's money to be made, people will really scramble to meet the demand. So I'm hopeful that this will pass, but at the moment, I'm a bit concerned about that. In my mind, although I think you're absolutely right with the chip thing,

it's security tooling.

Security has become

so complicated

that every day, you hear about a data breach.

So what we really need is to rethink

the security paradigms that we're using to protect our data, especially

personal data, PII.

So that

we don't have as much of a concern

that this data that's being collected,

analyzed,

deployed,

used for tracking and all of this type of stuff

will some

way, at some point, become a $50, 000, 000 bounty,

or they're going to release it to the world.

That to me, especially when we're talking about data warehouses and data in general, you know, outside of data privacy concerns, which we can get into it in another topic,

that's probably in my mind, number 1

is

protecting

that data from

bad actors.

And we don't have good tools for that. That's a great point. I completely agree.

Well, thank you both for taking the time today to join me and share your perspectives and experience

of using Postgres for data warehousing and analytical data storage. It's definitely a very interesting and relevant topic and 1 that I've spent a decent amount of time thinking about and digging into. So I appreciate having both of your expertise on that matter and the time and energy that you dedicate to that. So thank you both again for your time, and I hope you each enjoy the rest of your day. Thank you. Thank you very much.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links