Summary
There is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project. In this episode Thomas Richter and Joshua Drake share their advice on how to build a production ready data warehouse with Postgres.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Thomas Richter and Joshua Drake about using Postgres as your data warehouse
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by establishing a working definition of what constitutes a data warehouse for the purpose of this discussion?
- What are the limitations for out-of-the-box Postgres when trying to use it for these workloads?
- There are a large and growing number of options for data warehouse style workloads. How would you categorize the different systems and what is PostgreSQL’s position in that ecosystem?
- What do you see as the motivating factors for a team or organization to select from among those categories?
- Why would someone want to use Postgres as their data warehouse platform rather than using a purpose-built engine?
- What is the cost/performance equation for Postgres as compared to other data warehouse solutions?
- For someone who wants to turn Postgres into a data warehouse engine, what are their options?
- What are the relative tradeoffs of the different open source and commercial offerings? (e.g. Citus, cstore_fdw, zedstore, Swarm64, Greenplum, etc.)
- One of the biggest areas of growth right now is in the "cloud data warehouse" market where storage and compute are decoupled. What are the options for making that possible with Postgres? (e.g. using foreign data wrappers for interacting with data lake storage (S3, HDFS, Alluxio, etc.))
- What areas of work are happening in the Postgres community for upcoming releases to make it more easily suited to data warehouse/analytical workloads?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Postgres used in analytical contexts?
- What are the most interesting, unexpected, or challenging lessons that you have learned from your own experiences of building analytical systems with Postgres?
- When is Postgres the wrong choice for a data warehouse?
- What are you most excited for/what are you keeping an eye on in upcoming releases of Postgres and its ecosystem?
Contact Info
- Thomas
- JD
- @linuxhiker on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- PostgreSQL
- Swarm64
- Command Prompt Inc.
- IBM
- Cognos
- OLAP Cube
- MariaDB
- MySQL
- Powell’s Books
- DBase
- Practical PostgreSQL
- Netezza
- Presto
- Trino
- Apache Drill
- Parquet
- Parquet Foreign Data Wrapper
- Snowflake
- Amazon RDS
- Amazon Aurora
- Hyperscale
- Citus
- TimescaleDB
- Greenplum
- zedstore
- Redshift
- Microsoft SQL Server
- Postgres Tablespaces
- Debezium
- EDI == Enterprise Data Integration
- Change Data Capture
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because the number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all of this collaboration chaos firsthand, and they started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/atlan. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Thomas Richter and Joshua Drake, heretofore known as JD, about using Postgres as your data warehouse. And, Thomas, can you start by introducing yourself?
[00:02:09] Unknown:
Hi. I'm Thomas, a CEO and cofounder of Swarm 64. We are an extension for Postgres that accelerates complex queries and data heavy queries. So, basically, the classical query types you would see in a data warehousing
[00:02:23] Unknown:
environment. And, JD, how about yourself?
[00:02:25] Unknown:
I am the founder of United States PostgreSQL, as well as the founder of Command Prompt Incorporated, which is a full stack PostgreSQL consultancy and support provider.
[00:02:35] Unknown:
And we've been around for over 20 years. And going back to you, Thomas, do you remember how you first got involved in the area of data management?
[00:02:42] Unknown:
Yeah. My first exposure to that was really in what you would probably today call as sales analytics or in house analytics, and back back then it was called sales steering. That was just at the change of the 2000. And, there was an internship. And, basically, as the intern, I would do the daily data preparation and so on. So back then, it was IBM with Cognos on a big data warehousing system. Everything was, back then, still OLAP cubes, a very prescriptive way of how data is queried, and that was, in a way, a lot of fun.
And as you can see, left a lasting impression. So that was my first endeavor into data science, and then, basically, I've been always close to data processing throughout my entire career. Worked in IT systems for airlines for a while, then got into wonderful world of start ups, worked on projects around information access and mobile technology, and then eventually decided to create my own start up. So that's really my process of going into data processing and always been very close to MySQL, MariaDB, and Postgres
[00:03:53] Unknown:
throughout my career so far. And, JD, how about yourself?
[00:03:56] Unknown:
I have a slightly different background. I actually fell into databases. I worked for a company called Powell's Books, which is the largest freestanding bookstore in the world, actually. It's in Portland, Oregon. And there was a technical bookstore. This is back in the days when Linux was just coming about. C and c plus plus was the primary platform for everything. Java hadn't even been production. And I wrote a special order database system in dBase of all things. And that kinda landed me in the direction of going into web development and those types of things. And I ended up picking up Postgres before PostgresQL. It was called Postgres 95.
And that was the actual first exposure to Postgres that I had had. And because of that, command prompt was created and wrote a book for O'Reilly called Practical PostgreSQL all the way back in 2001. And I've continued to use PostgreSQL and support the community since then.
[00:05:05] Unknown:
For today, we're talking about Postgres in the context of data warehousing. But before we dig too deep into that, can we just establish a working definition of what constitutes a data warehouse? And because I know that there are sort of a few different ways to look at it and a few different approaches. So for the purpose of this context, if you can each just kind of give your thoughts on the the framing for this conversation.
[00:05:29] Unknown:
Well, Thomas, what do you think about that? Because the traditional data warehouses that you and I, you know, would consider, They're not like that anymore. We're getting a lot of hybrid workloads, things like that.
[00:05:41] Unknown:
Yeah. So I think data warehouse is actually a changing definition, so to say. In my view, it is basically a system that holds your data for analysis and that could actually take various sizes. And I think in the old days, it used to be very specific. It used to be very offline, as in you would put data in, and it would then precompute things and make them available for analysis. I think today, the definition is is shifting much more towards you can store quantity of data, often transactionally consistent, and then retrieve that. And there's no more prescription around, okay, how recent does your data need to be? Is there any offline activity? So there's a more of a real time aspect to it. So, really, a large, heavily parallel database that allows you to ask questions and get the answers back in a reasonable time or to provide batch jobs that can then be answered, for example, in the nightly run and provide that information by the next day.
That's how I would define a data warehouse. JD, maybe you can help with the Yeah. I would agree.
[00:06:49] Unknown:
I mean, traditionally, people might have something like a Netezza, which is actually a fork of PostgreSQL. It's actually a fork of PostgreSQL 7.2. But nowadays, what we've started to see, you know, when in our support and consulting is that our clients don't really want to make the distinction. They may use something like tablespaces to push their data for warehousing off to different storage than their analytic or not their analytic, their transactional loads. But we are definitely finding that a lot of our new business, a lot of our migratory business is how can we get off of these, you know, monolithic large expensive platforms and just run Postgres?
And how do we make Postgres do what we want when it comes to data warehousing. And it's usually, as you said, right, it's about analyzing whatever data that you put it in there. It's not necessarily transactional in nature. It's more reporting in nature, machine learning in nature, that type of thing. In terms of the out of the box experience with Postgres,
[00:08:00] Unknown:
what are some of the shortcomings or limitations when trying to use it for these scalable analytical purposes versus the transactional row oriented workloads that it's built to handle.
[00:08:14] Unknown:
And, Thomas, I'd like to get your feedback on this. But from my perspective, there is no inherent limitation, except for either a, a lack of knowledge or skill set of how PostgreSQL does things, and or usually you want value added tools. Postgres on its own is a perfectly viable data warehousing platform. It's also a perfectly viable and what it was originally designed for transactional platform. But when you're talking about a data warehouse, you're talking about usually a volume of data that is at a scale that requires more architecture than just let's install it on a Linux box and see what happens.
[00:08:57] Unknown:
Thomas, what's your experience with that? Yeah. I would agree. I think Postgres is definitely an underrated tool in the data warehousing space, and I think we'll see a lot more growth of using Postgres as a data warehousing tool. And probably later on, I'll cover a little more of some some really unique features that Postgres has to bring to the table. What you generally do for purely performance reasons is, of course, the column store versus row store arguments. So transactional data where data is usually changing within the same row as in, for example, if you update a customer address, you would typically update street name, the ZIP code, and the house number, for example, at the same time.
And that would all be in the same row versus analytical workload where maybe all you care about is finding out which customers are in which ZIP codes, and then you would not care about the street name or the house number. So a classical example that most data pure play data warehousing products would, of course, have a columnar storage format, and Postgres has a row store. However, as JD was saying, is with the right level of skill, you can work around these things. And, in general, it really then comes down to performance. So the level of parallelism Postgres can deploy, for example, on onto a single query is more restrictive than, for example, other solutions, also other solutions from the Postgres ecosystem.
The way how data is accessed is is slower given it's a row store versus column store, at least for that specific purpose. But in general, there's no inherent technical limitation. And when it comes to kind of extending Postgres into high performance data warehousing, there's many ways how you can do this with extensions, with community projects, and I think we'll cover some of them later on in this podcast.
[00:10:48] Unknown:
As you mentioned, the definition of data warehousing has evolved a bit from what it was a couple of decades ago to where we are now with things like cloud data warehouses or systems such as Presto and Trino where it it's just a query engine that runs across, you know, various different storage locations. And I'm wondering if you can just briefly kind of categorize the different types of data warehouse systems and where you see Postgres fitting within that given the fact that there are these different extensions for being able to stretch it in different directions?
[00:11:22] Unknown:
Well, I think you kinda answered your own question there. You know, it fits where you want it to fit. I mean, that's the nice thing about PostgreSQL and that it's open source. Right? Although, it's certainly not necessarily designed to fit into, let's say, a federated data warehouse, like, you know, a lot of Snowflake or or something like that. You're gonna end up in a much better situation by deploying Postgres for your data warehouse, because you can make it do whatever it is that you need it to do. Out of the box, it can be a data warehouse as it sits.
But if you need more performance, you can add something like swarm 64, which makes your B tree indexes so and and Thomas can speak more to this, much more performance and saves quite a bit of memory resources, which means access resources and speed is made more available. And you can federate it if you want to. Right? So I don't know that there is an actual 1 spot that Postgres doesn't fit. You know, it goes back to the original statement of, you know, if you have the skills and the knowledge, unlike say MySQL, Right? MySQL is good at what it does, but it's very specific to what it does unless you, you know, buy a bunch of different clustering software and things like that. Whereas with Postgres, it's fantastic at what it does. And then when you hit a ceiling with it, you can extend it using either your own knowledge or the knowledge of other vendors and projects that have already extended it for you. You just have to install it and configure it. Yeah. Maybe 1 way to answer this question is also to look at what does it do exceptionally well. Because I think it's no secret
[00:13:08] Unknown:
that there is a very, very large amount of data analytics solutions out there, quite simply because analyzing your data is such an important part of what you should do as an enterprise. So, clearly, it's a fast moving market. It's an attractive market. There's many, many solutions out there. And so what we've been asking ourselves recently is really, like, where does Postgres play strongest? And, of course, it has all the features that JD has mentioned. And I think the open source aspect and the versatility is extremely strong. But I think another area that is maybe, I would say, underrated, so I would just like to talk about this a little bit, is to have the ability to enhance your analytics with all the transactional features of Postgres.
We call this internally, TEA, as in transaction enhanced analytics. It is in this hybrid space where you're basically no longer distinguishing clearly between analytics and transactions, but your workload is a purely analytical 1. You're using transactions as a means to augment it. So let me give you a few examples. First 1, clearly, transactional consistency, as in very easy to debug because, you know, if something is there, it's there. And if something is not there, it's not there. So there's no guessing, no eventual consistency. Then, of course, the high availability that Postgres has. There's many, many different ways of how you can integrate Postgres into these 0 downtime or 6 sigma or 5 nines type environments, and that has been done all over the world. So you've got something that is very, very powerful. So if you apply those 2 concepts, you will find that very, very few all up databases in the world actually satisfy that. But then you can add a third thing, and I think that really makes it then even stronger.
You have the ability to prevent and heal human errors. So this is different things we've seen at clients. For example, you can enforce data consistency or unique constraints, which a lot of your classical data warehousing tools will not do. They will not allow you to, for example, say, I want this data, for example, I want a certain customer ID to be consistent with the customer IDs in my other table. Those kind of things are usually limited to the kind of databases like Postgres that can truly do that and enforce these constraints. Or, on the other hand, things like the ability to roll back even major changes. So unlike a lot of other databases in Postgres, you can even wrap creating or dropping an entire table into a transaction. So let's say you're operating this database. You're making a mistake.
There's the ability to roll back the transaction and basically undo your mistakes. So these are just 3 examples. And then the 4th 1 I would add is there's actually a lot of nice things in how your transactional features from Postgres can expand how you can do, basically, analytical data handling. So just 1 example we've seen in the field we found super interesting. And I have to abstract that a little bit because the exact case was under NDA, but I'll give you an example that's very similar. So instead of taking this classical append only design where you would, for example, place every shipment update of a shipment status of an item.
You would put that all into a table, and you would only append. Instead, what you can do is you can, for example, have a JSON field where you're concatenating all the shipment updates as an update statement. So as opposed to just adding to a table that grows and grows and grows, you can actually just update the individual field. And the benefit is then that, for example, asking for a shipment that has transitioned from loaded to delivery van directly to lost suddenly becomes a much easier query to ask because you do not have to reconstruct your table from all your append only structure and find out in what order updates actually happened. It's all there. It's all within a single field that will provide you for every single shipment the complete history of what has happened in terms of shipment status updates.
So it's those kind of things that really let Postgres shine even against a massive backdrop of products that are out there. And that's why we think more people should know about it and 1 of the reasons we're having this podcast today. I would actually like to add a 5th point to that. Those first 4 points
[00:17:46] Unknown:
are absolutely on target. And I think that it really brings about the true versatility of of Postgres as a data warehouse. But the 1 thing that we're forgetting is that Postgres can be the data warehouse for all the other platforms. You can directly replicate it in Jest, say, from Oracle. So if you have an Oracle environment that, obviously, for whatever reason, you can't get rid of it, you can't migrate off of it, that's fine. But you don't have to deploy any more Oracle. You can just deploy Postgres and replicate from Oracle to Postgres to allow Postgres to be the data warehouse. And you can do it Microsoft SQL.
You can do it with any major platform that's out there, you can do that with. So you end up in an environment where Postgres becomes, you know, the database of choice over time because it satisfies the needs of not having to deploy legacy platforms.
[00:18:45] Unknown:
And that brings us to mind things like the Presto slash Trino engine or things like drill for being able to be this distributed query engine across various underlying storage layers. And I know that Postgres has the concept of the foreign data wrapper for being able to do something similar. I'm just curious if you can dig in a bit to some of the trade offs or relative strengths of Postgres being used as this focal point for all of the data and being able to federate those queries across different storage engines versus something like Presto or Drill that is designed to be just a query engine with no native storage layer? Well, Presto actually does work with Postgres, so we should throw that out there.
[00:19:26] Unknown:
For me, the main limitation that you're going to have with using, say, in foreign data wrappers or something like that is if you want there's 2. 1 is that if you wanna go transactional, you are going to be limited by the quality of the FDW itself. So for example, if you're federating Postgres, so you're using FDW across multiple Postgres instances, the quality of performance is going to be much higher than, say, federating Postgres to Oracle. Because there's just certain things we can't see in Oracle, and therefore, when you're federating the data out, you're not gonna have optimum query plans and things like that.
[00:20:09] Unknown:
Thomas, what do you think? Yeah. I think you're also highlighting here 1 of the key points when it comes to, the relative strength of these solutions. So if you're looking at something that is essentially a query engine across a range of platforms, you're losing a lot of the points we made earlier around the absolute transactional consistency, the ability to have very predictable high availability setups, this ability, for example, to roll back changes because you're just an engine querying whatever data is there. It's basically taking things into a lot more of offline logic. And, of course, for certain cases, that's what the user may want. So then a tool like Presto is the perfect tool. But as you're moving more into kind of this reactive real time, as we said in the beginning of the call, the ability to kind of, you know, have that fluidity between is it now very recent data or is it offline data and you don't have to distinguish anymore. I think then you will very quickly hit the limits of those query engines that are just running over a host of different data storages.
However, there are very valid reasons in which you may actually want Postgres to just access data stored outside. So we've just had a customer case that we won based on that. And in the end, we ended up using the foreign data wrapper for the Parquet storage engine. So that's a foreign data wrapper developed by the community. We have basically contributed a little bit to it, and that allows us to just read parquet files that are stored anyway, like, outside of the database because that's how the customer's architecture works, and you can just bring that into Postgres. So you do have these federating capabilities or these foreign access capabilities in Postgres as well, but I think it does play strongest if you want that predictability, transactional consistency, and high availability concept you can set up without actually having to develop a PhD in computer architecture. So I think that will be my take on it. In those areas, post CRIS is really playing very strong.
[00:22:17] Unknown:
Actually, that's a good point, Thomas. 1 of the things that has happened as the new generation of developers come ab abound and items are moving, for example, to the cloud, a service such as s 3 is very popular. And you may not want a certain class of developers or certain code that's being executed to have access to the database itself. And so what would happen is that you can use the FDW technology to actually link to an s 3 bucket. And people can upload their s 3 data, you know, JSON objects or whatever files that they're putting in there. And because of how versatile Postgres is with the FDWs, you can use that as an ingestion point that's always live.
So you just have some database job that says, okay, grab the latest data from s 3 out of this table, which appears to be a table, but it's actually s 3, and manipulate it any way you need to to ingest it into the actual data warehouse storage that that Postgres is
[00:23:22] Unknown:
providing. And that gets into things like the relative cost per byte and cost per performance of Postgres. We are aware if you're storing the data in s 3, you might have a cheaper per byte, but it's going to be slower in terms of the latency to actually query across that data. In the other direction, these cloud data warehouses that are aiming to decouple the constraints of storage and compute where they can scale independently of each other. And I'm wondering how you see Postgres contending in the sort of overall landscape of cloud data warehouses and object storage as the kind of canonical storage layer for the growing big data ecosystem?
[00:24:04] Unknown:
It's kind of an interesting question because if you look at what the major cloud providers are doing, the big 3, they have all, especially the top 2 and being Amazon and Microsoft, have put an enormous amount of not only engineering, but infrastructure into deploying Postgres. You know, you can't buy something from Amazon without using Postgres. And when you start getting into, you know, the the data warehousing scenarios, I think what's, you know, really gonna come about is that we're just gonna see a continual expansion of that. When you look at the market, I think that there's always going to be, say, you know, well, like, your snowflakes, for example. I think they're all going to always exist because everyone likes to reinvent the wheel. That's kind of the nature of humanity.
What the market is doing and what the developers are pushing and where the growth is, is that at some point somewhere, you're going to be running Postgres. You know, whether it's RDS or to or it's Hyperscale, which is, you know, Citus on top of Postgres running. And as you're right, no matter what, somewhere, you're going to end up running Postgres because it's becoming, well, it's becoming the Linux of databases, Right? Everyone, whether they're running Apple or 1 of their running Windows somewhere is running Linux. And a lot of times they don't even realize it. And Postgres is doing the same thing, and it's very much becoming pervasive throughout the cloud because it provides a uniform standardized, highly functional transactional, you know, environment just as Thomas was stating for your data warehouse or your transactional load or both.
And I'm not sure how we're gonna get away from that at this point. I mean, I would imagine that within 5 years, it's not a question of whether or not you're writing Postgres. It's a question of, you know, which version. Right? Whether it's, you know, pgcloud or whether it's Nitrous, or whether it's, you know, even say like a timescale for time series data, it's all postscripts at that point.
[00:26:08] Unknown:
Digging more into how to actually take your out of the box Postgres, unadulterated, no extra capabilities added in, and turn it into a data warehouse things that you would do to actually move it from, you know, I just app get installed postgres to I'm now move it from, you know, I just AppGate installed Postgres to I'm now using it for the data warehouse for my entire enterprise?
[00:26:35] Unknown:
So first, kudos for saying AppKit. I'm a big Ubuntu fan, Debian fan. That was really kinda great to hear. You know, out of the box, you know, you're talking about configuration. Right? Making sure that you have the right storage, making sure that you have enough cores to run parallel queries, and make sure you have enough memory to manage the data as it's being, you know, manipulated and delivered. There's no native change. Right? It's not like you're flipping major switches or installing major extensions just to get to the point where you can start reporting. In fact, I would say the barrier to entry for a DBA to provide a data warehouse using Postgres is exceedingly low.
It's when you start reaching ceilings that you have to get more creative. Thomas can speak to this better, but something like Swarm 64 would come in. Yes. It's all a question of scale and our experience.
[00:27:32] Unknown:
So, you know, depends on the size of the enterprise or the size of your data warehouse as a result of it. And, yeah, I would say it's a question of the the quantity of data. It's a question of also what kind of response times do your analysts expect. Generally, as JD is saying, the first thing I would do is I would adjust the configuration, and I would make sure that a certain amount of hardware is is in place. So for those listeners that are actually actively considering a data warehouse, if you're starting with plain Postgres as your engine, I would recommend 16 cores at least. So 16 virtual cores or visible cores for your system.
So as in 16 hardware threads that Postgres can run on, generally, we find that this works quite well if you have, let's say, 1 or 2 concurrent analysts accessing your data. That's the kind of quantity that can very nicely be used. Of course, Postgres really likes a storage that responds quickly. So SSDs that are commoditized enough in this day and age, it's, of course, a recommendation. You should have an adequate amount of RAM to go with it. Let's say, as a rule of thumb, maybe look at maybe a 5th or a 10th of your entire database size should be at least available in RAM. And I think those kind of starting points will get you quite far.
However, as everybody knows, people want fast responses. They don't want answers in minutes. They want answers in seconds and so on and so forth. And then you have a range of choices.
[00:29:00] Unknown:
I think we're gonna talk about some of them in a moment. Yeah. Let me I would like to add to that, Thomas, just to kinda geek out a little bit. The reason the number of cores is important is that because of the architecture of PostgreSQL, which is process based. In order to do more than 1 thing on a particular process, so if you're running a query that is calling out, you know, multiple sorts kind of thing, you need more cores. Because it can only do 1 thing at 1 time per course. So your statement of 16 course, I would agree with you. That's a very good round number 1 that doesn't cost all that much.
And 2, will allow PostgreSQL to parallelize these longer report queries across the number of cores to retrieve the data faster. Further with the RAM, because of the way PostgreSQL handles RAM, there's 2 key parts, right? There's your working memory, And then there is your buffer pool, which is cache. And before data warehousing, the buffer pool isn't as important, but the working memory is. So when you're doing many sorts, you wanna be able to make sure that you have enough memory available to say, for example, process a 100 megabytes worth of data before you start spilling that data to disk to start processing it. Now luckily, again, to your point with SSDs, that's not as expensive as it used to be, but it's still not as fast as RAM.
[00:30:32] Unknown:
Patrick is a diligent data engineer, probably the best in his team. Yesterday, when trying to optimize the performance of a query running over 20, 000, 000, 000 rows, he was so eager to succeed that he read the entire database documentation. He changed the syntax. He changed the schema. He gave it his everything and reduced the response time from 20 minutes down to 5. Today is not a good day. Sarah from business intelligence says 5 minutes is way too long. John, the CFO, is constantly slacking every living being trying to figure out what caused the business intelligence expenses to grow so high yesterday. Want to become the liberator of data?
Firebolt's cloud data warehouse can run complex queries over terabytes and petabytes of data in sub seconds with minimum resources. No more waiting, no more huge expenses, and your hard work finally pays off. Firebolt is the fastest cloud data warehouse. Visitdataengineeringpodcast.com/firebolt today to get started, And the first 25 visitors will receive a free Firebolt t shirt. Digging more into the actual extensions and plugins or forks of Postgres that you can use to make it more purpose built for data warehousing, you know, maybe leaving out things like Redshift, which is originated as postgres, but is not necessarily considered in the same camp anymore. But, you know, I'm aware of things like the c store, FDW or column store for a data wrapper. As I was preparing for this podcast, I was reading about a project called ZedStore that's trying to add native column storage capabilities to postgres as part of the core offering, obviously, Swarm 64. Yeah. I know Greenplumb has been around for a while. So what are sort of the relative trade offs of those different open source and commercial offerings, and what your considerations are as you decide which ones to bring in for a given use case.
[00:32:21] Unknown:
1 thing on that, like you brought up, Zendesktor, which is actually a great example, 1 of the great features of Postgres. And I know I'm beating this drum, but because it's open source, but it also has a storage API. I mean, that's how Aurora works, Right? Is that it's postscripts, and then underneath it, it's using the storage API to actually do a different type of storage than what you would natively do. ZedStore is also going to provide a different type of storage specifically for a certain type of workload and dataset. I think I would actually like to pass the the baton to to Thomas on this 1 because of everything you mentioned, you know, Greenplum is fantastic. It is based on Postgres. It's open source.
You know, ZedStore, which is being driven by the same folks of Greenplum to try and get into Postgres native, is also very cool. The Citus people are wonderful. It's not really for data warehousing. They're recognizing the hybrid workloads, so they are adding some features for that, but it's not really designed for that. But the most interesting to me, which is why we're on this podcast, is form 64 because of what they're doing with their software to make Postgres be able to perform at a level that, as far as my experience in the field, for what Swarm 64 is good at, there is no open source solution, period. So, Thomas, can you explain a little bit more about that?
[00:33:47] Unknown:
Yeah. Basically, what we decided to do is we decided to really fit exactly into the framework that Postgres provides, and that includes a few things. So Postgres in itself, with its extensibility that we've already mentioned, allows you to also augment the database without actually forking the project. So you don't have to recompile Postgres into a new project and and, you know, turn it into something else. You can actually just provide additional features into the pluggable type of architecture that Postgres has. So, for example, under the bonnet, we are offering Postgres better ways to do large joins, which is something you'll have very frequently in data warehousing type workloads.
We allow it to execute better. We allow it to execute with more parallelism. We allow it to modify how data is moved and data is replicated during queries in order to basically have to spend less time moving data around cores and making the processing more efficient. So all these things we do behind the scenes. But then we also have implemented a column store based index. So it's called the column store index, and it is basically keeping a heavily compressed copy of your data in an index. You can define if you want your entire table to be kept additionally or if you want only certain columns of your data to be included.
Typically, we do see that the raw storage size on disk is anywhere between 1 5th to for example, people do not choose all columns or the columns are heavily compressible. We've also seen examples of a 20th of your actual data. And what happens is that this columnstore index is actually updated in the background. So it doesn't take away anything from your postgres processing as it's happening. It updates itself in the background. But the moment it sees that you need to analyze large quantities of data, it will offer that access method. It will offer the ability to use the column store, index.
And as a result then, it will read incredibly reduced quantity of data. So you can look at a factor 10 less data to a factor 100 less data that has to be read and then also process it much quicker to provide it to Postgres. So in other words, as you're moving into these kind of multi terabytes, tens of terabytes, 100 terabytes type problem scenarios, this makes a difference between hours to minutes or within minutes to seconds in your query time. And we generally find that other examples, like in the community, there's a column store form data wrapper that, Tobias, you mentioned a moment ago, that generally doesn't reach that level of maturity when it comes to access. So if you compare them side by side and there's a free developer license available on swamp 64.com.
So you can try that yourself. If you compare the column store foreign data wrapper versus swamp 60 four's column store index, you will see that's a night and day difference in terms of security execution speed. Yeah. So long story short, if you want to have a Postgres that is still 100% a Postgres, behaves like a Postgres, needs minimal additional configuration, if at all, and then just works also as a data warehousing powerhouse, then adding the Swarm64 extension is easiest way to get there. And it is basically completely noninvasive. So we're not locking you in. We're not taking you anywhere outside of this Postgres usage paradigm of the 100% Postgres compatibility.
That was a very important design goal we had when we basically upgraded our product version last year from version 4 dot x. We had a few versions 4. And then going to version 5, we basically gained that 100% Postgres compatibility. Yeah. So that's how you can turn Postgres into a high performance data warehouse very quickly. And then you'll be right up there with some of the best products, like, for example, Microsoft SQL Server with its column store index or Oracle in memory column store features. So you'll basically elevate Postgres to that kind of level.
[00:38:10] Unknown:
1 of the things that people don't really think about either is that because of the way that Swarm 64 is architected, 1, your ceiling becomes much taller. But also, you save money as people continue to the cloud where you're basically paying for every metric of resource, whether it be an IOP or memory or core or whatever. The high compressibility creates an environment where you need less RAM, for example, to process the same data. And further with SWARM64, and Thomas has said that, you know, 100% compatibility, but I think another way to put this is that it's backwards compatible. So if we have upscaled Postgres with now we're running Postgres Swarm 64, And then we decide, we'll never decide this because Swarm 64 is awesome, but let you know, assume that we have to uninstall install Swarm 64.
Your data warehouse will still work. Right? And those are important features when you're talking about environments that can change dynamically based on the whim of a CTO decision. Right? So there's more than just a technical argument for swarm 64. There's also a business argument, I guess, is what I'm saying. Yeah. And and to be honest, this was entirely by design from our side. We actually consciously chose that
[00:39:34] Unknown:
because, like, lock in is something that is like a hidden cost. And I think everyone who is now maybe considering things like Snowflake and, JD, as you said, reinventing the wheel. I'm not sure if I could quote you on that, but yeah. So what people aren't seeing is they're basically creating a massive dependency, and they're creating a big debt for the future, for their future selves or for their successors to solve. And we decided that there's so many excellent database operations, paradigms, and Postgres is so successful. It doesn't need new usage patterns. It doesn't need new SQL language. It doesn't need all these things. So we are actively keeping all our extra configuration parameters, for example, index definitions and so on, and absolutely in line with the standard way of operating it. So if people decide that Swarm 64 is not what they need anymore, then it's part of our feature set that they can actually do that and not be locked in. There's no migration.
And that, I think, is a very important feature that people should pay attention to because, I mean, as we all know, pricing models change. I think quite a few organizations getting into big Oracle licenses weren't expecting them to move commercially to a direction that they have since moved. So, you know, I think that's a very important design criterion and decision criteria.
[00:41:02] Unknown:
Well, that and, I mean, let's remember that in a decade, everyone's gonna move back off the cloud because it will no longer be the neat thing. Yeah. That could very well be. Right. All of a sudden, people are like, I can't believe I'm spending, you know, $50, 000 a month when I could actually only spend $10, 000 a month if I was willing to host it within my data center. And that's not to knock the cloud. I mean, I mean, it's been around for much longer than the term cloud. Right? In back nineties doing what's called virtual hosting. That was the cloud too. It's just everything continues to evolve.
[00:41:33] Unknown:
The way gaming laptops evolve, who knows? Maybe you'll be you'll be running your data warehouse on a gaming laptop in 10 years because it is all you need. No kidding.
[00:41:44] Unknown:
Digging a bit more into that cloud aspect of things for somebody who maybe starts off with a Postgres database, and they're building out their data warehouse, and then they start loading in more and more data from more different sources. And then maybe they start to hit some performance limits because they have so much data on disk. What are the best practices or what are the recommendations that you have for being able to maybe life cycle some of that data out into, quote, unquote, cold or warm storage on something like s 3 while still being able to query it and just some of the strategies around data modeling and life cycle modeling for Postgres as your primary interface to all of your data.
[00:42:22] Unknown:
That's actually an interesting point. I was surprised you brought up s 3 for that. That certainly wouldn't be a best practice through s 3. I think that s 3 obviously has its place, but that's not it. The most common way and what I would consider the standard way is that you would roll data off onto a non SSD volume. Right? Good old fashioned spinning metal, and use table spaces to do so. So you would have an environment where Postgres is just running like normal. But if you go into this schema with these tables that are partitioned, that schema and those tables and those partitions are all sitting on rotational media versus SSDs.
And there's a number of advantages to this. 1 is that rotational media is much cheaper and also much larger. So you can have an environment, whether you're hosting it yourself within a data center, or if you're in the cloud where you're mounting, you know, network drives, essentially. You're just able to just add that volume to whatever size that you see or need, and then move the data over there, and you still have your standard interface. And what's great about this is that if the rotational media isn't fast enough, you still have the opportunity to do something like install Swarm 64, which will, as already mentioned, greatly compress that data, or at least the indexes within that data, and allow much higher performance query capability, even though it's rotational drives.
That's kind of the standard way. Obviously, if you're talking about ancient data that you just don't wanna get rid of, you had mentioned cold storage, you know, it's not as simple. We can certainly roll it off using something like logical replication into a secondary node that's sitting on whatever you want it to sit on, and then you can just turn that node off. That is absolutely an opportunity. And actually doing that is functionally, gives you batch updates as well. Because if you were to turn that secondary node off once, you know, if you're adding data, say once a month. If you turn that node back on in a month and it reconnects, all that data will come down from your primary.
It'll replicate to that secondary node, and then you can shut that secondary node off again. So there are a couple of ways to do it, but the standard way of the warm storage or the rotational storage would be the most obvious with tablespaces.
[00:44:56] Unknown:
Just to add to this, if you're happy to do a little more work, what you can do is you can use a foreign data represent. Again, I'm going to use the Parquet 1 as an example because, I mean, Parquet is a standardized format, which, of course, is always important to use in standardized things. It's been maintained by a Just company from Berlin, and they're now part of Applovin, US company. And the nice thing that Parquet itself is a very heavily compressed format. And what you can do, basically, is you can for example, with a script or something similar, you could actively offload data from your Postgres data warehouse into Parquet files and keep those files, as JD was saying, possibly not an s 3, but keep them on a comparatively cheap storage medium, like, for example, spinning disk based elastic block store containers. Yeah.
And you basically put it onto an EBS volume that's based on spinning disks, so that's quite cheap. And then it's heavily compressed, And then you can access it with a foreign data wrapper. So that's another opportunity to basically age your data away while still having the full ability to query it for analytical workloads. So I would say that's maybe a middle ground solution. I think the tablespace solution mentioned is probably the easiest 1 for quick fix. Also, Postgres has a history of working well with rotational media with spinning disks. It is quite nicely optimized to work quite well with that. Then if you wanna be a little more involved and get greater savings, greater space savings, there's the way to basically copy data away to parquet files and then have those pulled back in via basically, a foreign data wrapper, still making it read only queryable.
And then if you wanna go to extremes, you could basically make replications onto nodes that you switch off and and have a very kind of actively controlled and orchestrated mechanism where you basically have something being switched off most of the time and only being switched on once a month. So, yeah, there's a variety of ways in which you can do it. I think right now, it's not a 100% automatic as you could tell from our answers, but it is quite easy to set up. So if you're interested in, I mean, a company like Command Prompt could probably set that up for you very quickly.
[00:47:20] Unknown:
We definitely could. Another area that has been seeing a lot of activity and variations in terms of its popularity is the data lake where there was the Hadoop era where everybody was all on the sort of commodity storage and MapReduce, and then now with object storage and these different query engines and cloud data lakes, and now there's the hybrid lakehouse. And, you know, we've talked a bit about sort of being able to move some of your hot data into warm storage on spinning disks. But for the case where you already have data in your data lake, you know, maybe it's already in parquet files, and there's maybe a Hive metastore that has all the table definitions. What are the capabilities for Postgres to be able to integrate with that and either, you know, migrate it all into Postgres or just have them live separately but be able to interop easily?
[00:48:08] Unknown:
I would definitely opt for the second, but I think JD can comment in the moment. But I would definitely think more about interoperability here and Postgres being a consumer of your architecture there as opposed to Postgres being the central manager of it all. That would be my take on it. So I think it's a fantastic strength of Postgres to be able to interoperate with these things. I think there's about, I don't know, 30, 40, 50 different foreign data wrappers, so there's a lot of formats that are are supported. SJD has also correctly mentioned they are of varying quality.
So I think there's a a wide range where you can really rely on and use them for production. There's others where you may wanna enlist the help of command prompt or swarm 64 to get them to a level where you can use them for production. But the strength is there for Postgres to interoperate with it. I'm not sure I would use Postgres as a data lake orchestration or data lake management engine. JD, what do you think? No. I I would absolutely agree with that. I mean,
[00:49:10] Unknown:
that point, Postgres becomes your canonical store. Right? And you allow all your data within these lakes to do what you do with it. And you can either ingest into Pubscripts or just interrupt, you know, making it work is not a problem. And it's not just FDWs, right? There are plenty of what I would call brokers out there. A good example is Debezium, which is open source, which allows you to interface with a lot of different data sources, and you can shove it right into Postgres if you need to. The data lake technology out there is, you know, I remember the days when, you know, you'd be doing EDI.
And it's interesting to me to see all this new stuff and yet more marketing terms come up for what's already been around. But definitely from a technology standpoint, Thomas is absolutely correct. I mean, Postgres is really your consumer at that point.
[00:50:06] Unknown:
As you continue to work with Postgres and its community and work with your customers to help them be able to put Postgres into effect for their various data needs, What are some of the areas of work that are happening in the Postgres community, both in terms of core and the surrounding ecosystem, that you're most excited about and that will help to make it more easily suited to these data warehouse and analytical workloads?
[00:50:31] Unknown:
From my perspective, the main things going on in core, we've spent the last few releases. We're on release 13 now. Adding features. But for the most part, it's been clean up and optimization. For example, you know, in version 9.6, we we had partitioning, table partitioning, which is fundamental to a successful data warehouse. It was okay. It did what it was supposed to do. But now, our data partitioning is well, I'll give you an example. It used to be that having, say, 250 partitions, planning a query around that number of partitions could take a considerable and demonstrable amount of time. Whereas now in the newer versions, the more modern versions of PostgreSQL, planning around even thousands of partitions is minuscule. You don't even notice it. So it's those kinds of optimizations to deal with larger datasets.
And, you know, a lot of people are like, oh, it's just a tweak. No, it really isn't because you're talking about a code base that goes all the way back to the eighties. There's quite a bit of, oh, if we change this, we have to change that, and then we have to change this. But at the end of it, it becomes highly optimized. And that type of work is happening now. And there's a lot of really good stuff. I'm not gonna go into specific patches, but I think that PostgreSQL used to be a very good departmental database. And then it became a very good Internet slash web database.
And now, because it used to be to have a data warehouse meant a terabyte of data. Well, having a terabyte of data is nothing now, because we're keeping everything. And so now we're modernizing the code base again to deal with the current demands for data volume and querying that data.
[00:52:23] Unknown:
I would say in terms of excitement, I do think that the zedStore initiative by Greenplum is an interesting 1 if it makes it into a later version of Postgres because it will offer a compressed table storage that is more optimized towards analytic type workloads, and that will work seamlessly with Swarm then as well because, as I said, we are just adding to Postgres. We're not taking away. And we think that could be a very nice complement to what we have done with the columnstore index. So then it offers a wider variety of what users can use, and they would, in some cases, keep data in the Postgres standard. Postgres format, which has a ton of advantages, of course, but they would sometimes choose to use ZStore. So I find that very exciting. So, you know, we're doing what we can to support that project. It will probably, unfortunately, take a bit of time until it's there, maybe 2 or 3 years. I wouldn't hold my breath yet, but that's 1 thing that really excites me. Actually, it's interesting that you just brought up a timeline.
[00:53:24] Unknown:
Although I know that currently the plan is to get it into core, within 2 to 3 years, I would expect that the value of having it in core would be rather reduced. And instead, like swarm 64, it would be essentially an extension. And there's a lot of benefit to keeping it out of core in that fashion, because it will be able to move at its own development life cycle, while maintaining compatibility versus PostgreSQL's development life cycle. That would be huge for a project like that as, you know, new methods and new ideas come about and being able to test and, you know, have, you know, long term releases versus short term releases for testing in in early adopters and all that type of thing. But I would agree with you that is that ZSOR is definitely 1 of the more interesting technological improvements that are coming about.
[00:54:21] Unknown:
And in your experience of using Postgres and working with other people who are trying to deploy Postgres for these analytical use cases, what are some of the most interesting or innovative or unexpected ways that you've seen it used? We just recently had a project
[00:54:36] Unknown:
in the health care industry. The client demand was that all, changes to a particular table be exported to JSON and was simultaneously imported into an s 3 bucket. And this was literally to be consumed by some external analytical platform to run reporting, financial reporting. And of course, command prompt happily, you know, we built a system using Postgres, using logical replication slots, and change data capture to export to JSON and insert it into s 3. And in fact, we were able to build it out in about 10 days after an extensive project by another company failed trying to use the data migration service.
To me, the reason that's interesting is because of how wrong it was that we had to do it in the first place. But, it shows the the functionality or the extensibility of Postgres. Right? The adaptability. We had a demand for canonical data to be transformed into a different data format live as it changed and imported into s 3 for export into another reporting system. And we were able to build that out in basically a week and a half. It's been in production for, you know, 4 months without a blip. I attribute that not to command prompt, but I attribute that to the stability and extensibility of PostgreSQL itself.
[00:56:08] Unknown:
Fantastic example of it. I would actually go back to what I described earlier. This is a while ago since we've come across that. But, basically, the idea to use updates instead of an append only workflow to make your queries be tremendously easier on the reading side as in when you're trying to understand the data. It's tremendously easier because instead of having to basically construct, for example, a certain flow of status updates, like the example I gave with the shipping status where you basically then have to backtrack and kind of rebuild your table of what every individual product has happened in terms of shipping updates, like what kind of life cycle has it gone through.
Instead, just having an update and basically just keeping 1 of those modern array style data types or structured data types like JSON and doing it this way, which is a very, very unusual paradigm by data warehousing standards. But it's such a brilliant concept because it just makes your query so much easier also from an understanding point of view instead of some database analyst having to conceptualize in his head 3 different joins and multiple swords in order to combine the exact shipment tracking history of every 1 of your shipments back together, that's just a single JSON field that contains all updates.
And I think those kind of things are they are great learnings. Like, coming across that and then realizing, wow, here, someone did something really genius. That's always the fun part of the job.
[00:57:41] Unknown:
As you have built out your own businesses and worked with your customers and, you know, dug into the guts of Postgres itself, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:57:56] Unknown:
Oh. So you put the term business in there, and I think that's the most interesting and challenging aspect is that as the market has changed, a lot of clients don't understand those changes. And I'll give you an example. RDS Postgres, which is a fantastic platform if you want an almost 0 barrier entry into developing or deploying PostgreSQL. Okay? However, just because it's managed Postgres does not eliminate the need for a DBA. It eliminates the need for a SRE, but you still need someone who understands PostgreSQL, understands its access patterns, understands its configuration, understands how to optimize it, how to extend it, how to scale it, how to back it up, how to make it highly available, all those types of things.
And a lot of people in the market, what they do is they will hire a developer, and say, build me this magical widget. And the developers great at building magical widgets. But the developer has no idea the chemical composition that is required to make that magical widget. Right? He's all about, I've got this API, I've got this framework, I've got this code that I any, you know, objects and methods and actions that I'm all putting together. And I'm working with a user interface guy to make sure that it's usable, but they don't understand Postgres underneath. To them, it's just a data store.
And that's a really unfortunate thing that has come about because it's not just a data store. This actually applies to any database. Right? It doesn't matter if it's MySQL or Snowflake or Greenplum or whatever. You do still have to understand the technology that is hosting and managing your data. And I think that's the most in fact, I would say it's the most complicated part of our business experience with Postgres is that we will take a request from a potential client or even existing client, and they'll say, you know, our Rockstar developer says it's Postgres' fault. And you go in and you look at what the Rockstar developer is doing, and what it is is clearly the Rockstar developer knows nothing about databases.
They only know how to put or get or post or, you know, insert through a method within their ORM. And that's challenging because then we have to prove the point. Right? You would almost think that's mundane, but it's actually, I would say that the majority of the business that Command Prompt does is not necessarily support in terms of things are broken. It's support in terms of education to the client on how the platform actually works.
[01:00:46] Unknown:
And I would add to that the emergence of Postgres solutions and them all having kind of small nuances in the flavor and the parts they automate and the parts they take away from the user in the good sense, as in the user doesn't have to spend any time on it anymore, and the parts they take away from the user in the bad sense, as in, for example, RDS excluding a certain amount of extensions and basically having no longer the standard extension interfaces that regular Postgres has. I think that is something that really creates a larger level of complexity, and it can go as far as branding even. So if you look at Amazon, the offering Aurora, which you can have in 2 flavors, it's called Aurora, but there's an Aurora MySQL and Aurora Postgres.
So it's basically muddying the water a little bit, which is a shame because it makes things a little more complicated from an operation point of view. So I would definitely stick on that that often operating it, especially as you're getting further and further removed from the database, is starting to become the bigger challenge for the users.
[01:01:57] Unknown:
No. I think that that's actually a very good point, Thomas, because if you look at all the managed providers, right, the extension problem is obviously a very important 1 because it basically RDS is great as long as you live within its box the moment you wanna do something that's outside of its box. And keep in mind that that's not the Postgres box, it's the RDS box. Because you can make Postgres do anything you want because it's open source, RDS is not. And then you take something like an Aurora PostgresQL, which is actually different.
It has different planner semantics, it's got different storage semantics, you know, pricing structure that's different. And then you take that, and then go over to Microsoft and their version, which also has its own storage engine. And then all of a sudden, you're ending up in this environment where you say running Postgres, but now we have to qualify our Postgres. Are you running timescale? Are you running RDS Postgres? Are you running Aurora? Are you running, you know, Azure? Are you running GCP? Because they all have their own extensions to Postgres to make it work the way they think it should work.
And so now what's happened, like, for example, with ClamPrompt, every good company has an, you know, client intake process. Well, 1 of the things that we do is we say, okay, can you tell us about your Postgres installation? And we ask them half a dozen, you know, actually it's more like a dozen questions now, but it used to be, what version of Linux are you running and what version of Postgres? That was the intake process for knowing how to take care of the client. Well, now it's what operator are you on Linux, or are you on Windows, or other? Because occasionally you run into free BSD or something like that.
What version are you running? And then, you know, is this a managed platform? If so, which managed platform? Those types of things. And so Postgres is no longer just this piece of software that we have downloaded or installed with AppKit. It is any number of flavors. And it's also making it challenging for someone like, you know, a Swarm 64, who is integrated so deeply with the known open APIs of PostgreSQL. But even though they work just fine with that, they may not work with something like an RDS because of the limitations that RDS puts on its users.
[01:04:19] Unknown:
Yeah. As somebody who uses RDS for my day job, I can definitely agree with the constraints and occasionally bemoaning them, but still being able to benefit from the fact that I don't have to worry about making sure that the actual underlying box is up all the time and managing the point in time snapshots for the backups.
[01:04:36] Unknown:
Right. And there are definitely things that RDS is great. There's no argument there. It's 1 of those things, like, it's great until it isn't. And once it isn't, your life gets really complicated really fast. Exactly.
[01:04:50] Unknown:
And so in terms of people who are looking at Postgres and considering it for their data warehouse or analytical workloads, what are the cases where it's the wrong choice and they might be better served with either, you know, a different database architecture or a different managed platform or just using a distributed query engine or, you know, throwing everything at Spark and saying tech with the database.
[01:05:12] Unknown:
In my mind, there's 2 parts to that question. Number 1 is not having the expertise by far. It's funny, the market thinks that anybody can be a developer or anybody can be a DBA, but you wouldn't expect your DBA to be able to build a car. Right? So why are you expecting your web developer to be a DBA? So you need to have actual expertise, especially with something like a data warehouse, where you have unique access patterns, volumes and volumes of data beyond what you would normally encounter. I mean, your transactional database could run, you know, say a $1, 000, 000, 000 company, and the transactional database might only be a 100 gig or less.
But the data warehouse for that company is a petabyte. It's a totally different understanding of how to deal with the technology. But once you have the expertise, it kinda answers your other question. Right? Is that, you know, what is the problem you're trying to solve? Well, I've got a data warehouse and I need tools to manage this data. And like you said, maybe it's Spark, maybe it's, you know, any number of other platforms. But that question answers itself once you have the expertise. So I don't think that there's any real hard line there, possibly if you're using a platform that is not relational.
You know, I could see a situation where if for whatever reason you were using something that, you know, object storage or something like that, you know, trying to push that back into Postgres could be a challenge and maybe the wrong answer to the question. But it really does make its own self apparent once you have the expertise to understand what your architecture is doing.
[01:06:58] Unknown:
Are there any other aspects of Postgres in the data warehouse and analytics ecosystem or the work that you're each doing with it and just the underlying technologies or capabilities that we didn't discuss yet that you'd like to cover before we close out the show? So we haven't discussed, no.
[01:07:14] Unknown:
Obviously, I'd like to affirm the part about it being open source. I mean, this is why it has become so successful. Right? It's why Amazon runs on Postgres, because it was open source. And I think that for those who are trying to deploy a new system, and whether it's transactional or data warehouse, Postgres should be on the list of consideration because it puts you in an environment where you are forever supported. You can either be supported internally through your own staffing or through companies like Command Prompt. And you are also supporting the ecosystem that is driving forward the new models of how business is being done, like a Swarm 64.
That that's kinda where I'm at on it. To add to that,
[01:08:07] Unknown:
it's easy to look at Postgres from a value point of view, free, open source, but we should also not forget that it is probably next to very few other databases. It's definitely up there, I would say, among the top 3, possibly, most fully featured databases in the world, full stop. And that allows you to do all these things that we've been talking about today. And that is the value in itself because who knows at the beginning of a project where the project may take you. And to swap out the database 2 years into an operational project, it's a nightmare. You don't wanna do this. So having the ability to start with something that is just so incredibly fully featured and extensible, I think, is always a very powerful starting point. And it saves you from trying to predefine all your requirements upfront, which we all know from experience doesn't work. You'll always forget something. It saves you from that burden, and you will find a way to solve it later as we've seen many times.
[01:09:07] Unknown:
Well, let's not forget that requirements not only are forgotten, they change. You may have requirement a, b, c, but, you know, the moment you start deployment or development of whatever you're doing, ABC becomes A through Z, and there's a whole lot of A. 1.2.3.4s that just appear. And to your point, Thomas, the great thing about PostgreSQL in that environment, is that you have started with podcast
[01:09:37] Unknown:
by podcast by itself is the capability of Postgres to be able to push user defined functions into it using a large variety of different programming languages and approaches to it and, you know, the viability of then being able to maybe push down some of your machine learning into the database rather than having to, you know, bring the data to it. And as I said, I'm sure that that'll probably spark up a whole another conversation that's worthy of its own hour plus discussion. So
[01:10:10] Unknown:
Well, definitely. In fact, I I was surprised you brought it up. User defined function store procedures are hugely powerful. And specifically, in regards to this topic, you know, you have the ability to write them in, for example, R or Python, both of which are, you know, very well suited and widely deployed for, you know, data analytics, data science, that type of, you know, statistics, that type of thing.
[01:10:38] Unknown:
Well, we'll leave that 1 on the table for now, and maybe we'll come back to it another time. But for anybody who wants to get in touch with either or both of you, I'll have you each add your preferred contact information to the show notes and so they can follow along with the work that you're doing. And as a final question, I would like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:10:59] Unknown:
Yeah. Last time, I answered with what I consider a certain need in software tools. But to be honest, I think in this day and age, I have a very different answer for you today, and I'm really looking at the tool as in the tool you can touch. I do think we're running into a dangerous shortage of chips in general and, specifically, processing chips. I mean, GPUs is very obvious, but also CPUs is starting to, show a degree of squeeze. And I do actually think that there have been absolutely amazing iterations, especially in the last few years when it comes to multicore. And what I'm seeing at the moment is that the global demand is currently not met. So what I'm actually thinking the number 1 gap at the moment is really to get all these amazing chips that have been recently developed into quantity and to kind of keep up this amazing speed of innovation that the likes of Intel and AMD and so on, NVIDIA have all shown recently. Because, for me, data warehousing is strongly connected to this massively parallel processing.
And I think the ability to just have these amazing core counts on a single CPU and then having easily 4 CPUs in the system is just completely changing the landscape. So I want these chips to deliver. I want them to be easily available. And, unfortunately, at the moment, I see this being slightly under threat. However, as we have this free enterprising world, you know, as demand there's money to be made, people will really scramble to meet the demand. So I'm hopeful that this will pass, but at the moment, I'm a bit concerned about that. In my mind, although I think you're absolutely right with the chip thing,
[01:12:42] Unknown:
it's security tooling. Security has become so complicated that every day, you hear about a data breach. So what we really need is to rethink the security paradigms that we're using to protect our data, especially personal data, PII. So that we don't have as much of a concern that this data that's being collected, analyzed, deployed, used for tracking and all of this type of stuff will some way, at some point, become a $50, 000, 000 bounty, or they're going to release it to the world. That to me, especially when we're talking about data warehouses and data in general, you know, outside of data privacy concerns, which we can get into it in another topic, that's probably in my mind, number 1 is protecting that data from bad actors.
And we don't have good tools for that. That's a great point. I completely agree.
[01:13:48] Unknown:
Well, thank you both for taking the time today to join me and share your perspectives and experience of using Postgres for data warehousing and analytical data storage. It's definitely a very interesting and relevant topic and 1 that I've spent a decent amount of time thinking about and digging into. So I appreciate having both of your expertise on that matter and the time and energy that you dedicate to that. So thank you both again for your time, and I hope you each enjoy the rest of your day. Thank you. Thank you very much. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Postgres as a Data Warehouse
Defining a Data Warehouse
Postgres for Analytical Workloads
Postgres in the Cloud and Hybrid Workloads
Configuring Postgres for Data Warehousing
Extensions and Plugins for Postgres
Data Lifecycle Management in Postgres
Postgres and Data Lakes
Future Developments in Postgres
When Postgres Might Not Be the Right Choice
Final Thoughts and Closing Remarks