Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens

Hello, and welcome to the data engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production, and Go CD is the open source platform made by the people at Thoughtworks who wrote the book about it.

Go to dataengineeringpodcast.com/gocd

to download and launch it today.

Enterprise add ons and professional support are available for added peace of mind.

And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked

from the site. To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media. Your host is Tobias Macy, and today I'm interviewing

Ozgun

Erdogan and Craig Kerstens about Citus, worry free postgres.

So, Ozgun, could you start by introducing yourself?

Sure. I'm Ozgun. I'm 1 of the founders and the c 2 at Citus Data. And, Craig, how about you? Craig here. I came over to Citus to head up, Citus Cloud and lead product efforts here at Citus.

And could you each start by describing how you first got involved in the area of data management?

I'll go first.

Prior to Citus, I was at Amazon dotcom,

and we had lots of data that we had to munch through and Amazon was using relational databases of course, but when we needed to scale, we turned to NoSQL and Hadoop based solutions because

relational databases at the time didn't quite scale. And I was writing yet another service that relied on NoSQL at the the time and I realized everything

that I had to do all the application logic that I needed to build in as my interfaces, it was already accessible available as SQL.

So I figured, okay, hey, you know, we have all this data.

The good way to go about this is through writing SQL.

Why not build a solution that makes relational databases scale? And my background is probably a bit different. I think Ozcan kind of underselling it, you know, being a very early engineer at Amazon, you know, and the difficulties they face with kind of scaling, you know, performance and the shopping cart essentially of Amazon dotcom. Interesting challenges.

Mine was,

kind of more dumb luck. I was at a Fortune 100 company,

wanted to try out this cool startup thing, and, happened to end up at a company that at the time extended Postgres to essentially be a streaming database

and essentially do map reduce on data as it came in. So we were processing, you know, potentially hundreds of terabytes a day on, you know, a streaming database.

And ahead of that, I knew about Postgres, kinda liked it, but I would say by no means an expert, and kinda just got, you know, thrown right into it. After the fact, looking back, you know, now 9 years later, I'd say,

I got pretty lucky of

getting on the post crest bandwagon pretty early on because it's a really, really great foundation to build on. But since that time, I've been, you know, at data focused companies,

developer focused companies, ran Heroku post crash for a number of years. So, now I've been in the Postgres ecosystem for a good little bit. And can you start by describing

what the Citus product is and how the project got started?

Yeah. Sure. So I can give a little bit probably on the what Citus is, and Ozcan can probably give a little bit more on the the history since he's been around longer on that side. So Citus comes into play.

So first, we're a pure extension to Postgres. We don't fork away from Postgres. We hook into these lower level APIs and basically turn Postgres into a

horizontally elastically scalable database.

And what that means is you come to Citus when, you know, you have a relational database that needs to scale out. So I think some of what I was gonna was talking about in that, you know, no SQL world of Amazon,

what they looked at and had to do when they were trying to scale the time, Citus does that just for a relational database, in this case, Postgres.

So if you need better performance,

for whatever reason,

that's where Citus comes in. And I'll go with the history of it. The history that really goes back to,

my our days at Amazon.

We basically had all these data

services and then we're using different technologies for different things.

And, I am a developer, and then developer say this notion, like, we think in terms of imperative programming paradigms. So we think in terms of like for loops, if all of that stuff. And I was writing 1 service after the other that used these NoSQL solutions or Hadoop based solutions

to make the data accessible to the application

layer. And I realized

SQL

1 is more accessible, like, because it's a declarative programming paradigm, many more people have an easy time using SQL.

The only thing it didn't do, and it addressed all of these things that I was writing at the application layer, the only thing it didn't do, was, not scale.

And I'll I'll correct Amazon I was gonna add 1 thing there. Like,

I think sue SQL is super powerful, useful. It's everywhere. I don't know that people enjoy writing it. There's still something horrible about the language. But overall, it's a very powerful and probably the right tool for the job to do it. Yeah. There's a great post from the folks at Timescale talking about

the resurgence

of SQL

as the lingua franca of data solutions, both in the relational and non relational worlds because of the fact that it is so accessible and,

able

to be read and understood by people who aren't necessarily developers themselves.

Yeah. I think that's completely spot on. I think if you, you know, look at SQL, it's not perfect. It's not elegant, but it's everywhere. It works. People understand it.

It it really is kind of the the lingua franca.

And

talking more about Citus in particular, I'm wondering why you chose Postgres

as the foundation for building Citus on top of as opposed to, for instance, starting with a brand new engine from scratch in when you were trying to

enable this horizontal capability?

Yeah. So I think if, you step back from a macro level and think about, you know, a mature database, how long does it take? Within the industry, it's pretty well regarded that it takes

10 years, sometimes longer to get a mature database if you're building it from the ground up. Our goal with leveraging postgres was actually to kinda cheat and shortcut that, right? By leveraging postgres, we don't have to worry about all of these complex things under the covers, you know, MVCC,

transaction management,

isolation levels. Not to mention all the cool stuff. Right? Not to mention full text search, GIS,

rich data types, all those sort of things. Our goal really by extending postgres was to shortcut that as much as possible. So Asana has been around a little over 5 years, and now we have people putting, you know, very robust production workloads. So we kind of cut that time in half by extending Postgres.

I was gonna I don't know if I missed some of the the key things about Postgres you got in there. I think that covers,

almost all of it. Yeah.

And were there particular

use cases or feature sets of Postgres

that led you to use that as the foundation as opposed to something else as such as MySQL or any of the other relational databases on the market?

I think the fundamental reason is,

my a PostgreSQL is way more extensible as a relational database,

than any other, relational database engine out there. When we started

working on-site with some of these extension APIs, we're not publicly available, but we could still look into the call base and see, hey. This database is fairly modular. Here is the planner. Here is the executor. Here is the storage engine. Here is the logging engine, and we can hook into them

and extend them. Whereas in MySQL, I think the only thing we could

extend or

change or improve was the storage engine. So it was only possible to do this with PostgreSQL.

The things that we were doing with Postgres, we won't be we wouldn't be able to do them, with a different another relational database. They just didn't have the APIs, and they just didn't have the modular design. Yeah. And I think if you looked at, you know, when Saif started, we're 1 of the more advanced extensions that exist. Probably the other really, really advanced 1 would be post GIS. Mhmm. I would say.

At the time, it was the early days of those extension APIs, but you could very much see where Postgres was headed,

incoming becoming more of a data platform. Right? It's not just a relational database, but it's a platform you can build and extend Mhmm. Without having to

get the code directly into to Postgres itself. So you could kind of see is how it was shaping up in that sense.

And particularly with the different rich data types such as JSON and XML

and some of the other extension capabilities as you mentioned, it's become much more of a multimodel database as opposed to being purely relational where you see people writing document oriented layers on top of it just because of the underlying storage capabilities are so robust that it saves a lot of time, as you've mentioned, for implementing something that isn't necessarily what was originally intended. But because of those capabilities, it's still a,

well functional and production grade solution.

Yeah. Exactly. And I think the the only 1 of those kind of multimodel,

you know, modes that it didn't have was as a graph database. And I said that at a conference recently and got corrected up. Hey. Have you seen this extension for Postgres, which turns it into a graph database? So I think now it probably checks all the boxes of what could you want a database to be. And if I understand

the history of Citus correctly,

I believe that when it was first started, it was actually a fork of the Postgres code base and then was converted into hooking into those extension APIs.

So I'm wondering if you can discuss the reasoning for that transition as well as the transition from being a closed source to an open source offering.

Yeah. Sure.

Yes. Your understanding is correct. When we first started, we didn't have, all the extension APIs

that we needed to access. The functions,

were there, but they weren't exposed as extension APIs.

And in order to do what we do with Citus, which is basically sharding, replication, query parallelization, and query routing,

we have to hook into

almost every sub module within the database. This includes the planner,

the executor,

basically extending the syntax

to support distribution

and

shared memory initialization,

the error function, like, an error logging subsystem to clean up after ourselves.

And

we did that in a way that was still extensible so we didn't go in and mess the Postgres internals or fork, but it wasn't an extension because some of the APIs weren't there. And over time, our customers

started asking, hey. The extension APIs improved and our customers started asking about the extensions. So we said, okay. Now is the right time. Now these APIs are powerful enough. So we converted,

Citus into a full blown extension.

And, did a look back, actually.

The benefit to us, is like 2 fold is, twofold. 1 is, forking an industrial grade database

means if we had when we forked, that we had to support both our responsibilities, which is the distributed execution part of things, but also what comes with Postgres. So So any major upgrade changes, new features, security, and data loss fixes, we had to,

reintegrate and then rerelease.

When we turn Citus into an extension,

fully extension,

we only can, like we said, okay, we're focusing on this functionality set, and

and then we don't need to release anything. Postgres fixed it, Postgres released it and it's available to our customers immediately. So it had

that huge benefit, and also our development cycles got faster. Our release cycles got faster as a result. Yeah. I think the the thing about it is, you know, being a fork

today, if we were to fork off, people wouldn't miss anything. But it's really about 3 years, 4 years, 5 years down the line. If you think of something like JSONB, which came in, you know, 3 years ago ago, I guess now, I I almost don't talk to a person now using Postgres that's not leveraging it. And if you could just imagine, hey. If Citus was a fork as of 4 years ago, we'd have JSON, but no JSON b data type. It becomes much, much less valuable

when the next whatever the new JSON b is when we replace JSON finally with some new better data type, it would be really, really painful to miss out on that. And you had a question on the decision to go open source. I think even from the beginning,

we are very developer friendly. We, had a good usage among the developer like community, but then the fact that it was a fork and also closed source, was a barrier to entry. And, I think there is a lot to be said about open source movement, having access to these technologies that were proprietary and super expensive back in the day. So we said, this is the right thing to do. We should put Citus out there as open source so anyone can benefit from it and it's like available to developers and to the bigger community at large.

And given that Citus is now an extension onto the Postgres database

and

is not is is no longer a fork, I'm wondering how well it works together with some of the other sort of complex extensions such as PostGIS that you mentioned or also Pipeline DB for streaming SQL or the timescale extension for using Postgres for time series metrics and data

types? Overall, it works quite well. In fact, our customers use extensions that I personally haven't heard about every now and then. So,

because of the way we hook into the APIs and because of the way in which we communicate in between the machines, it generally works quite well. We have customers using Postgres. I believe pipeline DB is becoming an extension or is about to has very recently. So,

I think when it becomes or the moment it becomes an extension, the 2 could work quite well. And I'd say the same thing for timescale.

At an architectural level, there is no reason not to use them together.

We haven't tested it, but, it should work. Yeah. It wouldn't surprise me if someone out in the wild because, you know, both are open source and they're free to use, have tested. We have people show up like that all the time. We have lots of customers and users in production with PostGIS.

And as I was gonna mention, there's crazy ones that show up every week that we we hear about or get requests for.

But in general, because of these extension APIs can be layered on top of each other, they can work quite well. Mhmm.

And for the distributed and clustering capabilities, I'm wondering if you can do some compare and contrast between

how Citus operates

and the functionality it provides as compared to

things such as the

or or a service from Amazon?

Yeah. Sure. So I think in some ways, we're, you know, similar. In other ways, we're very, very different. So, like, for Postgres Excel, it attempts to stay very, very close to the Postgres feature set, including kind of, like, asset transactions across all the data,

which Citus now has full support for. But Postgres Excel really kind of plays

a price in terms of overhead and availability. So with Postgres Excel, there's a global transaction manager, which has to be involved in every single transaction. And essentially, in the real world, what that means is multiple network round trips,

for kind of every single query or transaction that happens. And the result is, you know, if you have a single node failure,

it has the potential to bring down the entire cluster because there is no built in replication around that. So it it works in some ways of scaling out when you need much, much more data, but it's less so when, hey. I need much, much greater performance.

So most people come to Citus when I'm outgrowing

a very much a single node

relational database, and I need better performance.

Ozgun, you might be able to chime in a little bit more on Aurora. Aurora kind of hooks in and changes that underlying storage layer. Mhmm. You probably know a little bit more of the details there. Yeah. Yeah. I think, there are a couple of differences, between Citus and Aurora. 1 is, as Craig mentioned,

Aurora ultimately follows a shared store storage model where you have 1 machine that does the writes and then

underneath the covers, you scale out the storage. But if you're running a query, that query still runs on that single machine. If you're doing a write, that write still needs to go through a single

master node, and it is the 1 that actually

does the writes. So in terms of scale out capabilities,

it scales the storage engine, but it doesn't necessarily scale the CPU or memory,

part of it. And in addition to it, Aurora is a fork of Postgres,

and it uses proprietary technologies

underneath, like, on on the storage lever, to give this functionality,

whereas Citus is an extension and Citus is open source. Yeah. And I think we've actually had customers when I started at Citus, I was like, you know, look at Citus when you're at 2 terabytes of data or 1 terabyte of data.

And I've been amazed that we've had customers come over much, much earlier. It kind of like I'm like, I don't know if you need Citus. Just scale up to a bigger instance because that's the easiest thing to do. We've had customers move over at a 100 gigs and 50 gigs of data, which easily fits on a single node, easily fits all the memory on a single node, you know, RDS instance. But they saw much, much greater performance benefits on Citus because of the sharding, because of the parallelism.

So we see actually people moving over earlier because of some of the other architectural benefits they get. And digging deeper into

how Citus enables that replication

and sharding, I'm wondering if you can just talk about the architectural approach that it uses and how that has evolved over the time that you've been building and maintaining it.

Citus at a very architecture level is a shared nothing architecture. So basically, you add new machines into the cluster and then your performance,

increases linearly, you scale linearly. What that means is the way it is built under the covers is

when you enable the Citus extension,

we hook into the query planner, query executor, the transaction subsystem,

and you tell as the user, like, hey, I have these 10 PostgreSQL machines.

To you it appears like a single PostgreSQL machine and a PostgreSQL connection string but basically the moment you say hey this table I'd like to distribute this table you tell site to say I want to make this PostgreSQL table a distributed table and you give us a sharding key. Behind the covers, we say, okay, now this has to be a distributed table.

We take that sharding key and we distribute that table, shard that table across the machines in the cluster. Later when you run a query this could be a transactional query or a different query, we say, okay. Does this query have distributed tables? If it does, we hook in in the planning stage.

We replan the query for distributed execution.

We parallelize the query across the machines in the cluster,

merge the results, and give them back to the user as if it was a single node post SQL database, but, obviously, all the work being split across the machines

being done in parallel.

And what are the failure modes that people who are running Citus should be aware of, and how does it handle a the loss of a node? And particularly, if you were to discuss it in terms of the CAP theorem, where do you think it falls along that spectrum?

That's a good question.

In terms of the CAP,

theorem, let me think just for a sec.

I think you would consider Citus as a consistent,

available system,

that in case of a network partition,

that would just

wait for the network partition to go away. I think if you just think of it in those 3 terms, in practice,

I think of the CAP theorem as

an approximation because there are many different dimensions to it. Like, what is the performance going to be like, and then how do you think of the asset properties?

But that I think is the trade offs that we're making as a relational database.

And I think other distributed databases are their relational databases make similar

trade offs. In terms of the failure modes,

in the regular, the more standard failure modes where you lose a machine or 2 or when you have network hiccups or when you lose a disk.

Basically, Citus comes with 2 replication models, but I'll describe the 1 that's more common, which is streaming replication.

Every machine in the cluster has a streaming replica

and on Citus cloud,

when we lose 1 of those machines, we have an agent that

fails over to the replica within seconds.

So to the end users, the failure

usually isn't visible and then it's handled automatically behind the covers.

And looking through the documentation

and the sort of architectural diagrams, it looks like for being able to distribute the queries across multiple hosts, there's a coordination point of the sort of, query planner that determines where the query is allocated to and how the data gets sharded.

So I'm wondering

from the perspective

of single point of failure, how does that get deployed? Does each node have an instance of that query planner that's able to use some sort of consistent hashing to determine the location of the data? Or just curious how that functions? Yeah. So there's really a few options there. So, you know, generally, if there there is a single point of failure, but what we're running is that streaming replica with high availability. So if that node fails, you've gotta stand by immediately ready ready to fail over. What we also do have is Citus MX, which is a masterless setup, so you can connect to any node. And the key there is just that we need that metadata of knowing which which charge, which know which charge are on which nodes and how to route them accordingly.

So with MX, we basically do a 2 phase commit to keep all that metadata when there's any changes to it, and you can essentially connect to any node. The the coordinator in a sense

doesn't actually do that much work, isn't that intelligent. It kind of keeps that metadata of this

resulting hash value lives in this node. Really, if you think about it, if you were to mainly shard in an application,

we're doing just that for you. You just don't have to rewrite it all in your application logic. We just hook into the planter and rewrite things.

And looking again through the documentation, it appeared that for somebody who is migrating from an existing Postgres database, there are some differences

table structures

and the ways that they construct their queries, you know, able to be compatible with Citus because of the distribution aspect of it. So for somebody who is considering that migration, what are the, steps involved, and what are the things that they should be considering as they go through that effort?

Yeah. So within Citus, there's really kind of 3 types of tables that existed. And I think I was gonna hit on 1. Right? The sharded tables. If you've got, and it really comes down to a big question of of data model to give kind of the typical product manager answer, like, you know, how do you structure your data for Citus? It entirely depends on what you're doing with it. And so 1 is that sharded table. If you're, you know, a standard multi tenant app, think of something like salesforce.com.

Like, that customer's data is only that customer's data. And the easiest way to do that is just shard by customer ID

and all tables get sharded by that. Not just 1 table, but all of them by customer ID. And then they all get co located down together. Then we have a couple of other types of tables. We have reference tables, which are smaller lookup tables that will broadcast to every single node. So you can easily join against them. Think of something like,

maybe a state's table or ZIP codes that lives on every single node that you commonly join to. And then we have tables that just gets,

stay on the coordinator itself. So these might be internal admin tables, that sort of thing. And in general, what we see people do is kind of start at the 1 or a few tables that they're really having trouble scaling. And then as they go through, they start with those. Those are most likely sharded tables.

Then they start to identify the relationship of other tables to those. Like, does this need to be on every node or is this a sharded table or is this just a local table that doesn't have any other relationship?

Once you've kind of done that, there's just a simple function you call to

specify a reference table or specify it as a sharded table. You run all this on Citus, and then your scheme is completely set up and ready to go.

On the application side, it depends a little bit on how much work you need to do.

If you're sharding by, like, a multi tenant app, if you have that, you know, customer ID on every query, you don't actually have to do any additional work. It just works.

If you're doing

very, very complex aggregations,

it depends a bit. And Ozcan could probably speak a a bit more on the the SQL coverage if you're doing complex things like CTEs

and how that works in a distributed system. There's more and more support coming in time for that. And then on the cutover side,

we have some tools that actually live replicate from an existing post crest instance to Citus. We just had a couple weeks ago, had a customer cut over, you know, 4 terabyte database where they think, like, 1 to 2 minutes of downtime.

So we're able to to do that actual cutover, which is a a big concern for people. You know, when you've got 4 terabytes to dump and restore database, it could take

days possibly, at a minimum

several several hours. So we've built a lot of tooling to actually do that cutover. But I was gonna I don't know if you have any extra you wanna add on the kinda SQL piece on those complex SQLs and that kinda thing. I think the only thing that I'll add on on this,

more kicks in, when you are looking into the real time analytics use case or maybe a bit into data warehousing.

We have way more coming,

with Cypress 7.2.

So we are, for that use case when you're running very complex aggregations or when you're running c 2 operations that span across multiple

shards and machines.

Right now, you you need to do a few workarounds but then that is going to change with the upcoming Cyto7.2

release.

And looking at the feature comparison matrix between the different offerings for Citus with the open source versus hosted versus on premise, I'm wondering if you can just briefly

explain how they compare to each other and how you determine which features

to release to or withhold from the open source version as compared to the on premise?

Yeah. So I think at the the core, I'll start on the open source side. There, you get all of the sharding. You get all the scaling. You get all the performance of Citus. Like, we we don't wanna say, hey. You need to pay more to get the performance out of Citus. Like, it's there. It's open source, and it's you know,

every week, we have new people writing blog posts that we had no idea they were using Citus up and running in production, which is kinda cool. Like, recently saw, like, hitmonc write a blog post of,

you know, how

a sharded Citus cluster helped them improve performance.

So it's great to see, you know, just kind of the the cool part of open source. The bad part is, you know, we have no idea how many people are out there using it.

Then what we have are kind of 2 other options. We have the enterprise license version, which has additional features

really kind of around helping you run Citus in a production environment. So there's additional security controls in place.

There is the shard rebalancer that allows you to essentially elastically scale nodes with 0 downtime.

And,

you know, if you're on your own, you can basically put in some engineering time and figure out things or take your application offline or do a number of things there. So we have people that run the open source community version in production. But we've really focused on building more tools, into the enterprise to make it easier to run, maintain in a very available production environment on the enterprise side. And

the third option we have is a fully managed database as a service. And here, we actually had,

myself and the original engineering team from Heroku Postgres, where we ran

somewhere around 1 to 2, 000, 000 Postgres databases for customer, come over and build out a fully managed database as a service.

And there, we're running the enterprise edition for you, so you get all that benefit.

Plus you get

backups,

monitoring, high availability.

Basically, our goal is to be not all of your DBA, but I would say, like, 75% of your DBA so you can focus on things like, you know, features,

optimizing queries, indexing, that sort of thing. Basically, you don't have to worry about keeping the lights on. We do that for you. You know, which features go in which version?

I would say any kind of core

Citus performance, the basics of Citus will continue to be an open source. Then in enterprise and cloud side, we'll continue to add more value added features for

running, maintaining,

so that you can kind of do less and we're doing more of that for you. That's kind of our our goal there. And we've actually seen customers that got to production on the open source version

then come over and

say, this is running and it's fine, but I've gotta have an engineer kind of, you know, a DBA that's maintaining this, taking care of

security patches, upgrades, all that sort of thing, and then come over and cut over to the the cloud fully managed service side. So, our goal there is

basically to make it so you don't have to worry or think about your database,

which we've seen a lot of customers kind of appreciate.

And on more of the operational side of things, 1 question that I have is around backups where if you have a single node database, it's a fairly known quantity as to how to run the backups where you can either replicate the right ahead logs or take a snapshot of the disk and back up all of the actual database files or use pgdump.

But once you start scaling out horizontally across multiple databases and sharding that, it seems like it would add some complexity as to how to identify

which hosts have the canonical sources of data, which ones are copies of each other, and make sure that you get a clean, accurate dump of the entire database

for backup purposes.

Yeah. I think that's,

mostly spot on. Right? Like, it's,

a lot of the same principles apply. So, 1 of the most popular kind of tools for backups, Wally, we have the engineer that wrote it, and we recently released, wall g, which is a go rewrite. So it's up to 7 x faster on restore. So we're using a lot of those same principles of the base backups

framing the right ahead log to s 3 every 16 megabytes or 60 seconds, whichever comes first. But then we do have some additional bookkeeping around, which nodes have had backups, at what time, what's the state of them.

So,

a sense, it's the same, in a sense. It's just multiplied based on the number of nodes and that extra metadata coordination you've gotta kind of keep up with. So And, we recently merged in a change or about to merge in a change where we basically do a a 2 PC distributed transaction across the cluster.

The thing that helps with backups too in the way that

once we do that transaction,

and we're using PostgreSQL to PC,

we can practically get the load sequence numbers. These are the 1 log sequence numbers from a vNode and say, hey. At this point in time, this is a consistent backup. Each node was at this specific point in their log files. Through leveraging that information,

like, because that gives us a consistent view of the cluster, we can back up to a particular log sequence number in each node and then replay from that backup to have a fully consistent and distributed backup. So this is a very recent change that either just merge or that's about to merge

to establish backups not only on a single node, but on a distributed cluster if that's fully consistent.

And, again, in the operational sense, for somebody who is considering deploying the open source version,

what are the sort of resource capacities that they should be thinking of in terms of number of machines

and amount of memory, amount of disk, and the types of disk that they should be running on the different nodes or different types of nodes?

Yeah. I wish I had a perfectly clean-cut answer. It entirely depends on your workload. We have customers that run with huge Cisco with really low memory because of their workload.

And we have people that run, you know, more memory, lower storage, and try to keep their entire dataset memory. It really depends on your access pattern, your workload, what you're trying to do. Generally, what we have found is there's not a huge difference between a bunch of smaller machines or,

like an say, 8 smaller machines versus 2 really large machines when it's the same overall resources. Citus actually handles a lot of that really well of the number of cores in the system or the number of cores in the system. And it scales pretty elastically that way and same thing for memory. It gets down into the details of how many cores do you have that determine your shard count and other thing. But a good resource there is actually to hop into our docs. And if anyone ever ask questions on sizing, whether it's even on the open source or the enterprise version, we're usually happy to help. It's it's use case specific,

but there's some good guidelines that we've seen based on, hey. This is your use case. This is your access pattern. Here's the right thing. And we can get to those usually in just a matter of minutes with a few questions based on, you know, tailoring it to the right use case.

And is there a minimum number of instances that you should be using for being able to deploy a cluster? And then also in terms of

the network layout, should you be aiming for locating the instances fairly close geographically, or is it something that is capable of spanning multiple data centers for a sort of global availability?

You know, generally, you start with at least a 2 worker nodes and 1 coordinator node.

So this doesn't make a lot of sense when you're running it all on kind of a single node. So we see people start at 3 node setups and go as high as,

do we have a customer with a 100 nodes, or is it just 80 in production? Like, we're right around that a 100 to a100. Yeah.

In production and very, very beefy machines, you know, and that's holding,

I believe, it's 890 terabytes of data roughly. So it's just shy of a petabyte. So we're comfortable at the petabyte scale.

In terms of,

you know, setting up the kind of region locality,

we generally do recommend that it's in the same kind of region availability zone for latency. That said, you could run Citus in a geographically distributed setup. Most people that are coming to Citus, though, they want better performance, and that's why they're picking Citus is. I'm I'm running into issues on my single node database and from a performance perspective.

So we do see more so that Citus is located in the same region,

short millisecond hops between each other so that you can not have that kind of slowdown performance there.

And beyond just the

performance capabilities

and the,

data storage capabilities

that Citus enables beyond just the base postgres,

are there any

use cases that Citus enables which would be impractical to attempt in, just a standard postgres distribution?

No. I think any use case, where your dataset size

is fairly large, so it wouldn't like PostgreSQL wouldn't be able to handle it on a single machine,

Citus would help

and as Craig alluded to,

the most typical use cases that we see is the multi tenant applications or the b2b applications. These are operational workloads,

real time analytics where you need to ingest lots of data and then look at that data in real time

and also if you're looking to combine the best aspects of NoSQL,

with the relational properties,

of, like, Postgres.

Those are the most typical ones. I think there are some unique ones, though, specifically around, you know, parallelism

and Yeah.

Even without a large amount of data, 1 customer comes to mind.

So Citus actually powers analytics for professional football.

So if you, you know, watch Monday Night Football and there's some crazy esoteric stat that, you know, Tom Brady is 47%

in his career

when going against the Jaguars,

when going for, you know, between 7 11 yards.

Like, that just came out of a a live Citus cluster. And

the amount of data there is, you know, it's a decent amount of data, but it's not petabyte scale. The the difficulty and challenge there was the data needed to be sliced and diced

in 10, 000 different ways

and needs to respond in less than 1 second because otherwise you're sitting there kind of dead air. Hey. What are we talking about? So,

I think a lot of times when you have an explosion of

variables in your data and you need to have a bunch of ways to slice and dice it, that's kind of 1 of the more,

I would say, unique and kind of out of the box ways that Citus uniquely excels.

Yeah. When you're on live on air, and when you can't afford to wait 3 seconds,

then that's a use case where OVCs yeah. That Citus excels. Yeah.

And would it be fair to pit the Cite the capabilities that Citus provides against offerings such as Redshift or BigQuery that are focused at large scale data warehousing

with fairly low latency for,

interactive SQL requests?

Less so. So we do see you know, he did that question a a bunch, and I think maybe it's because, you know, Citus supports JSONB, and then some of the others are still waiting on it. But,

less so, Citus is really focused on higher concurrency

and, much more kind of real time responsiveness,

and less so on, as Ozkan was mentioning, some of the, you know,

very, very advanced kind of SQL functionalities, window functions, and that sort of thing, which you often see in a data warehousing setup. So, like, as a rule of thumb, if you have a internal data analyst that's gonna spend, you know, 2 hours writing a query,

run it. It's gonna analyze 100 of terabytes of data and come back in, you know, 1 minute, which is a good impressive time. You know, 1 of the other 2 are probably the right tool. Whereas if you're, you know, more predefined and you've got an end user. So I think the the most common case we see is, like, when Citus comes in, like, there is someone that's hitting a website somewhere

that needs a response in less than a second, and you've

got 1, 000, tens of 1, 000, 100 of 1, 000 potential users that need that kind of responsiveness, less internal analysts. So,

in some cases, we we have it come up,

but generally not kind of head to head for the exact same use case. Does that make sense? Yes. And what have been some of the most challenging aspects of building and maintaining the Citus extension,

whether from the technical

or the,

sort of advertising or sort of, community aspects of it?

I think from the technical standpoint,

probably

at first and, still,

getting a deep understanding of the post SQL code base and the different modules, how they interact with each other, and how they interact with Citus.

Because what we're

doing with Citus is fairly unique in that we're enabling all these distributed capabilities by hooking into the engine or all places within the engine. And the Postgres code base has been evolving over the past 20 years. It still evolves. It's very

lively.

And,

understanding

what every module does and how it does it so that we can extend on that functionality has been the

technically most challenging part. That challenge came with its own rewards though. A lot of the functionality that Citus provides today will take years.

Each feature would take years to build from scratch.

By understanding it, we get them for free and not only the features but also all the tools built around Postgres,

they just work.

All your processes,

everything that you use, all your binaries, they just work. But, yeah, understanding all different Postgres modules

in detail and extending them has been the biggest technical challenge.

Yeah. I think on the community side, it's simply that, most people confuse us for Citrus data than Citus data. That's that's probably the number 1 challenge there.

And what are some of the situations where you would advise against using Citus?

I think the biggest 1 is kinda coming back to, you know, a minute or 2 ago with that data warehousing,

use case. Like, if you truly have

internal analysts,

you don't have a need for sub second responsiveness. You don't have end users hitting the data. And there's always kind of exceptions where we uniquely fit in, just because we are very flexible.

But if you're just looking for a cheaper, faster data warehouse, we're not a perfect fit. We we tend to kinda be more end user facing

than most data warehouses are. So that's that's the number 1 1 I would say, probably. Ozkan, I don't know if you have others or if that's the big 1. I have 1 other 1, which is, if your data set is small and if it's going to remain small, like, that is a lot of companies obviously grow. But then if you're going to,

use like, if you're just going to have a gig of data and if you're not if you're going to stay at a gig of data, not

scale, you don't need to scale,

use Postgres. Like, Postgres is an awesome relational database,

for,

datasets that size. If you don't need to scale, if you're not growing, go with Postgres.

And you've touched a little bit on this, but what are some of the most interesting or impressive uses of Citus that you've seen?

I think 1 of them is, basically this

use case customer that, Craig mentioned, Heap Heap Analytics

uses about a petabyte,

Citus cluster across many machines.

And then what they do is, obviously, all these events data comes, into their cluster in real time,

because they're capturing all events in mobile devices over the web and then they're enabling

complex queries, both

like funnel queries, behavioral queries on that data in real time in seconds, and all of that cluster is powered by Citus. It's a fairly big sharded database about a terabyte in size.

That's 1 of the,

impressive use cases,

with Citus. Yeah. And are there any upcoming features that we didn't talk about yet that you'd like to call out that are coming in future releases?

The huge improvements to SQL for the real time analytics use case, as I mentioned, is 1. We're also looking at, open sourcing more extensions,

for that real time analytics use case.

We have a lot of customers who use this hyper log log, data type. It's an algorithm for count distinct approximation,

and we find that our customers also need extensions such as top n or percentiles, so we're going to be open sourcing a few other extensions that target that use case.

And, we have an exciting year ahead of us, but I don't wanna give too much of the road map, looking into the next year. Yeah. I think the extensions are gonna be really exciting area, especially when you look at distributed systems. Like, things like count star. It's just slow and hard. And when you look at large scale data,

you essentially jump into this world of sketch algorithms, which are

really, really close to a 100% accurate, but not quite. And they can work well in single node postgres,

but extremely powerful when you're able to, like, push things down, paralyze them in distributed systems. So,

hyper log log 1 is is, you know, an exciting 1, and it's just fun to say too.

But we've got a few others coming down the pipe that

I think, you know, for large scale data, to be able to do things that otherwise were just unfeasible,

that's 1 area that's pretty exciting.

And there was is another extension that I came across while researching the show that it looks like you guys have released, which enables

you to use Postgres as a column oriented

data store so that you can take advantage of some of those,

efficiencies in querying that would pit it up against

offerings such as Cassandra for that type of data modeling?

Yeah. Yeah. So we've seen, a number of people kind of adopt and leverage c store for that columnar compression. It's great on the compression. And then for time series data,

you know, uniquely performant. We've also seen with all the postgres 10 stuff, people start to leverage time partitioning with Citus. So you basically got a sharded set up, and you've got time partitioning underneath, which is great for, you know, time based data as well. So I think there's a lot of exciting things coming together. C store is a pretty exciting 1, and we've got some interesting things potentially coming with it. But it's always a question of how can we help you scale and get a better performance, which is gonna be a, you know, a constant theme for us, I think, going forward. Yeah. I would agree with that. Like, for the product, like, looking into the next

year, improving the performance across different use cases is a constant theme. Yeah.

And are there any other topics which we didn't discuss yet that you think we should cover before we close out the show?

Yeah. SQL scales. Like, that's what I would say. Yeah. I don't think it's a topic. Well, maybe it's a broad topic in by itself. But,

SQL scales.

Yeah. Actually, I would I would echo that. Like, it's,

it was the 1 and, honestly, it was the 1 valid knock against SQL for the longest time was,

hey. It's great. I use it. You know, my application framework,

whether it's Rails or Django

or

Hibernate,

all work fine with it,

but it doesn't scale. And I actually think, you know,

we're we're starting to change that narrative. And SQL, there's no fundamental reason why it can't scale.

And, it's nice to actually kinda have your cake and be able to eat it too.

Alright. So for people who want to follow the work that you are up to at Citus and personally, I'll have you each add your preferred contact information to the show notes. And as a final parting question to give this listener something to think about, I would like to get your perspective

on what the biggest gap is in the tooling or technology

for data management today.

Yeah. I think there's,

I don't know if there's a a huge gap in the the tooling necessarily. It's as much that there's such a proliferation of tools that, there's a lot of glue and duct tape of chaining tools together. A long time ago, it was you had the relational database, and that was all you needed. And now you have

a pipeline of 5 different things that don't necessarily

talk flow work well together.

And for a data engineer,

that makes sense. Where we need to get it to is that to be much more turnkey for someone that's not a data engineer than someone that's the app dev. I think for us, we see a lot of people come to us that they they don't wanna think about the database. They don't wanna think about a complex data pipeline. They just wanna focus on features. Right? And we need to keep making that kind of within the industry,

and data products.

Keep making that easier and easier so that you don't have to have a

a PhD and distributed systems to understand how things work,

that someone that just wants to write an app and have it scale,

and be able to

send parts of their data to a downstream pipeline pretty easily,

can do that. So I think the the key is

continuing to simplify

more and more,

versus building more and more advanced things. It's

how do we kind of lower the bar so that more people can actually take advantage of these systems. Yeah. I agree.

And simplifying

involves building more of the advanced things. So it looks simple to you, but behind the covers, it's, like, the system does the heavy work. Yeah. I yeah. As I tell the engineers, like, hey. I I can write the, you know, the user interface, like, just fine. Like, the rest is just implementation details. I was gonna love it when I say that. He's like, yes. Technically, that is true. But,

it's a lot of implementation

details.

Alright. Well,

with that, I would like to thank the both of you again for joining me today to talk about the work you're doing at Citus Data. It's definitely a very interesting tool and platform and 1 that I am interested to see how it evolves in the future. So thank you again. I hope you each enjoy the rest of your days. Thanks so much. Yeah. Great chatting with you. Yeah. Same here. Yeah.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links