Building An Internal Database As A Service Platform At Cloudflare

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack

Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses,

but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date.

With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation,

or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to dataengineeringpodcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy. And today, I'm interviewing Vignesh Ravachandran

about building an internal database as a service platform at Cloudflare. So, Vignesh, can you start by introducing yourself?

Thanks, Tobias. Thanks for having me. I'm I'm Vignesh currently working at, Cloudflare as an engineering manager building the,

database platform for for Cloudflare.

Started at Cloudflare in 2019,

so it's been now almost close to 4 years. Joined the team as a founding engineer,

individual contributor, obviously,

working on helping them build that AWS platform. Earlier, it used to be a generic SRE team, which was managing the Postgres databases.

And last year, I I've transitioned to leadership role, and I'm currently

managing a team of 6,

6 database engineers who are located all over the world.

And our team is

using the follow the sun model. So I have team members in Europe and as well as Asia.

For most of the time, I have been just fiddling with Postgres. And before Cloudflare, I was with Ticketmaster,

which is an online ticketing company pretty much running

ticketing events for NFLs, NHL, and pretty much any,

high profile events. And, yeah, that's that's been my focus, mostly just today databases. And, recently, it's on Postgres.

And do you remember how you first got started working in data?

How I first got here? Interesting. It it it all started probably 10 years ago. My very first job, right out of the school is an Oracle DBA. So that's database administrator.

It's it's a role used to be there for 10 years ago. I'm not sure how much it's still relevant,

and

and that's that's why I got into,

database field. It's more of an accident, I would say, because almost all my

all my all my schoolmates, they all got into application development. They were all interested in either Java, dot net, and and other frameworks and other programming languages.

And then I kind of stumbled upon into Oracle work, which I believe I mean, I I realized that it's a it's a it's a surprising 1 because a lot of the DPA roles require

prior

experience. Like, so that's how we got started.

And now talking about the work that you're doing at Cloudflare,

can you start by just describing some of the different ways that database workloads manifest

within the company and some of the

types of challenges that

you're faced with trying to maintain the health and reliability of those databases?

Yeah.

So Cloudflare Cloudflare's challenges is kind

of fairly unique, I would say. That's what, to be honest, makes it interesting and keeps me, you know, up in the night.

1, the very first is

Cloudflare doesn't run

on any other public clouds.

Cloudflare is

a public cloud to a certain extent, and they have

their own infrastructure.

We have, like, points of presence all over the world, and then we have, like, kind of a core data centers, which are located in, you know, in in, in a couple of regions or a couple of continents.

Postgres runs on the core data centers.

And, and and, again, also, it's it's all running on bare metal. So the Postgres that runs on Cloudflare is on bare metal. There is no abstraction

between the hardware, the server, and Postgres software,

which means no virtualization, no Kubernetes.

And and to be honest, there is nothing in between Postgres and and the metal.

So that that's the first challenge, I would say. The second 1 is there's no, like, sleeping time for Internet. There's no, like, downtime that you can take when you are working on the Internet. There's always

someone up,

and there is always someone who's going to notice

if your services go down. So that's the very interesting challenge because I didn't realize that to a pretty much

good extent. In all my previous roles and companies,

we were able to take down times. Like, hey. Okay. We are going to take shutdown this weekend and this night. That's not the case. So that kind of forces us to think anything that we want to do. Like, hey. We wanna keep our services up and running, but we also need to do this.

How do we then now, you know, go back and architect and engineering our processes?

So that's the 2 unique challenges that I have found working on on the on on cloud press environment.

And so

given

the high uptime requirements

and some of the data volume, wondering if you can talk to some of the different setups that you've had to go through at Cloudflare to be able to manage that high uptime and the database throughput and a bit of an overview about where you are now. Sure. Yeah. We we can give you a glimpse of, you know, the scale at which we operate. So right now, Cloudflare

pretty much gets, like, close to 40, 000, 000

HTTP requests per second. So by now we are talking it's already across more than 100, 000, 000 requests. So that's the scale at which Cloudflare is operating.

That's on the data plane, I would say. On the control plane, which is where our postcode databases are located,

which basically provides

the support for authentication,

authorization,

anything related to control

control plane management, which includes

user accounts, managements, user managements, profile creations,

DNS records,

configurations for r 2, d 1, d 0, all the products, including Cloudflare Workers.

All these products that we see, the front facing of Cloudflare is

using a control plane Postgres

database, and that's what my team provides and supports.

In terms of what you're building now, you've developed

a database as a service platform within Cloudflare so that application engineers don't have to be well versed in the operational characteristics of the database. And I'm curious if you can talk to

some of the requirements and design constraints that you had to account for in the process of

designing the desired state and some of the,

rough overview about where you've landed in the development of that platform. Yeah. Makes sense. So,

our it will start with the SLAs and SLOs. Right? That's that's what pretty much derives

how we build our platforms.

Our

service level,

agreement SLA for

point in time recovery

is

0 data days data loss. That shouldn't be any data loss

at even in any disaster. So that's the requirement number 1 first 1.

And then the next 1 is that,

recovery time objective. So how long it takes for our services to recover from a failure, which is currently under a minute. So which includes either postgres crashed,

postgres running out of memory,

a server goes down completely,

the primary server,

or even a data center

went down completely. A complete data center goes down. We have to recover our services, Postgres,

within a minute.

So these 2 things derives a lot of what we do, and and, we can talk more on in-depth on how we got there.

But that's the numb fundamental requirement,

and we're trying to, you know, keep pushing them and and whatnot. And in terms of the database technology that you're using,

obviously, you are a cloud provider, so you can't rely on other cloud providers to be able to fulfill this very critical core service. But in terms of the actual database technology,

what was the motivation and reasoning behind selecting Postgres versus some of the other options that are out there, even some of the Postgres derivatives such as Citus or, the work that they're doing at Cockroach, given the fact that you do have to manage this massive scale

global capabilities,

etcetera?

That's, 1 thing that I'm very happy of. The 1 of the decisions that the core founders made, not not, to be honest, our team, not the team that is currently at Cloudflare, since 2009 when the company started,

it's been Postgres.

From day 1, it's Postgres. I'm I'm so grateful that, you know, we are not migrating at some point in time from this,

x product

or this x vendor

to now to the open source Postgres. So I started with Postgres. I still see some of the initial commits by the cofounders from 2009.

So to be honest, I exactly don't know the reason why they picked, but I know some of the, you know, the ideas behind why they picked. Because, 1, it provides a good support, like, date data type supports

for INETs. Like, the IP addresses, they were, like, 1st class citizen in,

in Postgres. So since we are we are dealing a lot with IP addresses,

that kind of made sense to obviously, it relates to how much Postgres was solid even in 2019.

It made it so apparent that, okay, we can rely on Postgres and build our business on top of it. So that those are the 2 attributes, I think. Given the fact that there are these other database engines,

particularly things like

or Yougabite that have Postgres compatibility

with the work that you're doing to

manage the database as a service engine and the fact that you're running on physical infrastructure,

what are the reasons against moving to 1 of these other scale out Postgres compatible database flavors?

Yeah. I mean, first of all, you know, credits to all all of them who is, you know, stretching stretching the limits or the boundaries of what Postgres can do. Right? So, kudos to them. I think the very fundamental reason and and, also, to be honest, this topic comes off often. Like, hey. You know, engineers ask us that, hey, Vignesh. What what's next in the database world? And I go back them just post this. And they are like, disappointed. Like, come on. We are looking for, you know, to hear something more,

more more exciting,

and we are, like, still sticking to the plain old

Postgres.

So the reason is, 1, the autonomy, to be honest,

that that we can change

and do whatever we want with Postgres, right, without thinking or without inter involving anyone else, right, any other team, finance, legal,

marketing, nobody. Nobody. Because we just use

plain old Vanilla Postgres.

So the autonomy is, I would say, the biggest reason why we we use Postgres. And to be honest, not just Postgres. A lot of other core technologies

that we use Cloudflare

all follow the same pattern. Open source, no restrictions and licenses,

a good community. Okay. We will pick that because it provides autonomous.

I can go a couple of few, you know,

specific, case studies where that really helped us.

So it it helps us, you know, for example, like, patch

Postgres.

Doesn't have to wait for another vendor to come back and gives us a validation. We go to the mailing list, talk about the problem,

run it by some of the senior members of the Postgres community. And if we get a get a good thumbs up, we know that we are on the right track, and we go again and fix the software

and get that validated, hopefully, contribute to the upstream.

So this is by far the single most reason why we picked Postgres.

And so

with Postgres

as the core building block of your database service,

what are some of the ways that that has simplified your work of only having to support 1 engine versus something like Amazon RDS where they have multiple engines that they need to be able to run for different customers,

and some of the ways that you've been able to optimize for Postgres in the development of your database service?

Yeah. I mean, you know, lesser lesser the number of software, it's simpler as simple as it is now nowadays, I guess. We all can agree that

developing software is becoming more complex than it used to be. 1 has to know a lot more than than just get something up and running. So, yeah, that that's,

that's a that's a very good side effect, I would say, of just sticking to Postgres and and going back to the basics and asking the fundamental question of what are we trying to achieve

and and can and why Postgres is not a good fit. And most of the time, we came back. It's like, either maybe our requirement is not right. We thought that, oh, our requirement is this, and we have to do this. And then we went back like, not necessarily.

Or 2, even even, you know, even if we do come across some really interesting use cases

and Postgres has been able to stretch itself and unable to lend. So those 2 things made it easy for us to, you know, in terms of managing the managing the number of components that we have.

Another aspect of what you're building is that you have to be able to support multi tenancy within the database workload.

And I'm curious, what are some of the

challenges or complexities that arise as a result of that and some of the ways that Postgres has either helped or hindered that effort.

Yeah.

Sure. That's it. That's interesting. And to be honest, to answer this question, I wanna take a step back and and, give an overview

or an architectural overview

and then explain why this is a problem. Right? Because some of it, is that it's it's avoidable,

and we have kind of inflicted this problem on us. So we have the the architecture. Top letter is kind of 2 core data centers where we have our postcode databases have been located.

And within that each region,

we have, like, availability

zones. Right? Like, so you have your US West and, Europe East. And within US West, we have 3 availability zones, and there is 1 instance of Postgres

running on each of the each of those availability zones. And inside that 1 instance of Postgres,

we have multiple databases.

So we have, like, 1 Postgres installation,

1 process that's running, and there are, like, 20 to 30 other databases that's collocated on that same instance. We can take a step back and just think, do we need to do this?

Why can't we just put 1 database

per each instance?

Why does 1 instance has to collocate multiple? There are some good reasons. Let's say, if 2 databases

are kind of related has has some related data,

then it's so much more easier

to query the data

if they both reside on the same instance. So that's that's the biggest reason why when someone wants to, you know, kind of try to collocate.

2nd reason is a little bit more towards bin packing. Right? We're trying to be more efficient here. Right? We done only 1 instance, but put more databases in them so that we can pack a lot more. I would also counter argue that you can, you know, you can get further by

actually just sticking to 1 database per instance. There are some bit of overhead. Right? Because each now,

each now database

contain contains its own instances. That means you have to now build all the monitoring and the backups and everything now for, like, thousands of instances. Let's say you have 1, 000 databases. That means 1, 000 instances. Now you're kind of, like, managing,

kind of a service oriented architecture or microservice based setup. There are pros and cons for both. But at least 1 of the cons

with this bin packing approach of putting multiple databases

at a single instance is performance isolation

or, I would say, you know, multi tenancy problems or AKA noisy neighbor problems.

Any any of them, you know, kind of represents the same. The idea is that 1 of your tenants

is kind of

unintentionally

starting to impact

the other tenant.

We we need to, I mean, underscore the word unintentionally.

I don't think any of the applications here trying to intentionally sabotage

your your sister applications or, you know, or neighboring applications.

And it's a lot harder

to even first identify

who is that noisy neighbor. It's like a zoo. 1 of my colleagues says that, hey. This is a zoo here. And, you know, 1 of the animals just started, you know, acting weird.

And and the engineers now need to figure out first who is that animal.

It takes time.

Then we have to controlling them is another task. So first, we identified. Let's say, okay. Here is this 1,

application. It's just not acting

properly.

Then controlling them is the next challenge.

So keeping every 1 of them as a good citizens in a in a shared environment

is a challenging task. And to a certain extent, probably, I'm going to, you know, kind of do an art take here. I mean, art take here. Postgres is not built for multitenancy.

Postgres was built around 19 nineties.

And then in those days, the architecture or, you know, whatever it's meant for is that, hey. You just install 1 solution, and you are not going to put anything other than that on that mission. So it's it's but, again, it's not, you know, it's a it's not a promise that Postgres wanted to do or anything like that. So we're just stretching it, and we are trying to, you know, kind of overcome some of the challenges that comes with it. Digging more into the

architecture that you've built and some of the driving factors, you spoke to the SLAs, SLOs, some of the

requirements around the scale

and throughput of the database, but I'm also interested in the developer experience that you're trying to focus on supporting and some of the ways that you've been building relationships with the teams that are consuming the database engines and how that has driven the ways that you think about what to optimize for, how to build the interfaces, and some of the ways that developers are able to get access

to new database instance instances, whether physical or logical. Yeah. This has become such a big topic. I mean, it used to be it's like, hey. It's a black box. You you create a Jira ticket or whatever service desk ticket and then just wait for a week or 2 until someone comes back to you. That that's the that's the norm. Or at least it was the norm when I very first started, especially in a fortune, you know, 100 company.

But this is gaining a lot more attention that a developers have really used to the the the, you know, the feel of, of of a cloud provider, and they do want the same even in in house services that they are using. Right? Now still, I I think this is 1 of the areas that, to be honest, that my team needs to do some bit of work and to improve. Currently,

anyone who wants a database,

they go ahead and file, you know, our template that we have for our for the teams that has some fundamental questions,

like, hey.

Do you want your database to be multi AZ? I mean, to the most part, yes, almost all

all the time. Do you want your database to be multi region?

And and we ask them some fundamental questions to understand what kind of workloads that they are bringing. Is it a read heavy, write heavy? And what's the data retention

that they want? So after they've put up all these questions,

1 of our engineers go ahead and evaluate

and then locate or identify

the the cluster or the instance for this database.

Sometimes we figure out, okay. This is such a massive new database that's going to come to us. We can't collocate them. We need, like, a new hardware, like, a dedicated hardware for this database. There are other cases where it doesn't, so we kind of, you know, deploy them on the existing hardware.

So this is the approach that we take. I mean, the turnaround for this is

currently taking, I would say, around I mean, depends whether do we want a new instance, or we are going to put it in the existing instance. Existing instance, straightforward,

we get, like, you know, a PR

created by the developers themselves, and then we just merge them and apply, then wait for our configurate configuration management software to deploy them. So the turnaround is, like, close to 4 hours, 3 to 4 hours.

New instance, that's a different kind of a beast and depends on the hardware availability, etcetera. But still, I would say it's less than a day.

So much work to do be to be done here. We definitely understood that, okay, this is not on par with some of the other providers, and and we need to improve. So that's the first 1. When they create the PR itself, they get to pick the database name. That mean, they, in the sense, the developers. They they choose the database name. They choose the users,

different roles for, you know, different users. Like, hey. Some has the service users, which has all the privileges, then they have, like, a read only users, then they have, like, a read, write, operational users. So they get all those created when we deploy the databases.

And then they connect they have the credentials also. Right? Like, when they are creating users, obviously, they said this goes to Vault, and so they don't, you know, have it on their own local missions. And from there, they can actually connect to the databases

from their local laptops.

So they don't have to go through, like, a jump box or or another, like, a bastion host. So we use Cloudflare d, like our own product,

which provides a tunnel directly to, internal to our core network.

So that makes it very interesting that people get a seamless experience that they can just, you know, open up the terminal PC, provide the host name, username, and password, and you directly connect to the Postgres database. And we actually wrote a blog about on how to set this up.

And in terms of the journey that you've gone on from when you first started to where you are today, I'm curious if you can talk to some of the

dead ends or roadblocks that you've run into, some of the ways that you've been able to lean on the overall Postgres community

to help with this, whether that's in terms of actually talking to the Postgres developers or bringing in some of the surrounding ecosystem of tooling and some of the places that you've had to do your own custom development to be able to get to this end state where, I guess, it's not an end state, but to the current state of where you are today?

Yeah. Buckle up. This is the most topic, I I would say. So, to be honest, a lot of them has actually

been a lot of these, developments are happening

because of the the stronger or the stricter SLAs and SLOs. Right? They are the ones. They're kind of pushing us that when when, you know, when our organization came and told us that, hey, Vignesh.

We want the recovery time to be 30 seconds.

Can you do that?

I was like, okay. I mean, that's that's a hard challenge. And and I have been cons I mean, my team has been constantly challenged

to get that kind of, you know, recovery time or provide that kind of a service. So we can't, like, really settle for something mediocre.

Rather, we have to push, and it's like, you know, necessity is the mother of invention. So we have to, like, go back to the drawing board and actually spend we spend, like, close to 6 months to just solve this 1 problem. How do we bring our, recovery time to less than a minute? So,

I can go deep more on this.

So the way Postgres,

you know, fails and and recovers, we use a software called StrollOn, which is, again, an open source software written in Go, which which has this, typical setup of, you know, we have, watch dogs or keep alives that are keep checking for all the instances whether they are up or down. And then if there is down, then they promote the synchronous replica or the standby replica to become the new primary database.

This is good,

but but there are some challenges here. We run with synchronous replication,

which means

at any point in time, we should have at least 2 healthy instances

for our services, meaning for our Postgres databases, to start accepting any changes. Right? Like, because it's the idea of there are 2 kinds of replication, synchronous and asynchronous.

Now synchronous means that, the database needs to replicate

its changes

to at least to 1 other instances

before giving the acknowledgment

back to the client.

This provides you, obviously, you know, of, protection from data loss situations.

So we do want that. There's no negotiation on the fact that, we can do we can have any data loss.

So going back to this, idea that, okay, we have these 3 database instances located on 1 region, and a primary failed.

Great. So now we promote the standby database to become the new primary. But, also, at the same time, we need another database to be acting as a synchronous replica immediately.

If not, then all the DMLs, the inserts, updates, updates, and deletes are going to be blocked. So we use a utility called pg_rewind.

It's it's a Postgres utility similar to pg_baseback

up or pg,

anything else, pg underscore amcheck, I guess. That's also another utility. So it comes with the binders. It comes with the Postgres installation. You don't have to do anything. And the interesting fact of PG Rewind is it only copies,

the delta in the data between,

the source and the destination database. So we rely on this to get our recovery time to be less than a minute.

But there was a caveat or a gotcha on their tool in pg_deviant.

1 of the things the pgdeviant does is that it's supposed to be only copying the delta in the data, but it also copies,

all the right target lock segments in the data directory. So right target log segments are basically,

it's like a append only log where all the database changes first gets recorded in that before

the data is actually being recorded in in the data files themselves. Again, it it's for the durability purpose and pretty much how most of the RDBMS

works.

So now,

going back to the challenge of copying all the right ticket lock segments. And imagine if you are keeping, like, a 1 terabyte worth of right ticket

locks. These right ticket locks are really useful if people wants to do point in time recovery. Right? Like, because you can since it's a log

in a in a in a synchronous timely manner, you can just store your database and then keep applying all these changes so that you can pretty much go any point in time in the past. So because of this, we we keep more write ticket logs. Not only that, right tech logs are also used to for the binary replication. For the other region that we have in in the across the pond, we use this right tech logs to ship them. So having 1 terabyte worth of right packet logs means that pg underscore rewind, instead of supposed to be just copying the delta, it was also copying this 1 TB of data. When I ran the first time p g underscore, it took, like, 1 hour. And I was like, okay. Wait. There is there can't be 1 hour worth of delta in the data here. It it is very, very miniscule.

And then I found that PGN went and spent all its time copying.

So we understood this problem, and I reached out to Heike,

who is who is currently the cofounder and, cofounder of, also con also long time contributor of Postgres,

sent an email that day, and who is also the author of pg_revent.

So that's that's the reason why I first reached out to him, like, asking that, hey. This is how it is working. Is it really supposed to be this way, or am I misunderstanding it? And then he acknowledged that, okay. This is I know there is an improvement here. We we we can do this. So we went back to the postcode source code, spent 2 weeks and and, patched the the change and and sent it upstream. It's still there upstream. It is not yet merged. That's probably in 1 of my, you know, checklist items that this for the next version that we do want to get that merged because

we now have that patch running in our environment

now close to a year.

And we have seen that working

like charm every time. So every time now a server goes down, all of our failovers and everything happens, and our services are back within a minute. And this is possible because of this patch that we have made in pgenescodewin

and not possible because of the tool first pgdewin and then dispatch on top of it. And I know from looking at 1 of the presentations you gave on the work that you're doing that you're also relying on load balancing and connection pooling for being able to handle the scale out aspect. And I'm wondering if you can talk to

some of the scaling considerations that you've had to deal with as far

as multi region,

failover,

managing some of the latencies of replication, and some of the work that you've had to do in that load balancing tier to customize to your use case.

Yeah. Definitely. That's another interesting. Right? Like, this is where it goes back to, like, hey. People are you know, we are we are all hearing about the new distributed

system based

database. There seems to be the you know, like, solving this,

this, Internet scale or web scale challenges.

Should we use them, or should we use Postgres?

And as I said earlier,

we were able to stretch to a good extent of just using plain old Postgres. So the

first way that how we scaled out is by, obviously, by having multi region so that if if, you know, if you have a request, especially a read only request that comes from a different region, then we can just forward that request to the local region. Again, we have only 1 primary database. So anything that needs to write still need to, you know, go across the ocean or wherever to go to that particular primary.

This is in 1 of my list, you know, that at some day, that we will have a multi primary,

database.

But, you know, the the the support right now, it's it's much more challenging to do that in Postgres. And, again, that's also, to be honest, not a requirement

to the most part of the of the of the application. They don't really need a multi multi primary situation.

So the scaling out by region wise is is very good to begin with. Then within the region itself that we have to scale out, right, like, because you have, like you can you can only get, like, so much big box. At some point, you are going to hit the ceiling that, okay. That's it. We don't another

interesting thing at Cloudflare is that we also don't run on any specialized hardware. This is not the hardware, like, custom built for for. This is the hardware that you can buy

yourself by going to the vendor's website. And you can put in, like, hey. I want this, this, this spec, and you get the rack, and then you can just, spin it out on your own.

So building it out on custom,

commoditized

hardware

means that we need to scale out. The very first or the straightforward way is by adding more support for read replicas because and, also, 80% of our workloads is, in general, read heavy. Only, you know, 20% of the workload is actually the right portion. So we can scale out by adding more replicas

with even within a single region, which is great. Now comes this is where the load balancing

thing comes in. Okay. Now we have, like, 5 replicas and 1 primary database. How do you make sure that,

your your connections are spread out equally across all these 5 databases?

Well, we obviously use HAProxy, and that's 1 of the, talks that I gave at HAProxy conference last year in Paris

on how we use HAProxy

to to kind of load balance the read only traffic. Then it then it gets more interesting. Right? Like, okay. Well, the now you've you've got, like, a primary database and 5, replicas, then, obviously, there is going to be some lag because all of them is, you know, we're transferring bits and bytes over the network, which means there is a replication lag.

Do you want to steal data to your clients,

or do you want, like, a fresh data to the clients?

So then, again, comes back to the fundamental question, what's the SLA? Right? Like, if someone is signing up for it and then we immediately

send them to a replica, and the data is not there. And if you tell them that your account is not created,

that's not a good good good, you know, good user experience. So this is where the, the read after write,

problem comes in. So in those scenarios, we have to stick to the primary database where we do know 100% that data is there. But we made some smart things. You know? We have to abstract all these things away from the the end engineers or developers. This is where the developer experience comes in, and they don't want to be thinking about all those things like, hey. Read after. Right? Should I do this here, or, should I be just reading from replicas?

And we have added, like, health checks that, hey. If you have, like, you know, if your if your replica is unhealthy, don't send connections to that replica because, obviously, it's going to fail. Then the other 1 that is silently going to fail is their application lag. If your replica is lagging by, say, a minute,

meaning that last 1 minute worth of data is not yet there on the replica, then we take that out of the pool automatically.

So whenever a client connects to that, read only endpoints,

they are always been guaranteed

to find the real I mean, find the real time data.

So that's how that's where we use, HAProxy.

And in your work of building this platform,

what are some of the

most complex

engineering and architectural

challenges that you've had to address,

and some of the ways that the organizational

structure has either helped or hindered that effort?

I would say it's helped a lot. That is I I couldn't even think of 1 1 case where the organization has hindered us that day from doing something. Architectural

challenge. I would say it's, it's, you know, in general, the nature of just

just managing and operating databases is hard. To be honest, like, not not for, you know, since that, our team does it. But it's kind of, like, it's under the radar for the most part because it's behind you know, it's way beneath the stack, so no 1 really sees it. And so it doesn't really gets the kind of visibility that rest of the stack gets.

But in general, we have count so many examples in the recent past. You can see that cloud provider is trying to do this, trying to roll there on their own, and they're realizing that, oh, wow. It's just lot,

challenging than just

maintaining them. Nothing fancy.

Just provide an op like, I know, production ready

operational

database environment and keep the SLAs and SLOs. If you do that,

my kudos to them because I have the empathy on how to I mean, how hard it is to just to do this. With respect to the

challenges,

I mean, nothing nothing unique about it. I mean, obviously, we have you know, it's a it's a distributed systems problem. Like, we have to now think about what is how how does our systems are going to behave if they can't talk to our our, let's say, the metadata system. Right? So you have this,

the health checks that are running against each database instances, finding to trying to find, okay, what's going on on the instance itself. But they also all maintain the the the state in a in a metadata, which we use. In our case, it's ETCD. So a lot of times, they're like, what do we do if they can't even communicate to that ETCD? Do they going to make some, you know, weird decisions, or they are just going to, like, you know, just wait? Then interesting things comes in sometimes. You know, networking is always a fun thing to account for, like, different ways it can fail, And we need to always look out for how is it going to fail and how the systems are going to behave. And, also, do we want, like, a soft failure, or do we want, like, a hard failure? The the failures that are like,

oops. It's cut. That's it. Sometimes that actually may be much better than than kind of this soft failure where your systems are kind of semi healthy and started behaving really weird. So all this is fun challenges, I would say. In terms of how the organization

really helped us is is embracing

this idea of

of of, you know, of failures are are common. Fail we need more failures. We have to go through more failures to really find success. Right? Like, we we aren't, like, really scared

of of touching things. Right? For example, like, 1 of the requirements is that we want to be rebooting all of our database servers

once a quarter.

It's like, don't do that. If it's if it's running, don't touch them. So that that that used to be the status quo. An organization

push pushing the teams to, like, go on challenging that is kind of really beneficial.

So we even actually wrote a custom software called PGMonkey.

It's like the chaos monkey, but for Postgres.

So what it what it does is pretty much goes ahead

and restarts,

services in a random order. It goes ahead and restarts this load balancing software that we were talking for some time ago. It goes and and does a failover,

and and it goes and, let's say, you know, restarts I mean, reboot submission. We we still need to do that. So so so many of these kind of chaoses that we know kind of happens sporadically,

but we kind of, like, expediting them using our our kind of a chaos tool. It it's not yet open source. That's 1 of the things. Currently, it's, it's more hot hot coupled. Right? Like, hardly hardly tie you know, hardly

hot coupled with our our setup so that it's not yet ready for open source. But the idea is here, I'm happy to, you know, talk more, and, hopefully, people can, you know, implement their own version and to fit their own systems.

And so these things are enabling us to think about failures,

learn from them, and then go back to their drawing room and fix it. Not only that. Even within the organizations,

we do this kind of, like, disaster recovery practices. Right? Like, don't wait for them. Simulate your disaster recovery and make sure your your SOPs are ready. Make sure your systems are behaving as you said it will be. So we practice those. So overall, thanks to the organization in helping us build a solid platform. And also thanks for supporting us when when when things fail.

This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting data engineering podcast.com/datafold

today.

In order to be able to run these database systems at the scales that you're talking about, there are a number of additional layers that are required beyond just that core database engine. I'm wondering what your sense has been from going through this exercise of ways that the database engines themselves

can or should be evolved to remove that the the necessity of those extra layers or, some of the lessons that you've learned from going through this exercise that can also be translated into

environments where teams are using a database as a service such as RDS or 1 of the other providers that is still valuable despite the fact that they're not dealing with the physical hardware underneath?

Yeah.

A it's a interesting question. So what are things that can be part of the core Postgres itself so that just managing and operating it will be more easier if I if I can summarize that question? Postgres

is really

unopinionated

on this topic.

Doesn't really provide you a guideline or tells you, you know, in a in the documentation

that here is your backup tool. You wanna use it. Or here is your,

failover tool that you want to use it. No. And or even, like, connection pullers. Right? 1 of the things Postgres is notorious is, you know, creating a new process next process for each connections, and it doesn't scale. So you have to use a connection puller. There are right now, like, a 6 different connection pullers on the market.

Which 1 do you use?

There isn't right? To be like I mean, it it's fine. I mean, I understand

because these things evolve and and and the community recommending 1 thing strongly is is not the best thing. Rather just they're putting out, you know, the pros and cons of each 1 and letting the consumer speak is, to be honest, the way to go. But that always creates more confusion amongst the newcomers. Or even for the old timers, we go through this process that, hey. Patroni is is the amazing project that why are you folks still sticking to Stalon?

I'm like, Yeah. Sure. I mean, it is definitely good, but we made Stalon to what the way we want. Right now, us moving to another platform is not an easy task. That's like a year or 2 worth of, you know, commitment that we are trying to put. And is going to come with its own challenge that we don't even know what it is. So rather, I would spend my time and energy trying to get the existing platform, the 1 that we know or, you know, at least we think we know about it to to learn more about it rather than going and using something new as it's coming up. So that's going to be a challenge. I don't really think,

any easy solution for it. In terms of what are other things that can be part of the core is is this performance,

is this multi tenancy challenge that we have been chatting about. Any kind of observability

that we can add would be a it would be a really good, value add. Right? Like, because as more and more systems are going to be cloud based,

which means that,

which means that more and more people are, like, you know, been, been packing.

So what what what, you know, what Postgres can provide kind of insights that make it so easy to kind of enable. But other than that, I mean, I don't think anything else. I couldn't think of anything else that Postgres can roll it on its own to make it easier. Like, backups, we we have to still do them. This multi primary,

probably 1 thing that would be interesting. Right? There are some, improvements that's happening. So you can today, right now, in postcross 16, 1 can use logical replication,

and then they can basically set up as 2 primary database and then replicate the data between them. So it's coming. It has its own challenges,

but but it's not I would I wouldn't say it's a post class challenge. It's the fundamental challenge of which data is you should be considering as source of truth if you have multiple systems. And you, the owner of the data, should know that better, or you have to defy I mean, you know, you have to build your systems accordingly.

I don't think, anyway, magically that Postgres will come unsolved. So And talking about the different versions of Postgres also brings up the question of version upgrades and some of the ways that you're approaching that and how you manage that complexity

in this, you know, 100% uptime environment that you have to support.

Thanks for bringing that up. Yeah. I missed it in the previous 1. So that's an area that Postgres

definitely can do a better job than MySQL.

MySQL add this up way of, you know, like, you can upgrade,

your databases. You know, you have the primary and bunch of slaves and then further sorry, bunch of replicas and then bunch of further replicas down. And then you can, like, you know, upgrade from the bottom up. Right? Like, you can upgrade your replicas, keep going, keep going, and then finally just do a failover, and and, voila, your entire system is now upgraded.

Even today, Postgres doesn't support, especially if you are running and streaming your application.

There are, you know, maybe there are, you know, chats around how we can do this using logical replication, but logical replication itself is not the best solution if you are if you are, like, really thinking on the scale. So it's good, but not yet there ready for the prime time for the scale that, you know, that Internet scale that we are talking about. So, definitely,

PG upgrade, pg underscore upgrade or Postgres upgrade is the area that we need improvements. Postgres is also going to be releasing

releasing a new version

every year now. The the life cycle has changed. Now 17 is going to come next year. So which means people are going to be upgrading more more and more often, and and we need some some tools. There are a lot of chats I can also link afterwards that, you know, community

has been start thinking that, hey. How can we improve? The cloud does even the cloud providers

takes, like, a hard downtime

that we have to do this upgrade, like, for a minute or 2. How do we do, like, a it is approach that you can do this, your post course upgrade, with 0 downtime.

Nothing else is there. Not nothing out so far is there. We have actually you know, our upgrades are right now taking up to at least close to a minute or 2, and and we are also going to actually,

share in the in the next month or so on how we are doing the Postgres updates.

The spoiler alert,

we have tweaked the r sync process. You know, we're kind of faking the rsync

to be more smarter. If you just do, like, a, like, a dumb rsync, it would be taking, like, hours and hours because it'd be this is, like, 10 terabytes databases.

So

we tweak our sync in such a way that we can able to our sync the databases in under less than 2 minutes. So a blog will be out, and, hopefully, we can also share the orienteer run book on how we are doing the Postgres upgrades.

In your experience of building this system, supporting Cloudflare

to be able to manage their different database workloads

and the developers who are building on top of this, what are some of the most interesting or innovative or unexpected ways that you have seen the business using these new capabilities

or some of the interesting or unexpected ways that you've seen your team address the challenges that you're faced with?

Sure. Yeah. I mean, the the multiregion is, to be honest,

1 because we started multiregion for disaster recovery purpose.

It it wasn't meant to provide this kind of, you know, like, what, the new paradigm of edge computing. Right? Like, you throw something in multiple places, and now you have, like, what you call edge computing. That wasn't the intention when we started, like, 4 years ago or 5 years ago. Hey. We needed multiveregen because

1 region itself can go storm. Like, a disaster can take out a complete region. So we put that, and then people started thinking, well, we can actually make our application, you know, much faster. Like, we can reduce the latency if we can just go again and start using them. That that's a good I I think I like, you know, how, application started, you know, utilizing,

some of the resources that we have. We started using Postgres as a queues. Right? Like, earlier, it used to be a dedicated queue based softwares,

Sidekicks, and etcetera. Rather now, Postgres it's a Postgres table itself can act almost like a queue,

queuing system. Where do we use Postgres as the BI,

business intelligence purpose? We have we have, like, a dedicated set of, you know, Postgres instances

where we give the access to the analysts and and, you know, you know, data data analysts.

They go and run their, you know, weekly queries or monthly queries, quarterly queries. So we don't have a separate,

like, a data warehouse system. So Postgres is our warehouse system too. So, yeah, these are some of the interesting ways. I I found it interesting.

And in your work of building this system, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Interesting

lessons is that I I think don't give up. Don't don't settle for, you know,

average, let's say. You know, if you have a strong requirement and keep challenging yourself, so that's the that's when we're able to achieve the reducing

the recovery time objective that we are, achieved by basically looking at postcode source code. Other things we are also doing trying to do is another example

is that we're trying to reduce our backup restoration

time. Right? Taking backups is 1 thing, but how fast you can restore when there is a real need? Is this altogether a different different thing?

So we constantly

pushing ourselves that how do we reduce that,

recovery time even from from a failure? For example, like, hey. Somebody dropped a table.

Now you go back in your production system, get the data back as soon as possible. Or someone ran a delete statement without a where clause. So now the data is gone on in all your transactional systems.

Now you have to somehow restore

and recover. So there are so many areas to improve, and we are trying to improve. But I think the point is that we don't have to settle for average. Like, keep keep, you know, keep your options open,

chat with people when you don't know. Right? Like, there's nothing nothing shame on this. I I used to feel like, oh, can I ask them? Like, you know, no problem. This example, you know, the anecdote that I shared kind of opened up me that, hey. If you do the homework and then go ahead and hit the people who have worked on it, they are more than happy most of the times to just help you whatever you need. That's my key takeaways.

And so for people who are

considering whether or how to invest in the ways that they manage databases

for their engineers and for their organization,

What are some of the cases where building a platform

even on a fraction of the scale that you're dealing with is the wrong choice?

Oh, is the wrong choice?

I mean, I'm I'm a supporter of, you know, rolling things on your own. So I I'm I'm biased.

I even have, like, you know, a pet project called, spin up, which is an open source version of RDS. So I have been always an advocate of telling people that you don't have to use them, I mean, unless you really have a good reason why you need to use. So I'll put it you know, answer it in a in another way that you have to use, I mean, you have to roll up your own, you know, solutions

unless there is a specific reason that you don't need to.

So other than that, that's the default I would like to see. Because of, 1, the autonomous,

you know, that we talked about, that it provides you, right, hey, patch your own software. Don't consider your software as, like, a black box. I mean, at the end of the day, it's code. Even if you can't code, you have team members who can help you code. Right? Like, that's the other the fear that I have seen that, oh, okay. I can't. I mean, I'm not a daily daily developer, software developer. How can I go again and fix? That's fine. Part of the problem is is actually understanding the problem and then explaining it to someone that what the problem is. The actual fix itself is, like, probably a 50 lines or a 100 lines of code, which you can, you know, rely on a c programmer or a Go programmer who can just come and help you. But can you get up to that point? Can you actually find some interesting problem, first of all? Right? So all this, if you can do, people are ready to help you.

So go with the autonomous options, which is what I I really prefer. And, also, it gives you freedom in terms of also, like, adding extensions, for example. Right? Like, Postgres has this rich ecosystem of extensions.

And if you are using any other providers, you are waiting, you know, wait you are on their mercy

to support that. Like, hey. When am I going to get this new extension that is really cool, or when is this extension going to be get upgraded?

You don't have an option. Now you are waiting on them to do it. Then third 1 is more, you know, I really like the, you know, the freedom that it provides that you can go

you can really see. This is more of my engineer hat. It's like, hey. I like this kind of systems where I'm not, like, abstracted layer and layer layer and layer on top of it. And then I've just been given, like, this

minimized

version that I can just do this thing. Head off's there, obviously. I'm not saying that everyone should go and do it, but at least just spend your time to learn all these, you know, layers so it really helps, in terms of, you know, as being a better engineer.

So that's overall how we think about, you know, rolling things on your own compared to using other provider. And, also, another thing I recently noticed is that people are now more and more excited

about

people who are actually building things from fundamental, right, like, first principles

than than just,

than just, let's say, using a cloud provider. So I see there is this kind of enthusiasm that, okay. We can do this. Let's do it on our own. Right? So and there is always the open source community that's behind. Right? So if even if you have struggling with something to do, a complex thing that a cloud provider provides, you can start an initiative, and people are ready to chime in and kind of provide user support.

And so as you continue to build and evolve and iterate on the platform that

Developer experience And developer experience.

Yeah. No. It's the fact that, you know, hey. There is more we can do on this area. And and to be honest, it's a it's a it's kind of a fairly

still,

evolving area. Right? Like, we haven't really talked about companies that are coming out just to improve the developer experience. Like, hey. What are we doing here? How are we differentiating developer experience?

That wasn't the case earlier. I mean, you have to, you know, do things like fundamentally that they have to store data or have to move bits and bytes. That's your network.

Or you have to secure something. That's where all the security

company comes in. And then in the recent last 4, 5 years, there are companies which are like, what do you guys do? Like, well, we make it more nicer for the developers to build on top of us. They can also do it on their own, but this is our, you know, the go to market. Overall, this entire field itself,

relatively new, and that's where some of our challenges we haven't spent that much time on on looking at that aspect,

improving on the way how how developers connect and communicate with databases.

Provide them enough knobs that that it's still flexible,

but abstract them in such a way that they don't have to deal with every single minute details that, oh, how does this how do I do this? Alright. That's all abstracted, so you don't have to really worry about all these things. That's the fundamental area. You know, I can go back and, like, hey, put 2 other 3 other projects for the next 3 to 5 quarters and just work on improving developer experience. PG

upgrade is super, super critical. Right? Like, we talked about it. Another 1 is the voice upgrade. So Postgres relies on on the glib c, the c library version on how it's do the sorting and collision. And every time you're doing a major version of, right, of the voice operating system, whether it's a Debian or or whatever Waze, if you do that, your indexes can go corrupt

because the way how it sorts data between different versions of Waze is going to be changed. It can have, like, duplication of data. Your unique indexes can have, like, duplicate data.

Bad problems. Like, not the problem that you really want on your database.

And and, really, to be honest, that's another burning area that people have so many threats going on in the in the hacker hacker hacker list that, hey. How do we address this problem? Again, fundamental challenge,

but we need a solution for this. Right? Because these things are not going to get any better in 10 years from now. Right? Like, can we say somehow that magically that, Linux and another innovation's going to happen, and then we are not going to have this problem?

Not really. PG upgrade, will it get any better on its own in 10 years? No. Not really. So we have to go back, you know, even do these fundamental things on providing a good road map for how are we going to do the updates, how are we going to do the WES updates. And we, specifically, my team, is working on a on a kind of an SLA even for that upgrades to be happening in a more periodic way. Right? You know? It's an internal systems where so people have, like, you know, much more negotiation power than compared to dealing with an external vendor that would they're just going to pick a date and they're just going to do it. We we understand our nature of business and more accommodating.

And that that leads us to kind of, you know, little bit cat do a catch up in terms of upgrades and maintenances.

But automating it part of it is going to make it so much easier that, hey. We we have the systems ready to go. We have a template.

Let's talk and adjust the challenges, and then let's do. So 2 areas that I'm currently excited and looking to improve more.

Are there any other aspects of the work that you're doing at Cloudflare

or the overall space of running post grads as a service that we didn't discuss yet that you'd like to cover before we close out the show?

Yeah. I mean, some some of those interesting aspects that are still coming out and we are trying to, you know, understand

how how Postgres is going to evolve. Right? The the 1 of the things is Postgres

extension

extension ecosystem.

That's just blowing up right now. Like, for example, you can run machine learning workloads

on on Postgres.

So, you know, just just to kind of like, Postgres is now becoming this

the platform.

So look so that's 1 1 of the aspect I find it interesting. I mean, obviously, the serverless aspect, that's gaining more and more adoption,

and also this

idea of that, you know, more workloads in Postgres. So Postgres is

is stretching further and further out, so look out for that. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. For biggest

gap

for toolings in terms of data management,

that's interesting.

I I think the it's still the fundamental challenges is what I think about, in terms of gaps in data management. Like, taking backups and restoration

is hard. Like, there are tools that I came out, etcetera,

but not like not not like none of them can made it, like, a breeze. Right? Like, hey. I I how do you provide this kind of, like, a near instant

point in time recovery then? You drop a table. No problem. Within a minute, you can just go back to that. Can any of the tools currently today do that? As far as I know, no. So data management problem still going back to these fundamental challenges of how do we how do we kind of backup? Other things like, you know,

I would say PII masking, right, like, in terms of how do we provide some kind of masking

masking and and and provide you a solution that you can still have, like, a, hopefully, a single source of growth, but, also, different people gets to see or view differently.

And third 1 is that how do we use

the same data for different purpose. For example, like, we got we have, like, a OLTP workload. Right? That's all the Postgres is. But there is a requirement that came in. Hey. Now we want to, you know, add this Looker

or or some kind of, you know, BI tool on top of it. So, obviously, now the question is set. So now are we going to put a new system just to to support this? Are we going to keep using the, the existing 1? There. Again,

pros and cons for both, and we have we mostly pick 1 of them and we stick to them. But but in general, these problems are not yet solved. Like, it's hard. If you if you just fork or, you know, copy your data, well, now you have 2 places of your copy. How do you keep them in sync? How do you manage them? How do you set up the, you know, backup recovery of failure or high

for that system?

Okay. No. Don't do that. Keep it in the same place. Well, your your 1 query from the dashboard is basically going to bring down your entire OLTP system,

or you avoid that. Overall hard problem. So yeah. And so so that keeps me, you know, up up up and, you know, more excited about this space and looking forward to each day to learn from everyone.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Cloudflare to support their database workloads. It's definitely a very interesting problem, interesting approach that you've taken. So I appreciate the time and energy you've put into the work that you're doing there and sharing it with us.

Thank you again for that, and I hope you enjoy the rest of your day. Yeah. Thanks, Toby. Yes. I mean, you know, community is a really, really huge part, especially the open source. They they need community. They rely on community. So happy to help all the time. So thank you.

Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about

it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links