CockroachDB In Depth with Peter Mattis

Hello, and welcome to the Data Engineering podcast, the show about modern data management. When When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

to get a $20 credit and launch a new server in under a minute,

and go to dataengineeringpodcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy, and today I'm interviewing Peter Mattis about CockroachDB,

the SQL database for global cloud services. So, Peter, could you start by introducing yourself? Hi, Tobias. My name is Peter Mattis.

I am a cofounder

of Cockroach Labs, which is the company behind CockroachDB.

I'm also the VP of engineering there.

I've been in the technology space for 20 years,

half my lifetime now.

I've been involved in data storage for most of that time. And do you remember how you first got involved in the area of data management? Well, my first involvement was my job right out of college. I joined a company called Intimi, which was making a caching proxy,

HTTP proxy. But later on, it involved a much more serious data management when I joined Google in 2002.

And

at the time, Google was working on a prototype of Gmail. I was asked when I joined if I had any interest in email and in working on this prototype. And I was like, yeah, that sounds pretty cool.

And very shortly thereafter, I was put in charge of designing the the real back end for the search and storage infrastructure for Gmail,

which was quite an undertaking. As you know, like, Gmail's taken over the world of email since then. My work on Gmail eventually led to working on Google's 2nd generation distributed file system,

and that in turn led to some work on Bigtable.

Kind of minor amounts of work on Bigtable, but just was generally expanding,

my my interest and knowledge of data management and data storage systems.

I didn't realize you'd been involved in so many of those projects. It's definitely a lot of challenges there in terms of being able to manage data at scale. So I'm sure it prepared you well for the work you're doing with Cockroach.

So given that, can you describe a bit about what CockroachDB

is and the motivation for creating the database and building a business around it? Yes.

So CockroachDB

is a

distributed SQL database. You know, we talk about using, you know, design cloud data database. But, generally speaking, it is a when we say distributed database, it runs on multiple nodes,

horizontally scalable,

and it

is real SQL. It's not like these NoSQL systems that were the fad a couple years ago and seem to be tailing off a little bit now.

This is full SQL semantics, you know, transactions, indexes,

all of that.

What was our motivation,

behind building this?

We we saw,

like, some of the genesis of NoSQL.

My cofounders and I, we were all at Google at the same time. We saw the genesis of NoSQL inside Google,

and they're they're worried about web scale. They're worried about building these systems. They they they saw that traditional SQL databases weren't scaling to their architectures,

and there's this kind of adoption inside Google if NoSQL. And then very shortly after that, they saw

how much of a burden it put on application developers

to give up transactions, to give up secondary indexes.

So Google internally started moving away from this quite early on. They started building this system called Spanner.

Spanner provides

a SQL interface. It provides transactions,

and yet it's horizontally scalable, very much in the same band as CockroachDB.

My cofounder, Psy, eventually left Google and kinda looked at the state of the open source distributed systems, and we saw that everybody's still full speed at that time on NoSQL systems. And and we, you know, we saw this disconnect between where Google had had been and saw all that burden that was being placed on application developers

and what was being developed in the, you know, outside in the rest of the industry. And we saw an opportunity there. We saw this pain point of people trying to

scale SQL systems, and the pain point there is that, you know, you can't traditionally just add more nodes. You have to have the application

worry about sharding

the database and dividing the splitting the data across multiple nodes. And that places a huge burden on application developers, or you have to give up all the the benefits of SQL and go with these systems,

that don't provide transactions or they provide transactions in some very limited form. So 1 of my cofounders actually started developing Cockroach as an open source project, as a side project,

and it got enough steam that some PCs were like, hey. You should really do this for for real. And

we were all very interested in doing it for real. We knew that if we had a company behind it,

we could actually spend much more of our life working on this rather than just doing this as a side project on nights and weekends. Yeah. It's definitely a product that has a lot of legs in terms of being able to provide that large scale that people are demanding these days as it becomes easier to build a new application and actually have the infrastructure to support a global or, you know, even largely distributed system

versus when RDBMS

were initially created where it was largely the big iron era where you had physical servers that would run the databases and so in order to scale up, you had to rack a new instance.

So Yep. Definitely interesting to see the trends of people moving away from NoSQL and back to the SQL interface because of how well it expresses a lot of the paradigms for interacting with data and being able to store and retrieve it in manners that make sense versus the document oriented or key value oriented where you need to do all these back flips in the application code to ensure proper integrity of the data rather than relying on the data layer to do it for you.

And 1 thing we've been hearing loud and clear from a lot of the customers and users,

I mean, they they they care about scale. They worry about scale, and that naturally attracts them to NoSQL systems that can provide the the horizontal scalability.

But then they also care about how much burden it is on themselves as developers.

We've heard instances of, you know, users switching to Cockroach and be able to delete large portions of code because they don't have to do those backflips in the application layer anymore.

The original

generation

of SQL oriented databases

have typically had difficulties in scaling horizontally because of the architecture. So wondering if you can describe a bit about the way that you've approached building CockroachDB

to be able to support that distribution

and the distributed asset asset transactions that make it such a stand

out among the available databases? Yeah. Yeah. So it's kind of a a doozy of a question because it gets kind of to the heart of the

the CockroachDB architecture. It's a little bit challenging to concisely describe this, but I'm gonna do my best. So at the lowest level, Cockroach maintains a a key value abstraction, a key value storage layer.

And when we talk about this key value layer,

the Cockroach provides its monolithic key space. In this monolithic key space, we logically divide it into 64 megabyte ranges of data. Each of the each of these ranges of data,

is replicated.

So given 64 megabyte range, we actually,

use a consensus protocol called Raft to replicate the data for that range to 3 or more nodes in the system.

Raft is this, like I mentioned, consensus protocol.

I don't wanna get into all the details what consensus protocols provide and their their challenges. That that's a whole episode on its own right there. That is a whole episode on its own right there. 1 of the interesting things that RAP provides and all consensus protocols provide is to provide atomicity of rights. So if you're writing data to a given range,

to a single range, that rate is atomic.

Coverage builds upon this atomic primitive provided by Raft,

to build general purpose transactions. And there's a variety of additional mechanisms that come into play there. So in order to provide atomicity for general purpose transactions, they're spanning multiple ranges.

Each transaction has an associated transaction record that's stored on a single range, and we're able to flip the transaction from,

in progress to either reported or committed with a single right.

That's kind of like the very high level hand wavy description of how we bootstrap atomicity for these transactions.

Another part

of asset guarantees is isolation.

And Cockroach has a variant of locks, which we call intents.

All the data inside this, key value storage layer is versioned,

sort of similar to, NBCC,

multi version concurrency control. And this kind of gives, each transaction,

a stable view of the database while other transactions are in progress. Cockroach also, in addition to the the this kind of mvccmbcc

layer,

it also provides keeps track of all the data that's read, and you need to keep track of the data that's being read in addition to the data that's being written

in order to provide serializable isolation.

Serializable isolation and isolation levels in general are kind of, again, a whole other podcast which we can go to. Serial isolation is extremely important. It's kind of the gold standard of isolation levels. And

even, like, the the just somewhat weaker isolation level of Snapshot

use these weaker isolation levels, and they're not intuitive to application programmers.

And you think that you're, you know, you're doing everything and everything's consistent inside your application, and you get these things called anomalies, and they lead to very serious programming bugs in your system. So that's kind of like a little bit of a whirlwind tour of the the high level architecture.

I'd really point any listeners to the Cockroach Architecture documentation for more details. Yeah. I was reading through that while I was preparing for the show, and it's definitely very extensive and well presented. So I I'll second that recommendation.

And what are some of the trade offs that were necessary

when you were building Cockroach that to allow for this capability of geo replication and distributed transactions?

Yeah. So geo replication is kinda interesting. We're talking about geo replication. We're talking about geographically distributed

nodes in the cluster.

And what geographically distributed nodes applies is higher latencies. So if you're looking within a single data center, you know, single

Google data center or single Amazon data center, the network latencies can be very low, in the low 100 of microseconds,

which is a microsecond less than a millisecond. We're talking, you know, fairly fast to do a a round trip between 2 nodes in the cluster. To put that in to give you some sense of what those times are, you know, you could actually read from the RAM of a remote machine sometimes faster than you can read from disk on a local machine. So that's within a local data center. But you move to geographic distances, and you'll start seeing network network latencies

of 50 or a 100 milliseconds.

So much, much slower than reading from local disk, much, much slower than reading from,

another node at the local data center. So you you have to start considering these network latencies, both in the protocols used internally inside Cockroach, as well as within the application.

And this is you know, we we've we've taken pains inside Cockroach, so you kind of do things in batches, and you make sure you do a lot of the the work on behalf of a query,

in parallel. But 1 of the things that falls out of this is applications also have to be aware of these geographic latencies

when they're, building a geo replicated, you know, system.

You're not able to just, you know, do many, many round trips to perform some operation to the database, because each operation of the database in turn involves round trips that involve, you know, latency that are 50 to a 100 milliseconds, and very quickly that adds up into

prohibitive query times. The bright side is there are techniques that you can use to avoid this. You know, do work in batches and do it in parallel.

Just applications generally have to be aware of this. And when you were building the database,

were you able to stick fairly closely to the sort of canonical

definition of what Raft is or did you have to add your own workarounds

or fixes to account for some variances that occur because of these latencies or because of conflicts that arise from having to coordinate these various nodes?

Yeah. So we we do adhere to kind of the canonical raft protocol. I don't even know if there's a you call it canonical protocol, but there there's a thesis

around Raft, and we adhere to it. We actually share our Raft implementation with etcd. It's a library we use, but there are some complications that come up. So when we're thinking about the geographic

distances,

there there are some time settings that you have to put into Raft. Raft is this leader based protocol, and the leader periodically heartbeats the the follower replicas.

And those heartbeats have timeouts, and you have to size the timeouts

based on your network latencies.

Some of the other stuff, some of the other engineering complexities that came up from utilizing Raft, Raft itself provides this abstraction of a log, a log of commands. And, generally, you know, you propose a command to Raft, Raft does its thing, and then it comes out the other side saying, yes. This command has been committed, and it's, you know, it's now stable.

And the log is append only. This is completely general purpose. This is the way a lot of consensus protocols are described, but it's not particularly useful. If you were to try to read a piece of data that was written in command, you don't wanna read through the entire log to to find the data that was written, which what Cockroach does, what all the other systems that use use similar consensus protocols do is they actually apply these commands to a state machine, and the state machine maintains some state in the system.

This is you know, the the commands might say, hey. Write key x with value y. And

then underneath, you're applying that to the the key value storage to to to write the data.

This is kind of easy and obvious, but kind of what isn't obvious is you don't want the log to grow definitely. You have to have some heuristics about when to truncate the log. You have to have those heuristics based on when stuff has been committed to log and if the followers are up to date. And those heuristics are kinda non obvious. There's trade offs. If you truncate too frequently, that becomes significant overhead. If you don't truncate frequently enough, your log is taking significant amounts of storage space.

A kind of more subtle complexity

is how complex the state machine itself should be, Kind of the naive thing, which apparently a number of different implementers have done because it's kind of the obvious thing to do. It's like, ah, I could just make my state machine kinda arbitrarily complex. I can put these commands in, and they can be kind of rich commands, and the state machine can do its thing. But there's this this problem that comes up that only really comes up when you get to know, a production system, which is you wanna do software upgrades.

And 1 of the things that kind of an implicit

dependency

by Raft is the state machine has to be identical. Given an input command, the state machine has to produce identical output on all the nodes in the system that are executing it. And that makes software upgrades kind of fragile if your state machine is complex. To maintain

that invariant that, you know, just a small tweak of the state machine that is not actually causing the output to change is, you know, can get very challenging. We've since moved Cockroach from having a fairly complex state machine to having a very simple state machine model. And this

is completely motivated by making version upgrade to make version upgrades easier. Yeah. That's definitely an interesting complication because as everyone knows, statefulness in a software system is difficult inherently. But when you're dealing with having to upgrade versions,

you know, maybe 1 note at a time and having differences in the way that the statefulness of the state machine will change the way that the database behaves and achieves consensus is, that's an interesting complication,

and I'm sure 1 that was,

led to a lot of head scratching in the process. Yes. Yes. There's actually kind of 1 other final thing that we had to deal with Raft. It's not really due to Raft per se, but just our usage of Raft inside Cockroach,

which touches back to something I mentioned earlier, which is we divide all the data in in Cockroach, stored in Cockroach into the 64 megabyte ranges. And each of those ranges is a Raft consensus group. So if you go back and look at the Raft thesis, it's just talking about a single Raft consensus group and about the the semantics and the correctness of operations upon that Raft consensus group.

Our usage of Raft is that we have thousands, tens of thousands of Raft consensus groups, and we actually allow these consensus groups to split. And we're also working on having them merge together

both tens of thousands,

hundreds of thousands of Raft consensus groups in a single cluster. There's some bits of Raft that are that were considered like, oh, hey. This is not gonna be a problem if you only have 1. 1 of those is doing heartbeats between the leader and followers. Now you have, you know, 1 leader heart beating each of the followers. If that's all that's going on in your system, it's like, yeah, you can do those heartbeats every, you know, couple of 100 milliseconds. It's not a problem. But if you have a 100000 raft leaders, each heart beating, you know, 200, 000 followers,

the overhead of that becomes quite significant. And it's also

kinda strange to see this going on in your system where these heartbeats are occurring even if the system's otherwise idle. So we had to do a bunch of work to allow Raft consensus groups to become what we call quiescent.

Quiescent is, you know, there's no traffic on the the Raft group.

We stop the heartbeats. And

this is more or less safe because we also have another layer that is actually migrating for node outages. And when there are node outages,

we reawaken the RAFT consensus groups so that they can start participating in the the normal RAFT protocol.

That's interesting. So in theory, you could have a case where you have 3 nodes in a cluster and they each have, you know, maybe 50 shards each. And so there is some distribution

of each of those nodes that is a master in terms of the individual raft groups, but they're not necessarily all going to be on the same node. Is that correct?

Yeah. More or less. I mean, so if you have what you said is shards, we call ranges. You got these 50 ranges, and they'd be on these nodes in the cluster. And some of the ranges, you know, each node would be a leader for some of the ranges and a follower for some of the other ranges. You can imagine they're exchanging messages all the time. And they're doing this

even when there's no other load on the system. You know, nothing's being ripped to these raft groups. Do they really need to be doing all this work? And the answer is no. You don't want them to be doing all that work. You want some other kind of simpler mechanism to determine if a node is alive or dead. You don't have to have that down at the raft level itself. That's interesting. And in terms of optimizations

for supporting

these

geographic distributions

and reducing latencies, are there any particular strategies that you had to work in to make that more optimum and easier for people to reason about? Yeah. I'm not sure about easier to reason about. We certainly this is an ongoing process. Some of the work that we've done, we've we've allowed an extended SQL language so that you can normally, when you're, SQL is is this kind of interactive protocol where you begin a transaction, then you do a series of statements.

And the the normal way to interact with SQL is you say, like, hey. I wanna insert, you know, a row into this table,

and it returns saying, oh, the row was inserted. Then insert another row. The row was inserted. But we extended the language so that we can say, like, okay. Insert the row, but couldn't do it asynchronously.

You don't have to tell me if it succeeded or failed. I'll figure that out when you commit. And this allows us to do a bunch of the inserts in parallel. There's also stuff that's kinda built right into the SQL language.

You can do inserts of multiple rows at a single time with a statement. And we take advantage of that, and we encourage users to use that when possible. It's called batching.

So we encourage batched operations

are possible because the batch operations allow a lot of parallels. And while if you're doing these things 1 row at a time, like each statement's doing operating on a single row, It just gets very slow, and you're you're hitting these geographic latencies, experiencing

them much more than necessary.

And digging a bit deeper into the geographical capabilities

of the database, I know that 1 of the things that it enables is

to keep a certain subset of the data within a particular geographical region which is relevant for some of the GDPR

requirements Mhmm. And some of the various other compliance frameworks. So I'm wondering if you can describe a bit about how that works and some of the complexities that that might entail in terms of managing how that data is located or

joining in data across the cluster for certain queries?

Yeah. So GDPR,

all these, regulations, they're very interesting and intriguing to us because they're changing the landscape of what's necessary from your data storage system. So what Cockroach provides is we have this concept of a zone and a configuration for a zone. So you get to tag the tables in your database as being part of particular zone. And the zone configuration says, like, oh, for this table of data,

I can specify

which nodes in the cluster,

it can reside on. It doesn't force it to reside on those nodes, but you can exclude certain nodes in the cluster or, you know, nodes in certain geographic region, and you can include nodes in other geographic regions. And you can actually control this at a fairly fine granularity.

How does Cockroach know about the geographic regions? Well, you have to configure it at startup. When you start up by each Cockroach node, you give this command line flag that says here's the locality.

So Cockroach doesn't actually really know about the geographic regions. It just knows about you. You've given it the label to each of the nodes, and you can control, you know, where the replicas

associated with the zone, where they reside,

which which nodes they're allowed to reside on. So we've since so that was the original design of tables and zones, and we've since extended this so that you can take an individual table and you can partition it. There's a variety of mechanisms for partitioning it, but just imagine you're partitioning it and partition it by the primary key. So you can have ranges of data within a table itself and say, like, oh, here's my user data for users in France. I'm gonna restrict that to the France zone, and here's my user data for users in the US. I'm gonna restrict that to my users in the US. And the the 2 benefits here, there's a, you know, regulatory benefit, GDPR or some other regulation might require you to do this. There's also performance benefit. If your users are in the US, you don't want their data to be in Europe and have to do a, you know, a transatlantic crossing for every access. So that's where, like, where we are right now. Our our primary focus so far has been where the data physically resides.

So far, the regulations and what we provided haven't talked about, you know, where we actually are allowed to process the data, but that stuff's all coming down the pipe as well. Being able to control and say, like, the data must not leave this particular region, something we're thinking hard about. We'll be providing controls on that probably in the future. Going a bit more into the implementation of Cockroach itself, I know that at least initially, and I'm assuming this is still the case, it was written in Go, which at least traditionally is a bit of an unconventional language choice for building a database. So I'm wondering what you have found to be the pros and cons of that choice and whether you've been happy with it. Oh, it's a good question. I mean, some traditional in the traditional SQL databases haven't been written in Go, but a lot of the NoSQL landscape, a lot the deep landscape was written in Java, and there's a lot of similarities

in kind of the the performance, runtime performance between Go and Java. When I was at Google, I actually was something of a c plus plus guru, and I used it for, like, 10 years of my, my time at Google. I kind of loved the power and control of c plus plus but I also had, you know, a lot of frustrations with it. Super easy to write code in c plus plus that your future self cannot understand, let alone your coworkers.

My experience with Java was 1 of frustration fighting the garbage collector,

when performance tuning it. So

Go is also a garbage collection lang garbage collected language,

but it provides significantly more facilities than Java for controlling memory allocations and memory layout. That's been kind of important. You know, my experience using Go is that we're able to use those facilities that where there are issues with the garbage collector, we can work around them.

Having said that, though, there's, you know, still ongoing frustrations. The Go GC, you know, every now and then we've encountered issues where it does something surprising.

It's caused increased latencies,

has some peculiar behaviors. But overall, I'm fairly happy with it. And so that's kind of at the low level kind of runtime of Go. You know, if we step back up level and think about, like, what's it like to program the language, Goat doesn't have the bells and whistles of c plus plus or Java. I mean, the big notable exception is there's no, templates or metaprogramming.

And I see a lot of new engineers get frustrated by this. And, frankly, you know, when I initially started using Thinko, I had a a bit of that frustration as well. I I used templates all the time when I set C plus plus I was very good at using them

and decoding the compiler errors. You don't have to deal with any of that in Go. Go Compile is super fast. It has this focus on simplicity,

which is fantastic.

The Go code I wrote 6 months ago, the Go code I wrote a year ago is still perfectly understandable.

Sometimes there's a little bit of additional verbosity to it,

but, you know, I I will happily take that trade off. And 1 of the things we notice is, you know, Go, it's a small language, and you can pick it up super quick. I mean, most of the engineers who come to Cockroach Labs,

they either have minimal Go experience or no Go experience. And, really, after a week, they're like they're they're they're doing, you know, PRs and language without problem. After a month, they're proficient. And it takes a little bit, you know, a few more months after that to kind of learn, you know, some of the smaller bits and pieces of the language, but it's really super easy to pick up. And I can contrast that with something like c plus plus where you can work in c plus plus for years and still not know what you're quite what's going on under the hood. And there's another language out there that's, you know, kind of hot nowadays, which is Rust. And I don't personally have experience with Rust, but everybody who does talks about this deep learning curve of of trying Rust. It's kind of like an open question though if

when we started working on Cockroach,

Go was kind of ready for use and Rust was not. If we were to go back in time and Rust had been available at that point and was ready to use, would we still have chosen Go?

I don't know. I I I just don't know enough about Rust to know the answer. Another question is now Google has released some c plus plus libraries

that we were that make programming in c plus plus much more pleasant. Could we, you know, have written copies in c plus plus? Certainly. I'm just not sure if the trade off is is worth it. It certainly would solve some headaches, but it would have caused a massive amount of additional headaches.

And I'm sure that it's also made your life easier in terms of distribution to end users because 1 of the strong points of Go, particularly

as an operations engineer, is that the binaries are easy to generate as statically linked artifacts. So you just drop it in place, give it a config file, and start it up without having to necessarily have a lot of system libraries linked in or system dependencies at deploy time. Yeah. No system dependencies at deploy time. I mean, this is 1 of the kind of the architectural decisions of Cockroach as well. Something my cofounder really pushed for, which is he didn't want any other us to really be dependent on any other systems. And, certainly, Go was part of that. You know, there there are no shared libraries that we're linking to that you have to deploy and whatnot. It's kinda interesting that, you know, we took that push, and Cockroach itself is a single binary. It's super easy to get onto your machines and push around. And, yeah, now we have this whole container ecosystem, which kinda gives you a little bit of the same flavor.

And what, you know, from our perspective, I mean, I love containers

to, you know, they you're using Docker for doing certain development tasks is awesome. And yet when you toss Cockroach into a container, it's not really the container does something for you, but it doesn't do too much. I mean, in comparison to running something like Java app inside the container where the container is, like, taking care of getting everything set up with the environment

and your Java app needs that, Cockroach doesn't need that. So there's all this stuff that the container provides that Cockroach doesn't utilize.

For the most part, if I'm advising someone to try running Cockroach, I'm like, just use the binary. Don't bother with containers. You're you're just adding another level of complexity there. And looking at the documentation, though, 1 of the things that

because of the easy scalability and easy distribution of Cockroach,

it's like you're also strongly in favor of deploying it on top of Kubernetes because that's a widely regarded and widely available deployment target that people are focusing a lot on these days.

That's correct.

We we have a lot of customers coming saying, like, how do I run my database on Kubernetes? And a lot of traditional databases, you know, they don't want to that that there's a little bit of a mismatch there. It's so easy to horizontally scale out your application using Kubernetes, but you don't get the similar benefit from your database because your database isn't horizontally scalable. And we feel there's really nice fit there between Cockroach and Kubernetes where you want your, you know, your the stateless

application to be horizontally skate scalable. You also want your,

state pool database to be horizontally

scalable, and Cockroach works well there. Just to spend a couple of minutes talking about the cap theorem because of its prevalence when discussing databases,

looking for the documentation, I know that Cockroach in terms of that framework is strong on the consistency side, and so there's a debate of how it manages to be, in some regards, both consistent and in some regards also available. So I don't know if you wanna speak to that or just point people to the documentation.

Yeah. I mean, I would point people to the documentation. I believe we might have a blog post about exactly this topic. But the the summary is we're consistent and in the face of partition the the CAPT theorem is kinda confusing. I mean, what it's saying is

consistency,

availability, partition tolerance. You can't have 2, you can't have 3. This is proved.

But another way to think about this is, okay, in the absence of network partitions,

what happens? And Cockroach

is consistent. When there is partition, what happens to the system? And when there's a network partition,

Cockroach is still available, but only on the majority side of the partition. So this makes Cockroach the CP system.

A system which is available, an AP system, is 1 that when there's a partition, both sides of the partition are available. But 1 of the things you get into with these AP systems is that,

the availability at that point, you might have inconsistencies in your data. Clearly, 1 side of the partition can't talk to the other side of the partition. They don't know what they're doing. You could have 2 users withdrawing, you know, money from the bank account, and that would be the the nightmare scenario for a bank. So for, you know, for a variety of data storage systems, you don't want that availability. You actually want to shut down and not try to stumble forward in those in those cases. And that's the the the trade off that Cockroach makes is we're saying, hey. We're gonna maintain consistency at all costs. We're not gonna allow you to get your bank accounts into

an inconsistent state.

For people who are coming to Cockroach from either

no SQL databases

or traditional SQL databases, what have been some of the common points of confusion that people come to you with when they're first getting started with it or, working on operating it? Yeah. So

a pretty common point of confusion is just about how our replication

works. It's very easy to think like, oh, hey. Cockroach, you know, replicates the data. So I'm gonna start up 5 nodes, and the data is gonna be replicated to all 5 nodes. And each node is essentially an identical replica

of the data, and that's not how our data model works. Our data model is that we we take all the data, and, you know, it goes into this monolithic key space. It's divided into 64 megabyte ranges, and each of those ranges is replicated, but it's not replicated to all the nodes in the system. This is where you specify the replication factor.

So let's say you have 5 nodes in your cluster. The default replication is 3 ways. So there's gonna be 3 copies of the data, and they're gonna be spread across the cluster. And just, like, having that mental model that, you know, the the the nodes are not replicas of each other, but the replicas are down at this range level, and then those those replicas are then spread across the cluster.

I think that's 1 of the more common sources of the confusion.

And are there any particular edge cases or failure modes that people should be aware of when they're working on, getting started with Cockroach?

Well, I I I think the edge case that comes up is that we do value consistency over availability

because we're using

consensus replication.

The consensus

replication implies that in order to

serve data,

you have to have a majority

of the replicas available.

So let's just take a very simple example. You have 3 nodes in your cluster, and you decide to take 2 of the nodes down.

Then there's only 1 replica left for each of your ranges, and Cockroach won't serve the data at that point. And you might be like, why can't it serve the data? Well,

if you think through some of the scenarios that could have occurred here,

this node that's still up, it doesn't know what happened to the other 2 nodes in the system.

And because it doesn't know what happened, it might have stale data. And if, you know, cockroaches erring on the side of consistency, it doesn't wanna, you know, serve that stale data. It doesn't wanna allow that data to be written to you because if that data was written to you, then perhaps the other 2 nodes come up, and this node that was alive goes down. And suddenly you got more and more,

inconsistencies in your data. And I think this this kinda it's not really so much of an edge case. It's kind of a fundamental property of Cockroach is you need a majority of those replicas to be available in order to both read and write the data. As long as you're like, you have the right mental model, users, you know, can then, you know, they're aware of this. They can understand. They can work with it. But very you know, it's kinda like the the the first impression is like, oh, I I think of this as just kind of, like, you know, primary, secondary replication between MySQL databases

between, like, a MySQL database.

And in in those cases, you know, like, the primary goes down, I can just read for the secondary. There's you know, secondary goes down, the primary is unaffected.

And for consensus, replication, it just doesn't work that way. 1 of

the main factors of Cockroach is, again, the SQL interface.

And reading through the documentation, I noticed that the syntax is primarily modeled as postgreSQL

and,

with some exceptions. So,

for somebody who has an existing

application or is using

an ORM that supports postgreSQL, is it possible for them to point to CockroachDB

and use it in the same manner? And what are some of the examples

of differences in syntax where either Cockroach has extensions that aren't present in other systems or is missing some of the extensions that somebody might expect in Postgres?

The answer is, you know,

maybe. Maybe it might just work out of the box. We try to, you know, adhere to the base of PostgreSQL.

And by the base, I mean the the stuff that's part of the standard, you know, SQL part of the SQL standard.

Examples of things that are outside the SQL standard that Postgres supports that we do not, full text indexes,

any of the plugins,

geo, geospatial indexing. There's also some kind of more esoteric bits of the SQL standard that we just haven't gotten around to supporting yet, but most of those esoteric areas are not something that, you know, your typical application will run into. But you also ask about ORMs, whether the ORM will work on modified. And this is an area we're actively working on because while we do support, like, a a large swath

of the SQL, PostgreSQL,

and we actually support a large swath of the SQL used by ORMs,

ORMs themselves, when they start up, they do a lot of this introspection of the database. And in order to do that introspection,

they use these kinda 2 bits of functionality. They access access something called the information

schema database,

or they access the pg catalog. And these 2, the the information schema and pg catalog, they're, you know, essentially the same bits of data presented in different ways. Pgcatalog is Postgres specific. Information schema is is something related to the SQL standard. But they allow you to essentially reflect on the data in your database. Tell me what the tables are. Tell me what the types of the columns are and whatnot. And

we're still working through compatibility

at that level because the ORMs require they they seem to touch every, you know, nook and cranny of what Postgres provides, the various ORMs do. So while we might support everything needed by 1 ORM, a different ORM has a certain set of queries that it runs at startup. These aren't queries that are, executed by the application, but the queries executed by the ORM when it connects,

to Cockroach. And we're still working through providing compatibility for all of those. And lastly, you asked about extensions, bits of SQL that Cockroach has added that aren't part of Postgres.

Some examples of this is we have this concept of interleaved tables. So if you have 2 root tables that

are related via parent child relationship, and this might be something you normally express via foreign keys, we can actually specify that the child is interleaved in the parent. The purpose of this interleaving is it actually collocates the data.

Normally, if you have 2 tables in Cockroach, their data is on different ranges, and there's no overlap,

that there's no colocation

of that data. And by doing the interleaved

using the interleaved syntax,

you can actually tell Cockroach, hey. This data is gonna be accessed at the same time. It should be right near each other. And if it's right near each other, it's faster of access. Another example of an extension is that that I mentioned earlier in the podcast,

relating to being able to essentially async operations

and then waiting for them to commit when the transaction is committed. And this is via kind of an extension of Postgres syntax.

Normally, in Postgres, if you do, like, an insert statement, you can say insert returning some column. Well, we allow you to say insert returning nothing. And when Cockroach sees that, it asynchronously

executes to insert, and then when you go to commit the associated transaction, it waits for all those asynchronous operations to commit. Have there been any particularly interesting or unexpected uses of CockroachDB that you've seen? I mean, the 1 I'm, you know, kind of excited about is

this everything regarding GDPR and what's coming up there, the geographic use cases, and and how the regulatory environment has kind of opened up this opportunity for us in the industry and is causing everybody to think their data storage. I don't don't know if there's any I can talk about in particular other than, you know, many companies are concerned about this, and it seems to be an increasing concern, you especially with GDPR right on the horizon right now. And just going forward,

everybody's moving into these you know, even relatively small companies

are kind of going global before you'd expect it and meeting these global storage requirements.

Yeah. It's definitely

interesting seeing the reaction that people are having where some people maybe only focus on a particular region and so they're largely unaffected. But, yeah, it it's strange the distribution

of size of company and type of company that are being affected by these new regulations.

Yeah. You know, there there's actually another kind of interesting use case for Cockroach, and it was a little bit unexpected

to us, which is we have a a number of users using Cockroach

due to kind of the ease

of automation.

And what I what I mean by that is

if you were to use something like MySQL or Postgres, those would be the natural choices for a, your relational database, but you wanna essentially

embed those databases

into your own product.

You wanna naturally set up replication

and failover and whatnot.

And there's you know, doing that in kind of an automated fact fashion is challenging.

With Cockroach, it's kinda naturally has that automation

and failover capacity built right into it. So there's a a handful of users who we see in this case where they're essentially, you know, embedding Cockroach into their systems. And they're doing this because it provides the SQL functionality

that you, you know, they they could have gotten from my c4 postgres, but it also provides the reliability.

And it does this in a in a fashion that's kind of a little bit more hands off

for them to administer. They don't have to worry about the failover. It just, you know, kinda comes for free,

out of the system. And the the thing that I think is fascinating is we initially designed Cockroach about, you know, for these huge systems of scale. But this use case isn't 1 about scale. It's about, you know, ease of use, ease of automation.

So it's starting to edge into the use case where people might use a SQLite database because of the fact that it's easy to just instantiate it and not have to worry about it, but they also wanna have this capability of networking these various components together

without having to worry about all of the additional logic that's necessary to do that with a SQLite database for instance. Yeah. I mean, I'm not even sure how you'd set that up with SQLite. You know, SQLite is just, you know, it's a embedded SQL database that you're putting into your application. You know, Cockroach is a little bit more of your your betting into the product, not into the application.

And it it is giving you that redundancy for free that I'm not even sure what you do to set up SQLite to provide redundancy in the face of node failure. Yeah. Not nothing fun, I'm sure.

And nothing there'd be a lot of work at the application letter or just additional work in your product to achieve that. That's kind of what Cockroach is doing. So just use Cockroach instead. And are there any particular use cases where CockroachDB is the wrong choice for somebody? I mean, I think the the broad class of use cases where,

Cockroach is not a good choice right now and probably won't work well for you is when when we think about analytics use cases. If you're ingesting large amounts of data and then trying to perform analytic style queries on this. This is, you know, systems that do this, like the Vertica,

like, you know, Apache Spark. You know, they're they're designed for this, you know, super high throughput, these long running queries. They have sophisticated optimizers

for optimizing the queries, and this is just not an area that we focused on in Cockroach. And because we haven't been focusing on it, you know, the when we we've done some testing here, it's, like, it's not performed as well as you would get from these other systems.

Our focus so far has been on transactional processing.

We wanna be that, you know, the database of record. We've taken care, like, all are concerned about consistency and isolation.

These all play into transaction processing space where they're not they can be important for analytics, but usually they're a secondary concern for analytics. Analytics is primarily concerned, which is raw query processing speed.

And are there any new features

or a general direction that you're hoping to see Cockroach go in the future?

It feels like the longer we work on Cockroach, the more stuff we wanna add to it. I mean, general features, we're just gonna be enriching the experience for transaction processing.

Some of the stuff about analytics, we're gonna be edgy in that direction.

The next release of Cockroach, which is scheduled for this fall, is gonna have a much more advanced query optimizer, and this is gonna be getting, like, your additional footsteps

into some of those analytics spaces. Though, I don't think we're gonna be moving too far in that direction. And then there's, you know, I think the big 1 that's coming out in this next release as well is something called change data capture,

and this is a way of streaming data out from Cockroach.

You can essentially subscribe to a list of changes to a table or a set of tables and stream it out via Kafka. And we've had an immense interest

from users for this feature so that they can, you know, essentially

get data from Cockroach, which is being used as a transactional

data store, and get it out via Kafka into their analytics system.

And have there been any particular

challenges

of building or maintaining

the technical or business aspects of Cockroach?

It's a little bit of a broad question. There's there's there's challenges all over the place. Whenever you're building 1 of these systems, there's enormous numbers moving moving parts. And you focus on 1 area and another area doesn't get as much love, and you're trying to lift the whole system up at the same time and get to a state of usability.

And, you know, the the the initial struggle of the, you know, the first 2 years, I I would say, at the company was getting to that, you know, broad set of usability. You're trying to find minimal viable product, but for a database, the minimal viable product is actually pretty large.

That's 1 of the the the challenges we've experienced. 1 of the ongoing challenges is the SQL standard

is large, and, you know, it's like the surface area of all functionality provided by a SQL database is just huge. We have to be, you know, both prioritizing what we do and making choices

there

such that we provide aspects that our our current set of users are finding useful, but while also looking to that next set of users that are are missing little bits of,

of functionality.

So for anybody who wants to follow-up with you or follow the work that you're up to, we'll have you add your preferred contact information to the show notes. Sure. Thank you for taking the time to join me today and the work that you're doing with CockroachDB.

So, I appreciate that, and I hope you enjoy the rest of your day. Yeah. It's a pleasure chatting with you. Take

care.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links