Build More Reliable Distributed Systems By Breaking Them With Jepsen

Hello, and welcome to the data engineering podcast, the show about modern data management.

What are the pieces of advice that you wish you had received early in your career of data engineering?

If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know, and I need your help.

Go to data engineering podcast.com/90

7 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. If you've been exploring scalable, cost effective, and secure ways to collect and route data across your organization,

RudderStack is the only solution that helps turn your own warehouse into a state of art customer data platform.

Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management.

With their open source foundation, fixed pricing, and unlimited volume, they are enterprise ready but accessible to everyone.

Go to data engineering podcast.com/rudder

today to request a demo and get 1 free month of access to the hosted platform along with a free t shirt. You listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com/conferences

to check out the upcoming events being offered by our partners and get registered today. Your host is Tobias Macy. And today, I'm interviewing Kyle Kingsbury about his work on the Jepsen testing framework and the failure modes of distributed systems. So, Kyle, can you start by introducing yourself? Hello. My name is Kyle Kingsbury. I work on Jepson, which is a project to test distributed systems. And do you remember how you first got involved in the area of data management? Out of college, I couldn't get into grad school, actually. It happened during the financial crash in 2009. And I wound up working up, a startup in San Francisco and doing distributed database work there. 1 thing went to another. And so can you dig a bit more into what the Jepsen project is and some of the backstory of how you came to create it and your original inspiration or motivation for digging into this particular area?

Yeah. So Jepsen is a project

to help make distributor systems safer by testing them and seeing what they do under real world failure modes. And

back around

2010, the distributed systems were becoming popular,

in in industry,

but a lot of folks had,

sort of optimistic expectations about what was possible. And there were some claims flying around that seemed a little bit,

irresponsible. There was a good deal of argument over this back and forth on Twitter and blogs, mailing lists. I was active on the React mailing list for quite some time. And

eventually I started at Jepson to demonstrate that these problems were actually

things that could happen in the real world. They were just theoretical issues.

Yeah. I know that particularly

when you are just testing and building a system, it's easy to see, oh, everything works great

and kind of push off the fact that there are going to be network faults and there are going to be clock SKUs and clock drifts that cause things to not operate the way that you want them to. And I'm curious what you have seen as far as some of the common misconceptions or misunderstandings back

when

Back when we started,

there were all sorts of interesting expectations. I remember,

MongoDB, for example, wrote something about how

users shouldn't expect to preserve data when a process crashes or is killed with kill dash 9.

And as a result, I think React started throwing at some sort of cocktail party called kill dash 9. Meanwhile, React was busy, falling over anytime you tried to list keys in the database. So there was there's a lot of rough and tumble action in those days.

Nowadays, I think people take partitions, crashes, that sort of thing more seriously. I see a lot of folks at least trying to do some degree of testing for those.

And yet, when I do consulting work, often I'll find databases where it looks like nobody

tried to verify behavior

during vaults.

And aside from Jepsen, what are some of the other methods that people typically use for stress testing testing or evaluating

the guarantees of these systems?

I look at Jepson, because it's an experimental approach as being 1 part of a complete breakfast, a continuum of quality

methods.

1 of those is to start with a a proven algorithm, some sort of abstract model of your system, and

either to have a hand or a machine check proof of it or at least a model checker, maybe with TLA. And if you are lucky, you can do some sort of code extraction with Coq or

high level or sorry, for f h o l with Isabelle.

But that's that's not, I think, how most production systems get built. Most systems get built by pointing at a whiteboard and writing a bunch of code.

And then your tools for safety analysis tend to be more along the lines of unit tests and integration tests.

Jepsen takes a black box approach, which

puts it sort of at the far end of

integration testing. Other systems like Cassandra's d test suite,

take a similar approach.

You'll see people who do sort of manual testing where they do fault injection via

somebody just running commands on the cluster or toggling some sort of hardware

power switch or network switch. That's an approach that's taken by FoundationDB.

FoundationDB also does a lot of automated simulation testing where they will

run a virtualized network with many simulated nodes in the same process, and then they can

intercept messages and reorder them to get interesting results. So I think that there's there's a lot of approaches people take to testing systems. Jepsen tries to general and to to work across many types of systems, which often means that it's not as deep or as capable as exploring the state space as, say, a simulation test that's written specifically for the database that you're working on might be. And then

as far as some of the real world usage

of things like databases and other mission critical distributed systems,

what are some of the real world impacts that can occur as a result of these failures that aren't properly vetted or properly guarded against?

Oh, you can get all kinds of faults.

Obviously, data could be lost. So you you write something, and then it just disappears later. You can have

different nodes that compete

on some version of the history. So maybe 1 person thinks that the

the list you're adding things to is 1, 2, 3. The other thing's it's 4, 5, 6.

You can have temporal anomalies where you read some information from the past instead of the present.

You could have transactional violations where 2 transactions kind of land on top of each other or interleave in some weird way. And all these systems,

that I've tested exhibit,

typically 1 or more of these behaviors.

The question is sort of how hard do you have to push them to get there.

And then in terms of the actual testing approach of using Jepsen for being able to find some of these error modes that exist in a particular system, what is your overall approach for designing the test suite and customizing it for a given system? Because

all of these different platforms will have different guarantees that they're trying to provide or different approaches to linearizability

or serializability

or the strictness of the isolation that they provide. And I know that that provides a fairly extensive surface area to try and cover in any test suite. And I'm wondering just what your

approach is

for determining what are the key points to interact with, and then also any variability in the

operations that are being performed.

Yeah. You're you're quite right. There's there's sort of 2

huge surface areas to cover. 1 of them is the breadth of the API, all the different things databases do.

Modern SQL databases have so many functions and so many

interacting components that it's really difficult to predict, you know, how they're going to interact.

And

it also is tricky because when you get optimizers into the mix,

you know, depending on something like the size of your data,

you might have 2 very different code paths that get involved with different safety properties.

So I don't have a good answer for that. You know, I I write as many tests as I can in the time that I have, but that's usually just a tiny fraction of the overall API. My guiding principle is to look for

operations which are general and which will be sort of the most difficult for the system to do safely. As an example, there is a property called the consensus number,

which sort of tells you how difficult it is to achieve

a sort of operation in an atomic,

way using consensus.

So when you have a database which

offers, say, reads and writes over a register

or offers compare and set operations, it turns out that the compare and set operation is actually more difficult to provide safely

than just reads and writes over registers.

And so I'll test compare and set to try and stress the system more. If I have

an operation that works like appending something to a set or to a list, well, the set is order independent, whereas the list retains order. So I'll try to do something on a list because it will expose more possible,

histories in which the outcome can discriminate between correct and incorrect executions,

if that makes any sense. Yeah. That definitely makes good sense. Because as you said, the ordering is explicit in those types of data structures. And when you have these multiple concurrent clients that are interacting and you're dealing with a system that is trying to provide maybe snapshot isolation,

you ensure that

the right and then the subsequent read produces the expected result given the point of time and the other clients that you're connecting with. And that also leads me to wonder how you

sort of synchronize

the the state of those multiple clients to be able to determine what the correct output should be at the end of a test suite to be able to say, I'm writing all of these values. It's being run-in this multi threaded fashion. And when everything is said and done, this is what the actual output should be, assuming that everything is running properly. And then also

particularly for eventually consistent systems

identifying

how they're

resolving those potential conflicts that are being

created by those multiple clients?

Yes. Yes. Obviously,

real world problems with databases are typically concurrent problems, and Jepsen tests concurrently. So we have to be able to understand exactly what the concurrency structure was

of all the clients that are interacting with the system. To do that, Jepsen

puts them all in the same,

VM, the same JVM process,

which means that we have

a local sequentially consistent store that we can use to record the actions of each client, and that tells us what began first. But, of course, there's still concurrency there. Even though we have tight bounds on it, 2 transactions are likely running at the same time. And that means that when you're trying to

verify the system is correct or not, you have to consider both possible orders. You know, 1 transaction executing first or the other transaction executing first. Writing efficient checkers for those

permutations of operations turns out to be really tricky. Eventually, consistent systems, like you mentioned, actually have this nice property, typically, that they should commute, that lots of different orders are legal. Any any possible ordering should be okay. And so when you test eventually consistent systems, it's often

a little bit easier to design the test for because all you need to do is apply the operations in 1 order, and they should be equivalent to any other order, at least for something like a CRDT. If it's not, well, then you have a combinatorial explosion again.

And

given the massive surface area of the problem demand that you're trying to validate, what is your

heuristic for determining the overall completeness

of your test suite or determining a useful definition of done for the purposes of a particular engagement?

Right. Right. It's it's such a huge problem. And the fact that any of this works at all is sort of astonishing. 1 of the things that I use to guide my work is the idea of

a falsifiable result. I want to design a test which

yields

a failing result

before it passes.

And I should be able to turn up and down a concurrency knob or turn on and off some feature in the database.

Say, I I enable the linearizable mode and it should pass linearizability tests. And if I disable it, it should fail those tests. If it still passes with linearizability disabled, I know the test wasn't powerful enough to find some fault. So that's that's 1 notion of completeness.

Another is just looking at the API and kind of staring at the algorithm and trying to guess all the ways in which it could go wrong and then designing

faults for those. But, ultimately, the space of faults is very large, and we're never gonna be able to cover all of it meaningfully. The only reason this works is because the state space is often degenerate

and that many different permutations of failures

yield similar outcomes. Another metric I use to guide my work is,

I

watch the logs and I look at visualizations

of the history,

very carefully. And I try to intuit

have the behaviors that I have introduced with the system, the network partitions, the process crashes,

have those faults

created some sort of phase change in the system? Do I see a characteristic shift in latency, or,

do I see a transition between returning linearizable and nonlinearizable results even if both are legal? I wanna make sure that those transitions occur because that tells me my test is actually

measuring

some sort of event in the system, like a like a failover or promotion of a new node or some sort of reconciliation

process. If I don't see that happen, I have a hint that my system isn't

being tested rigorously enough. And then another element

of complexity

in dealing with testing these types of systems is being able to actually get them deployed in some representative

distribution

of making sure that there are multiple instances running, that they're all communicating properly and clustered together. And then for

the curious what your experience has been as far as being able to do this in a reasonable fashion without having to do everything manually, particularly given the different ways of

deployment and sort of different levels of maturity in terms of the operability of these systems and the configuration

availability and things like that? Yeah.

Luckily, most computer systems are meant to be deployed on mixed computers over IP,

which is good because that's how Jepsen works. But, you know, as as modern deployments show up, like like serverless things or hosted services, those are much harder for me to test because I don't have control over their infrastructure. The good news is that when something is meant to be installed, I can typically

write some sort of automation for it, like downloading a package and installing it and,

running some shell commands to configure it. All of that is automated by Jepsen. So Jepsen has a a whole deploy system built into it, which pushes out different binaries and run shell commands and uploads files and gets the cluster running. Running that automation is the first part of any test process. And because of the fact that the primary focus of Jepson is on the distributed systems guarantees,

another level of complexity in getting these system set up properly is dealing with things like encryption

and authentication and authorization.

I'm assuming that you just completely leave that to the side so that you can focus on the durability and serializability

guarantees of the system. Yeah. I typically ignore those, because they're often orthogonal.

It's typically the case There's, like, an authentication layer at the top end of the database when you first connect. And once you're past that internally, it's just the same function calls you would get either way. That's not always true. Some systems, like in cryptocurrencies

will have a sort of signature verification, like, baked into the very core of the consensus protocol. And there, it makes sense to test sort of adversarial methods. But generally with Jepsen,

you look for

getting the most bang for your buck. You look for the key

couple of functions of the the basic data model, which will give you

the best insight into how the concurrency control mechanism works as a whole, And then hopefully that generalizes.

So you might miss data type specific or authentication specific problems,

but those things could be incrementally added later if you have time to add those tests. And then for the actual process of

building the deployment

and

running the overall workflow of setting up the tests, suite, executing it, and then evaluating the output. Can you just talk through that whole process and some of the challenges that you face, particularly at the end in terms of interpreting the results and being able to provide useful feedback to the people who are building these systems and engaging with you to find errors that they need to correct? Yeah. My overall process, when I quote clients, I typically say there's a there's a 4 week minimum,

to get a report out the door. And that gives me enough time to get the system installed, which usually takes the first couple days to a week to get it installed reliably. Turns out a lot of systems, when you install them, don't necessarily set up reliably the first time.

Then there's a lot of documentation reading that usually takes a couple days. And I try to understand the model of the system, what kind of API calls would be involved, what kind of invariance to be measured. Then I will design the

the workload and the client that that interprets that workload and actually goes and makes network calls and gets results and understands them. And then once that's done, I'll start introducing faults. And that usually is

the the kind of process of weeks

23 is adding different faults and running the tests.

The actual test workflow is

you run a shell command, jumps and starts. It

installs all the stuff on the nodes,

starts making requests, introduces faults.

Once it's recorded a history for a few minutes or a few hours, it will analyze that history and spit out some,

files and some textual output. So I'll read those those,

explanations of what happened. I will look at the graphs it produces. I'll look at the charts and try to interpret those and and, tell the people who built the database what I found. And in terms of Jepson itself, how is it implemented,

originally, Jepsen was built quite quickly, just to demonstrate a couple of bugs in in 1 or 2 particular systems. There's an early prototype in Erlang, which was actually intended to be a proxy that would sit between all of your nodes and, disassemble their network flows and and, cut them together in new orders. That was a lovely idea, but it turns out that writing,

protocol level decoders for every database is a really tricky idea.

And, it was much more interesting and effective to just focus on running systems,

sort of in their native environment. So round 2 of Jepsen is a a JVM program, which

would run some basic workloads and requests against the server concurrently. And then I did automation, all the setup, introducing faults, separately using a deploy tool I wrote called Soltasid.

So I would be manually sitting at the keyboard triggering a fault and watching what happens.

Round 3 of Jepsen, which I think started around 2014, 2015,

brought those things together and said

the operating system and database deployment tools should be an integral part of the test. So when you're writing a Jepsen test now, you define

what kind of operating system do you wanna use, how do you set up your database,

what kind of schedule do you want to perform, how should that schedule be applied to the database and the results recorded, And then how do you wanna analyze that abstract history? And that decoupling between

this abstract model of the world, like, write 3 to the register named x and read the current value of register y and observe 2. That abstract world is what the

analysis tools in Jepsen are written to understand, which means that you can reuse the same analyzers from many different databases.

A lot of the interpretive work in Jepsen is now saying, well, what does it mean to write a register in FooDB? How how are how am I gonna encode this particular abstract operation? And often there are many different ways to do it.

So I'll write lots of different families of tests, which encode that sort of abstract operation differently in the database itself.

It sounds like from your description of actually running through the workflow that Jepsen actually has components that sit on the database nodes themselves in order to be able to communicate back and forth and execute these operations and record different data points. And so given that it is, at least on some level, a distributed system in itself, and I'm curious if you have created a Jepsen test for Jepsen.

It's a it's a good idea. But, there's no agent. There's no local processes involved in Jepsen, to try to keep things as simple as possible. Everything runs in 1 bog standard JVM program on your local node.

All interaction remotely is done for SSH or,

Kubernetes or Docker interfaces.

Another level of complexity

in some of these systems is that some of the failure modes may only trigger when there are certain volumes of data

in the system or if you have a large volume of data in a cluster and 1 of the nodes goes down and then it creates a cascading failure of trying to rebalance information and then

failure of trying to rebalance information and then introducing

new data into the cluster. And I'm curious if you explore those types of problem spaces as well. Yeah. This is something that Jepson is really bad at. Jepson excels at this kind of sweet spot of

moderate concurrency,

short temporal duration,

medium sized datasets.

We're talking about histories of maybe a 100000 to a 1000000 operations tops.

But it's very true that a lot of times, systems only fall over

for large datasets

or for super high concurrencies or super high node counts.

Those are important to test,

but they're the difficulty in testing those systems with something like Jepson is that trying to analyze the history for correctness might be computationally intractable.

It could take you millennia maybe to analyze it depending on the property you're trying to look at. So I think what often makes sense is to run Jepsen as a part of a suite of tests

or or Jepsen like system. You know, if you if you have, distributed safety testing Cassandra, for example, that exists in conjunction with other tests, which do

long run times,

high data volumes,

but don't try to do rigorous safety analysis.

They're just asking, you know, does the system continue to respond to requests and is latency okay? They're not trying to understand was it linearizable because that problem is NP hard. And then as far as the overall

pros and cons that you've seen of building Jepsen and Closure, I'm curious what you have found in that regard as far as the accessibility of the system and making it easy for other people to be able to use Jepsen for their own purposes or contribute new tests for new distributed systems? Yeah.

You know, choosing a list,

is not necessarily

the idea you wanna adopt if you're trying to get a big community.

Although, I've done it twice now. Riemann and Jepsen converted systems administrators

and database engineers to Closurists,

strangely enough. And when I teach classes on Jepsen,

I find that people are usually able to get closure within a few hours and at least write some sort of basic test following along. You know, obviously, it takes longer to become a fluent engineer, but the language isn't necessarily the barrier that I initially thought it was. In terms of suitability

for,

somebody who's proficient with a language, is Closure the right fit for this problem?

I think that the JVM

is a good candidate for testing databases in general. So doing something in Java, Scala, Groovy, Kotlin,

Closure,

it's nice because you have access to JDBC.

Almost every database publishes some sort of Java client of some type. And as long as you have interop between those languages and the JVM, you can call those clients. So from Closure, I can call anybody's Java client, and that generally works pretty well. You also want to have reasonable performance. It doesn't have to be super fast. You wanna be able to go somewhat quick, and you want real threads. So that rules out,

a single thread at runtime. Like JavaScript, it would rule out,

you know, a slower language like Ruby or Python. You wanna have something more on the side of Haskell or c or Rust or the JVM, something with a a reasonable profiler and some attention paid to VM performance.

And then because this work is experimental, I think you want a language which is concise and expressive. 1 of the pieces of feedback I get from people

when I teach,

Jepsen in classes

is I'm really confused by all the parentheses, which is, of course, the classic Lisp critique. It's it's absolutely valid. It's a new way of doing syntax, and it's confusing. But people really like that Jepsen's API is so compact and that it is easy to

read, understand, and explore new schedules.

So I feel like the API design

has worked out well there. And if I were to design a new version of Jepsen of the language, I would look for something which makes it easy to write compact code. And you mentioned that there are some considerations

any other elements that you would either redesign or re architect if you were to start over today and build a similar set of tooling.

Yeah. At the risk of grabbing the traditional third rail of of language arguments,

a type system might be nice.

Obviously, closure is a dynamic

semi unit type language with protocols and

basically JVM's dynamic type checks underneath the hood. But there are some instances in Jepsen where

you're gonna manipulate a whole bunch of data which has really similar representations for something with the same name. Like, a node might be a string, an internal identifier,

a node number,

a handle,

a client,

and you wanna discriminate between those things reliably. Having a type system

helps there. So I've been using core dot typed in some parts of Jepsen for places where it's hard to keep those things straight.

But it might be nice to write it in something like,

Kotlin or Haskell, where I could have some sort of nominal typing instead of,

just structural. I think in terms of architecture changes that I would make to Jepson,

1 of the

problems that I had for a long time with Jepson was the generator system,

the part that schedules operations and emits concurrent

things to do for the database. That system was initially built as a mutable object that every thread, every worker in the system would ask for an operation. So each thread says, hey, generator. Give me something to do. And the generator would, you know, mutate some state and spit out our op.

Internally, if you wanted to do something like wait for 2 minutes, you would just call thread sleep, and then the thread scheduler would take care of making sure that the thread went to sleep and woke up at the right time. This is intuitive and it's easy to write, but

it can lead to really tricky cases of deadlock. Because if you want to interrupt a thread that's sleeping or a thread that's engaged in some sort of barrier where it's blocking awaiting other threads,

you run the risk of maybe double interrupting and knocking a thread out of the interrupt recovery code itself. It becomes devilishly hard to write correctly.

The other big problem I had with the generators was that

they weren't responsive

to things that happened in the world. So the generator was kind of this

pure source of information. It told you what to do, but it didn't take any feedback. And that meant that you couldn't write something like, try writing values until you see at least 1 success.

And that's an important thing to do if you're gonna be testing database. You might wanna say, like, keep reading until you actually get something so that your test means something at the end. The way that you would work around that limitation would be to have some mutable state. So you would have the client

and the generator share a piece of mutable

information, some variable that they would both update and read in order to change the generator behavior. But that introduces this kind of ugly coupling, and it was confusing for users. So I've redesigned the generator system as a purely functional effects system with an interpreter.

I thought that this was gonna be a terrible idea, and it may yet prove to be. But so far, it seems to be better. It means that we can do all the mutation in a single thread, which eliminates all of the thread scheduling ugliness

and deadlocks that you have with the original generator system. And it also means that you can take a feedback.

The generator can be updated with each operation's completion as well indication. This might be way more into the weeds than you ought to talk about, but it's my my big architecture change in Jepsen for the time being. Yeah. It's always interesting

getting

behind the scenes of the problems that exist in a particular tool. Because as a user, it's easy to

gloss over those things or not understand

the problem space,

effectively enough to be able to know what is wrong with a given implementation. But as somebody who spent so much time in the code, there are always going to be warts or things that you don't like or things that you would do differently. And it's interesting to get some of the perspective on that. You know, it is it's funny. Jepsen's architecture has been basically unchanged for about 5 years. I take backwards compatibility really seriously.

And, this generator change just went live about a month or 2 ago, and it was the biggest kind of break I've made to Jepsen's API. But even then, it doesn't change the structure of tests. It's still a generator. It still has the same decomposition. And I feel like this is the 3rd iteration, and it has been reasonably stable. It it does the right it's the right way to break up the problem. But there's a lot of experimentation and bad design choices that I'm still making in things like the the checkers, the bits that verify histories and tell you if there are property problems.

So I am currently

knee deep in the weeds on a

a new checker based on l for doing set

verification and transactional systems. And that puppy, I have no idea how to write it correctly. This is my 3rd attempt. Maybe this time it will work.

That still tends to be a never ending process.

As somebody who has been

using and working

on Jepsen for quite a while now, and you have worked with a number of different companies and open source projects on various types of distributed systems. So you have possibly the best perspective in terms of the

overall problem space and the ways that it's being tackled. But I'm wondering what you see as some of the most common failure modes

that exist in platforms and some of the ones that continue to crop up and that continue to be overlooked maybe in initial attempts at building some of these types of platforms?

Yeah. And this has shifted over time. The industry is getting a lot better at building distributed systems.

This is a little tricky. There is a class of

what you might call an error where people rely on the well ordering

or the k ordering of clocks. They assume the clocks are roughly on time. They track time at the same rate, and then they use that to build, say, leases for leaders.

This is safe and correct so long as the clocks are correct. But we don't have a lot of empirical data on how often clocks are correct and how badly they go wrong. I would like to see more of that. In the meantime, it seems valid that you would write a database based on these assumptions and make the clock skew thresholds tunable.

But there's this sort of uncomfortable feeling. They're like, well, eventually, the clocks go wrong, and then what happens to my data? Am I okay with that? So this is

sort of an active field development. And I think all people who are choosing to do clock stuff at this point databases like,

TIdb, GigabyteDB,

CockroachDB,

I think. It's all blurring together at this point.

They they all rely on some clocks, I believe, for leader leases in different ways,

And they're making that choice in an informed way. They know that the clocks could be bad. So I don't think that's necessarily an error. It's it's maybe more,

a choice.

1 of the more serious errors that I see nowadays tends to be around understanding indeterminate

results.

So oftentimes, there'd be some sort of RPC mechanism

inside of a database where 1 component is supposed to send a message to another and get a response.

And if it doesn't get a response it expects, that's an error. But a lot of times, people will interpret that error as meaning that the operation did not take place, and it is therefore safe to retry it.

Sometimes it's safe to retry operations. But sometimes if you retry them, you wind up introducing some sort of anomaly, like a double execution or violating

transactional isolation.

So several of the bugs that I encountered recently had to do with internal retrying mechanisms

or client level retrying mechanisms, which didn't understand that they weren't allowed to retry in that context.

Tracking that stuff safely

is something that I talk about with most of my clients. And most of them understand it, and they're trying to do it. But it's hard to keep track of all the different ways in which things can fail because there's just so many different, you know, terrible things that computers can do to you. Yeah. Computers are wonderful and awful all at the same time,

and generally more awful than not for people who are trying to actually build them and work on them. Oh, for sure. I have so much respect for for distributed systems engineers, people who are trying to build databases because it is this

terribly complicated problem where you have to think of everything, and then

here I come like some sort of rampant toddler knocking over all the blocks.

Yeah. And yet they pay you to go and break things for them and tell them what they did wrong. I am weirdly effective at it. I don't know. The computers around me seem to break more than for other people.

I'd in fact, I'm currently

in some sort of bureaucratic hellscape because of some sort of bank computer prop for the company.

And,

it seems like every week, I discover some sort of new,

you know, fresh way that computers have have gone terribly wrong. I wasn't able to use my browser for, like, 2 weeks last week. Well, perhaps when you're done with distributed systems research, you can move on to security.

Yeah, please. Don't don't give me ideas.

And then in terms of

the problems that you identify,

what are some of the classes of issues that you have found to be the most difficult to resolve in some of these distributed systems?

The ones that are difficult to fix once you understand them are generally architectural problems where it's like, oh, we have to rip apart the whole thing and replace it with a new algorithm.

But I do think that

folks are using more and more proven algorithms nowadays. I see a lot of use of Raft. I see more deployment of CRDTs.

I see people using Paxos,

and that's,

that eliminates a whole class of errors where people just made up some sort of consensus system and then hope that it worked. The thing that that we don't know how to do yet is to build transactional protocols on top of those individual

consensus systems.

And there's a lot of active research and a lot of industrial effort

going into that. And I think that's a place where people will make

sort of tricky

mistakes,

which is, again, not to say that the engineers doing this are bad. It's that they're tackling an almost impossible problem with a huge,

you know, failure space and are

encountering the kind of obvious

difficulties in solving that impossible problem. And what are some of the

areas of development

or interesting

new entries to distributed systems design and research that you're keeping an eye on and that you think are going to have a meaningful impact on the industry? In

the

in

the late eighties to early nineties.

We now understand that the state machine approach to,

consensus

systems where you

you run consensus in each operation, you form a log, you execute the operations in order,

that is an effective way to build a state machine. It's how most people use Raft.

That seems well understood. But what's not well understood is how do you do operations between those state machines

and still make them,

legal under the loss of your system.

So

1 approach to that is Calvin,

which is implemented by FaunaDB. And then there's some recent work that I I think Daniel Bhatti has been doing. Is it called the slog,

which looks really cool. Sort of a generalization of this this idea

for

oh, gosh. It's been so long to separate the paper. I'm not even gonna try to explain it. I will butcher it. I'll add a link to the paper in the show notes. But it looks cool. Yeah. Another thing that I'm really excited about is,

revisiting core ideas about consensus. Heidi Howard has been doing,

really cool research

into, you know, do you actually need majority quorums to do consensus? We all thought so, but it turns out that that's an assumption that's not required for the proof in Paxos. And there are generalizations of Paxos where maybe instead of agreeing on 1 value, you agree on a set of values and you can skip around during contention. So I'm excited to see those ideas come into play because

in some cases, it looks like you could do fewer round trips or

shorter round trips, more local round trips,

in exchange maybe for a longer trade off during failover between domains. Another thing that I'm really excited about is CRDTs.

I know I've been plugging this for a decade now, but I think industry hasn't really caught up to this idea of commutative programming. And it is,

at least the Kalb conjecture suggests,

the way that we need to handle,

geographically replicated systems where latency

prevents us from using a consensus mechanism or some other sort of strong consistency.

So I'm excited to see people doing work with causal consistency,

having,

commuted operations which get merged together either via operational transforms or CRDTs or some related structure.

I wanna see more of that. And

you have been running the Jepsen test for a while now. You have come to be seen as

a sort of very strong signal

of correctness in a system or reliability in a system where different database vendors will publish the fact that they have

completed a round of Jepson testing and that they resolved these different bugs that have been identified by it. I'm curious what you see as being the overall impact that your work on Jepsen has had on some of the modern distributed systems products that we're using today.

This is tricky because,

you know, obviously, I have the desire to toot my own horn.

Like, I am the most important person in my own story. But I think realistically speaking, I am a small part of a very large movement to do

more correct distributed systems.

People are advancing

formal methods. Hello Wayne, for example, has been pushing the use of TLA.

Chris Micklejohn has been working on Coq

and other ways of doing verified systems.

All the people in industry who are writing distributed systems are reading literature and and advancing. And I guess I can maybe claim

some small part in pushing that along, but I think it's a lot of people doing a lot of work too.

You have been working on building Jepsen and evaluating some of these mission critical systems, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned or outcomes of the work that you've done?

1 thing that was

tricky for me to understand

when I first started this work, and that still surprises people when I teach classes,

is this notion that

you have to record

a concurrent history when you're doing a test. A lot of people are used to doing tests of stateful systems by performing some series of operations in sequence and then asserting the state as a certain way. So

I wanna test the counter. I add 1, 2, 3. I do a read, and the value should be 3.

And that's how a lot of distributed systems tests are are written

at the very first pass, you know, when you're just trying to figure out, does this thing turn on and do its job? But when you're trying to do more sophisticated verification, you have to do these operations concurrently.

And maybe if the read is concurrent with the right, you could either observe it or not observe it. Now there are 2 possible outcomes.

Realizing that I had to track that concurrency and that I had to track indeterminacy

to have a notion that an operation

may or may not have completed and that both things have to be taken into account,

that was a really subtle,

and tricky problem for me to wrap my head around initially.

The other thing that I've really struggled with, and this is more recent, something that I've been

trying to internalize over the last year,

is

what is it exactly that you are proving when you say

such and such an anomaly exists in a system?

Because anomalies, at least in

the academic sense, are

properties

of a formal system. Like like, you have

a set of,

like like, a history is a set of operations, and each operation has the start time and this end time, and and they have some sort of internal order. That's not how real systems work. Real systems are computers. They're they're sand that we tricked into thinking. There are quartz crystals we imprisoned and electrocute in order to watch them dance.

That's you know, there's a mapping between the abstract system and the real computer system you're testing, but that mapping

isn't

always faithful,

nor

is the system you're measuring the actual observation you have. There's a third thing, which is the history you record as the client.

And

so when you say,

I see this particular

anomaly, I see a,

dirty read, or I see a g 1 c

information cycle in a transaction.

What I'm saying is that

were there to be an abstract

system, an abstract execution in this audio formalism, say,

which was executed by this real physical computer system, if those 2 things were truly isomorphic

and if all the information that I got from the system is accurate, if I if I wasn't lied to about the results of any of my rights and reads,

then there cannot possibly have been an execution which did not contain this particular anomaly.

I know this sounds really pedantic,

but

being forced to write it out to develop a formal model of that separation between

abstract model,

whatever the computer did and what you observe of the system

has been critical in the research I've done with Elle to do transactional isolation checking.

And it's guided the way that I talk about anomalies in reports.

I can say something like this appears to events

a g 1 c anomaly, but it could also be a dirty read. And I'm choosing the most charitable interpretation of that fossil behavior. Yeah. It's interesting trying to reason about these systems that have potentially unreasonable behavior because of the fact that physics is a thing, and we have to contend with that in these sort of idealized systems that we're trying to build and rely on.

Yeah. Physics is a thing. I have a a sort of pet peeve there. This is a bit of a tangent, maybe not related here. But,

people often say, like, oh, relativity means there's no such thing as as synchronized clocks. This is bullshit. Relativity gives us the equations for doing clock correction and and figuring out what time it is. The problem is not,

the fabric of space time. The problem is,

that the clocks are are wonky and generally bad.

And in terms of the work that you're doing on Jepsen and your overall

work with distributed systems, what are some of the things that you have planned for the future or projects that you look forward to being able to start or continue on?

There's

an open problem in my head around l, which is my checker for,

looking at transactions and telling you if they interleaved improperly, if they were non serializable or non repeatable read or violated snapshot isolation.

And that is that I

I understand

the actual formalism for individual registers.

So I have variables x, y, and z. I wanna do transactions over them. That's great. What about

a transaction which says,

read the current value of all registers which have an odd value in them right now? That's a predicate. And predicates

are a part of the actual formalism, which I completely ignored in my initial past because I had no idea how to solve them.

It would be nice if I could

model predicates in history and analyze them in a way which lets me tell if there are anomalies in the same way that I've done for individual registers. I have no idea how to do this thing. It's an open research firm.

And as you continue to work on Jepsen,

I'm curious what you have planned for the future of that project or other projects or research in terms of distributed systems that you're looking forward to either continue or start a new?

The transactional checker I've built called Elle has been remarkably productive in finding bugs. It's uncovered all kinds of behaviors that we couldn't see with earlier,

tests tests in Jepsen.

However, there's a a key problem. We can only identify,

anomalies at the key value level.

So Jepsen's,

or Elle's use of the formalism from Ajah

works over keys and values like x equals 2, y equals 3. You can do transactions over x and y, but you cannot express a transaction

over, say,

all currently odd registers.

That's a predicate.

And odd just formalism includes predicates and talks about anomalies over them. It would be great if I had some sort of first class way to model those predicates and check their correctness, but I don't know how that's gonna happen yet.

That's my my current research project.

That doesn't mean that we can't test predicates in real databases.

You can still make a request to the database

using a predicate in order to perform what looks like a key value read in an abstract sense. And that lets us test whether, say, secondary indexes are correct. But we've got no first class representation of them at the checker level, and I'd really like to get that far.

And for people who want to dig deeper into some of the

transactional guarantees or problems problems in distributed systems, what are some of the resources that you have found valuable or that you tend to recommend to people who are getting into this area?

I've actually written a class,

which I I teach professionally to organizations, but the outline for it is remarkably comprehensive and has lots of links to other resources. So if you search for,

distsys

dash class,

on my GitHub, which is a p h y r, afer,

There's

a whole

introduction to the field.

I also read, like, MiXu's book, m I x u,

and,

Chris Micklechon has a really wonderful reading list for papers and introductions.

Alright. I'll add links to all of those for people who want to dig deeper. And for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Right now, databases

are

often tested via ad hoc tests,

via some sort of,

generative

property based testing like

Jepson or maybe Jepson itself.

But there are

fewer tests of the abstract underpinnings,

either via simulation

or via model checking or a proof system.

I'd like to see model checkers and proof systems become

more tractable

for real industry people, maybe who don't have the advantage of mathematicians sitting at their side to help them write the proof.

I myself struggle a lot to write proofs in Isabelle

or to write models in TLA.

And I would love it if these systems, which think have a lot of promise for industry, were more accessible.

Alright. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Jepsen. It's definitely very interesting problem domain and a useful approach that you've built to it. And it's something that I see mentioned very frequently throughout different database platforms and distributed systems. So I appreciate all the time and energy that you've put into that, and I hope you enjoy the rest of your day. Thank you, Tobias. You as well.

Listening. Don't forget to check out our other show, pod cast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links