CRDTs and Distributed Consensus with Christopher Meiklejohn

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.

Go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and

get in touch. You can help support the show by checking out the Patreon page, which is linked from the site.

To help other people find the show, you can leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media.

Your host is Tobias Macy. And today, I'm interviewing Christopher Mekeljahn about establishing consensus and distributed systems. So, Christopher, could you start by introducing yourself?

Sure. So, my name is Christopher Mekeljohn. I'm a PhD student at the the Universite Catholique de Le Vons in Belgium

and, Instituto Espiritu Tecnico

in Portugal. And the area of my, PhD

research is on programming

large scale,

eventually consistent systems, specifically

systems that are operating at the edge and and mobile systems. So

places where you can't really use consensus. And so, a lot of my research is in trying to avoid consensus and trying to do things,

coordination free as possible. And how did you first get involved in the area of data management?

I've worked in industry for a while. I I worked in industry for something like 15, 16 years before I decided to go back and do my PhD. And so my first kind of introduction to

distributed systems was when I was an engineer at Batcho Technologies working on the React distributed database, which is a NoSQL

eventually consistent key value store that's used by a variety of industry folks written in Erlang.

And from there, I got involved in European research project on on CRDTs

and, kind of eventual you know, trying to make working with eventual consistency easier. And I ended up leaving bachelor of technologies

to go pursue my PhD with the research group that, I began working with while I was there. Kind of since then, I I I've worked on a bunch of different NoSQL databases

and, worked on some distributed programming work with Comcast

and Mesosphere,

kind of in the interim time between,

starting my PhD and and leaving

Batch of Technologies.

And as you mentioned, you spent a decent amount of time working in industry with distributed systems, and you're currently doing research

into how to establish consensus in widely distributed systems.

And 1 of the focal points of that research and your experience from

from what I was able to gather while doing research for the show is in the concept of CRDTs.

So I'm wondering if you can just start by explaining a bit about what that is

and how you first began working with them and some of the current manifestations

of them in the you know, both in academia and in industry.

If you kind of think about the the boundaries of the kind of cap result and the cap problem, you're kind of left with this choice when building these large scale distributed systems where you know that network partitions are gonna happen and you have to either fail favor

consistency or availability.

And so in a consistent system, you would wanna kind of maintain a total order. You want to ensure that all of the nodes agree kind of in lock step. And you kind of don't have divergence of values if you're, if you're thinking of something like a database. From the availability side of things, you want to keep the system operational at all of all the time. And so even when the system's partitioned and you can't get consensus, you want the system to be able to continue operating.

And so kind of you have these 2 bounds, this kind of CP area on 1 side and this AP area on the other side. And so a lot of the CRDT work kind of looks at these kind of global scale systems where, you know, I might have a database copy in

Japan, and I might have a copy in Singapore and a copy in Virginia and a copy in California.

And

the problem of consensus becomes harder because,

inevitably, the latency I mean, obviously, the latencies are so long because you're sending messages between Japan and California, for instance, of Japan and Virginia.

And the low latency is being high. Partitions also can happen as well. And so kind of at the the global scale, you and even considering mobile, for instance, where devices are coming and going offline and things like this, you have to deal with these problems of latency and these inevitable network partitions.

And so CRDTs kind of come from the angle that you want to stay on the availability side.

But

if you stay on the availability side and you let nodes kind of continue to operate when they can't talk to all of the other nodes in the system,

You might have a system where not everybody agrees on a value at a given time. And so kind of the famous example of this, at least in the literature, is this shopping cart that Amazon talks about when they built their Dynamo database

that React and Cassandra are based off of where, you know, rather than have a system that,

can't accept somebody adding an item to a cart or removing an item from a cart, they always wanna accept the operation.

And what ends up happening is that you might have a conflicting add and remove, and you have 2 copies of the object. And when the user goes to checkout, you kind of have to make a decision which 1 is right.

Is the is the left 1 right, you know, the 1 that has the removed item removed, or is the 1 that has the item added right? And, you know, it's hard without tracking a bunch of additional information,

you don't know whether that ad came before the remove or after the remove.

And,

in 1 case, it might result in the item not being there. In another case, it might result in the item being there. And so in Amazon's case, they basically said, well, we'll just you know, the shopping cart is a set and we'll take set union because that's well defined. And,

you know, this results in these anomalies where things might come back. And so it's kind of this insight that led a bunch of researchers

in around 20 11 ish. But the work kind of work kind of has its genesis back in a paper from 83

where, it said, well, this idea that we can have a mathematical function that knows how to resolve

these differences in kind of a deterministic way. Can we kind of generalize that? Can we build a general theory around that? And so this this kind of general theory

and this framework for building data abstractions

that have these properties is referred to as a CRDT. To make this a bit more straightforward to think about, you can think about it as a data type, like a normal data type, like a set or a dictionary or

a counter or a register

or a Boolean,

where you just have that data type but with an additional merge function. So it says for any 2 copies

of the object, I always have a way to merge them together.

And so

this allows us to say that if I have a shopping cart that has item a in it on either side of a so if I have a shopping cart that can do add and removes and this is CRDT, I can operate on 1 copy of it. And I might have concurrent changes happening to another copy of it. And I can't get consensus because there's, like, let's say, all of the nodes are partitioned from 1 another. But when the partition heals and the nodes all communicate with 1 another, I have a way of merging all of those changes together in a way that makes sense. And so kind of the easiest CRDT you can think about is if I have a set where I can never remove values from it, then simply taking the set union is sufficient to merge any copies of this set. And so I can apply changes to, you know, 1 copy over here, 1 copy over there, and then merge them all together just by taking the union. And that result would be correct regardless of how the executions are interleaved. And so when things get a bit more complicated when you wanna start removing things, because now if I remove something from it, set union is no longer sufficient because a copy that is old or out of date when merged with a newer copy when the item's been removed, that item will reappear.

And so as you want to enable data types that have more operations

on them, these merge functions have to get more and more complex and the amount of metadata you have to store gets more and more complex. And so kind of the CRDT literature

basically takes these really simple types like a set where you can never remove anything

and kind

of increasingly makes them more complicated,

and gives you richer and richer data types. Like a set where you can remove and add things or an arbitrary number of times or a dictionary where I can,

you know, have a bunch of keys and set the values

in the dictionary for those keys and things like that. Right? And so the CRDT is kind of a broad term for this collection of data types that have merge functions. And for a lot of databases that are operating at distributed scale, they generally have the notion of eventual consistency,

which is largely what you were talking about where if there is a network partition, at some point, they will reconvene and start to try and migrate data between each other.

And in some database systems, for instance, CouchDB,

they use sort of optimistic concurrency control where they'll just keep a reference to both versions, and they'll just arbitrarily

choose, but deterministically,

1 of the states to be the reality. And then it's up to the programmer to determine the business logic to reconcile those states. And so with the CRDT or conflict free replicated data type, you are moving some of that business logic into the data type itself as opposed to having it live in the application code to resolve those conflicts. Is that a accurate representation?

Yeah. If you go back to kind of the earliest work is this paper by D Scott Barker, in 83 on on detecting conflicts when you have concurrent modifications without coordination.

And he's this paper is kind of the first place that kind of highlights this problem, but probably the most notable system that talks about this is the Bayou system from Park, which basically says exactly this. It says that, you'll have conflicting updates and will call back into user code asking the user to resolve the conflicting updates. They'll have to kind of make a choice. Right? And so, actually, the early versions of React the early versions of React do do this. If you have concurrent changes at the same time when the system resolves them,

it will store both values. And so you'll get 2 objects back instead of 1, and then the database expects you to write 1 value back to the database. And so that's how React worked. And actually, what's interesting is that behavior

is actually, a CRDT. So the the idea that you have a value and it diverges and you create an object with 2 values in it, and then you resolve that to 1 is actually a CRDT called the multivalue register. And so that idea

nicely fits into this CRDT generalization because the idea that you store both and then resolve it to 1 that supersedes the previous 2 isn't itself a CRDT.

So React React to this day, if you get a default install of the React database, operates precisely like that. It will store both values, it asks to use, and resolve. However, React does have a series of data types that you can use,

if you want that automatic resolution to happen. A lot of systems that are in production have been built. You know, a lot of people create custom CRDTs because they want that kind of custom resolution logic. What's nice about the kind of the mathematical framework that CRDTs present is that they have a nice way of knowing whether those merge functions are going to be correct. So they give mathematical properties that, this merge function must have for it to be correct in all cases.

Where, if if 1 person was to just write a kind of an ad hoc merge function, you know, and and wrote a merge function for every possible conflict that they would have in their system. Inevitably, this is usually pretty ad hoc. It's usually pretty error prone. Pretty easy to get these you know, with a sufficiently complicated data type, it's pretty trivial to write an incorrect merge function. And so kind of establishing a base set of CRDTs that map to the data types you'd have in a normal programming language, like a register, Boolean, or stuff like that, enables the programmer to work with those and compose those in a way that's safe, that has safe merge functions rather than kind of writing custom data types of custom merge functions and potentially getting it wrong. And other than CRDTs, what are some of the other methods for establishing consensus across nodes in a system

that have been used in the past and currently?

And

how does increased scale affect their relative effectiveness and where they break down? Yeah. So there's there's kind of a interesting here in here because CRDTs don't necessarily

get you I mean, if we kind of look at the the kind of academic definition of consensus, CRDTs are not going to get you that definition.

So that kind of the academic definition of consensus is going to be kind of these 3 properties of this,

termination agreement and validity, which basically says, like, you know, at the end of the execution, the execution will terminate.

All of the nodes will agree on a value, and that value is proposed. And so CRDTs don't actually give you that because

the value that you actually might result in like, for instance, if we are operating on the same object and I I add 1 to the set and you add 2 to the set, we might we might ultimately

agree on the value where 12 is in the set. So that wasn't actually a value that any of the individual participants

proposed, but that's the value that we arrive on. So CRDTs kind of give you this agreement property, but they don't necessarily give you this validity property, which says that the value

that, is that is agreed upon was a proposed value. But what they do guarantee is that those changes that all of the individual parties make will be incorporated kind of in that final result. And this is because, you know, with the CRDTs, you're not necessarily getting a total order over the execution. You're getting this partial order over the execution where, traditional consensus would give you kind of this total order over all of the operations. All of the nodes kind of agree on each value as they're being proposed

and,

and come to some final agreement on a value. Now in terms of mechanisms for achieving consensus, I mean, you know, kind of the the gold standard here is is Paxos and kind of kind of all of the subsequent variations of Paxos.

And then you have protocols like Raft and protocols like view stamp replication revisited,

and and and and and things like this. Right? And so, you know, kind of in those systems, the way those work is that you

will use, kind of a ballot proposition mechanism where somebody will propose a value, and you'll have acceptors or rejectors kind of on those values, and they'll continuously propose ballots until they come to agreement on a value based on majority.

And these systems typically involve typically require kind of a majority of participants to be online. If they need to be, you know, busy and deep, tolerant, you need to have more participants online. They usually, involve kind of 2 rounds of messages,

to come to an agreement on something.

And, these protocols kind of are,

you can see this kind of standard technique

in things like 2 phase commit and and stuff like this. Now these protocols can get very slow,

when they operate

at

larger geographic scale. So obviously in the mobile case, you have a problem because if devices are transient,

then coming to a majority may be difficult if a majority of those devices are offline at the time. So consensus becomes problematic if you actually want the mobile devices to become participants in the round, if they're all going to be proposing potential values. Now in the you know, if you have a geographically distributed data centers

across the world, this becomes problematic because you need to wait for round trip messages,

typically for these 2 rounds to happen between all of the participants

that are geographically,

you know, spread out. And so this protocol can start to slow down when participants are further away. Partitions become more problematic.

And things that try to mitigate, you know, things like Spanner

and and systems like, Calvin or whatever databases like Calvin, databases like Spanner, Try to mitigate these things by using clever techniques by, you know, electing sequencers and then having deterministic,

deterministic dissemination of events downstream,

like deterministic effects.

Some of the, you know, Spanner has this true time mechanism that uses synchronized clocks

via GPS

receivers and and atomic clocks to try to approximate,

windows of kind of,

windows of vulnerability between

the kind of,

transaction commit period. So all of these all of these mechanisms try to are trying to kind of, you know, I kinda I kinda think of it as just like this is kind of this bound that you're just trying to get between, trying to optimize for it. Right? And so there's all these various different versions of Paxos that try to do different things to get around this fundamental problem that

ultimately you have to talk to the participants. And the further away the participants are, the longer it's going to take. Right? And so, you know, some of these variants try to leverage commutative operations. So some things can be done out of water and try to gain the system that way. And, you know, some try to use these atomic clocks to kind of optimize that way. And so it's all just this game to try to to try to hide the latency. You know, you have to pay the latency

somewhere because the you have to talk to the node. You have to send it a message. And so the question

is, where can you hide that latency in the protocol

and different protocols optimize for for different use cases?

And 1 of the projects that you've been involved in to try and make it easier for engineers

to build systems that are capable of these

large scale and massively distributed scale systems is the LASP project, which also relies on CRDTs and some of its layers. So I'm wondering if you can just describe a bit about what LASP is and what your role in the project has been.

Sure. So LASP is basically my PhD work. I started it in, I think, 2014,

and I've been working on it basically every day since then. And so what it basically is trying to do is think about using CRDTs outside of the database. And so it's a data flow programming language, so it would look like something like Google Dataflow or RX or, something like NIAID or or TensorFlow. It looks similar in it being a dataflow language. And it's for writing computations with CRDTs.

And so the language is super restrictive. There's not a lot you can do in it at the moment, but all of the programs that you can write in the language are guaranteed to be safe for deployment on any topology. So it assumes, it makes no assumptions about ordering or anything like that. The only data abstraction in the language is a CRDT. And so what that basically means is that events can disseminate through the network in any any order, and it doesn't change the outcome of the program. So program is completely safe to run on an edge network with sensors. It's safe to run on a mobile network. It's safe to run-in the data center. It's safe to run across data centers. And so what are what the work we've been doing over the past years is trying to make that language more expressive, so add more things in it. So right now, it's kind of programs can only be directed

graphs. You can't have cycles. You only can do kind of basic data transformations

and some small amount of aggregations right now. It's sufficient to express some subset of SQL, so it's been useful for writing programs that kind of have that data access pattern. But what we'd like to do is kind of grow it into a richer program, a richer kind of functional style programming language. And so we've been working on that. And kind of, along

the road, it's been, you know, just a lot of implementation and experimentation

and trying to figure out kind of what we can build with this. And so, from the academic side of things, LASP is kind of a a formal semantics for programming language. But from the implementation side of things, it has grown to be quite a quite a large code base now. It's it's, it's a collection of a reference implementation of every CRDT that's come out of our group. So we have a reference implementation

of, I think, at least 30 different data types in Erlang. In addition to that, it's a lightweight programming layer on top of a peer to peer replicated database, which is the storage layer for the CRDTs that the programming language uses. And then that peer to peer database to scale it to the number of nodes that we want, We had we we scaled it up to about a 1000 nodes last year. And this year, into the next year, we're hoping to get to 4, 000 nodes. In terms of that, we had to write an entire

distribution layer from scratch because we couldn't rely on Erlang. So now we have this entire distribution layer that's written from scratch. It runs in various topologies.

It supports running distributed relying applications

that operate either in full connectivity or your client server, peer to peer, all sorts of stuff like that. And then to evaluate this at scale, we actually ran a series of experiments on, Google Cloud Platform and Amazon.

We built an entire deployment infrastructure for it that allows us to deploy these Erlang applications there via Docker or, you know, through Kubernetes and stuff like this. And and so and then it uses our distribution layer for messaging between all of the nodes. So,

you know, started off as this really small programming layer,

little formal semantics, and now it's kind of this full stack distribution thing. And and what's been really nice about it is we've seen a bunch of adoption of the individual components. So our data storage layer and part of the programming model is used

by,

a project called Erlangs, which is a port of the Microsoft Orleans

programming framework to Erlang. We have a couple other industry users that haven't gone public yet, so I can't name them, but, that are using our distribution layer,

to get around problems that they ran into in the standard Erlang distribution layer.

And, we've seen some other adapters of the data types. So so the artifacts that that we've constructed during the project have made their way out into industry. So that that has been a kind of a nice reinforcement that as, sort of general availability or production ready by people. So,

as, sort of general availability or production ready by people. So people are willing to experiment with it and there's enough stability and usefulness in it that they're able to start building their own systems.

Yeah. For sure. It's been really nice to have people reach out to us and and to see, you know, come across a project in the wild that's using part of the stuff that you made as your PhD is kind of a nice feeling. So And in the process of your research and your work on LASP and your work in industry, it would have been some of the most challenging

aspects

of working with CRDTs

and distributed consensus protocols

to both just from the aspect of being able to understand the operations that are at play as well as actually implementation level details?

I think kind of I think I think the hardest part I mean, the CRDT stuff, I mean, is kind of is kind of that's kind of our primary area. That's where all our expertise in. So that that's actually been that's actually been the least problematic. I think the the most problematic has been

or the most challenging, I think, has been trying to just run experiments with this stuff and and kind of see. It's it's really hard to run large scale experiments on on kind of public infrastructure

and and draw conclusions from it just because of the variability of stuff. To kind of give you an idea of this, you know, we can run expect you know, some of our experiments

that we ran, we wanna simulate things and we wanna run on, like, 2 1000, 3000 nodes or something like this to run an experiment. Getting 3, 000, getting 3, 000 instances to run an experiment

on Amazon is is difficult to do.

And then it's difficult to keep all of those instances running. And even if you sub virtualize them, which we do to run them on you know, to save costs. It's difficult to keep everything running. It's difficult to know

if things are working correctly. And then just because kind of the nondeterminism

that's inherent in the networks that these these cloud providers have. You can run 1 experiment now, and we could run the same experiment tomorrow morning, and we could see completely different results. And so when you're doing the CRDT work, you design a data structure with a particular, like, space or time complexity locally, and then you test it out and you see that behavior works. But then when you wanna run it on a real network, let's say you get a workload from Facebook or something like this or, you know, whoever,

Google, you know, you download some workload,

that's available for academic use, and then you wanna run a workload with this. The workload has very particular characteristics,

and some of these workloads need to be run on fairly beefy hardware

if you want them to be representative.

And

it's really difficult to kind of draw conclusions.

And so kind of I I think 1 of the big problems with,

with kind of, experimental science around, like, large scale cloud applications today is that anything that's kind of done over a 1000 nodes at this point is done in simulators. And simulators, at least a lot of kind of the state of the art simulators that we have in the academic space for doing this stuff, kind of gloss over some of the low level implementation details. Right? A simulator

is gonna be a lot different than a real TCP stack running a particular version of the Linux kernel or something like that. And so

it's hard because at least from my experience in Vassia, what we found was that taking CRDTs

out of an academic context where they had a particular performance profile and putting them in a real database with real users, you see a completely different performance profile.

And so, ultimately, we want to have adoption of these things, but,

you know, we have to meet somewhere in the middle. Right? You know, we can't rely on academics to build production quality code. That's not what they're paid to do. That's not what they should do. They should be doing research. But, you know, we want there to be something that industry folks can take and actually work with. And so we need that gap to be as small as possible. And, we're at least with our work, what we're trying to do is build Erlang stuff that might not be,

you know, bulletproof. It might not be ready to go into a bank, but it's at least it's at least good enough for some startup to use or something like that. Right? So we're trying to get our stuff as close as possible to where somebody can pick it up and run a real workload on it. And then hopefully, it will come back to us and say, this is what we saw. Can you can you find a way to fix this? And and that would be nice to have that kind of feedback loop because,

you know, you don't wanna do research on some idea that that is never gonna be looked at or is gonna disappear into some paper archive somewhere. You wanna you wanna do research on things that really matter and things that people can use. And so kind of that's been the real that's kind of been the real challenging part is is trying to is trying to kind of bridge that gap, trying to close that gap as much as possible. And I think I think we're getting there. I think I think as we get more users and and kind of show off what we can do and and build demos and things like this and talk about our stuff a lot, I think we'll get there. But it's not an easy challenge. Let's say. And for people who are working in industry that want to take advantage

of some of the,

concepts behind CRDTs

and their capabilities,

are there any sort of applications

that are available off the shelf or tooling that's available off the shelf to assist in implementing those CRDTs or using prebuilt CRDTs?

And what are some of the complexities or challenges that they should be keeping an eye out for while they're working on that? Yeah. So, I mean, recently, there's been a bunch of CRDT stuff that's been kind of happening in industry. I mean, at least I'll I'll highlight, like, I think a few things. So GitHub

and, some folks at GitHub have a CRDT library,

with some data types. I don't remember the name of it off end, but I can try to get that for the show notes. But it's the basis for their collaborative editing extension for their editor, the Atom editor. And then, there's a group, working out today called auto merge with a, Martin Klepman and and a few others who have built a

JSON CRDT for use in JavaScript applications with an implementation. So, you can you know, you have a JSON object. You can store whatever in it. And, you can,

you know, it has a merge function. You can synchronize it and all sorts of stuff like that. So that's probably what I would recommend a lot of people kind of start with because it's a familiar data structure. It's JSON, and it has merge semantics, and they have a nice kind of library that you can just use with MPM to start playing around with stuff. So that's 1 that's really kind of I think that's 1 of the ones I know of that's probably the closest to,

being something that somebody could just pick up and start using today. At least from the Erlang side of things, we have a library that's our CRDT library is probably the most robust library that we have out of our academic artifacts. And so there's a there's a library called types under the LASK organization on GitHub that has an implementation of a variety of different data types. And and I think it's like 30 data types in 2 different flavors that that people, at least in the Erlang community, have started using. I think Mesosphere was using it at 1 point, and, there were a couple other people I think that were as well. And then at least from the Java side of things, Akka has a CRDT implementation

as well that was written by the guys over there that, has it's pretty up to date with, at least, I think, the the paper

from last year on Delta CRDT. So that's that's pretty up to date as well. So I think those are probably the good implementations to look at if you're looking at, implementations of the data types. In terms of database work, we have a database that, is called antidote. So we have a database called LASKV, which is kind of the data stored that's embedded in LASP that can be used to be embedded inside of Erlang applications. So we have an embeddable peer to peer CRDT data store in LASP that you can use, which is the 1 that or Orleans is using. And then, we have a database called Antidote, which is a stand alone database as opposed to an embeddable 1 that is a CRDT database with transactions

that came out of our research group. It is it is not kind of industry grade yet, but,

but you can download it as a Docker image. You can boot it right up and start playing around with it. So that's another way if you wanna kind of look at it from a database

perspective. And then at least in terms

of databases that support CRDTs,

React is kind of the primary 1 that has a a whole collection of CRDTs built into it that you can use that,

me and,

a bunch of the other batch of engineers worked on. And for people who are working with some of the sort of big name data engineering tools like Spark

or

Flink or some of the different streaming solutions or even, you know, Hadoop and doing MapReduce jobs, are there existing tools or support within those tools for being able to leverage CRDTs?

I'm not really sure. That's a good question. I mean, at least,

from the Spark side of things and the Hadoop side of things,

that processing is kinda I mean, for the Hadoop side of things, you're usually processing unbeatable data, so there would be no need for CRDTs there.

But I imagine you could leverage CRDTs,

and Spark if you'd like to, using the Akka library considering, the Akka library is written in Scala. So

I imagine you could use that. But I don't know of, like, anything that's really integrated with those products.

And going back to Antidote DB, you mentioned that it's not quite sort of industrial grade yet, but that it's available for people to experiment with and play around with. So what's your vision for that project, and

what are some of the particular use cases that you think is well suited for? Yeah. So Antidote d b is trying to solve kind of, it's kind of trying to be like, kind of the spiritual successor to

systems like React and and,

and academic systems like COPS and and General Rain. And, basically, what it's trying to do is so

React provides eventual consistency where Antidote provides causal consistency.

So with causal consistency, you get you get stronger guarantees than you get in eventual.

You'll get guarantees that if a particular client does

a and then does b, those results will come visible a and then b

rather than an eventual consistency where b may become visible before a. And so that's nice because that matches well to,

writing programs that are sequential. So if you're writing a sequential program and let's say you want to write an object to the database

and then you want to add that object to an index, under Eventual Consistency, you can run into anomalies where the index might point to an object that's not there yet. Where under Causal Consistency that can happen. The other nice feature that Antidote provides is transactions.

So it has a transaction facility so you can do updates together, which is something that, React, for instance, can't do. So that's nice for grouping updates that need to happen together in a batch. Now what we're working on adding the antidote now is is ACID transaction. So antidote is is, antidote

is kind of the product for the vision that we have called just right consistency.

And, what just right consistency argues is that, using

consensus and coordinating for everything is too much because there's only a subset of operations that you actually need to coordinate for to prevent,

seeing incorrect states in your application.

And so by having causality and transactions,

these things are always available. You don't need to coordinate for them.

And they they guarantee that things can happen in batches and things happen in order. Well, they happen in some order. They happen in a kind of, like, an order if then else kind of like a predicate for order.

And then the only things you actually need to coordinate for or get consensus for are things where you have to check something and take an action based on that. Because under eventual consistency,

you might check something and then it might change before you take the action. And so the kind of the antidote vision is only coordinate for those operations. And so, you know, the kind of the obvious kind of the obvious thing to attack with this design philosophy is how do I know what the things I have to coordinate for are? Right? So the spanner approach is coordinate for everything. You don't have to think about it. The antidote approach is coordinate only when you need to,

and then, you know, now you have to figure out when you need to. And so, kind of the complimentary

to this,

this database is,

is a static analysis technique that is being developed,

and hope to be industrialized,

that basically uses

annotations that are done, in your source code

to determine exactly which operations

require coordination based on,

kind of a first order logic. And then

what you can do is basically run an analysis of your program,

and it will tell you exactly the points where it needs to insert coordination and only coordinate there. And so that's kind of the vision for the product. We're not we're not there yet. So we have a database that has transactions as causal consistency.

And so, you know, we need to get the asset transactions in there.

We need to then kind of refine this static analysis technique, and then you'll kind of have this comprehensive this comprehensive story of of, you know, coordinate only where necessary and pay and get the benefits of eventual consistency, causal consistency where where you can.

And you're working in academia now. So I'm wondering if there are any particular areas of research that you've come across that you're most excited about seeing

further development on and or seeing

them bridge the gap to being used in industry? So

I think the I mean, I mean, the area I'm I'm in, I think is really great. I mean, not specifically my thesis, but I, the kind of the general area of,

trying to take

so when I look at systems like last last because the system about programming with TRDTs where we wanna think about not databases. We wanna think or not data stores or whatever. We wanna think about, can I make a programming language that's safe for distribution?

Work, I think that's in the similar area is this static analysis technique from Antidote that's saying, Can we apply programming language techniques to a distributed systems problem and solve it there? And there's a group at Purdue that's doing the same thing. And, historically, kind of 1 of the groups that did a lot of work

in this area was UC Berkeley a few years ago under Joe Hellerstein.

And so and that those are the people I think that have done it really well in the 2000s. And so the the idea that I think is the idea that I think is so important in the area that I'm working in and that all these folks is working on in is this area that is trying to apply programming language techniques

to distributed systems to solve the problems there. Because it's not just a database problem anymore. We have these these applications that are made up of tens or hundreds of thousands of microservices.

How do we know that the application is right? You know? How do we know? How do we test it? You know, ideally, we can have some sort of guarantee. Maybe we can do this through types. And and what's interesting about this is that there are so few people working in the field. We need more people working in the field. It's an incredibly important

field to work in. And historically, there's been a lot of great work done in the eighties. A lot of great work done towards the end of the seventies into the early eighties,

81, 82. And then there was a lot of work in the mid eighties. And then kind of in the nineties, it was low,

kind of from the academic side of things because industry kind of took over with systems like Korba and RMI and stuff like this. Stuff that hasn't really

solved the problems in the way that they should be solved. And so we're kind I think we're kind of seeing a period right now where a lot of people are getting back into,

this kind of programming languages distributed systems area. And so that area, I think, is really exciting because it can have dramatic effects on the industry if we do it right. If we can, you know, figure out how to I mean, just imagine where we can, you know, have something where we do program slicing or something like this, where we can kind of take a program that's written as a monolith and and kind of deploy it as microservices.

You know? Something like this. And it's type safe and all of this stuff. And we know that it's gonna work fine because we it was, you know, correct, and we had some synthesis mechanism or something or some code extraction mechanism. Like, these are the types of things that I think would really make,

a big impact on industry because

because microservice thing and separating out applications and running different databases and putting data in different databases and duplicating data across databases and building processing pipelines. Like, all of this stuff is about moving data between systems and how do we think about

the consistency guarantees that are between those systems? How do we think that the program that is running on a server and a mobile device? How do we know that those programs can communicate? How do we know that the APIs are compatible? Things like this. Right? And so all of these things are

problems that programming language people like to solve. They like to prove things. They like to know that something is gonna be sound. And kind of the systems approach to it so far and a lot of the approach we take in the industry is, like, we just try to test it as much as possible. And that approach kind of isn't

I think it isn't kind of tenable,

in the long run. Right? I think it's kind of it's it's

we need to be more principled about solving problems. And I and so I I think that we need to kind of have this joint work between these 2 groups. We need to kind of have the systems community and the database community and the PL community to come together and kind of focus on solving this

practical problem of of building distributed applications because

more and more applications today I mean, I would argue that the majority of applications being written today are whether it's a web browser talking to a server

or a mobile app talking to a back end, whether that's AWS Lambda or this is a actual server running in a data center or physical hardware,

like bare metal. These are all distributed applications.

And so we really need to alleviate these, issues for the programmer.

And given that you're currently working on your PhD and I imagine nearing completion of that program, I'm wondering if you have any thoughts on the projects

or

industries that you're interested in being involved in once you do reach completion.

When I do finally finish, I will probably,

try to go the academic route. I will probably try to

find some professors somewhere

in this field because, this is kind of where I'd like to continue working. I'd really like to stay focused on this problem of distributed programming.

And I think there's a lot of work that needs to happen there. That being said, I I do like the tech transfer aspect of it. So so I I will stay close to industry. I have, obviously, I have a lot of connections and and friends there considering

considering I worked there worked in industry for so long. And

what what we'll probably try to do and kind of what we're in the process of thinking through now is figuring out how we can kind of industrialize,

components, the software artifacts that are coming out of last minute. So, you know, expect to probably see something on that front soon,

from our group, hopefully,

you know, some support, some resources, maybe

maybe some sort of open source consortium, something like that. So we're hoping to kind of, industrialize these things. So, ideally, I think that, you know, where at least I think I would be a good fit, when I when I finish eventually in in 2 to 3 years

is is kind of in this place where I can at least be involved in research and academia, but be at least close to getting these things into into real products used by real people, whether that's open source or or something else.

For anybody who wants to follow the work that you're up to and get in touch with you, I'll have you add your preferred contact information to the show notes. Great. And as a parting question to give listeners something to think about from your perspective, what do you see as the biggest gap in the tooling or technology that's available for data management today?

It's a good question.

So from my perspective, I think

where I think the biggest gap

for,

kind of this data management problem is today is that

application developers spend most of their time in in

writing, like, a business logic in in their in their, you know, in their application, writing, like, coding business logic. And I think 1 of the problems

that bothers me is that, you know, 1 case I like to think about a lot is,

a a case for kind of eventual

eventual consistent strong consistent trade offs. Right? So 1 1 thing I think about all the time is, if I was

let's imagine that I had some application that had to return the result to a user

within a 100 milliseconds. Right?

Now

I have to make a request to the database to get some data item, and then I wanna run to wanna return the results to the user.

But that user is paying some SLA. Let's say the user is paying to get a response within a 100 milliseconds. Let's say this is kind of a soft real time application, and it has these particular limits. What the application developer would have to write is the application developer would fire off a request to the database.

It would set some sort of timer. The application developer would then wait for that timer to fire. And then if the timer fired, they'd, like, cancel the request

and then maybe return a cached copy that's sitting in, like, a local

Redis instance on that machine or is, like, memoized or something like this. Right? And so

from I think it's I think it's crazy that the application developer should have to write that. What I what I would really like to write is there's some value,

and I would like to refresh that within a 100 milliseconds or return the local copy. Like, that is a much more declarative way of writing it. Yeah. The the developer shouldn't have to write this logic to say, try to refresh this value.

I think that that should be something that's kind of like a primitive

because

it seems like a big trade off the application developers wanna make all of the time

is they wanna make this trade off to say, can I wait long enough for this thing? Like, I mean, if if we could always get the most recent result

within, you know, a certain number of milliseconds, we would. Right?

The problem is that a lot of the time, we can't. That's why cap matters. That's why all of this stuff matters. That's why eventual consistency matters. It always comes down to the fact that we can't wait forever.

And so I would like to see programming languages. I'd like to see tooling that supports kind of a notion of time, a notion of bounds, a notion of service level agreements. I wanna see application code that allows me to say, you know, do this guarantee this thing is as fresh as possible,

but doesn't take longer than a 100 milliseconds.

And if applications could developers could write code like that, it would be so much more declarative. It would be so

less error prone because it would be provided as a primitive,

and then we would be able to

do all sorts of things. The runtime could optimize

prefetching if it knew that instruction was coming up soon. We could,

you know, have background processes that try to refresh things at given intervals. You know?

Could be some sort of stochastic modeling or something like this. There's all sorts of things we can do once

the programming language

knows the intent of the programmer. Because right now, if programmer just says set this timer and do this thing, the programming the programming language, the runtime doesn't know what the programmer is actually trying to do. It just knows, oh, there's a timer set, and there's some function that's gonna fire when it's when it's set. Right? And so the more we can build into the language, the better analysis tools we can build, the better runtime support you can build for actually achieving these things really efficiently. And so 1 of the things that, we we've

we've done we've explored as a research direction

at least is,

you know, in a language where in a language where operations are annotated, whether they're commutative or not, you can start reordering things at the runtime. You can start optimizing the execution of the program because you know certain things can happen out of order. But that's only able to be done once you start putting that into the programming language, into the semantics.

And so I think, again, I think that's why this field is important.

And I think this is an area that's severely lacking that, that could have a that could have a tremendous impact on program or productivity.

Alright. Well, I wanna thank you very much for your time. It's been an absolutely fascinating

discussion and has given me lots of food for thought, and I'm sure many of my listeners as well. So, thank you for that, and I appreciate your time and I hope you enjoy the rest of your evening and have a happy new year. Thanks. You too.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links