Summary
The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
- Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Ryan Worl about FoundationDB, a distributed key/value store that gives you the power of ACID transactions in a NoSQL database
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you explain what FoundationDB is and how you got involved with the project?
- What are some of the unique use cases that FoundationDB enables?
- Can you describe how FoundationDB is architected?
- How is the ACID compliance implemented at the cluster level?
- What are some of the mechanisms built into FoundationDB that contribute to its fault tolerance?
- How are conflicts managed?
- FoundationDB has an interesting feature in the form of Layers that provide different semantics on the underlying storage. Can you describe how that is implemented and some of the interesting layers that are available?
- Is it possible to apply different layers, such as relational and document, to the same underlying objects in storage?
- One of the aspects of FoundationDB that is called out in the documentation and which I have heard about elsewhere is the performance that it provides. Can you describe some of the implementation mechanics of FoundationDB that allow it to provide such high throughput?
- For someone who wants to run FoundationDB can you describe a typical deployment topology?
- What are the scaling factors for the underlying storage and for the Layers that are operating on the cluster?
- Once you have a cluster deployed, what are some of the edge cases that users should watch out for?
- How are version upgrades managed in a cluster?
- What are some of the ways that FoundationDB impacts the way that an application developer or data engineer would architect their software as compared to working with something like Postgres or MongoDB?
- What are some of the more interesting/unusual/unexpected ways that you have seen FoundationDB used?
- When is FoundationDB the wrong choice?
- What is in store for the future of FoundationDB?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- FoundationDB
- Jepsen
- Andy Pavlo
- Archive.org – The Internet Archive
- FoundationDB Summit
- Flow Language
- C++
- Actor Model
- Erlang
- Zookeeper
- PAXOS consensus algorithm
- Multi-Version Concurrency Control (MVCC) AKA Optimistic Locking
- ACID
- CAP Theorem
- Redis
- Record Layer
- CloudKit
- Document Layer
- Segment
- NVMe
- SnowflakeDB
- FlatBuffers
- Protocol Buffers
- Ryan Worl FoundationDB Summit Presentation
- Google F1
- Google Spanner
- WaveFront
- EtcD
- B+ Tree
- Michael Stonebraker
- Three Vs
- Confluent
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute.
Eluxio is an open source distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Aluxio unlocks the value of your data and allows for modern computation intensive workloads to become truly elastic with a cloud. With Aluxio, companies like Barclays, jd.com, Tencent, and 2Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to data engineering podcast.com/aluxio, that's a l l u x I o, today to learn more and to thank them for their support.
And understanding how your customers are using your product is critical for businesses of any size. To make it easier for start ups to focus on delivering useful features, Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain 1 integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers' time, it lets your business users decide what data they want where. Go to data engineering podcast.com/ segment I o today to sign up for their start up plan and get $25, 000 in segment credits and $1, 000, 000 in free software for marketing and analytics companies like AWS, Google, and Intercom.
On top of that, you'll get access to the Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product market fit. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. Go to data engineering podcast.com/ conferences to learn more and take advantage of our partner discounts when you register.
And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And please help other people find the show by leaving a review on Itunes and telling your friends and coworkers. Your host is Tobias Macy. And today, I'm interviewing Ryan Worrall about FoundationDB,
[00:03:06] Unknown:
a distributed key value store that gives you the power of asset transactions in a NoSQL database. So, Ryan, could you start by introducing yourself? Hi. My name is Ryan Wuerl, and, I'm a software engineer. And I've been working with FoundationDB since I was open source for about the the past year and for and with databases in in general and and software engineering for about 6 years now. And do you remember how you first got involved in the area of data management? Yeah. So I've been a kind of a generalist contractor for basically the entire time I've been in software, and I've seen a lot of different data management solutions at a variety of different scales.
And where that becomes important is I've actually seen how things go wrong, like how data gets lost and how data gets corrupted in production. And I've seen things like the complexities of multi tenant SaaS applications and how you balance workloads of large customers against the workloads of small customers and making sure there's capacity and stuff like that for the small customers without starving the large ones and vice versa. So I've seen a little bit of of everything, and that's made me very skeptical of most of the data management data management solutions out there. And what made me interested in distributed systems and databases as kind of like an introduction to FoundationDB, like, why I was interested in it at that time. I've been following Kyle Kingsbury's Jepsen series and, Andy Pablo's relational database management lectures from Carnegie Mellon. And those are available on YouTube. Those are kind of my, like, my formal things that I have learned about distributed systems that make me interested in Foundation.
[00:04:40] Unknown:
And so you mentioned that you've been involved in the community since around the time that it was open sourced. And I know that some of its history is that it started as its own startup that then got bought by Apple. And as you said about a year ago, got open sourced again. So I'm wondering how you first got involved with that community,
[00:04:56] Unknown:
and if you can also just explain a bit about what FoundationDB is. Yeah. So the way that I got involved is I, when it was open source, I, like, I knew of FoundationDB. I knew it was really cool. I never got a chance to use it. I was in college at the time, and just, like, the the programming things that I was working on and, like, the the clients I was working with just wasn't relevant. So I never got to use it. But when it was open source, I got really excited. I just jumped on the forums the first day, read everything I could, watched all the dug into the archive.org documentation of, like, the stuff that they didn't put back into the open source documentation.
Like, I was just just read everything and asked a bunch of questions and, and stuff like that. So I I made a my first contribution to the project was I made a documentation change, a while back, and then I've added 1 feature to the Go bindings, so far. And I also spoke at the Foundation DB Summit in December of, of last year. And to basically, to give you a description of what FoundationDB is, it's a it's a distributed key value store with ACID transactions. And the way that you use it is it's basically a big tree of byte string keys to byte string values. And you can read any key and write any key within a transaction. You don't have to worry about where the data is stored. It just looks like it's a a single node database, and you interact with it like that, but it's really distributed. And so can you talk through a bit about how it's actually architected
[00:06:19] Unknown:
and some of the deployment topologies that are typical for it? Yeah. So
[00:06:24] Unknown:
I I think that the best way to to understand FoundationDB is to to start at the the most basic level, which in a strange way is, the programming language it's written in. It is written in a language called flow, which is about 95% c plus plus with some extra syntax on top that provides actor model concurrency. Some people may be familiar with the actor model from Erlang. And the the reason that that's relevant is when you write code in Flow, it gets compiled down to really ugly c plus plus that uses callbacks for everything, and you'd never wanna read it. Like, it's just really ugly code. But when it is compiled down to to down to c plus plus using callbacks, it can be made entirely deterministic, and everything runs in a single thread in a single process. And inside that thread, the virtual processes, which are just actors, you can start as many of them as you need. And it's all a deterministic system, which ties into how they test it, which we can talk about later. And the ACRA model provides kind of an indirection layer between the physical operating system processes that you deploy on 2 servers and the roles that a Foundation DB process can perform in the cluster. So a single operating system process in the the smallest deployment of FoundationDB is literally a single process on 1 machine, and it simulates everything about, like, an entire FoundationDB cluster, all of the components that take place in that, inside that single process. It's not like some other databases where you, if you deploy it in a single node, it's really a different thing.
It is all the foundation e b is happening in that 1 process, and you can just add more processes. So after that level, to describe just how the programming language is written in, which is essentially c plus plus, you can call it c plus plus, the next fundamental thing to know about FoundationAD is every transaction and every operation is totally ordered, and every transaction appears to happen at a single point in time. And this point in time is called the version number of the database. It's just an increasing number represented by a a 64 bit number, and it just increases over time at approximately 1, 000, 000 per second. And the way this is exposed to you during a transaction is when you start a transaction, you're given a read version, which is the highest version that has been ever committed into the database.
So you know that when you start a transaction, all previous transactions are visible to you. And when you go to commit your transaction, all of your operations happen at that single commit version where your transaction is committed. So when you combine those 2 things together along with the conflict detection that we'll talk about later, that's how you get strict serializability, which is basically just serializability, which people may be familiar with from from relational databases, plus, linearizability, which is basically things happen in a total order. So when you combine those together, you get strict serializability.
And when you get above the layer of just, you know, 1 foundation DB process running on a single machine, the the typical deployment topology is to at least start with 3 machines. And if you're running on the cloud, you run them in 3 different avail availability zones. And the most important initial process in Foundation DB, the initial role, I should say, is called the coordinator. And this is kind of like zookeeper. It uses Paxos to manage membership in a in a quorum to do rights. And the coordinators are only used really to elect a few different leaders, in the cluster to do various things, And then they're important in when the when the database is recovering after failure in the transaction subsystem. But the coordinators are not really involved in anything else. They basically are just there to be the the leadership election system for some of the singleton processes on the cluster. So you don't have to think about them too much, but they're they're important, and you have to maintain a a quorum of them. And that's mostly automatic. You don't have to do much to to think about it. It's not like, for example, deploying Kafka with ZooKeeper on the side. It's not a separate thing, just built in. And the first process that gets selected by the by the coordinators is called the cluster controller. And when any new client or server process starts up in the in the system, it asks the coordinators who the cluster controller is. And the new process, after it knows who the cluster controller is, it registers itself with the cluster controller, and it, it provides certain administrative information about how it's configured. Like, I prefer to be a storage process, and I have a disk with this much space and those types of things. And once the processes have all registered themselves with the cluster controller, the cluster controller will assign different roles to them that we'll talk about you know, I'm gonna talk about in a little bit about all the different roles that can be assigned in the cluster. The cluster controller assigns them, and the cluster controller also continuously heartbeats with all of the cluster or all of the processes in the cluster to detect failures. The next role after the cluster controller is the role of master, which is primarily responsible for the data distribution algorithm that migrates the the actual data stored on disk between the different shards, which are called storage processes.
The actual mapping of the keys to where the data is stored is stored in the database itself, just kind of an an interesting design. The cluster the or the way the master works is it just commits transactions into the database to move data around. And that's a this is a it's a 2 phase process where the where the master says, I would like to move some data to a different storage process, and then it gets moved. The the storage processes realize that they need to move data around. And then when the the data is moved, it commits another transaction saying, this is now where the new location of the data is. Another, very important role is it this is actually where the version number lives, the version number of the database, and it's responsible for handing out those version numbers. The there's a if you if you put this on a box diagram, there's a lot of boxes.
So it's a pretty complicated architecture, and there are a few more components. But those are the the most important ones, I would say, in terms of, understanding where the the basics of FoundationDB come from, the version number, and the the failure detection. And when after that, you kinda get into the transaction subsystem. And the first bit I wanna talk about there is called the proxy. The proxy is kind of, is basically what it sounds like. It sits in between clients and the rest of the the Foundation DB system. And the function that they perform is they they're kind of coordinating a bunch of stuff that's going on because there's the transaction pipeline is kind of complicated. But the the 1 that is relevant for the clients is they cache the mapping of, where of key ranges to to storage servers in memory.
So if a client doesn't know where to go to get data, it can ask the proxy. Proxy doesn't actually go fetch it for them, but they just ask for the for the mapping. And when clients are ready to commit a transaction, they send their transaction to the proxy, which batches together a bunch of transactions from a bunch of different clients. And it goes to the master to get a version number in that batch. So, like, it's all the transactions receive their commit versions in a batch, which is why the master will never be a scalability bottleneck for Foundation DB because requests to get version numbers are all batched from the from the the proxies. They batch all their transactions together, and then all at once, they go and get a commit version. So after you get that commit version, you have to go through conflict detect detection, which is its multiversion concurrency control. That's the the general class of of algorithm that it is. And this is handled by the resolver role in the in the cluster. And the resolvers store a history of all the previously committed transactions and what key ranges are read and written from those transactions for a period of 5 seconds. And that's where the 5 second transaction limit comes from. If you've ever read the documentation, I should say 5 second transaction limit for read write transactions. That's that comes from the, from the resolvers.
And if you configure multiple resolvers in your cluster, which is common if you have a big cluster, they get sharded by key range. So 1 you know, if you imagine the key space from a to z, which is not actually what it is, but you imagine it from a to z, you break up the alphabet, you know, just a, b, c, etcetera. Break that up into chunks, and those are assigned to different, different resolvers. And when the proxy submits transactions to the resolvers, they're broken up along those shard boundaries, and each resolver independently says whether or not transactions are in conflict with each other. And what it means to be in conflict is if a if the resolver detects an overlapping or a different transaction than the 1 you're committing wrote to a key that you read between your read version and your commit version. That's how the that's basically how conflicts are detected.
And if they do detect a conflict, your transaction has failed, and you'll receive an error message saying that you need to retry the transaction, which is pretty common in MVCC systems. I'm sure people are used to that. So if you do pass conflict detection, the proxy then forwards your data to the transacting transaction logging processes. These are also chartered by the key range, and their job is to just append data to the end of the log so that it's safe during failures. And your transaction must commit to all of the relevant transaction logs to be committed. It's not like a quorum thing. And the transaction logs are assigned to groups of storage servers which actually manage the data after it's written. And the storage server is the largest role by number in the cluster. Your cluster should be roughly 90% storage processes, and their job is just to store the data. It's kinda what it sounds like. And they when clients ask for data, they give it back to them. The way it currently works with the the disks for a engine is they hold a history of 5 seconds of of writes in memory, and then they write it out to disk after that 5 seconds. That's just to allow the the MVCC so that if you, you know, you have a window of 5 seconds to do your to do your reads, and then there's no versioning on disk after that. Just 1 all 1 version together. And so that the data that's in in memory, you gotta remember, it's actually durably synced to the disk on the transaction logs before the transaction is acknowledged.
So it's not as if it it goes away while it's in memory on the storage server. That's a problem. It's not a problem. It's safe on disk on the transaction logs. So, yeah, I I know this all sounds like a lot. There's There's a lot of boxes if you put it on a box diagram, but this entire process from start to finish takes around 4 milliseconds to do a transaction. Like, if you just start a transaction, write a key, and then commit, takes about 4 or 5 milliseconds. The reads take some time as well. It's just basically how long does it take you to to do a single hop from your client process to the storage server that has the data. And, you know, that's just a network latency thing. Probably gonna be in Amazon on order of 1 to 2 milliseconds depending on what availability zones are talking to each other. The another cool thing is that your when you do writes within a transaction, they're queued up until you actually commit.
So there's no latency when you write data. All of the latency is saved until you commit. So, if you're careful with the way that you, the way that you write your transactions, you can you can make transactions that perform very well in FoundationDB, and you have a lot of freedom to order them because it's the the client library is all asynchronous.
[00:17:26] Unknown:
So you can choose what parts need to block waiting for reads and which part what parts can proceed in parallel. That's pretty much all up to you to to decide so you can make really efficient transactions. And as you were going through that architecture, there were a couple of things that came up as questions that you ended up answering in the process of continuing through the layers. So 1 of them was the idea of what happens if after the master process dies, how you manage to find where the data is allocated. But from what you were saying with the proxy layer, that sort of handles that by caching those segment blocks for the discovery piece for clients reading from it. So that was 1 of the questions that came up. And then, also, you talked about how data is reallocated via this 2 phase process where it's just another transaction that's initiated from the database layer to move the data in the event that you scale up or down and have to reallocate where it's stored. So I don't know if, there's anything that I've missed in that, summary on those 2 points. Yeah. So the way that the, the failure handling
[00:18:29] Unknown:
works with, in with respect to the to the master and with respect to all of the other processes in the in the transaction subsystem, so basically everything that's not, either a coordinator or a storage server. The way that failure failure handling works is actually different than most systems. The you cannot proceed during any failures of the transaction subsystem. When any process in the transaction subsystem fails, the cluster controller will attempt to recruit all new processes to become the roles that were existing in the old set of processes. So if the master dies, for example, you will not be able to commit any transactions until a a recovery happens in the database, which usually takes a few seconds. So you don't proceed like, if a if a client in your example, there was a failure of the master and then ask to the proxy where, a range of keys were so so it could go read them. It could go read them, and that would be fine. It would get the data back. But when it would go to commit its transaction, the master would be unable to assign it a commit version because it's dead. And then at some point, the cluster controller would notice that the master died, and it would elect a new 1. That's how the failure handling works. And another question
[00:19:39] Unknown:
is you mentioned that the way that the transaction IDs are allocated is from this 64 bit counter, and I'm wondering what would happen in the event of a rollover.
[00:19:49] Unknown:
That's not gonna happen. It's a very it's a very big number. In in Postgres, for example, it's a it's a 32 bit number, so that is actually a problem. But it's not really
[00:20:00] Unknown:
it's not really a problem in the foreseeable future of of FoundationDB. I don't actually know the answer to that question, but a 64 bit number is a very large. So I I can't imagine that that would be a a problem on the horizon for anyone. Right. It's just a matter of what that horizon is because you may have a system that's using Foundation DB. And I know that, for instance, we've got this 2038 problem with Unix time stamps where because they started in in 1970, in the year 2038, the that Unix time stamp counter will roll over. So I know that at some point, it will happen. But as you said, for the foreseeable future, you know, in near to midterm time horizons, it's not something that is feasible, but it's worth considering.
[00:20:37] Unknown:
Yeah. So as I said, it advances at 1, 000, 000, 000 per second. So you can do the math. You out there as as listeners can can do the math and figure out how long it'll
[00:20:47] Unknown:
be. And then another question that I had in there is you were saying that in terms of the allocation of storage processes in relation to the other components that are necessary to be able to run a full cluster, it should be on the order of 90%. So I'm just curious how that allocation happens if it's just at the subprocess level on a per machine per core basis, or if it's more, you have certain machines that are dedicated to those storage processes, and then the other coordinator and master and proxy nodes are located on separate instances within the overall network environment?
[00:21:19] Unknown:
That is up to you. You can choose to deploy it in whatever fashion makes sense for you. I have not heard of anyone in particular making dedicated entirely, like, completely dedicated storage servers. Although, I can see why 1 might choose to do that just because then you you know, if if that machine fails, then there's less like, you don't have to go through recovery. But, yes, in general, the way you deploy processes is you deploy it with 1 process per CPU core on the machine. You would also want about 4 gigabytes of RAM per process, maybe even up to 8. And when the roles are assigned by the cluster controller, it it can make sure that, you know, there is enough of everything to go around, but some processes may be handling more than 1 duty. So, for example, in a very large cluster, heart beating becomes a bottleneck on occasion because the cluster controller is busy doing a lot of heart beating. So you may want to split the cluster controller out into its own process, as you said. And that's something that you can do. When you're configuring a cluster, you can you have complete freedom to split up roles as you see fit, and determine the number of them that run on your cluster if you choose to do that. You do not have to specifically make these fine tuning little details.
[00:22:35] Unknown:
A lot of the recommendations and the documentation will work just fine. And we'll get into the actual deployment piece a bit later. But just from the perspective of configuration and operability of the system, I'm wondering what format that takes, whether, for instance, some systems require you to configure it by actually writing records into the storage layer itself, whereas others, you can just place a text file on disk. So I'm wondering where Foundation DB falls on those axes. The way that you configure
[00:23:02] Unknown:
the processes in Foundation DB is there is a configuration file that specifies all of these details that I just discussed. Basically, you set up the IPs and ports. You say what kind of process it is. All those types of things, those happen in the in the configuration file. Some of the the configuration of the database is managed inside the database itself in a portion called the system key space. But I don't think that's particularly relevant for for this part of the configuration, but there are there are other things that you do configure by writing keys into the database. And the database uses itself very frequent frequently for storing data about administration.
[00:23:39] Unknown:
But from specifically what you're talking about, it's mostly the configuration file. Another thing that we didn't talk about yet is how the ACID compliance is managed in this distributed system. So I'm wondering if you can just talk through that and how it's,
[00:23:51] Unknown:
designed and implemented at the cluster level. Yep. So let's just take it 1 letter at a time. A stands for atomic, which is basically everything in a transaction either happens or it doesn't. There are no partially committed transactions. And that's basically basically, you just like, you batch up all of your rights together, and you commit them all at once. And they're assigned the commit version all at once. And that's how atomicity works. There's no like, because your operations are only happening at a single point in time, it's kind of naturally atomic. From a logical perspective, there are obviously details that are a lot deeper to to make it all work correctly. That's more of failure handling stuff, but just covered somewhat. C stands for consistency, which is very importantly not the c in the CAP theorem. Consistency in ACID means you're enforcing specific rules about your data, like foreign key constraints a SQL database. Those are an example of something about consistency, which is kind of an application level concern from the perspective of FoundationDB, but it gives you all of the tools that you need to implement foreign keys or a foreign key like feature in your application if you need it. And that's mostly up to you. I stands for isolation, and this is referring to, like, transaction isolation and concurrency control. And FoundationDB provides the strongest level of isolation available called strict serializability or external consistency. I covered this a bit when we were talking about the resolver role, but basically, the resolvers make sure that no operations overlap that could violate strict serializability.
When you combine this with making all operations happen at a single point in time, that's basically correlated with wall clock time, that's how Foundation DB provides external consistency. D stands for durability. There is, some some contention in the database community about what durability means specifically, but I think by anyone's definition, FoundationDB provides it, which is basically data is guaranteed to be synced to disk on multiple machines before your transaction is acknowledged to the client. And so
[00:25:53] Unknown:
given all of these different features of Foundation DB and the levels of performance and reliability that are designed into it, I'm wondering if you can talk through some of the use cases that it's uniquely well suited for. That's a that's a really good question.
[00:26:08] Unknown:
Uniquely well suited for is, I I think it would be hard to say it's uniquely well suited for for anything because there are there are other systems that provide, you know, similar guarantees in similar ways. Like, a lot of people are used to using a SQL database. And there are features of FoundationDB that are really nice. Like, you get the automatic fault and stuff, but, SQL databases work work really well too. Where I think that you should consider FoundationDB for your application is if you have people that are not distributed systems experts and you need to build fault tolerant and scalable data solutions that fit an exact use case. Like, if you have a very specific use case that you understand and you can spend more time making it really good in FoundationDB than, you know, reworking it to fit the model of another database. Like, there's basically no constraints with FoundationDB on the the structure of the keys and the values. So you can you can put whatever makes sense for you in there. And the programming model is basically it feels like you're writing single threaded code in 1 process. And the details of all the crazy failure modes of distributed systems are mostly hidden away from you and in a way that you can trust. And on top of all of these layers of reliability
[00:27:22] Unknown:
and distributed systems management that are baked into FoundationDB, there's also the concept of layers that they have released to provide different semantics on the way that you interact with the underlying storage. So things like a, relational interface or a document oriented interface that gives you the same feel of working with a regular SQL database or something like MongoDB where, at the end of the day, it's actually all being stored in this key value space. So I'm wondering if you can discuss a bit about how the layers
[00:27:58] Unknown:
foundation DB that is strictly about layers. There is nothing in FoundationDB that is strictly about layers. Like, layers are, layers are more of a a concept than a specific implementation. And the way that the concept works is that, basically, it's a it's it's a different pieces of software coordinating about the format of their keys and their values. And when you have them coordinate on the format of their keys and their values, you can interoperate between different pieces of software. The most basic layer that almost all of them use, all of the applications written for on top of FoundationDB and even higher level layers use, it's called the tuple layer. And the tuple layer specifies an encoding for keys that allows you to encode what's called a tuple or an array depending on the language bindings that you're using. You can put different scalar values, like strings and numbers into the 2 pool, and, it writes them out into a binary string that sorts correctly.
Like, sorts how you would intuitively understand it to, like it to sort. And when you read data back out of the database in another process using the tuple layer, you can do queries that will read ranges of data in an like, back in order from what you're ex you would expect. So for example, you could make it tuple that is an array of the string users, and the second component of the tuple is an ID for a user ID, and that could contain a record for a user. A different process could come along that's completely different or in a different programming language. It could it could read, do a range read for tuples that, begin with the string users, and that would sort correctly. It would read all of them back out. To contrast this in a way that I think most people will understand is if you're using something like Redis, the key format is completely ad hoc. And in order for 1 program to read data that another program wrote into Redis, you need to very strictly coordinate on the format of the keys in a way that's not, general for every program to use. So a very common thing in Redis is to use colons between the between the keys as a as a separator. But that doesn't necessarily sort correctly because a lot of people just use ASCII for their, for their numbers that they put into the keys in Redis. They're not bringing many people writing binary numbers into their keys in Redis, so they don't sort correctly. But the tuple layer writes the keys out in a big MDN format that sorts correctly. So that's the kind of the basic understanding of what what layers are. The most interesting layer in in my opinion is the record layer, which is an open source project from Apple.
It provides something kind of analogous, as you were saying, a relational database, and it's used at Apple as a part of CloudKit. It is the, in my opinion, the highest quality and the most developed layer that is available as open source today. There is also the document layer, which is an implementation of the MongoDB API. I think it's MongoDB version 3.6, I think is what they're saying it's compatible with right now. And layers like, those 2 layers, I think, would provide a really good example for people that want to understand what layers are. It's basically just people coordinating on the format of the keys and values so that they can interoperate between different pieces of software. And so in principle, it should be possible to apply different layers such as the relational and the document to the
[00:31:22] Unknown:
what the interaction what the interaction patterns are with the overall data that would potentially throw some wrenches in the works. Yes. You're you're exactly right. It is 100%
[00:31:32] Unknown:
possible in theory, but there are certain optimizations that a layer may want to provide that would make it more difficult to interoperate with another layer potentially for efficiency reasons. 1 of these that you can imagine specifically for a relational layer is that you choose to store the data in a fixed width format. So for example, if you had a table that just had, you know, 2 columns that were each, an integer, Say they're 4 byte integers. A record in the relational side could just be, you know, 8 bytes long spot for each of the 2 4 byte records, and that's all that it stores. For a document layer, that would be much more difficult to work with, but definitely not impossible if they coordinated on the schema, which is, you know, the the schema in the relational database sense. If they coordinated on the schema, they could potentially interoperate together, but not necessarily automatically. It would it would take some work.
And it's really up to you as the developer of the layer what you want to choose to allow. For example, you could just store JSON in the in the value, and then pretty much anybody can do anything they want with it. But you may wanna choose to store something that's more optimal for your use case, and then it would be a bit harder. It's up to you. And 1 of the aspects of FoundationDB that's called out in the documentation
[00:32:44] Unknown:
and also that I've heard other people refer to most notably in my interview with the CTO of Segment is the performance that Foundation DB is capable of. And so I'm curious if you can talk through what some of the optimizations are that exist in Foundation DB to provide these levels of performance, particularly in a distributed system, and some of the tuning parameters that are available for maybe, customizing where that performance takes place or sort of what format of performance you're optimizing for, whether it's sort of read or write, things like that. Yeah. And it's very interesting
[00:33:20] Unknown:
that you call out Segment because they're, yeah, they're they're currently if you go look around on the forums, you'll see Segment people. And it's I I I hope they're gonna be very, very happy with it. I've talked with, some of the developers, not about their specific system, but just about FoundationDB in general. And the way the performance works, I I think it's very easy to get caught up in the performance of individual operations like reading and writing. But I think what's more important is that FoundationDB provides really smart workload management that it will allow you to run the system at a higher utilization that you may otherwise be comfortable with in other systems. And the way that this works is that there's internal monitoring to detect backups in any of the queues in any of the different processes. There are backup, but if the internal monitoring detects, backups in any of the queues in the different processes, it will slow down the rate at which it hands out read versions to new transactions, which will naturally rate limit how much work you can make the cluster do. And when the client is being throttled, it doesn't really know what's going on. It just takes a while for it to get that read version. So maybe if you're if you're really hammering it, it might take you a second to get that read version. But what doesn't happen is it you when you're in your transaction, you don't get, like, slow operations.
It's not like when it's not like when the cluster is under load, your operations start to get slower. The individual reads and writes, because they buffer up all the latency into getting the read version, when you start when you have a read version and you're doing your transaction, everything is pretty low latency. It's pretty flat. So that makes it, makes it possible to run the system at, you know, maybe a higher utilization than you may be comfortable with in other systems because if you get a big burst of work, it's not gonna take down your cluster. So just the latency will just get buffered up and, you know, know, some transactions may time out and they might have to retry later, but it's not gonna take your cluster now. And also in terms of conflict
[00:35:20] Unknown:
with something that happened in between your retrieval of the read version and then you're actually creating the commit, that it will automatically error out. But I'm wondering if it's possible to relax that constraint a little bit so that you can potentially write conflicting data and then resolve it as maybe a batch process after the fact in order to improve the overall throughput of your write volume if you're willing to accept some measure
[00:35:42] Unknown:
of consistency issues. I think what you're what you're getting on and what would be more useful for people is, conflicts are more about, about reading, more importantly about reading, I should say. They're more common with with reads, like, you know, a write conflicting with a read. Write write conflicts are probably less less common. And the way that you can you can get around that is you can use what are called snapshot reads, which do not add conflict information to your transaction. So you could read a key at the snapshot isolation level, which basically just means you're stuck at your commit version. And if you get a conflict, it's not detected. That is a knob that you should tune very carefully, because it's it's very it's hard to understand all of the implications of, using snapshot isolation. But for certain types of workloads, like, analytics, for example, snapshot isolation may be perfectly acceptable, and you may choose to use that. You just need to be careful when you're doing when you're using snapshot isolation to then base the decision on whether or not to write further data in the transaction is where you need to be careful. But, yes, there are ways to relax the isolation a little bit. Another way is you can reuse read versions across transactions.
So you may want to be careful with that as well because you're kind of invalidating the workload management features I was just talking about. But if you don't wanna wait to get a read version and you're okay with slightly stale data, as in not causally consistent, I think that's the best way to describe it, or you can reuse, a read version from an old transaction. And that's another way that you can kind of relax the guarantees in order to get lower latency
[00:37:22] Unknown:
out of, Foundation DB. And another thing that you mentioned earlier is the fact that Foundation DB is very well tested. And on the landing page, it mentions that it uses a deterministic simulation engine for executing those tests. So I'm wondering if you can just talk through a bit of the overall strategy that goes into testing such a complex and distributed system and some of the challenges that come about in those efforts. Yeah. So my personal interaction with the with the testing system,
[00:37:49] Unknown:
comes from using the binding tester, which is a which is something off to the side. It's not part of testing the main database, but it's how the, the bindings are tested. And you're given a state machine where you you implement the transitions of the state machine, and then you, you know, you do operations on the database with your bindings, and then it can be validated against another set of bindings. And you're typically validating against the Python bindings because those are the ones that they're kind of the reference reference implementation. And that that makes sure that the bindings are all very high quality because they can be ensured to be compatible with the implementation. The way that the database itself is tested, yes, is using a deterministic simulation system. The best way I can describe that is because FoundationDB can run all of the processes in a cluster in a single process, you can even start multiple nodes on a conceptual level within that single operating system process. And because it's all running deterministically in a single thread, you can introduce faults into the application.
Like, for example, maybe 10% of the time when you attempt to send a packet across this virtual network, which within when you're in 1 process, it's a virtual network, but maybe it's across a physical network when it's deployed across multiple processes. But the interface, FoundationDB is the same. You could say 10% of the time, when we try to, send a packet to another machine, it just doesn't work, and we get no feedback about it. Just we don't get any error messages. It just doesn't work. And the the way that that manifests, is you you write a bunch of test workloads that try to verify different properties of the system. Like, it's like, say that it's ACID. It's kind of hard to write a generalized workload to say that something is ACID, but 1 example of 1 is called a ring benchmark. People can Google that. It's not particularly interesting, but, basically, it's just a it's a benchmark to say that you're maintaining basically a bunch of pointers, and there all the pointers are valid between a bunch of different objects. That's a simple way to describe it. And you can run, run that workload against a database that is constantly chrome throwing crazy failures all over the place. Disks are failing. The network's failing, and you verify that the workload is still correct. Maybe you're not making a whole lot of progress if 10% of your packets have a random error and you don't get any feedback about it. You could maybe you're not moving very fast. You're not committing a lot of transactions per second, but you're still correct. And the way that the you debug this is because it's an entirely deterministic simulation. When you detect an error, you can see exactly what happened, every operation in the exact same order by using the same, seed value. Like, it's a, you know, the random number number generator seed that I'm sure people have heard of before. You just take the same seed value, you run it over again, and then you could go into a debugger. You can watch your distributed system in the debugger, which is pretty much the dream for people that are developing distributed systems is that they could they wish that they had a debugger that actually was like it's let them step through through the distributed system. And that's what Foundation DB does. And that's why people can trust it because it's just pounded on with this testing system. And they're running these tests every night on, you know, the single operating system processes, and they also, I I don't know if they have this anymore, but back in the day when it was a commercial system, they had a physical cluster of machines using, network powered, network switches that you they could toggle on and off over the network, and they would just toggle the network switches on and off. And test that the the same workloads were still correct. So that's why I think people should look at FoundationDB if they're if they're if they're interested in extreme high reliability is because the the number of hours of testing that, of simulated testing, I should say, that are happening because they can just they can speed up time in their testing. And you you get way more hours of testing out of the deterministic simulation system than if you had a bunch of customers running your product out in production. And the good thing is when they find bugs, the bugs are in the simulator, not out with customers in production. And so now talking about how you would actually go about getting this deployed and managed, I know that you were saying that
[00:41:49] Unknown:
the configuration of the different processes is managed in this flat file, and you can deploy it across multiple nodes in a network environment. But I'm just wondering if you can discuss what some of the typical issues that come up are or some of the approaches for getting the system deployed and clustered and just any of the sort of ongoing maintenance tasks and edge cases that people should be aware of? Yeah. So the the typical
[00:42:12] Unknown:
deployment would start out with, you know, either 3 or 5 machines depending on how many availability zones you're gonna deploy in. You should prefer to use fewer number of larger machines with many cores versus using a bunch of small machines. It's pretty much a universal database thing, as long as your database scales well across multiple cores, which foundation DB does. The current storage engine works best when you deploy it with multiple storage processes per physical disk. And the reason why this is the case is there's only ever 1 outstanding IOP from each storage process at any given time for writes. So if you if you wanna get the peak write throughput out of your disk, especially if it's a a physical, like, NVMe disk, you would want to deploy multiple storage processes per disk. Modern disks basically require somewhere between 8 16, a queue depth of 8 to 16 of 4 k reads to reach their their peak throughput, 4 k writes to reach their peak throughput. For reads, the storage process will issue up to 64 of those concurrently, so there's much less of an issue for reads. So if you have a read heavy workload, you can back that off a little bit. You don't need to deploy multiple per disk, but if you have a write heavy workload, you might wanna deploy multiple per disk. So after you've outgrown those 3 machines, it's really not that hard to add more. There, you know, there's some specific things about this that I am admittedly not an expert and it will make sure to schedule the processes appropriately for fault tolerance. You don't really have to do much for that. You can just add more machines into the cluster. And as I said before, you can choose to specialize certain processes to different roles for efficiency, and you can use the, the included tool called fdbcli to get monitoring information out of the database. It's got a lot of monitoring information. And specifically, what you can look at is monitoring the memory, CPU, and disk usage for every process in the cluster along with which roles the processes are performing at the present moment. So you can look at 1 example of a process. Maybe it's both the cluster controller and a storage server at the same time for some strange reason. You can decide, hey. I wanna move the cluster controller out into its own process because we're using too much CPU right now. And the the included tools are very helpful for this. There are other modes of operating that may make sense for different people. 1 thing to consider is disaster recovery. There are 2 methods to do this with FoundationDB. There is, an older included tool with FoundationDB called FDBDR, and there's a new multi region configuration that effectively provides disaster recovery out of the box. There are a lot of specifics and nuances that, are are detailed in the documentation that depend really on what your needs are, but it's not it is not that complicated if you just read the read the manual about it, and you'll find what works for your company best. Because there may be different requirements for for different people. There's a lot of edge cases cases that you should think about. And that's the case with when you're deploying any distributed system. And for the case where you're doing a version upgrade of the clusters, is it something where you can just do a rolling deployment where you upgrade 1 node at a
[00:45:17] Unknown:
clusters, is it something where you can just do a rolling deployment where you upgrade 1 node at a time? That depends. If you're doing a patch version
[00:45:25] Unknown:
in the, semantic versioning sense, yes. You can do you can upgrade the processes. Patch versions are compatible with each other. The major and the minor versions are not compatible with 1 another. This means that you incur a small window of downtime when you upgrade from from 1 version to the other. There are ways to do this that incur downtime that is on the order of seconds if you're using the SSD storage engine because the data is just on disk. So when you reboot, it's all right there. If you're using the memory storage engine, it's a little bit different because you have to read back all the data off disk first, which kinda limits the speed of your recovery to the speed of your disk. So that's 1 thing that you should consider if you're deploying Foundation DB. But upgrading between different versions of Foundation DB has been, it's been done many times in in multiple companies at this point, so there's some information about it on the on the forums about best best practices. And there is talk about changing the way that this works, and, some of that is coming from Snowflake where they have some ideas about how to use flat buffers, which is similar to protocol buffers but a little bit more efficient, to replace the current RPC system that's within Foundation EV. And that's why the that's why, there are potentially incompatibilities between the the major and minor versions, because the way that the the RPC system works is you're just kinda taking c plus plus structs and writing them out onto the network as far as I far as I know. So using flatbuffers would potentially introduce the ability to have, interoperability between different versions of the database. But there's a kind of a philosophical reason why you still may not wanna do this and why it may be a bad idea for other systems that are not just FoundationDB. It's kind of easy to illustrate. If you have version 1 of your software and version 2 and you wanna upgrade from 1 to 2, you have probably tested, hopefully, version 1 all by itself very well, and you've also hopefully tested version 2 all by itself very well. But have you tested all of the possible interactions between version 1 and version 2 when when they're running at the same time? I don't know. Then that's kind of a difficult question on why maybe, Foundation DB will not ever introduce that feature because it's just there are potential dangers to it. It it requires a lot more testing to be confident.
So I don't wanna say that it's impossible or would never happen. There are just potentially reasons why it may not happen. And my understanding of looking at the documentation and some of the presentations
[00:47:39] Unknown:
out of the summit is that a typical way of using FoundationDB is to actually deploy a single large cluster and then have different sort of logical key spaces that coincide with different applications or different use cases. And so I'm wondering, what are some of the ways that having a Foundation DB system deployed impacts the way that you approach application architecture or data engineering architecture?
[00:48:04] Unknown:
That is a a really interesting question because I think when you're using FoundationDB, it is extremely freeing. You get the freedom to build the exact solution that you need instead of adapting your solution to fit the data model of another system. So for example, as you as you said, you can there's a there's a there's a layer called the directory layer, which is kind of how it sounds. It's like a directory in a file system where you can open up a directory in FoundationDB, which is basically just an indirection layer between a a name, just direct like a directory name, and a very short prefix to, save space in the keys. And this is managed for you, by the database. You don't have to do anything. You just, you know, give the string, and it will it will find the right number for you to to go to as your prefix. And, yeah, you could build whatever application you want inside there. As an example, you could model a durable queuing system like Kafka in FoundationDB trivially using a feature called version stamps.
You could then transactionally transform your data across all the steps in your data pipeline and then write out aggregated data, again, transactionally back into FoundationDB for serving all in 1 system. You no longer have to glue together a bunch of distributed systems and make a scalable and fault tolerant solution for each of them. And This was kind of the focus of my talk at the FoundationDB Summit. FoundationDB basically just makes everything easier for you to to make these complex data processing systems, and it's it's all happening in the same cluster. Now you may want to deploy multiple clusters for some different reasons, like, reducing the blast radius of failures, or you may wanna deploy different clusters in different regions.
Most of the big users of Foundation AB don't operate out of a single large cluster. They deploy multiple clusters for different use cases and then different locations. There is 1 example of the the largest cluster that I know of, which is, I don't think the company is comfortable with me naming them, but they, they're not any of the ones that you've heard of that are using FoundationDB. They have 3 petabytes of disk attached to a FoundationDB cluster, and this is for, for an analytics workload. So there are a lot of different ways to deploy and use FoundationDB, all the way from the super high throughput online transaction processing use cases to the multi petabyte analytics use cases. And there are different ways of using it and all of those different things. So for example, Snowflake uses it to manage metadata, and I'm sure they've got petabytes of data too that use FoundationDB to manage the metadata of all of the the data in s 3 where the actual tables and the data warehouses are stored. And you you you know, you get the freedom to build these exact solutions, and it's just so much easier than using a lot of a lot of other systems. The best example I can think of that's coming in a future version of FoundationDB making something easier for you is, it's called the metadata version feature. And the way that the metadata version feature works is you would use it if you're developing something like a SQL database layer, and you need to manage the schema of a bunch of tables.
The the metadata version is just another version number of the database. Kind of you can think of it as the 3rd version besides the read version and the commit version of a transaction. This metadata version can be updated like normal, like, you know, any other key in a transaction, and you can use the metadata version to key a cache that is just a cache of the schema on each 1 of your clients. So the clients all have the exact schema of what's going on in the in the database, and it's it could be updated transactionally. It's consistent and atomic. The the f 1 database from Google, which is an internal thing for the for the ad system that was built on top of Spanner, similar to how Foundation DB layers works, f 1 is built on top of Spanner, which is a key value store. It implements a a pretty complicated schema change protocol that involves you managing 2 different schema versions active at the same time within a cluster. With this metadata version feature, it makes it completely trivial to implement this protocol. It is just it's very simple to implement this protocol with the new version. Previously, layers would have to reimplement the complicated f 1 design or read the the schema during the beginning of every transaction, which would cause a scalability bottleneck because the the key that's holding the schema would get really hot. So this is just like, Foundry Genie just keeps adding new features that make it easy to build the the kinds of advanced distributed systems and data processing systems, that you would previously have to tie together a bunch of things, and it's doing that in a way that is scalable and fault tolerant, you know, to the to the highest degree that's tested tested way better than some of the the systems that I'm sure people are using in production today, and it's especially tested way better than the the boundaries between the the systems that people are building today, like say the boundary between Kafka and Flank or the boundary between Kafka and Spark. There's a lot of people using those systems, but they're not all using them together in the exact same way. So they're just not it's not tested as well. It's a coherent cohesive system that Foundation Media is. And you've listed a number of different ways that you've seen FoundationDB used and interesting use cases, but I'm wondering if there are any other sort of interesting or unusual or unexpected ways that you've seen it used. Yeah. I think the the most interesting 1 that I don't think anybody would think of when the you wouldn't consider this at all, like, it just would be an obvious idea, is Wavefront, which is a a time series monitoring system that is, now a part of VMware. They use FoundationDB back before it was open source, and they still use it today. And the way that they deploy it is they deploy the memory engine and the SSD engine onto the same machines, and they receive high volumes of rights into the memory engine, and they periodically write that data out into the SSD storage engine for longer term storage.
And, yeah, I just didn't like that. When I heard that, it was it just blew my mind because that's never something I would have thought to do, but it makes it makes perfect sense. The the memory engine, is very efficient on receiving lots of small rates. And the SSD engine, makes a lot of sense if you're gonna store a bunch of data because it's, you know, the time that it takes to recover after, say, you know, recovering from a failure or, you know, upgrading the version, it takes no time because it's the data's on disk. You don't have to do anything. So the the balance of those 2 2 things there makes perfect sense for their their solution. It's just an idea that I had never thought of before I heard it. And when is Foundation DB the wrong choice? That's the that's the tough question. Today, I think it is difficult to run Foundation DB within a container orchestrator.
So I haven't heard anybody, but I have not heard of anybody doing it. So if if that is a requirement for you, that might be tough. Another thing that I would not recommend is I would not necessarily recommend putting 3 petabytes of disk onto a Foundation AD cluster unless you really needed to. I would suggest going something that's more along the the Snowflake route where they keep the data in s 3, and they use FoundationDB for transactionally modifying the data. You know, the metadata is stored in FoundationDB rather than just storing the data directly all in FoundationDB. So basically, the way Snowflake works is they have, big immutable files that are chunks of their the tables in their data warehouse, and they use Foundation DB to manage, inserting new records, the concurrency control of inserting new records, and then they use it to manage the metadata of the different versions of that table over time as it changes.
So you can do a big snapshot isolation transaction across all of your data in Snowflake. And because the files in s 3 are immutable, you just like, it's it's really easy to do, and they manage all the metadata and the locks and all that stuff. All the locks are implemented in the in the foundation media layer. Another situation where it may be the wrong choice is if you know that you don't need transactions. If you don't need transactions, like it's read only data, it may still be useful to use FoundationAD because of the operational characteristics. It's just really easy to really easy to use, but you don't necessarily need it there. Another situation is is if you need the absolute most extreme low latency, you may be better off using something like Redis or something that you write yourself because you have to go through basically 4 network hops to commit a transaction in the FoundationDB, so there's some latency there. Whereas Redis, it's basically, you know, you send the data to Redis, and they can immediately commit it and acknowledge you without any waiting. So that may be a good fit for some applications, like, you know, things that are not necessarily transactional, or you don't care very much. You could you maybe could get away with using using something like Redis.
1 thing I I'm trying to get across here is that FoundationDB is not just a database. It is a way to build distributed systems. So as an example, I was just talking about Redis, and maybe you'd be better off using Redis. But you can manage different Redis instances across multiple machines. You could build the orchestration to know reliably which instance is currently the primary and which is the secondary. You can augment existing systems with FoundationDB to make them easier to use or safer, just like ways that people use ZooKeeper or etcd. You get the same guarantees out out of FoundationDB as you get from ZooKeeper or etcd, but you can actually store large amounts of data in it, and you can do many, many transactions a second, which I think is why it's so useful. So it's it's not necessarily you need to think about, Foundation DB is the wrong choice. It's that Foundation DB can be very helpful in many situations, especially if you have a specialized solution where you don't wanna implement everything under the sun about a distributed system and make it correct. FoundationDB can handle a lot of it for you, and you can build your specific thing that you can shim it where you need to. And so what is in store for the future of Foundation DB? The most exciting things that I see coming on the horizon are the, they're improving the multi region support that will allow you to deploy a cluster to more than 2 regions at once, like, more than just a primary and a backup region. The way that this multi region system works is that you pay a latency cost to go get a a read version from the primary region, and then you pay latency cost when you're committing your transaction, obviously.
But if you don't commit, like, it's just a read only transaction, you only pay to get the read version. And you can reuse read versions, as I said, if you can tolerate, slightly stale data in your secondary regions, like if you're doing analytics, for example. And the in the secondary region, the reads are all local. So I think it's a great trade off, that they're they're letting you do. And in the future versions, they're gonna be multi region deployments that can, span more than just 2 regions. So you you still only have 1 primary region, and then you can have many secondary regions that you can fail over to whenever you need to. And I think this is, this is a good fit if you need a global view of data. But if you don't need a global view of data, you can just, you know, deploy multiple clusters in multiple different places. That's another perfectly fine solution. The next, feature coming up after that is a new storage engine called Redwood. This storage engine, its primary feature will be that you can, do read only transactions for longer than 5 seconds, so you can have data from more than 5 seconds in the past. It doesn't extend the the length of a read write transaction. Importantly, that is a limitation of the resolvers, but it will, let you do read only transactions, like, for analytics, for example, at that snapshot isolation level for, for a longer period of time. And the way that it works is you might a naive implementation would be you just write, the keys into the database with a with the version that they're valid at as a part of the key, which would, which would mean you're storing the data multiple times, and you have to skip over those keys when you're reading if you can see that they're not valid yet. That's 1 way to implement that, and you can do that today as a part of your layer if it's relevant for you. But the way this is gonna work in the new Redwood storage engine is the b plus tree itself manages the history with an indirection table of database versions to pages on disk. So you don't pay a penalty for those old versions in terms of reduced performance of your of your reads. You just pay for that disk space that is, you know, that has the old versions, and you would be able to configure how long you want read transactions to run for, in that model to say how much disk space you're you're willing to use. And are there any other aspects of FoundationDB
[00:59:52] Unknown:
that we didn't cover yet that you'd like to discuss before we close out the show? Yeah.
[00:59:56] Unknown:
I think I I would like to describe, basically, an example solution that is somewhat similar to to what I talked about in my Foundation DB Summit talk just to give people a full picture of how you of how you use Foundation DB and the the way that it can be useful to you. And the way that I'm gonna do this is I'm gonna describe how you can reimplement something that is analogous to Kafka using the the new metadata version feature that's coming in in 6.1. It's coming out soon. So with, Apache Kafka and other, streaming systems like Kinesis from from AWS, the the way they work is they they break up a key space, and they do that with hashing into multiple different partitions or shards depending on what they're called in the in the system you're you're using. Foundation v works very similarly, except it breaks up lexicographically along a range of keys. So, it's not really super important for for this example. It's just, something that you should you should know. And the way that you can write data into FoundationDB and ensure that it's going to different storage servers, if you have enough storage servers, I should say, maybe maybe only have 3, you know, or maybe you have a 100. But the way that you can ensure that the data is written to different places is you put a different prefix on the keys. So you could have a prefix of of the number 1, and that was shard number 1, prefix of number 2, shard number 2, etcetera. And the way that this metadata feature will let you implement something like Apache Kafka along with a few of other features I'll describe is that you can take these prefixes and say you set a met your metadata version right now is 1. And in version 1, this topic that is storing all the data, it has 4 partitions.
So the clients read the metadata, and if they, you know, if they don't have it in their cache, they go and actually read the the mapping to say that you have 4 shards, and then they write their data into those 4 shards. 1 really cool thing is you can write data into all of those shards atomically and transactionally. If you have data that needs to go into multiple shards, you can do that as a part of a transaction, and when you decide, hey, we're hitting only right now, we're hitting only 4 storage servers because we have 4 shards. We wanna make this more scalable, so let's go to 8 shards. And the reason why the number of shards matter is because it's the concurrency level of how you're processing the data further down in the pipeline, exactly how Kafka does. So you you say, I wanna scale up from 4 to 8 shards. The way that you can do this is you write a new key and say, hey. I've got 8 shards now, and you update the metadata version key. So the next time anyone starts a transaction, they will see, hey. My metadata version is out of date. I'll go read it. I'll go read the, the key that says how many shards I have. They read the new key. It says, hey. I've got 8 shards now. So what do they do? They start writing to the new prefixes where those 8 shards are. You can store this long this mapping over the long term so that a client can see at any given version of the database how many shards there were and which prefixes the data was stored at. You read using the the version stamps feature, which is basically version stamps feature. You can write keys into the database that put the commit version at commit time into the key for you. So you can read data out of foundationdb in commit order, which is a total ordering over all rights in the entire database. You can read data back out in that order. So you do a range read. If you wanted to do the example of a get records request from Kafka, you say, hey. I wanna start reading data in this shard from this version, and you just read a range read forward. You get all your data back out. You can also do range reads of more than 1 partition at once. You wanna read 2 partitions transactionally from 2 different topics so you have an exact join between the 2 partitions at an at a specific point in time transactionally, atomically, consistently.
Within that same transaction, you can read the data from the 2 partitions and write it out back into a different stream transactionally. All of this is happening in a single application if you wanted to or multiple applications, doesn't matter. And the way that this manifests itself when you're writing code is you get to write the most obvious simple code to accomplish things that previously took giant distributed systems, a ton of configuration, and lots of thinking to get correct. You get to do a lot less thinking, which is wonderful when you're using FoundationDB.
[01:04:10] Unknown:
Alright. And for anybody who wants to get in touch with you or follow-up, I'll have you add your preferred contact information to the show notes. And so for my final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. That's a really that's such a tough question because there there are so many things that I think are wrong with with data management today and, in software engineering in general. It's it's really hard. But I think the most important 1 that I think a lot of people are feeling today is the,
[01:04:41] Unknown:
the number of distinct data systems that are involved with their application at their company makes it really difficult to integrate data together in a cohesive way. And, you know, the the problems of of big data, I think it was Michael Stonebraker who said this. There's, there's the problem of, variety, volume, and velocity. We very, very well solved the volume and velocity problem. There are data warehouses that process lots of data. There are systems like Redis, for example, that can process a huge number of operations a second. But the variety problem, I don't think we've handled quite so well yet. And I think that that, I mean, I know it's kind of already happening, but I think that integrating and integrating all these disparate systems together across, big companies will be the the next area that is still yet to be tackled.
I think Confluent is doing a good job with trying to get all the data into Kafka, but I don't know if that's gonna be enough.
[01:05:38] Unknown:
But they're I think they're they're off to a good start there. Alright. Well, thank you very much for taking the time today to join me and discuss your experience working with FoundationDB. It's definitely a very interesting system, and I'm excited to see where it goes in the future. So thank you for that, and I hope you enjoy the rest of your day. Thank you.
Introduction to Ryan Worrall and FoundationDB
Ryan's Journey into Data Management
FoundationDB: History and Community Involvement
FoundationDB Architecture Overview
ACID Compliance in FoundationDB
Use Cases for FoundationDB
FoundationDB Layers and Interoperability
Performance Optimizations and Tuning
Testing FoundationDB with Deterministic Simulation
Deploying and Managing FoundationDB
Impact on Application and Data Engineering Architecture
Interesting and Unusual Use Cases
When Not to Use FoundationDB
Future of FoundationDB
Example Solution: Reimplementing Kafka with FoundationDB
Biggest Gap in Data Management Tooling and Technology