Summary
Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Zookeeper is and how the project got started?
- What are the main motivations for using a centralized coordination service for distributed systems?
- What are the distributed systems primitives that are built into Zookeeper?
- What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?
- What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?
- Can you discuss how Zookeeper is architected and how that design has evolved over time?
- What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?
- What are the scaling factors for Zookeeper?
- What are the edge cases that users should be aware of?
- Where does it fall on the axes of the CAP theorem?
- What are the main failure modes for Zookeeper?
- How much of the recovery logic is left up to the end user of the Zookeeper cluster?
- Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?
- In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?
- What are some of the cases where Zookeeper is the wrong choice?
- How have the needs of distributed systems engineers changed since you first began working on Zookeeper?
- If you were to start the project over today, what would you do differently?
- Would you still use Java?
- What are some of the most interesting or unexpected ways that you have seen Zookeeper used?
- What do you have planned for the future of Zookeeper?
Contact Info
- @phunt on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Zookeeper
- Cloudera
- Google Chubby
- Sourceforge
- HBase
- High Availability
- Fallacies of distributed computing
- Falsehoods programmers believe about networking
- Consul
- EtcD
- Apache Curator
- Raft Consensus Algorithm
- Zookeeper Atomic Broadcast
- SSD Write Cliff
- Apache Kafka
- Apache Flink
- HDFS
- Kubernetes
- Netty
- Protocol Buffers
- Avro
- Rust
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/chat.
[00:00:48] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems. So, Patrick, could you start by introducing yourself? Hi, Tobias. Yes. My name is Patrick Hunt. I work for Cloudera. I am currently the tech lead for data science and data engineering. Although I've worked on a number of different projects, including creating our original, configuration management and monitoring systems for for Hadoop that we sell as part of our product. I've worked for Cloudera for about 8 and a half years. Prior to that, I was at Yahoo for about 5.
[00:01:21] Unknown:
And the last 2 years, I was the tech lead for the zookeeper project. And do you remember how you first got involved in the area of data management?
[00:01:28] Unknown:
I'm an engineer at heart. I've worked on a bunch of different projects. I like to work on interesting things. I'm sort of a generalist. I've worked on in the networking space. I've worked in search. I got involved with ZooKeeper at, at Yahoo because we were looking at building out our internal cloud infrastructure, and we needed some distributed coordination capabilities. Zookeeper was, at the time, part of Yahoo's research efforts. So Ben Reed and Flavio Jinquera had, done some efforts around creating algorithms and protocols for distributed systems, coordination.
And, we're looking at productionizing that as part of our operations, And that's how I originally got involved with ZooKeeper, and I would say, you know, data systems in general. And can you give a bit of an overview of what ZooKeeper is
[00:02:23] Unknown:
and some of the history of the project and how it got started? Sure.
[00:02:27] Unknown:
ZooKeeper has been around for quite a while. I think it's over 12 years now since the project was originally, conceptualized around, like, the 2006 time frame. At that point in time, Google had published, Mike Burrows, who's who's amazing, had published the Google Chubby paper, which is a distributed lock system that that Google had been using internally. I believe Ben and Flavio had looked at that and said, hey. That would be something great for for, Yahoo to have, and it would solve a number of problems that we were seeing in in production, especially around operations. So 1 of the nice things about ZooKeeper is it addresses issues not only for developers, but also from an operational, perspective. And that was 1 of the key areas initially that we looked at addressing because it it was kind of a tough issue.
So like you said, 12 years ago, Zookeeper started, started in Yahoo Research. When we started productionizing it, that's when I got involved. And initially, around 2007, it was open sourced on SourceForge. And at that point in time, I was working with the Hadoop team at Yahoo as part of, the work that I was doing. And we saw that Hadoop was open sourcing their project in Apache, and we decided to move the project to Apache. And that's how around June of 2, 008 is when we moved Zookeeper to Apache. And at that point in time, got a lot of uptake, especially from companies. I think initial users were folks like LinkedIn, Twitter.
1 of the bigger projects to adopt it, publicly was HBase to solve some of their, problems. It's actually pretty interesting sort of a why of why would you use zookeeper. Right? I can We we saw this a lot in the early days where people would say, well, I could just have NFS and I can store some information on there and, you know, share it between a number of different systems, or I could put some data into MySQL or some database and share data. Why do I need why do I need Zookeeper? I think 1 of the 1 of the key things was we provide a service that's reliable and available, and that allows you to folk you as a developer of some other distributed system, to focus on your domain problem. Right? So if you're building a key value store and you need it to be highly available, let's say, it'd be great if you could use a service to get that done rather than implementing it yourself, which can be pretty challenging.
So a lot of these companies, both from a development perspective, like HBase is a good example where they want to fix their single point of failure problem with their but also from an operations perspective, at Yahoo, what we were seeing is that a lot of the teams who are building distributed systems, we're doing we're doing so in a variety of different ways as I described. It might be NFS, it might be MySQL, etcetera. You can imagine the kind of difficulty that makes for the operations team who then has to learn a bunch of different ad hoc ways to manage these distributed systems. Zookeeper centralized that and provided some best practices for how to how to do that as a developer, which would then flow through to the operation staff. And all of those other options that you mentioned, NFS or MySQL, are just their own single points of failure that introduce their own particular headaches for how things might fail. So I imagine having a,
[00:06:05] Unknown:
more robust mechanism and 1 that is more widely used and widely understood will help with the overall recovery effort so that you don't have to try and remember all the different quirks of any given system. Because even though they might all, centralize on MySQL, for instance, they might do it in any of 15 different ways. So Absolutely.
[00:06:26] Unknown:
And that was 1 of the, you know, that was in the early days when we were first, you know, promoting Zookeeper. That was quite a bit of the feedback that we would get is, hey. You know, it used to be so easy. Right? I would just kinda stuff something in in NFS, and that's worked perfectly fine. And, generally, it works fine, you know, until it doesn't work fine, you know, Saturday at 1 o'clock in the morning, and you end up, you know, you end up with some issues. And and that is another piece of feedback that we often get with zookeeper is, wow. This is a lot more complicated. Right? Like, I have to worry about all these issues that I've never worried about before. Like, the network may fail, And, you know, what do I do about it? Having a service such as zookeeper that facilitates the distributed coordination and is a central point that you can then share between different services addresses a lot of those problems. I think you may have heard of the fallacies of distributed computing. Have you ever heard of that? And and and also the,
[00:07:27] Unknown:
the fallacies that programmers believe about networking?
[00:07:30] Unknown:
Absolutely. So like Bill Joy, Deutsch, Peter Deutsch, and those guys, I think, came up with the initial 3 or 4. It was added to by Gosling and a few other folks. But, you know, things like you know, people assume that network latency is 0. They assume that bandwidth is infinite. They assume that the network is reliable and secure. Those are some of the main fallacies of distributed computing, and Zookeeper puts that right in your face. It requires you to consider those issues. 1 thing I also usually have to mention to people is that zookeeper doesn't really solve all these problems for you. Right? It provides you a framework to address them, but it's not like, you know, magic pixie dust that we sprinkle on your system and everything is addressed. You still have to worry about these all these issues, but zookeeper provides a framework for you to do to do so. And so
[00:08:22] Unknown:
can you dig a bit into some of the different primitives that zookeeper provides as as part of that framework and the ways that it encourages developers and operators to think about and plan for some of these different failure cases? Absolutely. Zookeeper was 1 of the early
[00:08:39] Unknown:
distributed coordination frameworks. There's a few that have come out relatively recently. It's such as console and etcd, which are great. Initially, as I mentioned before, there was, like, Google Chubby, which is not an open source project, but was available on Google at Google and kind of conceptualized through some of the papers that they published. But that was more of a distributed lock service. It was primarily involved with how do I take distributed locks. What zookeeper wanted to do is try to provide some primitives that would allow you to implement different use cases. We call those recipes in zookeeper. So there are various recipes.
There are recipes for configuration. So distributed configuration management. That's probably 1 of the simplest ones. Discovery, service discovery. We also provide distributed locks, leader election, distributed queues. We also support notifications as well, and that's usually used as part of these recipes in order to implement higher level primitives. So out of the out of the box, zookeeper kind of looks like a file system. There are there are there's a hierarchy of what we call z nodes. Z nodes can have children. They can also store data, which is a little bit different from a traditional file system, directories and files. And there are various capabilities such as ephemeral nodes and sequential nodes as well as notifications. Optimistic locking, is also supported that allow you to then take these various primitives and build the use cases, the the things like distributed locks, leader election that I mentioned previously.
And out of the box, we, ZooKeeper, provides these primitives. There are other tools such as Curator that provide more high level abstraction. So if you're looking to just get started and want to do leader election, you probably wanna pull in there's another Apache project called Curator, which is separate, which provides this higher level of abstraction using Java Java object oriented, primitives. Let's see. So zookeeper itself is a distributed system. It's made up of a number of servers that we call an ensemble. And the way that the system works is we elect a leader in that ensemble, and we form a quorum. So there'd be a leader with some number of followers that are made up by the initial list of of servers. So you might have 3 servers, you might have 5 servers. You typically want an odd number because it is a majority leadership election that happens to decide within zookeeper itself. So this this is often a confusing area, which is zookeeper itself is selecting a leader even though within your application you might wanna do leadership election. That's that's different.
But zookeeper itself is electing a leader amongst these servers, and as long as the majority of that initial ensemble of servers is talking to each other, has formed a quorum, the service is up and available. And that means, for example, you could have 3 servers. If 1 of the servers fails, the service is still live. You could have 5 servers. If 2 of the servers fail, the service is still available. So the clients, so we provide a client library that you embed into your application. As long as the clients can talk to the Quorum, the servers that are able to communicate to each other, you get the services that I described previously. You could implement leadership election, and the leadership election would be valid and current as long as your client can talk to the quorums, so the majority of the 1 of the servers that's making up the majority. Does that kinda make sense? Yeah. Absolutely.
[00:12:14] Unknown:
And you mentioned that most of the higher order capabilities, such as being able to easily do leader election without implementing your own logic on top of it, or being able to do optimistic locking is delegated to other client frameworks. So it sounds like most of those aren't built into ZooKeeper itself, and the project has opted to stick with just having these primitives and building blocks to make it more of a framework type approach and leaving a lot of the specifics of how those primitives are used up to the application and system developers?
[00:12:52] Unknown:
Sort of. Yes. We do provide I mentioned previously, we have the this concept of recipes. So we essentially define how the various primitives can be put together to implement the recipes or the use implement the use cases such as leader election, distributed queues, etcetera. And you could certainly, in many projects, have just done that themselves on top of the zookeeper primitives. I personally think from seeing a number of systems use these things, it's it's better to use 1 of these third party higher level frameworks on top of zookeeper such as curator just because if all you want is vanilla leadership election, it's a lot easier and safer to use the curator provided capabilities, the the curator code, to do so because it allows you to just really focus down on your domain specific problem. Because, again, you're probably a developer who's building, let's say, a key value store. You really wanna focus on building a key value store, and the less you have to focus on how exactly does distributed locking work, the better.
Now there might be cases, and there certainly are cases where you have some specialized use case that could benefit from implementing things on top of the primitives directly. And, certainly, projects do that as well. And and ZooKeeper is not the be all, end all of distributed coordination. Raft is very popular because sometimes people want to implement the distributed coordination themselves rather than offloading it to a project such as zookeeper because you can squeak out potentially more performance or you can optimize it for for particular use cases. When you have a gen generic system or general system such as zookeeper, you're obviously going to have some trade offs. And it's kind of the 80 20 rule. Although, I probably would argue in this case, it's more like the, you know, the 99 1 rule where for most people, zookeeper is probably the best choice, especially initially when you're originally building your system. And then over time, you might get to a level of sophistication where, you know, in order to optimize certain cases, you would either use the zookeeper primitives directly or potentially even implement
[00:15:07] Unknown:
distributed coordination yourself. Beyond things like leader election or locking, what are some of the other types of system level features that somebody who's building an application or a distributed systems on on top of ZooKeeper will typically need to implement themselves that are still too special case to have just a general solution for? I would say things like distributed locking,
[00:15:32] Unknown:
discovery, distributed configuration. So essentially, a key value store, leadership election, distributed queues cover most of the cases. Where people often get into trouble is that where they try to use Zookeeper for things it wasn't originally intended for. So it looks like a file system, so I'll use it as a file system. And I often get the question, 1 gigabyte files, and I want to store them in Zookeeper. Will that work? And I need to share these files between various systems. Now, of course, it it will work, if you have 1 1 gigabyte file, but if you start scaling that out as your application grows, you can certainly run into problems. Also, zookeeper itself is not horizontally scalable.
So as I mentioned previously, you have a number of servers that form the ensemble, 3 or 5. You typically don't want 1, by the way, because that's not not very reliable. There's no failover support there. But having 23 is not the best idea either because all the all the operations go through the leader, and having 23 servers to coordinate with adds a lot of overhead. So you typically want 3 or 5 in order to facilitate that failover characteristic, but it's not going to give you higher performance. So that's 1 of the things certainly to consider with zookeeper is that there is a limit to the number of operations that you can perform per second, the throughput of the system. You know, that is limited by them, the networking, etcetera, the the architecture of the system.
And there's no way to say, you know, add more nodes and make it more performant. It actually decreases the performance as you add more nodes. So that's something to consider is when you're building an application, you need to do the distributed coordination parts of ZooKeeper and ZooKeeper, but overloading it with all kinds of other, hey. It'd be nice if I could use ZooKeeper as a traditional key value store or as a database, are not the are not the best choices. I don't know if that's answering
[00:17:37] Unknown:
your question directly. Yeah. No. That that's a good answer. It's definitely useful to have that additional context. And 1 of the things that that leads me to question is in terms of, you mentioned that adding new nodes to the system doesn't increase the throughput, but from an operational perspective, does it support being able to dynamically remove and add nodes into the cluster for recovery purposes?
[00:18:02] Unknown:
Until recently, the configuration of ZooKeeper itself, surprisingly enough, is fairly static. So there are configuration files that define the ensemble, and the only really the only real way to make changes to that would be through changing the configuration files and doing your rolling restart of the server. Our most recent versions that we're working on right now introduced, I believe, in 3 dot 5, which is about to come out of beta to the stable release train relatively soon, added dynamic configuration support to zookeeper itself. So that's 1 of the first versions that is the first version that actually supports dynamic changes to ZooKeeper aside from just doing a rolling restart of the service, which is the way things have been managed previously. And that's actually a very popular feature.
A number of companies, a number of users are already using that capability even though it's not out of beta just because it's considered and and seen as super valuable to be able to modify the ensemble size dynamically, but we do support it through rolling restart. Generally, you don't make too many changes to zookeeper once it's you know, once you've defined the ensemble and roll that out into production, let's say, generally, in my experience, that doesn't change very often. Typically, change it changes when there's a failure. So if 1 of the machines has a hardware failure or you need to make changes within your, networking environment, within your production environment, you might wanna you might wanna make some changes to the zookeeper ensemble. But once it's created and up and running, it's pretty hands off. 1 of the precepts, 1 of the principles that we built into zookeeper in the early from the very early days was that it should be pretty hands off. So once you set up the ensemble, it forms a quorum itself. It elects a leader.
If 1 of the machines fails or goes offline or gets partitioned, say, there's a network partition where 1 of the servers on of the ensemble is partitioned off from the from the main cluster from the quorum for some period of time, when it gets re when that, when that part partition heals and the, and the server rejoins the quorum, it will automatically update itself to the latest state of the service. So that's that's without any operator,
[00:20:25] Unknown:
involvement at all. And can you dig into how zookeeper itself is architected and some of the evolution of the internal design over the years that it's been, built and maintained?
[00:20:38] Unknown:
Like I said, it zookeeper has been around for quite a while, for about 12 years or so, and it's a Java application. So I think Java 1.4, 1.5, was was the primary Java version at the time. So if you look at the code, there's definitely, you know, definitely some legacy aspects of the fact that zookeeper was written at a time before many of the capabilities and or niceties that have been added to Java over over the years, were available. So certain, you know, we use, like, synchronized locks in in many cases rather than some of the higher level primitives that are available today. Once you have a system that works and is used in production by many folks and has been tested, you know, both directly and and tested by time, makes it hard to make changes to the core of the application. So 1 of the things you'll definitely see is that it's you know, the the code was written in a previous error, if you will.
The basic architecture of Zookeeper is a message passing system. So the clients pass messages to the server. The servers pass messages to each other. If you are familiar with the SEDA architecture, s e d a, which stands for staged event driven architecture. If I remember correctly, I think Eric Brewer was 1 of the folks who who originally, helped to define that. That's basically the underlying architecture of Zookeeper. Another aspect of our initial work and our initial principles was to try to build the system that favored or to build the system that favored availability and reliability overall else.
So zookeeper is your distributed coordinator. It is your distributed single point of failure, I would like to, you know, often say. So we definitely take very seriously the fact that this is how, you know, the the zookeeper is the underlying engine for your system. So if if something needs to fail over, it needs to be highly reliable, highly available. Otherwise, your system won't work. Obviously, people won't use Zookeeper if that's the case. So we definitely favored availability and reliability overall else. So many of the underlying implementation specific details such as use of of SITA were chosen because we want to make the code as simple as possible because we knew that we would make mistakes, and we wanted to limit the number of mistakes that we made.
So the main event pipeline, for example, for the leader and for the followers that process these change requests and and and read and write requests from the clients is actually has traditionally been single threaded as a result because we know if we knew that we could improve performance, but at the same time, we would reduce reliability because the likelihood we would introduce bugs, would be higher. So we actually reduced things. We limited the amount of, for example, threading that we did internally to, improve availability and reliability at the fault of, you know, at the on the negative side, you could have higher performance. Now recently, a number of folks have been doing work to improve performance and throughput, and those changes are starting to go in now. So there are some multi threaded pipeline pieces in the processing of the client requests.
And those are those are relatively recent changes, and they've been, you know, battle tested. The nice thing about having large companies that run zookeeper in production allows us to really get some some battle hardening before we, you know, ship it out to the rest of the folks. And the Apache process itself facilitates that as well, right, since it's open source. And so in terms of the design of the system,
[00:24:33] Unknown:
as far as where it falls on the axes of the cap theorem, it seems that you're favoring availability and partition tolerance at the expense of some amount of consistency, because in some cases, if 2 different parts of the system get a different response, it's not necessarily the end of the world as long as everything can keep running.
[00:24:53] Unknown:
Yeah. Surprisingly though, zookeeper has actually more of a CP than a system on on the on the cap side. So if you're a client and you make a request to Zookeeper, if you get a success back from zookeeper, you're guaranteed that it's consistent. And 1 of the again, if the the service itself is using a leader election with a majority rule. So if a majority of the Ensemble such systems, so 2 out of 3, if you're running 3 3 servers in Zookeeper, are able to talk with each other and coordinate the services up and available. If you fall below the majority, then the service will stop and it won't be available.
So the client as I said, the client is guaranteed that if it gets a success, that it's consistent. 1 of the trade offs that we make, however, is that we don't guarantee that the clients see a consistent view of the world. So as an individual client, you're always guaranteed that the world moves forward. You never move back in time, And the information that you're seeing is consistent at least with respect to to the server that you're talking with. So I didn't mention this before, but the clients are talking to a particular server in the ensemble in the forum. And the servers that make up the ensemble will only broadcast themselves as available or only will only accept connections from the clients if they're a member of the quorum. So if they're able to form part of that majority that I described, they'll allow clients to talk to them. They'll service requests.
However, you might have client a that is talking to 1 of the servers that's currently in the currently in the quorum, 1 of the majority servers, and that would be up to date. However, you might also have a client b that's talking to a follower of the ensemble of the quorum that has been periodically partitioned off from the service. And there are timeouts. So there are session timeouts for the clients. So when you establish a session with the service as a client, you specify a session time out. If you're doing diverging here a bit, but if you're doing leader election, you're basically saying, I'm willing to tolerate a certain amount of, certain amount of time that I'm willing to be out of date. Of course, it's a distributed system, so you're always going to be out of date, right, fractionally.
But how much can you tolerate being out of date? So if you wanna be the leader and you can tolerate being out of date for 10 seconds, you would establish a a session initially of 10 seconds. And what that guarantees you is you're always up to date with the main service within 10 seconds, and that's where that cap aspect kind of comes into play. So if you're a client that's talking to a follower that's been partitioned from the service, you might not know about that for, let's say, 10 seconds. Again, your client will always see a consistent view of the world that things are always moving forward. For example, say the server that you're connected to actually fails, you would have to disconnect and reconnect to another server. You're always guaranteed again that things will move forward in time, but you're not guaranteed that 2 clients will see a consistent view of the world, which can be a little bit challenging folk for folks, who are building on top of zookeeper, but it's kinda goes back to the, you know, the fallacies of distributed computing. You're always guaranteed that you're going to be behind. It's just a question of how much behind are you going to be. And in terms of building and maintaining the zookeeper project itself, particularly as a centralized
[00:28:36] Unknown:
component of distributed systems? What have you found to be some of the most complex or difficult aspects of, building and maintaining the project and educating users on how best to take advantage of it? I think in the early days, a
[00:28:52] Unknown:
big issue was just educating people and making them aware of the capabilities of the system, what it could do, what it couldn't do. We ourselves were working out, you know, what the best use cases were for ZooKeeper and when a different system would be a better choice. We were building community as part of the Apache aspect of building ZooKeeper. Initially, we were the only ones who were writing the code. Right? We were using that internally at Yahoo. And then once we moved to Apache, a big part of it was just bringing on additional folks, you know, like minded folks as part of the community who could work with the community to continue zookeeper development, keep it focused on distributed computing, distributed coordination versus potentially adding other features that weren't towards that main goal. I think over time as the system has matured, there's a lot of information out there about Zoo keeper. We're starting to see a few additional systems over the last few years that are providing similar kinds of capabilities so more people are educated.
I would say just the education part, again, initially especially initially, that was 1 of the 1 of the biggest things. Today, I I think we face a lot of issues just from the perspective of having a system that's been around for lots of years and and a lot of people using it and relying on it for their to build their to build their systems on. So things like maintaining backward compatibility is a huge problem today because we wanna make changes. We find a bug. We find a security issue. We wanna make changes. That really can be difficult because, again, zookeeper was, you know, initially created, let's say, 12 years ago. Things like Netty didn't exist at the time. Right? Zuki, Java itself didn't support SSL at the time for for network communication.
So we ended up building a lot of this ourselves. Right? Protobufs didn't exist, etcetera. So we ended up building our own version of of such things. If I was building a system today, I'd be taking advantage of all these things, and it would make it a lot easier to do schema migration on my transport level, on my marshaling and unmarshaling of data, etcetera. We didn't have all those things, so we ended up building it in. And I think over time, just as you add layers of code, as you add complexity, as you try to fix bugs that you didn't expect, the code just gets more and more complex. I'd say that's that's 1 of the bigger challenges that we face today. Again, especially maintaining backward compatibility because that is a value that people place on zookeeper as well. The fact that our versions are backward compatible, they can upgrade to the new version with fixes, with, improvements, and not need to really change their code that much if at all. And
[00:31:38] Unknown:
you mentioned that in terms of scaling for zookeeper, it's mostly for being able to improve the reliability of the cluster more so than the throughput, but that there have been some recent changes to improve overall performance of the system by introducing some of these threading capabilities. So I'm wondering what are some of the main scaling factors for zookeeper and the systems that rely on it and some of the edge cases that users should be aware of in terms of utilizing ZooKeeper and ensuring that it is operationally reliable?
[00:32:16] Unknown:
Sure. There's a lot to unpack there. Let me let me see if I can go through some of them. So 1 aspect I I didn't talk about too much before, but under the covers, zookeeper is using a write ahead log and a and a snapshotting mechanism to store persistent state. Zookeeper is an in memory database effectively, so all of the information needs to be in memory. So that's certainly 1 thing to consider is when you're putting information into ZooKeeper, you're really limited by the heap size that you're allocating for the JVM that's running the servers. ZooKeeper itself is backing up that information using traditional wall and snapshotting, operations onto the file system. So we're using a a right ahead log and a snapshotting mechanism in order to store the information to the file system. 1 of the primary limitations as a result is how frequently and how quickly we can f sync the write ahead log.
So as the clients are making changes, they're doing rights into the system. That's 1 of the primary limitations of Zookeeper in terms of throughput is we need to write the changes to the right head log, fsync it before we can respond to the client that the right operation that it's trying to perform was successful. So the client will propose a change to the zookeeper service that will get forwarded to the leader. It's part of the atomic broadcast protocol, Zab, that the leader will communicate with the followers as soon as a majority of the followers have f synced the change to the to the disks as part of the right head log, that change will be communicated back as successful to the client. So that's 1 of the primary limitations of Zookeeper in terms of the throughput.
What we often see, in operations is people are trying to colocate as many services as possible to reduce costs, etcetera, reduce their overhead in terms of the number of systems that they need to manage. So if you end up collocating Zookeeper with another service that is heavily use utilizing the underlying disk, that can actually slow down Zookeeper quite a bit. We actually recommend that if you're using a mechanical drive that you dedicate a spindle, SSDs help in that regard. Although, SSDs have their own failure modes that need to be considered. Right cliff, in particular, has been an issue in the past, and that's an that's an issue that's typically dependent on the firmware of the of the drive, which makes it particularly difficult to track down. But from an operations perspective, you wanna size zookeeper serve the the service based on the number of failures that you wanna be able to support. So if you have 3 servers, as I mentioned, 1 of the servers can die. The service will still be available. If you have 5, 2 can die. 5 5's kinda nice for production serving because while it doesn't improve your throughput, it does mean that if you have to take 1 of the servers out for maintenance, you can still suffer an unexpected failure and the service will still be live.
Let's see. So, hopefully, I've did I miss I'm I'm sure I've missed probably a couple of your points if you wanna redirect. But those are those are some of the primary things to consider from an operations perspective and in terms of of sizing Zookeeper.
[00:35:45] Unknown:
Yeah. That that's all valuable information. And given that particularly the in memory aspects, it helps in terms of the instance type selection for somebody who might be running in a cloud environment to know that they should, optimize on the memory side more so than on the disk side, for instance. So, for people who are doing that deployment, it it's good to have that knowledge. And, you were talking about co locating services as far as ZooKeeper running alongside some other, server process, but, the other direction of using ZooKeeper as a multi tenant service, I'm wondering what the capacity is as far as being able to, for instance, use ZooKeeper as the coordination layer for maybe a Kafka cluster and a Flink cluster using the same underlying, ZooKeeper deployment?
[00:36:35] Unknown:
Yeah. That's a great question. For most distributed coordination efforts, you're not gonna be maxing out ZooKeeper. Right? You're going to be, you're going to have 3 machines that are trying to, manage a distributed lock. You might have thousands of machines that are managing a distributed lock, but the number of operations that you're performing as part of that are relatively small. And zookeeper is also optimized for read heavy workloads. So not something that I mentioned before, but in in terms of multi tenancy and scalability, zookeeper is optimized for read heavy workloads. So the the the thought is that you're going to do be doing a lot more reads than you are going to be doing rights. So let's say you're doing leader election that requires a small set of rights, and then all the followers need to basically read the z node in this particular case, read read, 1 of the pieces of data from ZooKeeper, and you're gonna have potentially large amounts of systems doing that. And ZooKeeper handles that, particularly well. ZooKeeper is a hierarchical namespace. As I mentioned, it looks very much like a file system, and that does facilitate, you know, having a /apps, you know, /appskafka/apps, solar /apps, you know, Flink, etcetera.
And we have a very primitive, change route kind of capability, to route kind of facility. So when you have a application that's connecting to ZooKeeper, it can actually route itself in 1 of these, sub namespaces. So you could, like, have a /app/kafka, and Kafka could root itself at /appskafka. So its root would look like slash. So there are some multi tenancy capabilities built into Zookeeper. There's quite a bit of overhead if you're using Zookeeper properly. There's a lot of there's a lot of unused re again, it it's not like you're doing large amounts of operations on zoo zookeeper all the time. So there's there's unused capacity. Maybe that's what's what I'm the word that I'm looking for. But there's quite a bit of unused capacity typically, in a zookeeper system if you're using it properly because you're just not doing that much. Right? You're doing leadership election. It's kind of bursty. Like, when something fails, you might have a burst of activity. But for the most part, it's just quiescent. In terms of operations in a multi tenant environment, there definitely are some things that we could do, and we're seeing some that's 1 of the areas that, has had recent development efforts.
Facebook uses zookeeper extremely heavily to do their internal configuration management and internal coordination. They've contributed a lot of changes. They're in the process of contributing a lot of changes because they use it at at a pretty extremely high scale in a multi tenant environment. So I would say just the fact that we have the namespaces facilitates the multi tenancy aspect pretty well. But, I in my own in my own day to day here at Cloudera, we're running Hadoop systems, and there are quite a few of the systems in the in the Hadoop ecosystem that rely on ZooKeeper nowadays for their distributed coordination, HDFS, HBase, etcetera. And those systems are all sharing a single ZooKeeper, typically, at least in an in a Cloudera deployment, for example. And that that works perfectly fine. There's quite a bit of of overhead. And that that works perfectly fine. There's quite a bit of of overhead or unused capacity, let's say, in a Zookeeper environment. Because you need to have, like, 3 or 5 servers to facilitate a zookeeper service. There's just a lot of extra capacity that you have based on the fact that you need multiple machines in order to facilitate that. And as you mentioned before,
[00:40:11] Unknown:
in recent years, there have been some other projects that were created and released for doing some of this centralized coordination of distributed systems, such as etcd or console. And I'm wondering how ZooKeeper compares with some of those other projects and some of the cases where ZooKeeper might be the wrong choice and you would prefer to use something else? Yeah. That's a good question.
[00:40:36] Unknown:
Just as ZooKeeper itself, you know, looked at, and it was was inspired by, systems such as Chubby back when we were conceptualizing, you know, what we should build, obviously, optimized for our particular use cases. I think the same thing has happened in these other spaces. So console, which is a great product, etcd facilitating a lot of the you know, in the go space, and especially with Kubernetes now, quite a few of the capabilities of Kubernetes rely directly on etcd. If you look at I think, etcd in particular was influenced not only by ZooKeeper, but also by Raft.
So John Osterhout and and some of his folks did a bunch of work more on the education side, I would say. So they took some of the concepts that had been floating around, kinda put them together again slightly differently, and that provides you with Raft, which is a great tool set for learning distributed coordination, especially if you're in that minority, I would say, of folks who need to implement distributed coordination yourself. Raft was really sort of conceived, at least in my opinion, more from an educational perspective, right, which is Paxos is great, but it's really hard to implement properly.
Zookeeper is good because it provides you this service. But what if I need to build my own distributed consensus system? What how should I go about doing that? What's what's the best way to do it? Like, how would I be successful? Raft really, really well documented how to do that. And then many of the systems such as etcd were able to take those concepts and build around it. So if you look at etcd, I'm less familiar with console, but I believe also to be the case, you know, they they looked at some of the best some of the best things out of ZooKeeper and some of the other systems and, you know, implemented their own system based on that, again, for their, you know, for their specific for their specific use cases. Also, ZooKeeper was created, again, 12 years ago, and, you know, things like JSON and some of these other de facto best interface that it provides. It also provides a c interface.
Many people have built their own language specific implementations of zookeeper clients on top of ZooKeeper. The same thing is here. Right? If you're if you're in the Go space and you're building a Kubernetes app, of course, you're gonna use etcd. Right? Hopefully, you're not going to have to run 2 distributed coordination systems. I I personally would would hate to have to do that. So I think it's, if you're in an environment where you have a distributed coordination system already, you probably would want in. And given the length of time that zookeeper has been around and the
[00:43:22] Unknown:
massive changes and developments that have occurred in that time span, I'm wondering how you have seen the needs of distributed systems engineers and the systems themselves change since you first began working with ZooKeeper.
[00:43:36] Unknown:
How has it changed over time? I think people are just much more educated now than they were when we originally started. As I mentioned, there wasn't you know, there was Paxos and some of these other things. But I think building large distributed systems was a really sort of niche area. It's become more and more so. You know, if you go to Amazon and you wanna build an application, you know, of course, you're gonna wanna build something that has capabilities and is likely going to need the scale to the point where you're gonna have to run multiple machines. And you need to then consider the operational aspect as well because you have multiple teams that need to work together. How have things changed over time? I think people have just become much more sophisticated.
The demands have just increased in terms of the kinds of capabilities that you need, the kinds of, you know, just the the tolerances have, you know, the have just decreased. Right? Like, the, you know, if it was an inch before, now it's a fraction of an inch. Right? And I think people rely on things like console, etcd, zookeeper more and more. Have they changed? I mean, I I've been a little bit surprised that there haven't been more distributed coordination systems available. It's only relatively recently that FCD and console have really be you know, come out and become available. I guess, maybe the thing that I've been more surprised at is that there aren't more systems like this. Probably the the thing that I've been most surprised by. And
[00:45:03] Unknown:
as you were saying, the code base itself has accrued a number of different, peculiarities because of the age of the original implementation. So if you were to start everything over today with greenfield and knowing what you know now, I'm curious what you would do differently and if you might choose a different language runtime to build it in, or if you've been happy with Java as the implementation target? I think we've been super happy with Java. It's a known quantity.
[00:45:33] Unknown:
Again, if you're building something, you're gonna spend, you know, months, a year, 2 years building a system. You're gonna spend many, many more years operating it. Right? Once you build a system that relies on another system, that thing is going to exist for for a long period of time. And you you mentioned a number of systems, Kafka, Flink, etcetera, who currently HBase that I mentioned that, that rely on Zookeeper. That's not an easy thing to change from their perspective, either. 1 of the big changes today is that there are a lot more facilities available for building a distributed system, so such as NetE and other other capabilities for, like, transport level communication, protobufs, etcetera, Avro, that supports schema migration, which is 1 of the areas that we kinda suffer from in ZooKeeper today because it because our implementation is kind of bespoke. We didn't build in certain niceties like schema migration, which can make it difficult for us to change our on the wire protocol, which we occasionally need to do if we're adding more capabilities, trying to fix a bug. So, certainly, I would be leveraging a bunch of things that exist, a bunch of tools that exist today in the in the open source ecosystem that didn't exist at the time. I think Java was a great basis for that. I might be tempted to potentially use Rust, today. But, again, getting back to the operational aspect, you know, I'd I'd definitely wanna choose something that the people who are operating the system felt comfortable with, both from a making a system that's reliable and available, but also building something that I can operate, which would include things like monitoring
[00:47:09] Unknown:
and debugging. And we've listed some of the, well known projects that have built on top of ZooKeeper, but I'm wondering if there are any particularly interesting or unexpected ways that you've seen ZooKeeper used.
[00:47:23] Unknown:
Well, I've seen a number of, sort of anti patterns, used for zookeeper certainly through the years. I mean, 1 of the things I'm kind of consistently surprised by is the number of, projects that continue to to adopt the zookeeper as their distributed coordination mechanism. So the fact you know, it's easy when you're making the sausage to kind of see the problems under the covers, see the bugs, and the kinds of issues that we're facing. But, you know, again, it's it's pretty encouraging to see that there are many new systems that are picking up ZooKeeper as their distributed coordination mechanism even though there are other systems available today that that are potentially great options as well. I don't know. I I can't think of anything off the top of my head that is particularly surprising. At the end of the day, there are very common there are very common set of problems that people who are building systems need to solve, such as leader election, service discovery, locks, distributed queues, and, you know, systems are picking that up and using it. And those are the most common things that I see. I I can't say that I see, sort of, crazy, you know, crazy things that nothing at least that comes comes to mind. And going forward, what are some of the plans that you have for the future of ZooKeeper, either in terms of improvements
[00:48:43] Unknown:
or new features or capabilities or anything in terms of the community or educational aspects of the project? Zookeeper is a
[00:48:53] Unknown:
community based project and has been for quite a long time at Apache, which is great because people come and go, new requirements kinda come and go, and companies come and go. But at the end of the day, you always have that Apache community that you can go back to. We've been working really hard over the past few years to come out with an updated release that includes things such as the dynamic configuration support of ZooKeeper itself, so changing changing the ensemble, dynamically. I'd say that's 1 of the biggest things that we've been working on is really kinda, like, shipping that. It's been a quite a bit of work on security, and we continue to work on security just hardening all the system, both in terms of availability, reliability, but also in terms of of security.
Getting so many eyes on it is great from that perspective as well. It's a pretty mature project, you know, being around for 12 years. I think 1 of the big things that we've been seeing recently, like I said, Facebook has been contributing a number of changes. They've been doing a lot of work on observability because they use it so heavily throughout their infrastructure. Monitoring and management managing of these clusters becomes even more important for them. So they've been contributing a lot of a lot back to, zookeeper around observability, which I think is really great. So making it even easier to manage you mentioned multi tenancy before. I think that's 1 of the areas we could also do some more work on. And, you know, as we get more sophisticated in our use cases, like multi tenant use cases in particular, it's probably the 1 of the areas that we'll continue to focus on or focus on more, but it is a pretty mature project. We've added some capabilities recently around, like, TTL, which I think is a capability that we identified out of, like, etcd and maybe or maybe it was console. But we've we ourselves are seeing some of the changes that are going into these other systems and saying, hey. Wouldn't it be great if you could set a time to live on a zookeeper on zookeeper data, which is not a capability that we had until relatively recently?
So we're we're learning things as we go along as well, which is great. And are there any other aspects of the zookeeper project
[00:51:05] Unknown:
or distributed systems, primitives, or centralized coordination for these types of systems that we didn't cover yet, which you think we should discuss before we close out the show?
[00:51:16] Unknown:
Other aspects. You know, I think we we touched on it, but the the operational side of things is a huge, huge issue. It's easy to provide these capabilities and to convince people to use these systems, but putting them in production and making sure that they're actually really reliable, really available, proving them out for the particular user's use case is a lot of work and a lot of effort. And it's 1 of the main things that defines a successful project from an unsuccessful project.
[00:51:47] Unknown:
So I'd I would say that that's 1 of the main areas. Alright. And for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I I think 1 of the main things that I see, again, goes back to the
[00:52:12] Unknown:
operational aspect of things. 1 of the areas for concern that I have is around observability. So whether you're building on top of Zookeeper or you're trying to build the data pipeline or trying to build a data science model that you wanna get into production, there are going to be unexpected hiccups. Zookeeper still has bugs, surprisingly, after all these years, and, occasionally, you'll run into 1 of these bugs. How do you identify that that issue? It's going to impact your system. How do you identify how do you track that? How do you identify it? How do you debug it? How do you make sure that you get that notification and resolve the issue as quickly as possible?
Whether you're using ZooKeeper, whether you're writing a data pipeline where the data may change unexpectedly and send suddenly your ETL processing is starting to fail. You've written a model that you deployed into production for data science, and you're getting unexpected results suddenly for some unexpected reason. How do you identify that before it impacts your your business in a negative way? I see a lot of systems and a lot of work being done, and and it's great work. But how do we integrate all these different there's so many components these days that have to be pulled together to build the system. How do we make sure that we can run that in production and quickly identify problems and resolve them. That that's 1 of the areas I would say that I I spend a lot of time on today, helping people not only in Zookeeper, but in other aspects of the work that I do. And I think that's 1 of the areas that we need to continue to focus on as a community.
[00:53:51] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you're doing with ZooKeeper. It's a system that I've been aware of for a long time, and I've heard a lot of people talking about it. So it's great to be able to get more of a deep dive and a better understanding of the overall scope and history of the project. So thank you for that, and I hope you enjoy the rest of your day. Thank you for having me on and, chatting about Zookeeper. I really appreciate it. Thank
[00:54:17] Unknown:
you.
Introduction and Guest Background
Overview and History of Apache ZooKeeper
ZooKeeper Primitives and Use Cases
Operational Considerations and Scaling
ZooKeeper's Architecture and Evolution
Scaling Factors and Multi-Tenancy
Comparisons with Other Coordination Systems
Changes in Distributed Systems Engineering
Future Plans and Community Contributions
Operational Challenges and Observability