Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt

Hello, and welcome to the data engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode.

With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to dataengineeringpodcast.com/linode

today to get a $20 credit and launch a new server in under a minute.

And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/chat.

Your host is Tobias Macy. And today, I'm interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems. So, Patrick, could you start by introducing yourself? Hi, Tobias. Yes. My name is Patrick Hunt. I work for Cloudera. I am currently the tech lead for data science and data engineering. Although I've worked on a number of different projects, including creating our original,

configuration management and monitoring systems for for Hadoop that we sell as part of our product. I've worked for Cloudera for about 8 and a half years. Prior to that, I was at Yahoo for about 5.

And the last 2 years, I was the tech lead for the zookeeper project. And do you remember how you first got involved in the area of data management?

I'm an engineer at heart. I've worked on a bunch of different projects. I like to work on interesting

things. I'm sort of a generalist.

I've worked on in the networking space. I've worked in search. I got involved with ZooKeeper

at,

at Yahoo because we were looking at building out our internal cloud infrastructure,

and we needed some distributed coordination capabilities.

Zookeeper was,

at the time,

part of Yahoo's research efforts. So Ben Reed and Flavio Jinquera had,

done some efforts around creating

algorithms and protocols for distributed systems,

coordination.

And,

we're looking at productionizing

that as part of our operations,

And that's how I originally got involved with ZooKeeper, and I would say, you know, data systems in general. And can you give a bit of an overview of what ZooKeeper is

and some of the history of the project and how it got started? Sure.

ZooKeeper has been around for quite a while. I think it's over 12 years now since the project was originally,

conceptualized

around, like, the 2006

time frame.

At that point in time,

Google had published,

Mike Burrows, who's who's amazing,

had published the Google Chubby paper, which is a distributed lock system that that Google had been using internally. I believe Ben and Flavio

had looked at that and said, hey. That would be something great for for,

Yahoo to have, and it would solve a number of problems that we were seeing in in production,

especially around operations. So 1 of the nice things about ZooKeeper is it addresses issues not only for developers,

but also from an operational,

perspective. And that was 1 of the key areas initially

that we

looked at addressing because it it was kind of a tough issue.

So

like you said, 12 years ago, Zookeeper started, started in Yahoo Research.

When we started productionizing it, that's when I got involved.

And

initially, around 2007,

it was open sourced on SourceForge.

And at that point in time, I was working with the Hadoop team at Yahoo as part of, the work that I was doing.

And we saw that Hadoop was open sourcing their project in Apache, and we decided to move the project to Apache.

And that's how around June of 2, 008

is when we moved Zookeeper to Apache.

And at that point in time, got a lot of uptake, especially from companies. I think initial

users were folks like LinkedIn, Twitter.

1 of the bigger projects to adopt it, publicly was HBase to solve some of their,

problems. It's actually pretty interesting

sort of a why of why would you use zookeeper. Right? I can

We we saw this a lot in the early days where people would say, well, I could just have NFS and I can store some information on there and, you know, share it between a number of different systems, or I could put some data

into MySQL or some database and share data. Why do I need why do I need Zookeeper? I think 1 of the 1 of the key things was we provide a service that's reliable and available,

and that allows you to folk you as a developer of some other distributed system,

to focus on

your domain problem. Right? So if you're building a key value store and you need it to be highly available,

let's say,

it'd be great if you could

use a service to get that done rather than implementing it yourself, which can be pretty challenging.

So a lot of these companies, both from a development perspective, like HBase is a good example where they want to fix their single point of failure problem with their

but also from an operations perspective,

at Yahoo, what we were seeing

is that a lot of the teams who are building distributed systems,

we're doing we're doing so in a variety of different ways as I described. It might be NFS, it might be MySQL,

etcetera.

You can imagine

the kind of

difficulty that

makes for the operations team who then has to learn a bunch of different ad hoc ways to manage these distributed systems. Zookeeper centralized that and provided some best practices for how to how to do that as a developer, which would then flow through to the operation staff. And all of those other options that you mentioned, NFS or MySQL, are just their own single points of failure that introduce their own particular headaches for how things might fail. So I imagine having a,

more robust mechanism and 1 that is more widely used and widely understood will help with the overall recovery effort so that you don't have to try and remember all the different quirks of any given system. Because even though they might all,

centralize

on MySQL, for instance, they might do it in any of 15 different ways. So Absolutely.

And that was 1 of the, you know, that was in the early days when we were first, you know, promoting Zookeeper. That was quite a bit of the feedback that we would get is,

hey. You know, it used to be so easy. Right? I would just kinda stuff something in in NFS, and that's worked perfectly fine.

And, generally, it works fine, you know, until it doesn't work fine, you know, Saturday at 1 o'clock in the morning,

and you end up, you know, you end up with some issues. And and that is another piece of feedback that we often get with zookeeper

is, wow. This is a lot more complicated. Right? Like, I have to worry about all these issues that I've never worried about before. Like, the network may fail, And, you know, what do I do about it? Having a service such as zookeeper that facilitates

the distributed coordination and is a central point that you can then share between different services

addresses a lot of those problems. I think you may have heard of the fallacies of distributed

computing. Have you ever heard of that? And and and also the,

the fallacies that programmers believe about networking?

Absolutely. So like Bill Joy, Deutsch, Peter Deutsch, and those guys, I think, came up with the initial 3 or 4.

It was added to by Gosling and a few other folks. But, you know, things like

you know, people assume that

network latency is 0. They assume that bandwidth is infinite. They assume that the network is reliable and secure. Those are some of the main fallacies of distributed computing, and Zookeeper

puts that right in your face. It requires you to consider those issues.

1 thing I also usually have to mention to people is that zookeeper doesn't really

solve all these problems for you. Right? It provides you a framework to address them, but it's not like, you know, magic pixie dust that we sprinkle on your system and everything is addressed. You still have to worry about these all these issues, but zookeeper provides a framework for you to do to do so. And so

can you dig a bit into some of the different primitives that zookeeper provides

as as part of that framework and the ways that it encourages

developers and operators

to think about and plan for some of these different failure cases? Absolutely. Zookeeper was 1 of the early

distributed coordination frameworks. There's a few that have come out relatively recently. It's such as console and etcd, which are great.

Initially, as I mentioned before, there was, like, Google Chubby, which is not an open source project, but was available on Google

at Google and kind of conceptualized

through some of the papers that they published. But that was more of a distributed lock service. It was primarily involved with how do I take distributed locks. What zookeeper wanted to do is try to provide some primitives that would allow you to implement

different use cases. We call those recipes in zookeeper. So there are various recipes.

There are recipes for configuration.

So distributed configuration

management. That's probably 1 of the simplest ones. Discovery, service discovery. We also provide

distributed locks, leader election,

distributed queues.

We also support notifications

as well, and that's usually used as part of these recipes in order to implement higher level primitives.

So out of the out of the box,

zookeeper kind of looks like a file system. There are there are there's a hierarchy of what we call z nodes. Z nodes can have children. They can also store data, which is a little bit different from a traditional file system, directories and files. And there are various capabilities such as ephemeral nodes and sequential nodes as well as notifications.

Optimistic locking,

is also supported that allow you to then take these various primitives and build the use cases, the the things like

distributed locks, leader election that I mentioned previously.

And

out of the box,

we,

ZooKeeper, provides these primitives. There are other tools such as Curator

that provide more high level abstraction. So if you're looking to just get started and want to do leader election, you probably wanna pull in there's another Apache project called Curator, which is separate, which provides this higher level of

abstraction

using

Java Java object oriented,

primitives. Let's see. So zookeeper itself is a distributed system. It's made up of a number of servers that we call an ensemble. And the way that the system works is

we elect a leader in that ensemble, and we form a quorum. So there'd be a leader with some number of followers that are made up by the initial list of of servers. So you might have 3 servers, you might have 5 servers. You typically want an odd number because it is a majority

leadership election that happens to decide within zookeeper itself. So this this is often a confusing area, which is zookeeper itself is selecting a leader even though within your application you might wanna do leadership election. That's that's different.

But zookeeper itself is electing a leader amongst these servers,

and as long as the majority of that initial

ensemble of servers is talking to each other, has formed a quorum, the service is up and available.

And

that means, for example, you could have 3 servers. If 1 of the servers fails, the service is still live. You could have 5 servers. If 2 of the servers fail, the service is still available. So the clients, so we provide a client library that you embed into your application.

As long as the clients can talk to the Quorum,

the servers that are able to communicate to each other,

you get the services that I described previously. You could implement leadership election,

and the leadership election would be valid and current

as long as your client can talk to the quorums, so the majority of the 1 of the servers that's making up the majority. Does that kinda make sense? Yeah. Absolutely.

And you mentioned that

most of the higher order capabilities, such as being able to easily do leader election without implementing your own logic on top of it, or being able to do optimistic locking

is delegated

to other client frameworks. So

it sounds like most of those aren't built into ZooKeeper itself, and the project has opted

to stick with just having these primitives and building blocks to make it more of a framework type approach

and leaving a lot of the specifics

of how those primitives are used up to the application and system developers?

Sort of. Yes.

We do provide I mentioned previously, we have the this concept of recipes.

So we essentially

define

how the various primitives can be put together

to implement the recipes or the use implement the use cases such as leader election,

distributed queues, etcetera. And you could certainly, in many projects, have

just done that themselves on top of the zookeeper primitives.

I personally think from seeing a number of systems use these things, it's it's better to use 1 of these third party

higher level frameworks on top of zookeeper such as curator just because if all you want is vanilla

leadership election,

it's a lot easier and safer to use the curator

provided capabilities, the the curator code, to do so because

it allows you to just really focus down on your domain specific problem. Because, again, you're probably a developer who's building, let's say, a key value store. You really wanna focus on building a key value store, and the less you have to focus on how exactly does distributed locking work,

the better.

Now there might be cases, and there certainly are cases where you have some specialized use case

that could benefit from implementing things on top of the primitives directly.

And, certainly, projects do that as well. And and ZooKeeper is not the be all, end all of distributed coordination. Raft is very popular because

sometimes people want to implement

the distributed coordination themselves rather than offloading it to a project such as zookeeper

because you can squeak out potentially more performance or you can optimize it for for particular use cases. When you have a gen generic system or general system such as zookeeper, you're obviously

going to have some trade offs. And it's kind of the 80 20 rule. Although, I probably would argue in this case, it's more like the, you know, the 99 1 rule where

for most people, zookeeper is probably the best choice, especially initially when you're originally building your system. And then over time, you might get to a level of sophistication

where,

you know, in order to optimize certain cases,

you would either use the zookeeper primitives directly or potentially even implement

distributed coordination yourself. Beyond things like leader election or locking, what are some of the other types of system level features that somebody who's building an application or a distributed systems on on top of ZooKeeper

will typically need to implement themselves that are

still too special case to have just a general solution for? I would say things like distributed locking,

discovery,

distributed configuration. So essentially, a key value store, leadership election, distributed queues cover most of the cases.

Where people often get into trouble is that where they

try to use Zookeeper for things it wasn't originally intended for. So it looks like a file system, so I'll use it as a file system. And I often get the question, 1 gigabyte files, and I want to store them in Zookeeper. Will that work? And I need to share these files between various systems. Now, of course, it it will work,

if you have 1 1 gigabyte file, but if you start scaling that out as your application grows, you can certainly run into problems. Also, zookeeper itself is not

horizontally

scalable.

So as I mentioned previously,

you have a number of servers that form the ensemble,

3 or 5.

You typically don't want 1, by the way, because that's not not very reliable. There's no failover support there. But having 23 is not the best idea either because all the all the operations go through the leader, and having 23

servers to coordinate with adds a lot of overhead. So you typically want 3 or 5 in order to facilitate that failover characteristic,

but it's not going to give you

higher performance.

So that's 1 of the things certainly to consider with zookeeper

is that there is a limit to the number of operations that you can perform per second, the throughput of the system. You know, that is limited by them, the networking,

etcetera, the the architecture of the system.

And there's no way to say, you know, add more nodes and make it more performant. It actually

decreases the performance as you add more nodes. So that's something to consider

is when you're building an application,

you need to

do the distributed coordination

parts of ZooKeeper and ZooKeeper, but overloading it with all kinds of other, hey. It'd be nice if I could use ZooKeeper as a traditional key value store or as a database,

are not the are not the best choices. I don't know if that's answering

your question directly. Yeah. No. That that's a good answer. It's definitely useful to have that additional context.

And

1 of the things that that leads me to question is in terms of,

you mentioned that adding new nodes to the system doesn't increase the throughput, but from an operational perspective,

does it support being able to dynamically

remove and add nodes into the cluster for recovery purposes?

Until

recently,

the configuration of ZooKeeper itself, surprisingly enough, is fairly static.

So there are configuration

files

that define the ensemble,

and the only really the only real way to make changes to that

would be through changing the configuration files and doing your rolling restart of the server.

Our most recent versions that we're working on right now introduced, I believe, in 3 dot 5, which is about to come out of beta to the stable release train relatively soon, added dynamic configuration support to zookeeper itself. So that's 1 of the first versions

that is the first version that actually supports dynamic changes to ZooKeeper aside from just doing a rolling restart of the service,

which is the way things have been managed previously. And that's actually a very popular feature.

A number of companies,

a number of users are already using that capability even though it's not out of beta just because it's

considered and and seen as super valuable to be able to modify the ensemble

size dynamically, but we do support it through rolling restart.

Generally,

you don't make too many changes to zookeeper once it's you know, once you've defined

the ensemble and roll that out into production, let's say, generally, in my experience, that doesn't change very often. Typically, change it changes when there's a failure. So if 1 of the machines has a hardware failure

or you need to make changes within your,

networking environment, within your production environment, you might wanna you might wanna make some changes to the zookeeper ensemble. But once it's created and up and running, it's pretty

hands off. 1 of the precepts, 1 of the principles that we built into zookeeper

in the early from the very early days was that it should be pretty hands off. So once you set up the ensemble,

it forms a quorum itself. It elects a leader.

If 1 of the machines fails or goes offline or gets partitioned, say, there's a network partition where 1 of the servers on of the ensemble is partitioned off from the from the main cluster from the quorum for some period of time, when it gets re when that,

when that part partition heals and the, and the server rejoins the quorum, it will automatically

update itself to the latest state of the service.

So that's

that's without any operator,

involvement at all. And can you dig into how zookeeper itself is architected

and some of the evolution of the internal design over the years that it's been, built and maintained?

Like I said, it zookeeper has been around for quite a while, for about 12 years or so, and it's a Java application.

So I think Java

1.4, 1.5,

was was the primary Java version at the time. So if you look at the code, there's definitely, you know, definitely some legacy aspects of the fact that zookeeper was written at a time

before many of the capabilities

and or niceties that have been added to Java

over over the years, were available. So certain, you know, we use, like, synchronized locks in in many cases rather than some of the higher level primitives that are available today. Once you have a system that works and is used in production by many folks and has been tested,

you know, both directly and and

tested

by time, makes it hard to make changes to

the core of the application. So 1 of the things you'll definitely see is that it's you know, the the code was written in a previous error,

if you will.

The basic architecture of Zookeeper is a message passing system.

So the clients

pass messages to the server. The servers pass messages

to each other.

If you are familiar with the SEDA architecture,

s e d a, which stands for staged event driven

architecture. If I remember correctly, I think Eric Brewer was 1 of the folks who who originally,

helped to define that. That's basically the underlying architecture of Zookeeper. Another aspect of our initial

work and our initial principles was to try to build the system

that favored or to build the system that favored availability and reliability

overall else.

So zookeeper is your distributed coordinator.

It is your distributed

single point of failure, I would like to, you know, often say.

So

we definitely take very seriously the fact that this is

how, you know, the

the zookeeper is the underlying

engine for your system. So if if something needs to fail over, it needs to be highly reliable, highly available.

Otherwise, your system won't work. Obviously, people won't use Zookeeper if that's the case. So we definitely favored

availability and reliability

overall else. So many of the underlying

implementation

specific details

such as use of of SITA were chosen

because

we want to make the code as simple as possible because we knew that we would make mistakes, and we wanted to limit the number of mistakes that we made.

So

the main event pipeline, for example, for the leader and for the followers

that process these

change requests and and and read and write requests from the clients is actually has traditionally been single threaded as a result because we know if we knew that we could improve performance,

but at the same time, we would

reduce reliability

because the likelihood we would introduce bugs,

would be higher. So we actually

reduced things. We limited the amount of, for example, threading that we did internally

to,

improve availability and reliability

at the fault of, you know, at the

on the negative side, you could have higher performance. Now recently,

a number of folks have been doing work to

improve performance and throughput,

and those changes are starting to go in now. So there are some multi threaded pipeline pieces

in the processing of the client requests.

And those are those are relatively recent changes, and they've been, you know, battle tested. The nice thing about having large companies that run zookeeper in production

allows us to really get some some battle hardening before we, you know, ship it out to the rest of the folks. And the Apache process itself facilitates that as well, right, since it's open source. And so in terms of the design of the system,

as far as where it falls on the axes of the cap theorem, it seems that you're favoring availability and partition tolerance

at the expense of some amount of consistency,

because in some cases, if 2 different parts of the system get a different response, it's not necessarily the end of the world as long as everything can keep running.

Yeah. Surprisingly

though,

zookeeper has actually more of a CP than

a system on on the on the cap side.

So if you're a client and you make a request

to Zookeeper,

if you get a success back from zookeeper,

you're guaranteed that it's consistent. And

1 of the again, if the the service itself is using a leader election with a majority

rule. So if a majority of the Ensemble such systems, so 2 out of 3, if you're running 3 3 servers in Zookeeper, are able

to talk with each other and coordinate the services up and available.

If you fall below

the majority,

then the service will stop and it won't be available.

So

the client

as I said, the client is guaranteed that if it gets a success, that it's consistent.

1 of the trade offs that we make, however, is that we don't guarantee

that the clients

see a consistent view of the world. So

as an individual client, you're always guaranteed that the world moves forward. You never move back in time,

And the information that you're seeing is consistent

at least with respect to to the server that you're talking with. So I didn't mention this before, but the clients

are talking to a particular server in the ensemble in the forum. And the servers that make up the ensemble will only broadcast themselves as available or only will only accept connections from the clients if they're a member of the quorum. So if they're able to form part of that majority that I described,

they'll allow clients to talk to them. They'll service requests.

However,

you might have client a that is talking to

1 of the servers that's currently in the

currently in the quorum, 1 of the majority servers, and that would be up to date. However, you might also have a client b that's talking to a follower

of the ensemble

of the quorum

that has been periodically

partitioned off from the service.

And there are timeouts. So there are session timeouts for the clients. So when you establish a session with the service as a client,

you specify a session time out. If you're doing diverging here a bit, but if you're doing leader election, you're basically saying,

I'm willing

to tolerate a certain amount of,

certain amount of time that I'm willing to be out of date. Of course,

it's a distributed system, so you're always going to be out of date, right, fractionally.

But how much can you tolerate being out of date? So if you wanna be the leader

and you can tolerate being out of date for 10 seconds, you would establish a a session

initially of 10 seconds. And what that guarantees you is you're always up to date with the main service within 10 seconds, and that's where that

cap aspect kind of comes into play. So if you're a client that's talking to a follower

that's been partitioned from the service, you might not know about that for, let's say, 10 seconds.

Again, your client will always see a consistent view of the world that things are always moving forward. For example,

say the server that you're connected to actually fails, you would have to disconnect and reconnect to another server. You're always guaranteed again that things will move forward in time, but you're not guaranteed

that 2 clients will see a consistent view of the world, which can be a little bit challenging folk for folks, who are building on top of zookeeper,

but it's kinda goes back to the, you know, the fallacies of distributed computing. You're always guaranteed that you're going to be behind. It's just a question of how much behind are you going to be. And in terms of building and maintaining the zookeeper project itself, particularly as a centralized

component of distributed systems? What have you found to be some of the most complex or difficult aspects of,

building and maintaining the project and educating users on how best to take advantage of it? I think in the early days, a

big issue was just

educating people and making them aware of the capabilities of the system, what it could do, what it couldn't do.

We ourselves were working out, you know, what the best use cases were for ZooKeeper and when a different system

would be a better choice. We were building community

as part of the Apache aspect of building ZooKeeper.

Initially, we were the only ones who were writing the code. Right? We were using that internally at Yahoo. And then once we moved to Apache,

a big part of it was just

bringing on additional folks, you know, like minded folks as part of the community who could work with the community to continue zookeeper development, keep it focused on distributed computing,

distributed coordination

versus

potentially adding other features that weren't towards that main goal. I think over time as the system has matured, there's a lot of information out there about Zoo keeper. We're starting to see a few additional systems

over the last few years that are providing similar kinds of capabilities so more people are educated.

I would say just the education part,

again, initially

especially initially, that was 1 of the 1 of the biggest things. Today, I I think we face a lot of issues just from the perspective of having a system that's been around for lots of years and and a lot of people using it and relying on it for their to build their to build their systems on. So things like maintaining backward compatibility

is a huge problem today

because we wanna make changes. We find a bug. We find a security issue.

We wanna make changes. That really

can be difficult

because, again, zookeeper was, you know, initially created, let's say, 12 years ago. Things like Netty didn't exist at the time. Right? Zuki,

Java itself didn't support SSL at the time for for network communication.

So

we ended up building a lot of this ourselves. Right? Protobufs didn't exist, etcetera. So we ended up building our own version of of such things. If I was building a system today, I'd be taking advantage of all these things,

and it would make it a lot easier to do

schema migration on my transport level, on my marshaling and unmarshaling of data, etcetera.

We didn't have all those things, so we ended up building it in. And I think over time,

just as you add layers of code, as you add complexity, as you try to fix bugs that you didn't expect, the code just gets more and more complex. I'd say that's that's 1 of the bigger challenges that we face today. Again, especially

maintaining backward compatibility because that is a value that people place on zookeeper as well. The fact that

our versions are backward compatible, they can upgrade to the new version with fixes, with, improvements,

and

not need to really change their code that much if at all. And

you mentioned

that in terms of scaling for zookeeper,

it's mostly

for being able to improve the reliability of the cluster more so than the throughput, but that there have been some

recent changes to improve overall performance of the system by introducing some of these threading capabilities.

So I'm wondering what are some of the main scaling factors for zookeeper

and the systems that rely on it and some of the edge cases that users should be aware of in terms of utilizing ZooKeeper

and ensuring that it is operationally reliable?

Sure.

There's a lot to unpack there. Let me let me see if I can go through some of them. So 1 aspect I I didn't talk about too much before, but under the covers,

zookeeper

is using a write ahead log and a and a snapshotting mechanism

to store persistent state. Zookeeper is an in memory database effectively,

so all of the information needs to be in memory. So that's certainly 1 thing to consider

is when you're putting information into ZooKeeper,

you're really limited by the heap size that you're allocating for the JVM that's running the servers.

ZooKeeper itself is

backing up that information

using

traditional

wall and snapshotting,

operations

onto the file system. So we're using a a right ahead log and a snapshotting mechanism in order to store the information to the file system. 1 of the primary limitations

as a result is

how frequently

and how quickly

we can f sync the write ahead log.

So as the clients are making changes, they're doing rights into the system. That's 1 of the primary limitations

of Zookeeper in terms of throughput

is we need to

write the changes to the right head log, fsync it before we can respond to the client that the right operation that it's trying to perform

was successful. So the client will propose a change to the zookeeper service

that will get forwarded to the leader. It's part of the atomic broadcast protocol, Zab, that the leader will communicate with the followers as soon as a majority of the followers

have f synced the change to the to the disks

as part of the right head log, that change will be

communicated back as successful to the client. So that's 1 of the primary limitations of Zookeeper in terms of the throughput.

What we often see,

in operations is people are trying

to colocate as many services as possible to reduce costs,

etcetera, reduce their overhead in terms of the number of systems that they need to manage. So if you end up collocating Zookeeper

with another

service that is heavily use utilizing the underlying disk, that can actually slow down Zookeeper

quite a bit.

We actually recommend that

if you're using a mechanical drive that you dedicate a spindle,

SSDs help in that regard. Although, SSDs have their own failure modes that need to be considered.

Right cliff, in particular, has been an issue in the past,

and that's an that's an issue that's typically dependent on the firmware of the

of the drive, which makes it particularly difficult to track down. But from an operations perspective,

you wanna size zookeeper serve the the service based on the number of failures that you wanna be able to support. So if you have 3 servers, as I mentioned,

1 of the servers can die. The service will still be available. If you have 5, 2 can die. 5 5's kinda nice for production serving

because while it doesn't improve your throughput,

it does mean that if you have to take 1 of the servers out for maintenance, you can still suffer an unexpected failure and the service will still be live.

Let's see. So,

hopefully, I've did I miss I'm I'm sure I've missed probably a couple of your points if you wanna redirect. But those are those are some of the primary things to consider from an operations perspective

and in terms of of sizing Zookeeper.

Yeah. That that's all valuable information. And

given that

particularly

the in memory aspects, it helps in terms of the

instance type selection for somebody who might be running in a cloud environment

to know that they should,

optimize

on the memory side more so than on the disk side, for instance. So,

for people who are doing that deployment, it it's good to have that knowledge. And,

you were talking about co locating services as far as ZooKeeper running alongside some other,

server process, but,

the other direction

of using ZooKeeper as a multi tenant service, I'm wondering what the capacity is as far as being able to, for instance,

use ZooKeeper

as the coordination layer for maybe a Kafka cluster and a Flink cluster using the same underlying,

ZooKeeper deployment?

Yeah. That's a great question. For most distributed coordination

efforts, you're not gonna be maxing out ZooKeeper. Right? You're going to be,

you're going to have 3 machines that are trying to,

manage a distributed lock. You might have thousands of machines that are managing a distributed lock, but the number of operations that you're performing as part of that are relatively small. And zookeeper is also

optimized for read heavy workloads.

So not something that I mentioned before, but in in terms of multi tenancy and scalability,

zookeeper is optimized for read

heavy workloads. So the the the thought is that you're going to do be doing a lot more reads than you are going to be doing rights. So let's say you're doing leader election

that requires a small set of rights, and then all the followers need to basically read the z node in this particular case, read read, 1 of the pieces of data from ZooKeeper, and you're gonna have potentially large amounts of systems doing that. And ZooKeeper handles that,

particularly well. ZooKeeper is a hierarchical

namespace. As I mentioned, it looks very much like a file system, and that does facilitate,

you know, having a /apps,

you know, /appskafka/apps,

solar

/apps,

you know, Flink, etcetera.

And we have a very primitive,

change route kind of capability, to route kind of facility.

So when you have a application that's connecting to ZooKeeper, it can actually route itself in 1 of these,

sub namespaces.

So you could, like, have a /app/kafka,

and Kafka

could root itself at /appskafka.

So its root would look like slash. So there are some multi tenancy capabilities

built into Zookeeper. There's quite a bit of overhead if you're using Zookeeper properly. There's a lot of there's a lot of unused re

again, it it's not like you're doing large amounts of operations on zoo zookeeper all the time. So there's there's unused capacity. Maybe that's what's what I'm the word that I'm looking for. But there's quite a bit of unused capacity

typically, in a zookeeper system if you're using it properly

because you're just not doing that much. Right? You're doing leadership election. It's kind of bursty. Like, when something fails, you might have a burst of activity. But for the most part, it's just quiescent. In terms

of operations

in a multi tenant environment, there definitely are some things that we could do, and we're seeing some that's 1 of the areas that,

has had recent development efforts.

Facebook uses zookeeper extremely heavily to do their internal configuration management and internal coordination.

They've contributed a lot of changes. They're in the process of contributing a lot of changes because they use it at at a pretty extremely high scale in a multi tenant environment.

So

I would say just the fact that we have the namespaces

facilitates the multi tenancy aspect pretty well. But, I in my own in my own day to day here at Cloudera,

we're running Hadoop systems, and there are quite a few of the systems in the in the Hadoop ecosystem

that rely on ZooKeeper nowadays for their distributed coordination, HDFS, HBase,

etcetera. And those systems are all sharing a single ZooKeeper, typically, at least in an

in a Cloudera deployment, for example. And that that works perfectly fine. There's quite a bit of of overhead. And that that works perfectly fine. There's quite a bit of of overhead or unused capacity, let's say, in a Zookeeper environment. Because you need to have, like, 3 or 5 servers

to facilitate

a zookeeper service. There's just a lot of extra capacity that you have based on the fact that you need multiple machines in order to facilitate that. And as you mentioned before,

in recent years, there have been some other projects that were created and released for doing some of this

centralized coordination

of distributed systems, such as etcd or console.

And I'm wondering how ZooKeeper compares with some of those other projects and

some of the cases where ZooKeeper might be the wrong choice and you would prefer to use something else? Yeah. That's a good question.

Just as ZooKeeper itself, you know, looked at, and it was was inspired by,

systems such as Chubby back when we were conceptualizing,

you know, what we should build, obviously, optimized for our particular use cases. I think the same thing has happened in these other spaces. So console, which is a great product, etcd

facilitating a lot of the you know, in the go space,

and especially with Kubernetes now, quite a few of the capabilities

of Kubernetes

rely directly on etcd. If you look at I think, etcd in particular was

influenced not only by ZooKeeper, but also by Raft.

So John Osterhout and and some of his folks did a bunch of work

more on the

education side, I would say. So they took some of the concepts

that had been floating around, kinda put them together again

slightly differently, and that provides you with Raft, which is a great tool set for learning distributed coordination, especially if you're in that minority, I would say, of folks who need to implement distributed coordination yourself. Raft was really sort of conceived, at least in my opinion, more from an educational

perspective, right, which is Paxos is great, but it's really hard to implement properly.

Zookeeper is good because it provides you this service. But what if I need to build my own distributed consensus system?

What how should I go about doing that? What's what's the best way to do it? Like, how would I be successful? Raft

really,

really well documented how to do that. And then many of the systems such as etcd were able to take those concepts and build around it. So if you look at etcd,

I'm less familiar with console, but I believe also to be the case, you know, they they looked at some of the best some of the best things out of ZooKeeper and some of the other systems

and, you know, implemented their own system based on that, again, for their, you know, for their specific

for their specific use cases. Also, ZooKeeper was created, again, 12 years ago, and,

you know, things like JSON and some of these other de facto best

interface that it provides. It also provides a c interface.

Many people have built their own language specific

implementations

of zookeeper clients on top of ZooKeeper. The same thing is here. Right? If you're if you're in the Go space and you're building a Kubernetes app, of course, you're gonna use etcd. Right? Hopefully, you're not going to have to run 2 distributed coordination systems. I I personally would would hate to have to do that. So I think it's, if you're in an environment where you have a distributed coordination system already, you probably would want in. And given the length of time that zookeeper has been around and the

massive changes and developments

that have occurred in that time span,

I'm wondering how you have seen the needs of distributed systems engineers and the systems themselves change since you first began working with ZooKeeper.

How has it changed over time?

I think people are just much more educated now than they were when we originally started. As I mentioned, there wasn't you know, there was Paxos and some of these other things. But I think building large distributed systems

was a really sort of niche area. It's become more and more so. You know, if you go to Amazon and you wanna build

an application,

you know, of course, you're gonna wanna build something that has capabilities

and is likely going to need the scale

to the point where you're gonna have to run multiple machines. And you need to then

consider the operational aspect as well because you have multiple teams that need to work together. How have things changed over time? I think people have just become much more

sophisticated.

The demands

have just increased in terms of the kinds of capabilities that you need, the kinds of,

you know, just the the tolerances

have,

you know, the have just decreased. Right? Like, the, you know, if it was an inch before, now it's a fraction of an inch. Right? And I think people rely on things like console, etcd, zookeeper more and more. Have they changed? I mean, I I've been a little bit surprised that there haven't been more distributed coordination systems available. It's only relatively recently that FCD and console have really be you know, come out and become available. I guess, maybe the thing that I've been more surprised at is that there aren't more systems like this. Probably the the thing that I've been most surprised by. And

as you were saying,

the code base itself

has

accrued a number of different,

peculiarities

because of the age of the original implementation.

So if you were to start everything over today with greenfield

and knowing what you know now, I'm curious what you would do differently

and if you might choose a different language runtime to build it in, or if you've been happy with Java as the implementation target? I think we've been super happy with Java. It's a known quantity.

Again, if you're building something, you're gonna spend,

you know, months, a year, 2 years building a system. You're gonna spend many, many more years operating it. Right? Once you build a system that relies on another system, that thing is going to exist for for a long period of time. And you you mentioned a number of systems, Kafka, Flink, etcetera, who currently HBase that I mentioned

that,

that rely on Zookeeper. That's not an easy thing to change from their perspective,

either. 1 of the big changes today is that there are a lot more facilities available for building a distributed system, so such as NetE and other other capabilities for, like, transport level communication,

protobufs,

etcetera,

Avro, that supports schema migration,

which is 1 of the areas that we kinda suffer from in ZooKeeper today because it because our implementation is kind of bespoke. We didn't build in certain niceties like schema migration, which can make it difficult for us to change our on the wire protocol, which we occasionally need to do if we're adding more capabilities, trying to fix a bug. So, certainly,

I would be leveraging a bunch of things that exist, a bunch of tools that exist today in the in the open source ecosystem that didn't exist at the time. I think Java was a great basis for that. I might be tempted to potentially use Rust,

today. But, again, getting back to the operational aspect, you know, I'd I'd definitely wanna choose something that the people who are operating the system felt comfortable with, both from a making a system that's reliable and available, but also building something that I can operate, which would include things like monitoring

and debugging. And we've listed some of the,

well known projects that have built on top of ZooKeeper, but I'm wondering if there are any particularly

interesting or unexpected

ways that you've seen ZooKeeper used.

Well, I've seen a number of,

sort of anti patterns,

used for zookeeper certainly through the years. I mean, 1 of the things I'm kind of consistently surprised by is the number of,

projects that continue to to adopt the zookeeper as their distributed coordination mechanism. So the fact you know, it's easy when you're making the sausage to kind of see the problems under the covers, see the bugs, and the kinds of issues that we're facing. But,

you know, again, it's it's pretty encouraging to see that there are many new systems that are picking up ZooKeeper as their distributed coordination mechanism even though there are other systems available today that that are potentially great options as well. I don't know. I I can't think of anything off the top of my head that is particularly

surprising. At the end of the day, there are very common there are very common set of problems that people who are building systems need to solve, such as leader election,

service discovery,

locks,

distributed queues, and, you know, systems are picking that up and using it. And those are the most common things that I see. I I can't say that I see, sort of, crazy, you know, crazy things that nothing at least that comes comes to mind. And going forward, what are some of the plans that you have for the future of ZooKeeper, either in terms of improvements

or new features or capabilities

or anything in terms of the community or educational aspects of the project? Zookeeper is a

community based project and has been for quite a long time at Apache, which is great because

people come and go, new requirements kinda come and go,

and companies come and go. But at the end of the day, you always have that Apache community that you can go back to. We've been working really hard over the past few years to come out with an updated release that includes things such as the dynamic

configuration support of ZooKeeper itself, so changing changing the ensemble,

dynamically. I'd say that's 1 of the biggest things that we've been working on is really kinda, like, shipping that. It's been a quite a bit of work on security,

and we continue to work on security just hardening all the system, both in terms of availability, reliability, but also in terms of of security.

Getting so many eyes on it is great

from that perspective as well. It's a pretty mature project, you know, being around for 12 years.

I think 1 of the big things that we've been seeing recently, like I said,

Facebook has been contributing a number of changes.

They've been doing a lot of work on observability

because they use it so heavily throughout their infrastructure.

Monitoring and management

managing of these clusters

becomes even more important for them. So they've been contributing a lot of a lot back to,

zookeeper around observability,

which I think is really great. So making it even easier to manage

you mentioned multi tenancy before. I think that's 1 of the areas we could also do some more work on. And,

you know, as we get more sophisticated in our use cases, like multi tenant use cases in particular, it's probably the 1 of the areas that we'll continue to focus on or focus on more, but it is a pretty mature project. We've added some capabilities recently around, like, TTL, which I think

is a capability

that we identified out of, like, etcd

and maybe or maybe it was console. But we've we ourselves are seeing some of the changes that are going into these other systems and saying, hey. Wouldn't it be great if you could set a time to live on a zookeeper

on zookeeper data, which is not a capability that we had until relatively recently?

So

we're we're learning things as we go along as well, which is great. And are there any other aspects of the zookeeper project

or distributed systems, primitives, or

centralized coordination for these types of systems that we didn't cover yet, which you think we should discuss before we close out the show?

Other aspects.

You know, I think we we touched on it, but the the operational side of things is a huge, huge issue. It's

easy to provide

these capabilities

and to convince people to use these systems, but putting them in production and making sure that they're actually really reliable, really available, proving them out for the particular user's use case

is

a lot of work and a lot of effort. And it's 1 of the main things that defines a successful project from an unsuccessful project.

So I'd I would say that that's 1 of the main areas. Alright. And for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective

on what you see as being the biggest gap in the tooling or technology that's available for data management today. I I think 1 of the main things that I see, again, goes back to the

operational

aspect of things.

1 of the areas for concern that I have is around observability.

So whether you're building on top of Zookeeper

or you're trying to build the data pipeline

or trying to build a data science model that you wanna get into production,

there are going to be unexpected hiccups. Zookeeper

still has bugs, surprisingly,

after all these years, and, occasionally, you'll run into 1 of these bugs.

How do you identify that that issue?

It's going to impact your system. How do you identify how do you track that? How do you identify it? How do you debug it? How do you make sure that you get

that notification

and resolve the issue as quickly as possible?

Whether you're using ZooKeeper, whether you're writing a data pipeline

where the data may change unexpectedly and send suddenly your ETL processing is starting to fail.

You've written a model that you deployed into production for data science,

and you're getting unexpected results suddenly for some unexpected reason. How do you identify that before it impacts your your business in a negative way?

I see a lot

of systems and a lot of work

being done, and and it's great work. But how do we integrate all these different there's so many components these days that have to be pulled together to build the system. How do we make sure that we can run that in production

and

quickly

identify problems and resolve them. That that's 1 of the areas I would say that I I spend a lot of time on today,

helping people not only in Zookeeper, but in other aspects of the work that I do. And I think that's 1 of the areas that we need to continue to focus on as a community.

Well, thank you very much for taking the time today to join me and discuss the work that you're doing with ZooKeeper.

It's a system that I've been aware of for a long time,

and I've heard a lot of people talking about it. So it's great to be able to get more of a deep dive and a better understanding of the overall scope and history of the project. So thank you for that, and I hope you enjoy the rest of your day. Thank you for having me on and, chatting about Zookeeper. I really appreciate it. Thank

you.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links