Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil

Hello, and welcome to the data engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and 40 gigabit network, all controlled by a brand new API, you'll get everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/

Linode to get a $20 credit and launch a new server in under a minute.

And are you struggling to keep up with customer requests and letting errors slip into production? Wanna try some of the innovative ideas in this podcast but don't have time?

DataKitchen's DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and datasets while improving quality.

Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love.

Join the DataOps movement today and sign up for the newsletter at datakitchen.iode.

After that, learn more about why you should be doing data ops by listening to the head chef in the data kitchen at dataengineeringpodcast.com/datakitchen.

And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Sage Weil about Ceph, an open source distributed file system that supports block storage, object storage, and a native file system interface. So, Sage, could you start by introducing yourself? Sure.

My name is Sage. I'm the Ceph project lead. I've been working on Ceph since the beginning, which is almost, I think, 13 years now. The first line of code was written in 2004, so thereabouts.

These days, I work at Red Hat. I'm part of the office of the CTO,

but my primary focus is on the stuff upstream project. I'm setting the technical road map and trying to get as much time as I can, to spend actually doing coding. Yeah. It's difficult.

Some days you end up having to just write a bunch of pros. Lots of lots of meetings. Lots of video conferencing. And do you remember how you first got involved in the area of data management and data storage? I

cofounded a web hosting

company out of college in Los Angeles area, DreamHost. And so storing all the customer data was sort of 1 of the problems that we dealt with. That was wasn't really my primary focus. But when I went to grad school at UC Santa Cruz, I joined a research group that was focused on storage,

and became involved in,

a project that was funded by the the National Labs Los Alamos, Livermore Sandia. I'm basically focusing on building a petabyte scale

distributed file system. And that's that's really where Seth Seth began.

And as I understand the history, you ended up using Seth fairly extensively at DreamHost as well to serve as sort of the underlayment

for a lot of the virtual machines that you ran. Yeah. Well, that took us a long time to get there. So the project began as sort of a research project, and I think I underestimated the amount of work it would require to turn into something that you could actually use in production. So it was actually even after leaving Santa Cruz and focusing just on developing the upstream project,

it was

years before it was actually used in production at Dream Horse Host or anywhere else. But I guess it was 2011, 2012 thereabouts.

We launched a s 3 compatible object storage service called Dream Objects, and it was also around that time that

separately took off in the open stack space and was used

by lots of people for for, block storage for virtual machines, and DreamHost was among many users in that space.

And beyond just being used for the sort of platform layer for virtualization

and cloud hosting, what are some of the most common and widely used use cases for Ceph?

So Ceph provides object blocking file as well as sort of a a lower level native object store. So there's sort of, like, 4 main interfaces that people use to talk to the source system. The most common are definitely,

currently in the block storage space for virtual machines. Our stuff is used by the OpenStack community extensively,

lots of other virtualization tools. Like, there's a system called Proxmox,

that's really popular especially in Europe. And then also on the object side for the radius gateway, which provides an s 3 and swift compatible object storage service,

both for individual clusters and you can federate multiple clusters together. So those are, I'd say, the bulk of the users are in those 2 spaces. In the last few years, as we've finally declared that the file system is sort of stable in production quality, the file system usage has also picked up significantly. But I think it's not quite as extensively used as blocking objects still. There are a large number of different

or being purpose built for a particular use case. So how do you view Ceph

in terms of the broader landscape of offerings for distributed file systems? Well, a couple of things. It was originally intended to be a distributed file system.

And so the overall architecture was targeting high performance computing workloads on supercomputers

with 1, 000 and 1, 000 of nodes, you know, piling into the same process in these big crazy workloads that, you know, zillion people creating files in the same directory, that sort of thing. That was that was really where it started.

But the architecture that we came up with to support the use case was was layered in the sense that we had an underlying object storage layer, that came to be known as Rados. And then the file system functionality was really built on top of that. And I think it was sort of an accident of fate. It wasn't really what we intended.

That that sort of object layer was generically useful for all kinds of different storage.

So it was a few years later that we added the block device,

that basically just stripes a virtual disk across objects. Initially, it was pretty simple. Now it's got all kinds of other features as well. And similarly,

that we built an s 3 compatible

restful object storage

gateway that also stores its data inside Rados. The the system wasn't really intended to be all of those things initially. The what what really did motivate us though was sort of a relentless focus on scale,

so that the way that the the radius was architected, it was designed to try to eliminate any sort of single point failure

clearly, and also any sort of single point that would could would limit the overall scalability of the system. That and a view that we wanted to build the system out of components that were expected to fail, whether it's a hard disk or SSD or a network card or a host,

or an entire rack of servers. So that the way that radius is put together is is very different from sort of the other distributed files that was used at the time in HPC, which was Lustre, where you had sort of,

pairs of nodes and a sort of a traditional failover pair with, tool connected

device disk arrays and and so on. There was really specialized hardware and sort of a traditional approach to doing

Ceph was very much of the view that when you're building a system that's really big, things are always gonna fail. Sometimes they're gonna fail in place. You're not even gonna be able to replace them. And you really wanna be able to build a systematic commodity components and have it still for at a high level of reliability.

So

in the fact that you were designing

for high failure rates and commodity hardware, you were, in a lot of ways, anticipating

the current trend towards

what's being referred to as cloud native architecture,

and also a lot of the work that's being done on some of the systems like Hadoop, where you're relying on these just off the shelf hardware or virtualization

components to be able to build these massively scalable and reliable systems rather than

having to invest in very high cost purpose built hardware that doesn't necessarily

allow you to scale as quickly and as easily

in the cloud era. That's exactly right. And I think it's it's interesting that my recollection at least of at the time, what we're thinking, it really wasn't we weren't thinking cloud at all. Like, cloud wasn't even really a buzzword at that point. This is 2004.

We were really thinking about

supercomputers.

And it's interesting because today, the way that these these systems are deployed, it's like a you know, the a lab will buy a new supercomputer every, like, 4 or 5 years or whatever it is.

And they, like, you know, they deploy a bazillion racks, and it, like, it sits there and it doesn't change for the life cycle of that machine, which is gonna be, like, 5 or 7 years, however long they keep it keep it running. And then they, you know, tear it all down and, like, build a new 1 the next building over. But when we were designing it, we were very much thinking about the idea that these systems had to be dynamic. So we were sort of thinking beyond the way that the HPC systems were sort of run at the time that you would deploy new hardware incrementally. You'd have a mix of different types of hardware. Some stuff would fail. It'd be decommissioned. New stuff would be deployed and so on. So we're imagining this sort of, system that would live, have a much longer life cycle that would live continuously. But I I I don't know. It's it's interesting because I don't remember thinking of

at that in terms of cloud, but that's very much the way that that sort of real systems, how their life cycles look today.

Yeah. And particularly,

the fact that it was built with no single point of failure is what allows it to continue to be relevant, whereas some of the older systems such as Gluster that you mentioned are starting to fade from

the common discourse because they aren't really able to keep up in this cloud era where you have new instances coming into and out of existence very rapidly

and automated deployment capabilities.

So having that scalability

and the ability to delegate across multiple nodes in the cluster for that scaling really makes it able to keep up as new types of workloads are introduced. Yeah. I mean, we definitely focus on,

the idea that a a cluster is a very dynamic

entity. You're always adding or removing nodes, and they can sort of be put anywhere. And anything could happen at any time. To be fair, Gluster does have a lot of these same features, but it is it is it is true that it's a lot more structured than the way that you add and remove nodes. So you have to be added in pairs and have to be aligned in the way that rebalances is I'm a little bit less flexible. But I think that the sort of relentless focus on unreliable

hardware,

being able to cope with that and sort of arbitrary changes to the cluster configuration are what makes the step architecture very compelling in these environments.

And 1 of the

most notorious issues in these types of distributed and scalable systems

is network partitions and network failures

and the whole concept of the cap theorem and the limitations that that imposes on computing structure. So what are some of the ways that you have,

worked around some of those failure modes and limit and and worked to mitigate the impact of network partitions and some of those other types of events that can happen in these large clusters? So in in Seth,

the way that we deal with partition is

by relying on the monitors, which are sort of the gatekeepers

and cat herders of the system.

They run our implementation of Paxos.

And so you always have to have a majority

of the monitors online and available in your partition in order to make progress in the system. But as far as the way that the actual data is laid out on disk and how we replicate rights and so on, we face this sort of a design decision whether we wanted to have, sort of like a 2 phase rollback type system or you could

or how how complex and how involved we wanted to make that the protocol of actually applying a write to multiple replicas. And what we ended up doing is sort of, walking this fine line where the system sort of, what's the word, optimistically is assuming that all all replicas are gonna be able to apply the right, and it does so. And the way that we, maintain

global consistency in the system is by relying on,

a global view of who the replicas are and which ones are allowed to write in a particular point in time that is managed by the monitors themselves. So if you end up in a network partition on the side of the monitors that don't have enough enough of them available to form a quorum, then you aren't able to form a view of the system that allows you to write. And so that that half the cluster is doesn't know isn't able to make progress.

And it means that we don't have to use,

you know, a quote, unquote heavyweight protocol like Paxos,

for every IO, but instead,

only use Paxos for forming that consistent view of who is allowed to write,

and then can use a more efficient protocol,

to actually do it. And so in the event of a network partition,

you

it sounds like you would go into a read only mode where you're still able to access existing data, but the piece but but the systems on the partition side just aren't able to create new data so that you don't end up in an inconsistent state when that partition is resolved.

Right. Although, in reality, you can't actually allow clients to read data either on the partition side because they might be reading stale data. And the the level of consistency that Rados is trying to provide to clients is, I think the tickle term is full serializability,

although I'm not actually a,

database expert, so I might be might be wrong there. But for a user, for example, a a block device or a file system, those applications are expecting, you know, perfect

read after write consistency.

And so you can't actually allow them to read from the the partitioned stale site or whatever because it might not be the most up to date version of that of that content. And also in the case of the POSIX compliant file system view, any read would also

entail a write by modifying the a time or the m time of the file that is being interacted with. If if we,

aspire to do full fully consistent a time, then, yeah, that'd be correct. In reality, we just don't really do a time because it's it's so inefficient

and so expensive.

I think the I hear that. What,

what Linux has done is the the relay time concept where a time is updated if it's newer than the m time. So that you'd at least have this notion of whether the file has been accessed since it was last modified,

which is in reality the only real semantic that actual user applications

rely on. And to be fair, Cepheid even doesn't actually implement that directly. We just are relying on the client, kernels that are actually mounting the system to to apply those 8 time updates when when they're relevant.

And so in terms of the implementation

of Ceph,

how is the software actually

architected

and what sorts of evolution has it gone through over the period of time that you've been working with it? Right. So the the system is, broken down into a,

a set of, I think, 4 or 5, 6 different demons depending on how how far you go down the list. I already mentioned the monitors.

They're usually, like, 5 of these in a cluster, maybe 7.

If it's a small cluster, maybe you only have 3. These are the nodes that are running Paxos and are managing sort of the critical cluster state about, who's a member of the system, all the authentication metadata,

and and so on. What the current IP addresses of nodes are, whether they're up or down, that sort of thing.

Those are managed by the monitors. And when you and they're actually the the only component in the system that has sort of a well known, advertised,

IP address. So when you quote unquote mount the system, whether if you mount the file system literally or whether you're accessing a block device or whatever, they're the first ones you contact to authenticate and then find out about all the other daemons and nodes in the system and what their current state is. So those are the monitors. The sort of the next most important daemon,

is the OSD, the object storage daemon. And these are demons that manage an individual local storage device,

or possibly a pair of them. So usually there's 1 OSD per hard disk or per SSD.

And so in a typical box of, like, I don't know, 12 drives or something, you know, 12 of these demons running. And they're the ones who manage storing,

objects on that local device. And then they also talk to each other and to clients to service IO requests and to replicate,

rights or to migrate data between daemons. It's sort of the driving concept,

with the way that Seth and the Redos object layer is put together is that these are smart daemons.

So most distributed systems and most storage systems end up with something similar to an OSD. But it's it's,

usually sort of a passive entity where it comes up and it sits there and waits for you to ask it to read and write data. And then something else external to them is, reading data and writing it and moving things around. But they're just sort of sitting there as a passive entity, sort of like a hard disk. In contrast, the OSDs,

are more

active in a sense that they,

they talk to each other.

They talk to the monitors to find out what the state of the system should be, where the data should be distributed.

And then they sort of work in cooperation with other OSDs to to make that sort of by migrating data around,

and ensuring that IOs are properly ordered with respect to data migration

tasks and so on. So together, the monitors and OSDs provide sort of this pool this low level radius pool of objects,

that everything else is built on. There's a relatively recent addition of what's called a manager daemon. That's sort of similar to the monitor, but it it's stateless,

and it's more pluggable.

So

earlier on in the system, the monitor was also tasked with, keeping track of stats about, how much data is being stored and how it's being used so we can implement things like DF. All of that workload has been moved on to the manager, so that it isn't being, like, funneled through Paxos, which was ended up being a performance issue and and scaling bottleneck. And then the manager is also

written in such a way that you can write, Python plugins,

to implement various management style things.

So for example, there are new modules that will expose

the self health to,

like, our metrics to Prometheus for exporting stuff or in flex DB or, you know, send health reports to Zabbix. There's a new module in the process that's scraping smart statistics, and it's gonna run like a prediction model about disk failure. All those sorts of things are sort of, like, not critical to actually making sure the data is stored in the right place at the right time and isn't,

isn't gonna get lost, but are important sort of management background management functions,

for the for the cluster that, are much easier to write in Python and sort of open up this whole ecosystem of of contributions and integrations.

So that's that's a relatively recent addition to the the, the system, and it's been it's been pretty successful. We've seen, like, a half dozen,

user contributed

modules already. I'm adding all sorts of interesting,

new functionality. And then for the file system, there's another daemon called the metadata server that handles the file system namespace. So for the file system, clients talk directly to the OSDs to read and write iota files, and then they talk to the metadata server to, traverse the file system namespace to list directories. And there's a very complicated,

client server protocol there for providing leases to the clients, to let them know when they're allowed to write directly to files and when they're not, when they have to, you know, do all the locking and and so on there and manage all the client caches to make sure they're fully coherent.

In in the case of stuff of us, we aim for fully coherent client caches so that 2 processes interacting through the file system will have the same result if they're running on the same host versus if they're running on different hosts talking to accept the best, which is a lot of work, but it provides a lot of value for for applications. And then more recently, there are a couple other sort of extraneous things. Oh, there's the radio gateway, I guess, I should mention. That's,

also a daemon that sits on top that's providing an s 3 endpoint.

So it's allowing you to store s 3 style rest

restful objects

in Rados and exporting, you know, an API that implements a whole bunch of stuff in s 3. You know, everything from, like, quotas and ACLs and so on to doing, multi site federation of self clusters to replicate data across sites and all that kind all that kind of stuff. Yeah. It's definitely a very,

broad project that seems to encompass a lot of different use cases. And in many ways, appears as though you could use it as part of a multi cloud strategy to abstract out some of the specific services from different cloud providers

and manage them on your own if that's something that you are particularly concerned with. Yeah. I mean, the the the scope of set has definitely grown over time. Originally, it was just a file system, then it was also block and a 3 object store. And today, it has all kinds of multi multi site federation capabilities.

And and that's very much, I think, part of our road map too. As we sort of look at the IT landscape, it's clear that organizations,

don't just have their, like, 1 data center where they put all their stuff, but they have multiple data centers and multiple sites at branch offices and so on. And then they're also leveraging the public cloud. And so tying all that together, and managing our data across them, this whole, like, new,

data services

term, I guess, which applies to the, you know, the mobility and the policy and,

insights and everything into what what what your data what data you're storing and where it is and where it should be stored in automating the management. And it's very much where we're where we're going. I think that's most most relevant in the object storage space because I think a lot of these new generation of cloud native applications

are and should be talk targeting object storage, because it provides sort of the right

fit as far as the semantics of object storage to what

the mobility requirements are gonna be and scalability requirements.

So it's much easier to make an object store, replicate across multiple sites,

in any sequence fashion and do that in a consistent fashion than it is doing that at, like, you know, a block device or a file system or whatever. It's

much more challenging to make sure that you don't, introduce conflicts into the system. So there's there's a lot of work going into the RGW roadmap to make that all a reality. We already have,

this multisite federation capability where you can set up multiple sub clusters all over the place and tie them all together into a sort of a single unified namespace of buckets and users. This is sort of analogous to what you get with the the s 3 from AWS

where you have a bucket name and the system sort of knows via DNS and all its internal metadata where to send you what region that bucket is actually stored in. RTW provides you sort of a similar capability. In addition to letting you do disaster recovery sort of replication,

and even active active replication where you're you can write to either side of a, a replicated set, and it'll send the data in the other direction. But we're looking to extend this to a a sort of a much broader set of capabilities as well where we can, like,

automatically manage the migration of buckets between sites,

apply policy to where objects go, push individual objects to, sort of tier them out to

external public cloud storage and possibly encrypt them and the whole the whole sort of gamut there. Yeah. And as we get more powerful abstractions for enabling some of these multi cloud capabilities,

most notably things like Kubernetes and other container orchestrators

that provide a unified API for being able to run regardless of what the underlying

platform looks like.

Being able to also

extend that into the file system and object storage capabilities

makes the whole, what used to be a fantasy of multi cloud capability

closer to being a reality. Yeah. Kubernetes is a really big piece of it. It sort of sort of struck the night right balance

and and note, and and allowing people to,

automate and abstract the way that applications are deployed across multiple clouds or at least across a single cloud. I think the whole multi cloud operator is sort of the evolving next step, but lots of people are certainly working on that. But sort of the Achilles heel of that or the elephant in the room or whatever metaphor you choose, is really that there's all this state associated with your application, all this data that also needs to be portable and accessible for multiple occasions. And if you're, you know, it's 1 thing to sort of redeploy your application on another cloud, but it's another to, like, migrate an existing application with a whole bunch of state, all the databases and objects and so on. Actually pick that up and move it somewhere else and do that in a way that, keeps that application running and online the entire time. And so I think today we have most of the the sort of underlying building blocks in Ceph to allow that to be possible with the object storage, replication, federation capabilities.

And even in the block storage layer, we have an asynchronous

replication capability where you can take a block device and start

replicating it to another another site.

And it'll be, you know, a few seconds behind or whatever you configure, whatever the bound is on your asynchrony that you that you allow. But the sort of the the missing piece is all the stuff that sits above this that actually

orchestrates it and automates it. Because at the end of the day, you want the user wants to be able to say, this is my application

and all the, you know, all the YAML

stuff that you're feeding into Kubernetes to, like, define how it's actually being deployed. And you just wanna, like, click a button that says move it,

to another cloud

and do the 2 of all it's online. And so, you know,

coordinating the the movement of the storage, as well as the movement of the application that's consuming that storage is sort of where where the trick is. And then in principle, it's all possible to do manually, but it's really hard to do. Anybody who's actually tried to to do a big data center migration

knows this all too well. But we're getting to the point where we have all the tools to actually automate it. And so it's sort of getting to the point where we actually have to build them and make them available. And for somebody who's interested

in getting started with running their own self cluster,

what are some of the considerations

that they should be planning for in terms of the system capacity and network capacity

and the,

scaling factors that are enabled by running Ceph? I think it really depends on what what the

plan what your plan is to use it for. So, I mean, I run a SaaS cluster in my basement, and the the requirements there are very different than than what you'd have. Make sure to point something in the data center to deploy or to support

a a real application.

I think 1 of the the first things to do is make sure you're buying the right type of storage devices for, like, a a hobbyist basement cluster. You know, consumer SSDs are probably fine because you're not actually

writing that much data. But for most production systems, you have to be really careful to buy, you know, like data center, SSDs,

that have the the number of write cycles that you're gonna need to support in order to actually have the cluster survive

for whatever its intended life time is

or at least those devices.

That's 1 thing. I think that it's similar situation with the network for casual usage for just, like, you know, my personal data archive.

You know, gigabit Ethernet is probably fine. I don't really need a lot of performance out of it. But in real systems, usually,

you need 10 gig or 25 gig or whatever it is. And, you know, although you can cobble together all this random hardware,

and just, like, throw it at the cluster and it'll, you know, slip it up and you can build something that's,

that's replicated or restricted or whatever. The overall reliability of the system is sort of a complicated function of the reliability of all the individual components.

So if you have, like, really flaky hard drives or old hard drives that you're just tossing into the system, you know, they're gonna fail at a pretty high rate. So you wanna make sure that you compensate by having a higher degree of replication or or whatever it is. So it's I mean, it's there's no sort of easy answer, I guess. It it it depends on what what you're gonna do with it. But we see we we see users doing the full gamut. Right? We people are buying, you know, prebuilt,

pre validated,

high end hardware that's like, you know, all NVMe

and very reliable, high performance networks,

all the way down to people who are, you know, like, taking their Raspberry Pis and plugging in a USB stick and, like, random consumer the cheapest consumer SSDs they can find,

via, like, a a SATA adapter or something. And it it'll all work. It's just a question of how well. It's kind of fun to see what people will build.

You just have to, you know, hope that they're they're paying attention

to what their their real requirements are. And as you're scaling the system up and down, does it support things like service discovery via whether it's DNS SD or console or etcetera or anything like that? Internally,

Seth has written so that, there's sort of no manual configuration about, like, nodes and IP addresses and everything like that. Every daemon is sort of has an ephemeral,

IP address that it comes up and binds to and then registers with the monitor to make it discoverable to the sub cluster itself. So the monitors are really the only thing in in the system that sort of have fixed quote unquote IP addresses and everything else is totally dynamic,

depending on what happens to be running at the time. As far as the monitors go, what there are a couple ways to deal with that. And 1 is that you can you can either have given fixed addresses and just configure that on every node. An easier way to do it is to use DNS where you can have an a record for every monitor. And then you sort of have 1 central place where, the monitor appears are registered and then everybody's configuration file just points to the DNS name. It's also possible to use SRV records and have the the monitored

IPs get discovered that way. But it's it's not entirely clear to me what value that actually adds over just like having a DNS thing that points to,

the cluster in that sense. As far as having sort of another level beyond that where you're using sort of 0 comp where you just sort of magically notice that there's a cluster on your land. We haven't done anything like that. I haven't really seen a strong use case for that, but I think in most cases where sub clusters are deployed,

it's it's a big enough thing that you wanna have sort of a controlled

way that you know which cluster you're talking about and where it is, and are explicitly referencing it instead of having too much magic around, like, finding it and and and consuming it. For

file systems, there are any number of different types of workloads that are gonna be some mix of read and write, and they could be a high volume of small files or a small volume of larger files or some,

combination thereof. So what are some of the capabilities

that Ceph has for being able to address these different types of workloads? Are there just different tunable parameters for the expected workload, or does it, have some sort of dynamic allocation for being able to adjust to the actual real world use case? There are

a lot of tunable parameters that are available.

But our goal is always to

not require anybody to actually

ever have to touch them. And how far you can sort of get to that ideal will vary depending on how how weird the workload is. Generally speaking, Seth does a pretty good job of dealing with whatever you throw at it, provided the hardware is appropriate. If you're have, like, a a high IOPS block workload with, like, random reads and writes and you have, like, slow hard disks, then you're not gonna get very good performance. It's not gonna work super well.

But assuming your hardware is appropriate, then then it'll do it'll do the best it can best best that it can regardless. With the new the new storage back end, that was recently introduced,

called Bluestore. We sort of took over management of the individual storage devices.

Prior to that, we just layered on top of an existing file system, normally XFS. But with Bluestore, we consume Roblox devices directly, and we have our own sort of custom built file system like thing that stores objects on those devices.

And Bluestore does a number of things that try to, make sure that it's storing data as efficiently as possible for the given object that you're storing. An example of that is, the way we deal with the check sums.

So 1 of the big additions that Bluestar brings over layering things on top of x of s is that every

byte that you write into Bluestar is has a checksum associated with it that's stored. And so normally, if you're just writing objects, we'll check some on 4 k

block granularity.

But if there are hints from the client that,

the right is,

it's a sequential right and it's gonna be read back sequentially,

then we'll check some in large chunks, you know, like 128 k or something like that because it it's much more efficient. It it results in less metadata that we have to store internally. And then the downside being that if you read a 1 4 k block, you have to read the full 128 k chunk and verify the checksum on the large chunk in order just to return the 4 k. But if you if your workload sort of knows that that's not not gonna happen, then then we can we can avoid that work. And then the the various components of Ceph are feeding those hints in. So for example, Rados Gateway, when you're doing object storage,

it's all it's pretty much all big puts and big gets. The API supports doing sort of a range get where you read a small piece of an object, but it's a sort of a rarely used part of the API. And so RGW is hinting that all the stuff that it's writes is gonna be written sequentially and read sequentially. And so the way that it gets stored on the back end is much more efficient in terms of the way that the checksums are calculated and stored. Whereas in the block device case, it's sort of the opposite. We don't know what the IOs are gonna look like. They frequently are small and random,

and so we always check someone a small on a small granularity.

So that's that's just 1 example, I guess. But the that that Bluestore back end,

1 of the goals in sort of finally taking over that that last mile and and controlling the way the data is laid out on the block device is our ability to do implement those sorts of those sorts of heuristics and support. So, you know, if you BlueStar also supports compression,

which has a whole bunch of similar trade offs where depending on how big the chunks are that you compress, it changes the way that how much read amplification you might have if you're reading small bits and pieces. But Ceph allows you to set

hints and tuning parameters on pools or on file systems or on block devices or whatever so that all that stuff eventually trickles down at the back end, and it can know for any particular object that it's storing

how it how it should,

optimize

the way that it's laying out on disk. And 1 of the interesting aspects of Ceph as well is that

for most platforms that people might be familiar with where the object storage is distinct from the block storage and distinct from shared file systems,

and ceph provides

an abstraction layer on top of a sort of unified

storage underlayment.

Is there any capability for being able to interact with the same data from each of those different

sort of abstraction layers on top of it?

The answer is a little bit of both. So in designing the way that, we lay out data for the file system, for the S3 object store, and for the block. We sort of optimize for each of those use cases separately.

So they're all consuming Rados, but they're using Rados and laying out their data in Rados differently. And the idea there is that because the semantics of those APIs are different, they have different workload patterns and different requirements. And so it didn't make sense to layer our s 3 object storage on top of a file directory, for example, because of the way that mostly around the way that

objects are atomically replaced when you do a put and you might have versioning and also because of the way that an object enumeration in the s 3 API works. It just it was you couldn't really check all the boxes and do it optimally and also do it the same way for 2 different 2 different set of sets of requirements.

So generally speaking,

each protocol lays out data differently. That said, we've added support in RGW and the gateway for,

exporting

a a bucket as NFS, Because what we found in the real world is that, what most organizations

have today are like a bunch of NFS servers that were dumping their data. And they're trying to make this

organizational transition to object storage,

and they can't just do it overnight and they can't, like, rewrite everything all at once. And so they need to, a, copy all their data over, and, b, they frequently have

some parts of their infrastructure that are still want to dump all their data out over NFS even though the new parts of their infrastructure are reading it out out of the object store. And so having this sort of dual protocol access for objects,

is critical

to to making that transition. So that's something that we added, I guess, about 2 years ago now. We've talked about the idea of having our GW buckets that are sort of backed by a directory in the file system.

Also, so you could sort of have have that dual protocol access to,

the file system storage. But we haven't seen a whole lot of of user demand for that because in most cases when you're using object storage the object is sort of where you're trying to get to and having sort of suboptimal file access in order to get there is fine. And we don't really see that happening in the other direction. And on block, we haven't had a lot of of need for it. Usually, when people wanna access a block device as a file, it's for backup or just to, like, import or export. And so they're all there's a whole set of tools around doing that already and being able to, like, mount a directory

in your on your server that, like, shows your block devices as files isn't all that important. Although we for a while, we did have a fuse, tool that did exactly that, but it's not I think it's not supported.

It might even have been re removed. I can't remember. And in terms

of use cases, there are a huge variety

because

everybody needs to store data in some fashion, and there are multiple interfaces to that. And so

it sounds like a big driver

of SEF within an organization

is for being able to provide this

multi cloud capacity or being able to migrate data across infrastructure.

And when I was looking through the documentation, it looks like 1 of the pieces that's actually supported by that as well is for being able to use it as the

underlying storage system for database engines

or for sort of more big data workloads as well? Yeah. We see sort of a number of of common use cases fall out. Backing an open stack cloud and only virtualization is 1 of them. Just general purpose object storage where you just have your reams of data that you wanna dump into a scalable and cost effective system is 1.

1 of the emerging sort of use cases that we're seeing a lot of right now is the sort of quote unquote data lake where you're storing a bunch of data in RGW in the object store, but then you're running Hadoop on top of it. And that's something that we've invested a lot of work in making making that work perform really well. And we find that a lot of people find stuff to be a good fit there because they have, sort of a mix of wanting to store a lot of data in a general purpose of a store and have all the advantages of having a sort of a rich, s 3 API that lets you access that and also needing to run their analytics on top of it. And in in sort of the past, you would have to have a, like, a Hadoop cluster that you would copy all your data into HDFS and then run your analytics and then copy the results back or something like that. And this sort of eliminates that step. And interestingly, sort of the the 1 of the motivating concepts of the way that,

HDFS was built was this idea that you would run your you'd break up your your map component part of the computation into pieces, and you'd run the compute right where the data is. In reality, with sort of today's network speeds, that's become less of a challenge and less a requirement. There's really

not a a noticeable

benefit from doing that. And so having everything go through, you know, a set of radios, gateways, or whatever is fine with sort of today's hardware,

when you're not optimizing for, you know, these gigabit links and and so hard drives that you were in the past. And what are some of the situations

or use cases where you would advise against using Ceph and direct somebody to use, an alternative project instead? This discussion actually came up recently for me when I was thinking about, the NVMe over Fabrics sort of hardware trend and where that was going. And the thing that really came out of that is Seth is really focused on providing a reliable storage service. So every bit that you saw in Seth is replicated or ratio coded in multiple devices, and it's it's all set up so you can tolerate multiple multiple points of failure and all that stuff. And there's a cost that's associated with that. Right? It costs latency. You get to buy more hardware and and so on. And there are a lot of use cases that just don't require that, especially as we move further into the sort of, cloud native

way of building applications. A lot of, you know, worker nodes or whatever are containers

are are stateless. They have they need some storage because they're, like, deploying a little mini operating system and doing a bunch of stuff. But a lot of the data they store is is either

reproducible,

or throw away, or temporary, or can be rebuilt, or whatever. And you really don't need to pipe it into this, like, highly reliable home storage system that sort of. So I think it's the biggest thing is to separate out your ephemeral storage from your reliable storage and,

think of those 2 2 sets of requirements. It's often easy to just

deploy a cloud, you know, your OpenStack cloud and back everything by Seth because you just don't have to think about it.

But you're sort of throwing away,

some money when you do that because a lot of the a lot of times,

those VMs actually just need a femoral storage that is is fine to lose in the event that the system crashes or or whatever whatever happens.

And I think the the challenge there is really around

building the the tooling to make that easy to manage.

1 of the benefits of having,

sort of a

unified source system like Ceph that gives you file block and object is that you can basically take all of the source devices in your system and feed them into Ceph and manage them all centrally, and it'll deal with balancing data

and setting up different pools and, different types of storage devices and all that and all that instead of having to sort of manage different sets of hardware for different workloads.

And when you have a bunch of storage that's ephemeral and you actually don't wanna have replicate it at all, that that introduces sort of a second class of storage devices and figuring out how much to allocate to each of those pools, becomes a bit of a burden.

So 1 of my goals is to figure out how to sort of tackle that problem from sort of a a whole data center perspective of how you allocate your storage to

to the system in a way that makes sense in a Kubernetes environment. I think that's something that eventually we're gonna have to make easier for users. And as somebody who has been involved in

1 project for so long, and particularly a project of this breadth, what have you found to be some of the most interesting or unexpected or challenging aspects

of that work and of that,

effort of building Seth and the community or new things that you've learned that you didn't expect to have to encounter in the process?

It's it's been a very educational experience as far as how communities work, and open source communities in general. I was definitely naive when I started, and we open sourced stuff. I thought that the contributions were just gonna pile in and everyone was gonna make it all better, and it was just gonna be sunshine and roses all the time. The reality is that building communities, especially developer communities, takes a lot of work, and it takes a long time for 1. I think the other thing is that,

it's it's tempting to try to make stuff all things to all people

because there's such a breadth of users and a sort of a crazy variety in, like, the environments that people wanna deploy Ceph. And in theory, you could can sort of fit it to do all kinds of different things, but you can't

because you you have to limit the scope or else it'll just be overwhelmed with the complexity of the system. And then when it comes down to it, there are certain things you have to choose and what you're optimizing for. And so figuring out what the right balance is in terms of sort of where to where to focus the project and where to target it and what not to do is always hard. And I think the last thing that I didn't anticipate is that it's actually a lot of work to keep up with what people wanna do with the system. There are a lot of instances where you have a contribution from somebody who's, like, trying to implement some new feature. You know, they wanna add, you know, some geo replication

capability to the file system, for example.

And oftentimes,

you'll have a group of developers that go off and do a lot of work and actually build, like, a fairly set of changes to do it. And it's a lot of work for the core maintainers to, like, actually find the time to look at it in detail and provide meaningful feedback to figure out how to review it and and fit that into where the larger project is going. And to try to get those developers involved with

the the core development sooner in the process so that they build something that is aligned with what where the project is going and avoid sort of instances where you build something that's you invest a lot of time to something that's pretty complicated and it just isn't quite right and ends up not getting merged and

being being somewhat wasted effort.

It's a it's a it's a challenge. Right? Because you have, a a small set of core developers, and they have all this other stuff that they're trying to do and figuring out the balance of how to how to engage with with those people and get them to engage,

and also

get the things done that are sort of more on your your immediate

task list,

is pretty hard. And what are some of the plans that you have going forward for the future of Seth?

World world domination.

No.

I think at at a very high level, my goal for Seth is that the first choice for reliable storage in any data center environment should always be an open source option.

1 should never feel like the first thing that you reach for is a proprietary system, and 1 should never feel like you have to reach for the proprietary system because the open source system isn't good enough. It should be it should be quite the opposite. And it should be that the place where, you know, the most innovation and the most exciting new development is happening is the open 1 because that's the 1 that's most accessible to researchers and experimenters and all and all the rest of it. Right? Provides all the transparency. So that's sort of the high level goal. The more

tangible

media goals for Seth, really center around, a few key areas. 1 focus area is performance. You know, Seth grew up with hard drives and later SSDs that were like SAS and and SATA attached. The new world is, like, PCI attached flash,

and even, like, nonvolatile memory technologies that are sitting right on the memory bus. Whole new whole new level of performance is required, and just the way that the software is written itself, needs to change. So there's a there's a big effort underway right now to,

refactor a lot of the code in the OSD in particular

and rewrite it using new software frameworks. So we're we're looking at c star, which is a c plus plus framework that, sort of explicitly shards work across cores

and tries to never move control across across cores. So all the CPU caches work very efficiently. And it's all sort of futures based, run to completion style, putting software together. So that's that's 1 big effort.

And it'll enable us to use things like DBDK and SPDK,

these software toolkits from Intel that allow you to move the drivers for both your NVMe drive and your network card into user

space and into this whole sort of,

co scheduled runtime environment for very high performance.

So that's 1 effort. Performance is really important because

hardware is changing.

The other big area is around usability.

Ceph was sort of written by

Linux hackers and and developers, whatever, and initially targeted, like, open set cloud users who are very technical. And so it's built up this reputation of being sort of

complicated, lots of knobs, hard to deploy, hard to manage, hard to understand. And that's inhibited adoption in a lot of other spaces. And so a lot of effort now is going into making the system more self managing,

which is important both for sort of nonexpert users and also for managing

the making the system work at scale when, you know, you have 10, 000 nodes. Like, you really can't have somebody who's, like, looking at all the details to try do things manually. You really need to automate all of it. So there's a big focus there. And then the last thing is really around, platforms and making stuff where work well with the emerging application platforms, specifically Kubernetes

is the main 1. Seth should be, if it isn't already, the first choice and obvious choice for providing storage in a Kubernetes environment. And so we're investing heavily in projects like Rook, which is an operator for Ceph that automates the management

deployment and, makes it very easy for Kubernetes applications environments

self storage. Briefly, the mention of c star is very interesting because I know that it's been very,

sort of battle tested

in the Cilla DB project

where it was originally created.

So it'll be interesting to see how that manifests in terms of its utility within Ceph. That definitely sounds very exciting. And are there any other aspects

of the Ceph project or anything related to it that we didn't cover yet that you think we should discuss before we close out the show? I think we covered it. So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question,

I would like to get your perspective

on what you view as being the biggest gap in the tooling or technology for data management today?

Let me think about that 1. I think it's I think it's really around I think it's this emerging data services space as organizations move to multi cloud and hybrid cloud environments. There isn't a lot of software and tooling and

sort of underlying capabilities that allow you to manage data across

these multiple footprints and certainly no open source tools to do it. And so this is this is a key area where we're investing, and I think it's gonna be increasingly important as people, you know, shift from their more traditional IT styles into this new world of of cloud native, multi cloud, hybrid cloud environments.

Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing with Seth. And thank you also for the work that you have done on that project. It's 1 that I have been keeping an eye on for a while

and will most likely be working on deploying in my own environment for shared file storage. So thank you very much for that, and I hope you enjoy the rest of your day. Yep. Thank you very much for having me.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links