Summary
Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you explain what MinIO is and its origin story?
- What are some of the main use cases that MinIO enables?
- How does MinIO compare to other object storage options and what benefits does it provide over other open source platforms?
- Your marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph)
- What are some of the challenges that you face in terms of maintaining compatibility with the S3 interface?
- What are the constraints and opportunities that are provided by adhering to that API?
- Can you describe how MinIO is implemented and the overall system design?
- How has that design evolved since you first began working on it?
- What assumptions did you have at the outset and how have they been challenged or updated?
- How has that design evolved since you first began working on it?
- What are the axes for scaling that MinIO provides and how does it handle clustering?
- Where does it fall on the axes of availability and consistency in the CAP theorem?
- One of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume?
- For someone who is interested in running MinIO, what is involved in deploying and maintaining an installation of it?
- What are the cases where it makes sense to use MinIO in place of a cloud-native object store such as S3 or Google Cloud Storage?
- How do you approach project governance and sustainability?
- What are some of the most interesting/innovative/unexpected ways that you have seen MinIO used?
- What do you have planned for the future of MinIO?
Contact Info
- @abperiasamy on Twitter
- abperiasamy on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- MinIO
- GlusterFS
- Object Storage
- RedHat
- Bionics
- AWS S3
- Ceph
- Swift Stack
- POSIX
- HDFS
- Google BigQuery
- AzureML
- AWS SageMaker
- AWS Athena
- S3 Select
- Azure Blob Store
- BackBlaze
- Round Robin DNS
- Service Mesh
- Istio
- Envoy
- SmartStack
- Free Software
- RocksDB
- TanTan Blog Post
- Presto
- SparkML
- MCAdmin Trace
- DTrace
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, the combined events of the Data Architecture Summit in Graforum, and Data Council in Barcelona.
Go to data engineering podcast.com/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Anand Babu Periyasamy about MinIO, the neutral open source enterprise grade object storage system. So, Anand, can you start by introducing yourself? Hi. This is,
[00:01:52] Unknown:
a a b, or Anand Babu Periyasamy. I'm 1 of the cofounders
[00:01:56] Unknown:
and CEO. And do you remember how you first got involved in the area of data management? No. It was kind of accidental.
[00:02:03] Unknown:
I did not start out this way. Our original plan in it happened in my previous startup when we did Gluster project. Gluster, the name Gluster came from cluster. It was supposed to be a shared memory distributed operating system. And, within, within, like, 6 months of the project, like, custom our customers started telling that their problem was not computing. And, they had petabytes of data sitting on tapes, and they wanted to move them to drives to do large scale simulation. I actually could not find a good file system that I can adopt. So we actually built a file system from scratch. That's how I fell into this market. And, once you've come into data management, you cannot leave. Right? That then I I stayed on. It's funny how many times I've had that conversation
[00:02:57] Unknown:
where the actual product that somebody ends up offering had absolutely nothing to do with the original intent of their business or their project, and it was just because they happen to build something useful that everybody else then started asking for that they decided to turn that into their actual project and company. Most notable to that's coming to mind is Timescale DB. Yeah. In fact, like, even the company name behind Gluster, Gluster, no 1 knows. It it's called Z Research. And, at some point, Gluster became, very popular
[00:03:26] Unknown:
because of Gluster FS. And we killed every other initiative, changed the company name to, to Gluster so they would recognize us. And, Gluster became a file system. It it being open source, you know, if you built a good product, it gets noticed.
[00:03:43] Unknown:
And, that's how it happened last time. And so now you can't seem to escape from file systems, and you have become the cofounder of MinIO, which is a different type of file storage. So I'm wondering if you can just start by explaining a bit about what that product is and some of the origin story behind the project and the business. This time, while, MinIO is an object storage,
[00:04:06] Unknown:
most people think that I am doing this because I became a storage guy and, because of my past experience with the file system. That's actually not the real story. I I was at Red Hat for a while, after the acquisition to make sure the transition went smooth, and they are able to take take care without me. But what actually happened was I was thinking about doing something more fun. We were working on me and a small team. We were working on bionics, like ability to see through skin, you can close your eyes and see things like I wanted to do something more fun. But then when it came to actually doing a startup moment, you have, VCs involved.
You want to do something that makes an impact the next 6 months, next 12 months, you want to show progress. Right? And, I went back to a a asking a very simple question. Right? And it the object storage was an accident of that. The the real question I asked was that 10 years from now, what problem if you pick today will still stay relevant? Not only stay relevant, it has to compound and grow and don't pick something that is trendy and short term. A where every where I looked, the direction was simply the world will produce more and more data. And it was never about storage, right? LusterFS was about building a storage system. Heart of every modern enterprise and how you build a powerful brand of trust and love. And, then I saw that everyone was struggling with just managing vast amount of data. The closest thing they saw was HDFS and some distributed file systems like LustreFS.
And in 2008, it became quite clear even while I was doing Gluster that Amazon will convince the rest of the world that if you are willing to let go all the legacy, interfaces like file and block, you can build a significantly better data storage system. And I knew that Amazon will convince the rest of the world. So I saw that up building an object storage was a good starting point. My simple thesis was that if the data sits on a technology that we built, we can do many powerful things on top. And a and a good starting point would be to build an object storage. Everyone else thought object storage is a hard problem to go attack. And I thought that object storage is simply a distributed set of web servers. Right? And, we started out with the object storage, and the idea being if you are inside AWS, you should use Amazon s 3. But if you are outside AWS, what choice you got? And, that's the part that Minhayo should address. And, my, my calculation, and it's also a bet. Right? Is well, we'll produce so much data. What percentage of that world's data will be sitting on Amazon s 3, compared to the rest? I saw that bulk of the world's data will be outside of Amazon s 3. And I wanted to go give the rest of the world an a powerful open source better alternative to Amazon s 3. And, it turned out to be great in the last 4 years. It really picked up virally. So in the object storage space, there are a number of different contenders, most notable being s 3 as you said, but there are also products from Google and Azure. And then there are also some other open source offerings, most notable being the Swift object store, which is built on top of Ceph. And I'm curious if you can just characterize the overall landscape of object storage both now and at the time when you were first trying to tackle this problem
[00:07:53] Unknown:
and some of the main use cases that you see for object storage, and how that has evolved over the past few years? There there is plenty of choices and which is a good thing. But,
[00:08:03] Unknown:
if, like, if I mean, I always just add another alternative to 1 of those file systems. If it's about, like, k d, Argunom, and Emacs, and VI, there is really no good reason to just add another flavor to it. This is a market where users want, to standardize on 1 system that is most proven. We could imagine, like, there are, if there are many choices, you picked 1 of them because it somehow it it came in your flare, and then tomorrow the project is not well maintained. Your data is going to be held hostage. Right? The real reason why we had to do Gluster, like, Gluster back then and this time, Minayo, is I actually could not put my own data in a system that, that was out there. Back when I built GlusterFS, it was about, like, every other system was so complicated with metadata management.
If you if you get corrupted, you just you're out of luck, right? It's a complex systems don't scale. And this time around, when it came to object storage, Amazon showed something very powerful that if you strip down everything, and reduce it to basic get put list type atomic, immutable operation, you would actually be able to, significantly simplify the whole object storage implementation. And everyone else I looked around, either they built a block store or a file system or a unified storage system and added a s 3 object API gateway. If if not, they are a SaaS product. Right? Like Google Cloud and others are a SaaS product, and they're incompatible with Amazon S3.
What, what we wanted was something that that is fully s 3 compatible, and it is built for object API from the scratch. It did only 1 thing really, really well. Even some of the like, Swift Stack, for example, like, Swift API, you know the the fate of OpenStack itself. Right? Like, Kubernetes took off and OpenStack has not made any any dent in terms of, like, replacing Amazon, for the rest of the private cloud. And Swift, it it to me, when I started, I asked the community that you pick Swift API versus s 3 API, pick 1. And it was very clear the industry wanted s 3 API. And, it's not just the API itself. If you look underneath most of these object storage systems, either they are a file block and then an object API gateway on top, or they have built a file system like storage system and then added a API object API gateway.
MinIO is a very different breed from the ground up. It is actually designed to be a single layer object storage system. I'll go into the details later. But it is by architecture, by implementation, it's a very different system. And what, how does it matter, what, why it matters to the end user? It is really about a very simple system that can do very powerful things. So why is it important today? I found that the data is only growing bigger and bigger, and every enterprise today is struggling with the data data management problem. And the performance and scale. Everything has grown many folds. The traditional systems don't, scale to handle this kind of workloads in terms of performance at, and, and data liability at at scale. If you have tried any 1 of those systems, just installing them itself requires significant expertise, then think about operating them at scale. And another thing with Seth is that, as you said, it has this API gateway for being able to provide the object interface.
[00:11:58] Unknown:
And at the foundational layer, it's essentially a distributed key value store for being able to store the bits and bytes of the files, but there still isn't any compatibility layer between the object store and the POSIX interface for being able to interact with the same files in a different format. So you don't really get any real benefit from running that system if you all you really care about is the object interface. And because of the fact that it's adding additional abstractions, you're adding additional overhead as well. And I know that 1 of the factors that you're focusing on for MinIO in terms of positioning and feature set is the speed as well as the s 3 compatibility, which I know is sometimes lacking with Swift. And I also noticed that a lot of your positioning is based around the use of MinIO and object storage for use cases around machine learning and artificial intelligence workloads, And I'm curious how you have seen the overall market for object store for those types of use cases as compared to a distributed POSIX interface? So the use case part evolved organically.
[00:13:06] Unknown:
As we watch the market in terms of the community, how they grew, and what all they did with it, the AI ML, if you noticed, if you track pastry of the of the project, you you would have seen that that happened only in the last 1 or 2 years. It when we looked at the use cases, it was all over the map. And, as open source, you can't really control, like, what all they use it for. But what we found was that a bulk of the enterprise data was it's actually sitting in HDFS and very little on scale out NAS. In the public cloud, the shift already happened. Right? Every major, database systems, analytics, machine learning, if you look at them from Google, BigQuery to Azure ML, Power BI, SageMaker, EMR, Athena, everything you look at, they are built on optic storage. Private cloud is still sitting on HDFS, and a a and users are struggling with managing that vast amounts update on the complexity of HDFS and Hadoop. They want their on prem infrastructure to reflect how, AWS is built. This is where Kubernetes took the computing side, and they want object storage to be the data management side. And, this grow, the AI ML big data applications to, Minayo's needs.
And and for us, because the performance was there, the other object storage systems were, eventually consistent and the performance was not, was not as good. They were most of them are built for archival needs. Right? Sure. You can put Minhay on a hardware based system and you can use it for archival. It'll be cheaper and faster for, for even archival needs. But, where it really differentiated itself compared to other, projects out there, was the performance and the business critical needs. Simplicity is what the users like. And, it was we naturally gravitate towards that use case. And then in terms of the
[00:15:10] Unknown:
s 3 API compatibility that provides a certain number of constraints where you are committing yourself to ensuring that you have this interface for the object storage. And I'm wondering what challenges you faced in terms of ensuring the completeness of adherence to their API, particularly as it has changed and evolved, most notably being the s 3 select capability. And then also with the fact that you are accepting these constraints,
[00:15:39] Unknown:
what have you found to be the areas of innovation for the product? So, the good and bad part of s 3 API, I would say it's mostly good. The the the only bad part is that it is not it is not a open standard API. Right? It is a standard simply because it's the most popular implementation. And, I don't regret it because it is true for pretty much any system out there, whether it's your Apple charger or pretty much any device out there, any standards out there. If you are the most popular implementation, you automatically become the standard. And it's something that we don't control or the industry experts don't control. But opinionated like Amazon driving the direction and controlling it instead of consortium, I'm actually fine with that.
Now the details, is where, it mattered to us that, if you look at the AWS rest, API spec, the spec is just like a guideline, and how exactly the API is implemented, it is quite nuanced. The the problem we found was that if you took different SDKs, different tools, even developed just by Amazon itself, right? Whether it's AWS, Java SDK, different versions of the Java SDK, if you see from old to new, you will see that quite different quite different, right? I wouldn't say quite, it's maybe slightly different, but different SDKs, different language bindings, they actually have different interpretations too.
You will also see open source tools. Sometimes they even have bugs. The the tools that they have bugs and the API on the server side needs to be forgiving. Amazon implementation of that s 3 API, Amazon s 3 service that is new, hitting the server. Now when it comes to we mimicking exact compatibility, the challenge we have is it only takes 1 API to bake. And it's not just 1 API. The APIs themselves have different variations and sometimes even bugs. Right? We have to make sure that every single detail of it is captured. And we found that the only way you can get to that level of granularity and correctness is you have to be focused. This is where we decided very early on that do 1 thing really, really well. And it and when we decided that it is s 3 API, it was, it, we never turned back turned turned back. Right?
The API itself within S3 API, if you notice S3, V4 API, we were the first to implement S3, V4. Everybody else either copied our code or they just copied our product itself into their product offering. And since then, anything Amazon introduced, like, we will have it right away. And, like, you mentioned s 3 select, for example, like, a a s 3 select only Amazon has and we have it. If anybody else has s 3 select, it is simply MinIO in their product. A the being focused, we are able to get catch up with Amazon. When Amazon introduced s 3 Select, we were along the side. We implement s 3 Select. In fact, before s 3 select, we had a more powerful implementation inside Minayo. Like, you could upload, like, JavaScript as as part of your risk call, and JavaScript gets executed on the data and your output of the JavaScript attachment would be the output of the object. But anytime when Amazon innovates, they we wanted to actually be closer and be compatible with Amazon. We actually we constantly remove features rewrite to be compatible with Amazon. It's only a good thing for the users because, like, in the short term, like, if I lock somebody to MinIO, in the long term, it's bad for the user that is bad for us too. So, we we stay very close to Amazon s 3 API, and everyone claims that there are 3 compatible, but the the details where is where it matters. Right? For us, given the scale of adoption that we are able to, keep up, with the s 3, s s 3 compatibility.
[00:20:12] Unknown:
Continuing on the idea of data lock in and providing a consistent interface to make it easier for people to move their workloads between different environments, I was also noticing the federation capabilities that you've built into MinIO both in terms of federating between different clusters of MinIO itself, but also in terms of providing an API compatibility layer for placing in front of things such as Google Cloud Storage. And I'm wondering if you can just talk through some of the, strategies and implementation that goes into that overall idea of federation and providing those compatibility layers to make it easier for people to migrate their workloads while keeping their client code the same? Yeah. So it the the the part the compatibility part, like an API gateway,
[00:21:01] Unknown:
it was relatively easy for us to do, because of the design that we adopted inside Minayo that even a ratio code, to bit to all those capabilities where at the object level granularity. So all we have to do was to was to make the erasure code module pluggable so we can write other storage adapters and make everybody else look like Amazon s 3 compatible. But the story behind how it happened is the interesting piece. Right? Microsoft actually was the first to ask for this feature that, Microsoft 1 Microsoft came to us and asked that can if we can make their Blob store s 3 compatible. And I'm like, they I showed them the data that how many many instances they're running inside, Azure.
And, we were actually the, I think Azure was the 3rd most popular deployment base, for MinIO. And, second is Google Cloud. And in fact, number 1 deployment base is Minayo. It's actually, Amazon, AWS, EC 2, EBS. Yeah. There is, like, a 600, 000 plus I think now 700, 000 plus unique ITs of running inside Amazon itself. Now when people are running inside public cloud, like, I asked I showed Microsoft that, look, there is already a lot of running inside Azure. So when people want the s 3 compatibility, they just run MinIO on your cloud. And, what Microsoft wanted was something different. They told me that when I were running on top of the EBS, actually, on Azure, it was file and block share. That was not very useful for for them. The reason being, if you uploaded data to MinIO on Azure running on a a filer block share, That data is, is while you can read and write through s 3 API, you can't run Azure ML, or Power BI or any other Azure cloud service on that data because they don't speak s 3 API.
But what was interesting to us was they told that they can't even read the data sitting on their file and block. They're all built on object storage. And what Microsoft wanted us to do was to make Minayo store the data on top of Azure Blob Store. And for me, it's like, wait a minute. Right? What you're asking me is to put object storage on top of object storage. Why would I do that? And they explained to me that all other cloud services inside a inside Azure only speaks blob API. The second part was they treated file and block to be a legacy. And, we storing the the user data on a file and block was meant to be an enterprise a every other system, that is not s 3 compatible to become s 3 compatible, is it's also a very important problem for us, right? Today, the biggest problem that industry is facing is applications are still speaking legacy API. They need to be rewritten to be cloud compatible.
And every cloud being different API is not helping it. If we made everybody look like S3 API, more applications will be speaking S3 API migration to, to whether Kubernetes or MinIO or public cloud or pretty much you want to stay with your existing human legacy. Let's say, your NAS and SAN, systems, we can make even them look like s 3. Even HDFS can look like s 3 back place, Alibaba Cloud, like Google Cloud. We made everybody look like S3. And now it's a what it did to the industry was made a huge amount of applications, private cloud market adopt s 3 now that there are more applications.
[00:25:09] Unknown:
And they also see they have been built on MinIO in most cases. And when they go to production, they just carry us along with them. That's funny that it was actually a request for Microsoft that led to this feature. And I like the point that you made as far as the s 3 API is more than just the storage because by allowing people to interface with these other object stores, such as in the Azure case where that's all the other services know how to talk, or in the Google case where you're able to load data into the Google Cloud object storage, and then from there, directly loaded into BigQuery. It simplifies adoption of not just the storage system, but also additional
[00:25:51] Unknown:
services and capabilities that you wouldn't necessarily think of at first. And you can see right there, it doesn't really help us at all. We are actually helping Google Cloud or like other public cloud services. But I think in the end, if you do what is right, if you are altruistic,
[00:26:06] Unknown:
it actually pays off. It is profitable to be altruistic. 1 of the challenges I'm sure that you faced on that front is how to map some of these s 3 APIs, particularly the select capability onto the other object stores that don't necessarily have them natively. And I'm curious how much effort that was and how much involvement you've had from those different companies in terms of either working with you to add those capabilities to your code base or in terms of modifying the capabilities of their platforms to make it easier for you to implement?
[00:26:41] Unknown:
So there are actually some might some minute details that matters. Like, say, for example, say, whether it's encryption, s 3 select, like, features like that, they we the way we did it in the gateway layer is, if say, if the back end does not support the functionality and if I supported it, then, if you, directly go to the back end, can you read the data? This is actually the the this detail matters a lot. What users really want is we have to be a transparent layer on top of the existing system, whether it's a existing cloud storage provider or a NAS, like NFS mount point or even your ZFS, even sand vendors do that. What they really want is your existing data. So if you mount your NFS volume, if you already have, say, 2 terabytes of data seeding on your NFS volume, you're not going to migrate that data domain IO. If I write the data in a proprietary format on your NFS, if you can access the data on, on s 3, like, I mean, through main iOS s 3 compatible object. But if you go to the back end directly, it will be all, binary blob type data, and you won't be able to access them. That actually is not a good idea. Right? So what really want people want is is what date whatever data they put through Minayo is written natively like a file on NFS or a GCS object in Google Cloud or if it it's a blob in Microsoft Cloud, we need to translate it and write it natively to the to the back end. Now now when we do that, some features that the back end does not implement like, say, for example, even the encryption APIs that Amazon has, SSE, server side encryption, client supplied keys, and SSE s 3 and KMS, I can't really do a full translation to the back end. Now if I encrypt it at our gateway layer, the catch is that now I have encrypted with Minayo. Now if you go to the back end directly, the you Blob API won't know how to decrypt the object that we encrypted.
So how did we handle cases like these? So we actually implemented the functionality. In cases where we can translate it completely by somehow masking it, we actually did it transparently, but we never wanted to write it in some proprietary format. Encryption is a case where where actually some of the large financial, customers, they wanted the data to be encrypted before it left, their, their their data center. And they are they were using public cloud just as a d r copy. So it made sense for them to use the encryption feature. But for the rest of them, generally, what I tell is use the features that like, object storage has now nowadays, object storage has, like, it it has more features than you need.
Generally, don't pick features that are that are very specific that if you move to some other let's say if you move to Azure to Google Cloud, and if you are going to miss that feature, you, it's okay for you to not use it. Be like, it it's really important to not, look at some features that that will hold you back hostage. If you use it, you have to have a very strong reason. Right? In our case, what we did, in with MinIO being a complete alternative, we were able to catch up everything that Amazon has that the rest of the world needs. In fact, some cases that like worm like capabilities, we were able to add add more without breaking the compatibility.
But for the for for the gateway part, that it's not possible for us to mimic the cert like, a subset of the APIs without breaking the idea of writing the blob, writing the data in a native back end format. So we we implemented them as options. Then the the users are pretty knowledgeable. They know how to turn it on and off depending on their needs, and we have good guidelines on how to how to, make these choices.
[00:30:54] Unknown:
In terms of the actual implementation of MinIO, I'm wondering if you can talk through the overall system architecture and some of the ways that it has evolved, and then also some of the additional services that you've built alongside it because I know that you have replicated some of the functionality around the key management store and IAM or identity and access management, and just some of the overall, sort of, tertiary projects that have come to be necessary as you evolve the overall capabilities and reach of MinIO? So this architecture is where it fundamentally differentiates.
[00:31:31] Unknown:
It is also the reason why we, we, saw that the object storage was the re was the the right starting point for us. If I am going to implement it another textbook theory of object storage, it it wouldn't be any better, right, other than cosmetic reasons. What I saw was that when you look at the s 3 API, it is simply a get put list type atomic immutable web service API. Right? And if that is the case, your entire storage stack could be just reduced to a web server. And what MinIO is, if you have noticed, like, it just you just download the static binary and run.
And if you have a distributed set of machines, you just download that same binary to all the machine and type same exact command line, and they all cluster themselves up. There is not even an installation process. Right? It's a static binary download and just run. That is where the real differentiation starts. It is a web server at its heart. All of the storage layers got collapsed into a single layer. Everybody else, if you look at the design, there is a distributed block layer that handles a code or replication or whatever. Then there is a virtual file system layer, and they have a multi protocol API gateway on top. This is where they get it wrong, right? They end up actually following the same old file system theories that are, that are not applicable today. I felt like the same arguments that when Gluster was implemented in user space, even the kernel hackers thought that user space file systems were toys, and we would never be able to give a meaningful file system. But in every benchmark, we showed that we were faster than the kernel file system. There were no good kernel based file systems out there. And, it now I see the same thing, got even more simplified.
An object storage server is actually not a file system with the API gateway that speaks s 3. In instead, it's actually an object storage server is a web server. It a distributed object storage system is nothing but a collection of web servers that are stateless, distributed, and cooperating. Right? So we by collapsing everything into just a web server and all of the storage functions, like then what about erasure code locking, distribute like a bit rot check, you know, encryption, all of these functionalities, where do they go? There are simply web handlers attached to the web server. Literally, the muxer as as soon as the entry comes, the API entry comes, it decodes the, the s 3 API. Mostly, they are XML content as part of the HTTP request body. It it translates that into a generic object API. And then when the data comes, it's simply atomic blobs of the object. You erasure coded or, bit replicated your encryption or whatever you want to do. They are simply transformation stateless functions.
Once an object is erasure coded, you got collection of blocks and you scatter across other web servers. It's a really simple design, and that is what made MINI World a very different breed because there is no multiple layers to have. There is no metadata database, and it is a webs collection of web servers with a cooperating distributed lot. And this made it very resilient, very high performance, stateless. You can crash a system in the middle of busy workload. You won't lose any data. There is no caching. There there is no eventual consistency.
Everything is committed. In fact, we even write to the disks with o direct turned on. Buffers, right? You could lose power on the data center and data can get corrupted because XFS did not journal the data.
[00:35:22] Unknown:
We don't want to run into those kind of problems. It's really a simple architecture, and that is what made Minerva difference. So can you talk through the overall clustering strategy that you use and some of the way that you actually manage the file metadata, given that you don't have a centralized
[00:35:39] Unknown:
storage layer or database for being able to reference that? So the centralized data layer came from, the legacy idea. Right? A file system historically meant translating a a a POSIX like operation into, into collection of blocks. And then once you break them into collection of blocks that are mutable, then you need to have the association of what blocks make an object and what all objects are in which bucket, that's where the metadata comes, right? We actually did not find any such need. They when the when the when the blob comes into the system, So within the Minayo architecture, this the concept is like this. Every 16 drive, this is a default setting you can override if you want. Every 6 16 drive, is a a set.
And then, when you have a collection of machines, a cluster is a or a server set is a server set is a cluster. And within the server set, say, if you have 32 servers and each having 32 drives, you have 10 24 drives, right? Now what MinIO does is it deterministically picks 16 drives, making an erasure set. And when the data when the object comes, it actually breaks it into picks different different drive sets and object comes, even the location of the object terms of which drive it has to go to is simply a hash mod operation on the object name. Object names are are unique across the namespace and a hash mod operation deterministically places the object on the same exact 16 drive set each time.
And, now when the request comes to any 1 of the server, each of the server has the same logic. It's simply a deterministic lookup algorithm. It doesn't have to check a metadata database, and also all object object data and metadata are written together. By doing that, you actually have a significant advantage in terms of resiliency and performance, because you don't need to hold a lock you need to make sure that the metadata database is updated. We have no such requirement. Each object is granular and completely parallel. There is no global lock here. Each object, because object names are unique, it holds a object level lock and then takes the data. Often, the data comes in bits and pieces like get object range or put object multipart. All of them are committed atomically, coding and scattering across the drives. The location API is simply deterministic. So within a cluster, this is how it behaves.
Now what happens when I have multi data center, multi, like multi cluster? That is when different, different regions behave like exactly Amazon AWS regions. Depending on the bucket location, your request gets forwarded to the right cluster. Within that cluster, the, the every 1 of the node in the cluster is fully symmetric. There is no metadata name or a name node like property here. Right? Every node is equally capable. Once your bucket location tells that this is the cluster where your bucket is, that cluster, any 1 of the node knows how to how to exactly serve the data. It may be a little, a little hard to grasp without a picture, but think of it as you wrote a simple web service for file uploader and file getter. Right? You would find that, you will actually gravitate towards this kind of model very closely.
And the only difference would be if you don't have to remember the the location in a database, how would you do? Simply use a deterministic lookup algorithm.
[00:39:40] Unknown:
So how do you characterize min IO in terms of the axes of availability and consistency on the cap theorem? And how do you handle network failures in that clustered context to ensure that the blocks that you're trying to write to based on these modulo hash operations are going to be accessible at the time of the commit?
[00:40:03] Unknown:
Yeah. So, when I designed the system, I never thought of the cap theorem, based approach. It makes sense when you are actually doing a replication model, where when you are doing a scale up or a replication, like, where these problems comes into, the picture and you need to make a hard decision. In case of MinIO just to map it into the, into the cap theorem, what is our model? I would say that it is CP. A consistency is is actually very important that most people don't realize Eventually, consistent systems just before the data is propagated, the drive that held holds back cache. If that drive died, odds of that drive dying is higher than the other drives.
It is really important to actually be done with it and let the application know exactly what failed. And we took consistency very seriously. And exactly what failed. And we took consistency very seriously. And also partition tolerance, I think it is it is no go. You can't really for for you can't be forgiven if you make mistakes there. Because in any system, it's okay to actually fail, but not cut up the data. Right? Any storage system, this assurance has to be there. So the partition tolerance and consistency is super critical. Is not well understood by most architects availability part, it is where a ratio code comes in handy and a combination of a ratio code and the S3 API itself that the concept of S3 API being atomic and immutable in nature, application remembers the context. Right? And object storage system, every operation as long as I committed atomically, and I continue to keep that atomicity, then I'm good to go, which means that I only need to take care of metadata updates and data updates because there is no metadata database. They're written together atomically.
I'm able to actually solve these problems more elegantly. Now how how are we dealing with the availability problem then? Here is where when you take an object, break it into multiple parts and spread across 16 drives across 16 servers, you have plenty of parity nodes. If I have, let's say, 4 parity, I can lose up to 4 nodes and my data is still available Because erasure code and the very nature of S3 API and how Minaya was designed, we got plenty of availability to, to withstand failures. That's the part that it is completely alright to, to compromise. So,
[00:42:59] Unknown:
we did a CT system here. For somebody who is deploying MinIO, what are some of the operational characteristics and server capabilities that they should be considering? And in the clustered context, how is the load balancing handled? Is it something that you would put a service in front of the cluster, Or is that something that the nodes themselves handle as far as routing the requests to the different nodes within that cluster for being able to ensure proper distribution of the data? So the the way we designed it,
[00:43:32] Unknown:
in MinIO is that the nodes themselves should handle all the routing, but it's not always the case. Right? So when to do you really need a load balance or a or a service mission in front? In it depends on the use case. So I find that when people deploy this in case of the AIML, like data processing workloads where they need high performance, every hop counts. And usually these applications are running along the side of over 100 gigabit Ethernet and NVMe. When when they do that, you want the fastest path to the data and shortest path to the data. And NVMe and SSDs are all about how the fast that time to fast bite. And in that case, they go directly.
And any 1 of the node you hit, they all have the same, capability, and it's fully symmetric. So you you I wouldn't recommend putting a load balancer there. Now then how do they discover which node they want to go to? It is simply a say if you have a a 16 node cluster or say 100 node cluster, you simply have these node names mapped as DNS. And being a simple round robin DNS based approach, the clients are spread uniformly across, all the servers and they do the job. And s 3 API is quite resilient in terms of even if the server restarts or anything, they actually reconnect. HTTP itself is is a stateless interface. Right? Now why do you then need when do you need a load balancer or a or a service mesh in front?
When you are actually building an application use case where you are using this like a photo store or some mobile application data that they are accessing across the Internet. Termination, like bandwidth throttling. There are multiple other reasons like routing and service discovery. There I actually see almost like 2 classes are emerging here. The slightly old school, I don't know if I can call them old school, but I it is the most common approach is a load balancer approach. The emerging 1, particularly the large scale ones are actually switching to a service, discovery model.
Here is where Istio Envoy type, there are actually multiple solutions out there in the in the market. Evaluated bunch of you here. If you're building very large, stack there, actually, the the service discovery part actually takes care of, scaling. Like, as you add new clusters, they get registered automatically and applications discovers these capabilities. But then when it comes to actually, hitting the data, they don't want to go through a load balancer. This is where, like, on while, like, sidecar proxy, brings the load balancer almost like a personalized, proxy closer to the application.
These are some really cool emerging ideas. If people are building large scale application like SaaS applications, they would go this route. And in terms of the, applications, they would go this route.
[00:46:48] Unknown:
And in terms of the project itself, I'm wondering how you approach the ideas of project governance and sustainability and just some of the overall strategy from a business perspective
[00:47:00] Unknown:
of having it be open source. Yeah. Actually, this is the 1 that, close to our heart. And, you know, the while we we are doing startups, the reality is for us that we when we did Gluster last time, the team came together because they were passionate about, about open source. In fact, what I call open source, we are the free software guys. Right? You know the difference. Right? And,
[00:47:29] Unknown:
the the gnu Free as in speech, not as in beer.
[00:47:33] Unknown:
Yeah. Exactly. Right? And it was out of passion that it grew. But when you do when you when you build products with passion, it the attention to details and craftsmanship, they get noticed. And that is attention to details and craftsmanship, they get noticed. And that is how it led to success, not because we are a bunch of entrepreneurs, we wanted to make money and we got together and figured out what will make more most money. It didn't happen like that. And, and because we were passionate about open source back then, it I call it open source simply.
It's a kinda first me when I say open source every time. But when I say free software, the being a purist, very few people really understand it. But the truth end of the day shows up. Right? With Gluster is the case, and when I was the case, you see there is nothing proprietary here. It is a 100% pure open source or free software model. And why it matters? Like, back then customers used to ask me, like, hey, open source means it's inferior, is it not secure? No. They used to ask all these questions, and it has come a long way. Now you can see even VMware has to look like Kubernetes and embrace Kubernetes.
Every large deployment, particularly in the data space that we find customers actually are mandating open source for their infrastructure, because they have seen even big companies scaling products shutting down every few years, on these products that enterprise depend upon. With open source, you can easily hire top talent across the world and it is it's no longer a problem, right? Now when it came to governance that for now how do how do we run our work, across, a a compared to other other open source projects, we took a benevolent dictator model. Right? In fact, it's not a pure benevolent dictator.
I would say it's like benevolent dictators all the way down. My job is to make sure that I continue to groom more and more leaders. Everyone in the team is empowered to make decisions within their scope. And a it the hardest part for me is it's like you have 1 giant canvas and you have a bunch of artists and our goal is to create 1 piece of artwork. Right? 1 mind, it is not quite it's not easy, right? But you have to be opinionated and you have to bring everybody to see the same vision dollars and we are able to see the scale at which we grow, like like, 324, 000 downloads a day.
Maybe some of them are CICDs. But by all measures, our, like, say 53 100 members on our Slack channel, every every, like, 18, 000 GitHub stars, Any number we look at, we have deep penetration. But the key here is we are not an Apache Foundation or a Linux Foundation project. We what I found was that the moment you bring in a consortium on the board, you have multiple chefs and then every vendor has their own commercial interest. It becomes really hard to drive a a a project. Opinionated, if you are true to your cause, you will be able to build a powerful community. And the community showed faith in us that they don't need a third party nonprofit organization to endorse that we are a we are a company that we can be trusted.
And they have seen this even in the past like Gloucester, I'm no longer involved, but it is a project that is still thriving without me, even some of the core members, right? That is the 1 that's giving users a lot of confidence that we understand what we do. And, and we are we we for us, open source is not a business strategy. Philosophy we believe in, right? And they're able to see through that. And combined with that, you also have to build product with with the craftsmanship, like thinking like an artist, a minimalist culture. If you establish the culture, actually brings the right kind of people together. And once you once you bring those people together, then you can't stop it, right? Way of life for them and they get emotionally attached. Some of these guys that wherever they go, they are almost like attached, not only Minayo, everything they do afterward, they want to do it like this.
That is the 1 that is giving the the stability for governance and the stickiness and sustainability
[00:52:23] Unknown:
for long term. Yeah. It's definitely great seeing businesses that have that understanding of the ethos and culture of open source projects, and ensuring that they're actually in it for the technology itself, and not just for the business opportunities that come along with it. Yeah. No. Like, we are kind of fortunate
[00:52:44] Unknown:
now that because the industry has come a long way, you don't you no longer have to convince the customers that why open source and free software is great for you. Now they are telling us that that is all they need. In fact, like, the surprising part is even investors are actually favoring, open source, particularly in the enterprise space, infrastructure space. They favor open source over proprietary start ups. They even in investors are advising these proprietary software start ups, entrepreneurs to actually consider open source. And the problem is that you you have to believe in it. Right? It's not something that,
[00:53:19] Unknown:
you think of it as a business strategy. Because of the fact that you have gained this measure of popularity, I'm sure that there have been some interesting use cases that have come about. And I'm wondering what you have found to be some of the most interesting or innovative or unexpected ways that you've seen MinIO used. Okay. That's actually the fun part. Right? I can say this is a true private cloud use case. Like, literally,
[00:53:41] Unknown:
this is, MinIO running inside Royal Caribbean ships, because they have no connectivity. I've seen even similar cases like movie production, the on-site, they have they have to capture editing and and processing, on prem. And from from cases like that, like, the the interesting the the 1 that surprised me the most was that 700, 000 unique IPs of running inside AWS. I I don't know how many of them are seriously, like, growing. I actually I kept telling these users that they should be using AWS S3 when they are inside AWS because they are running us on top of EBS. EBS, I think is like 3 times more expensive than S3. You're not really saving money by running on top of EBS block storage.
When I spoke to these users, what they're what they were telling me was that they wanted cloud portability. They have fully automated their stack through Kubernetes, and they are able to burst their deployment into public cloud, whether it is CICD or application development, they are able to move between clouds. It is portability and convenience is more important than the cost. But I but I would if you ask me, I would still recommend that, if you are if your data is growing, you then, you either stick to s 3 inside AWS. If you are outside AWS, then I then, I mean, I will make some more sense. Then the 1 that recently came, that was quite interesting to me was the, Tantan use case. Tantan is like Tinder of China and, they have, like, in insane amount of really small files. These are like sub 100k, and the largest they can ask is, like, 200k, type object. And when you have, like, few kilobytes worth of objects, you should really not be looking at object storage. You should be storing them in a database.
But the problem is that they had petabytes of such data and no database would scale, that big. So they what they did, interestingly was they changed our drive FS layer with with Rocks DB. So they can actually store petabytes of really, really small files that should actually be stored in a database. Instead, in this case, it's stored as objects, and the objects are indexed as collection of rock compacted rocks DB databases. And essentially, Minayo doing erasure code, bit rot, all those capabilities with stick consistency across many nodes at petascale.
This is almost like a distributed database with the s 3 API. That actually surprised me. If they had asked me on their own, I would have completely discouraged them. And this surfaced after they were running us in production at scale and, they they published a blog post recently and, it came as a surprise and I'm actually encouraged by it, the the the innovation that they did. Then I also see from a medical imaging use case, usually they are the 1 that is like most conservative. All the VNR, VNA, vendor neutral archive and tax data they are storing. It like, 12 of the 15 large largest banks in the US are using us. Like 13 or 15 in the Europe is using us. We, we organizations that are supposed to be conservative are the ones adopting open source, optic storage. I think that they have come a long way. And also, I think they're struggling with the data, I think that they have come a long way. And also, I think they're struggling with the data explosion problem. Yeah. It's definitely interesting.
[00:57:30] Unknown:
Sticking a file inside a database, inside an object storage to be able to query it at scale.
[00:57:35] Unknown:
Yeah. You know exactly what I mean.
[00:57:39] Unknown:
In terms of the future direction of MinIO, I'm wondering what you have planned in terms of the product road map, and any additional projects that you may build to incorporate with the object storage and provide additional capabilities?
[00:57:55] Unknown:
Yeah. So some of the newer things that we are working on, like, more and more I see, even the databases, are coming to, object storage. But because even the small data is growing big, from Splunk to Verdecar, Teradata, in the in the proprietary world To open source, you will see all the way from Presto to Spark ML to all the like drill. All the databases are turning to object storage. They're leaving their storage back end to object storage and working on just a SQL processing and query engine. So that working with these guys in terms of integration validation, there that's just a matter of ecosystem, right, in terms of enabling more and more richer applications to come to object storage. But in terms of features inside MinIO itself, the most important thing that, that I care is actually supportability, supportability in the name of how do you really operate a large infrastructure at scale with with very little expertise.
As you scale more bigger and bigger, if you need to hire more people, then you are not scaling at all. Right? So how do you then do that? This is where I see that supportability is not a separate problem. It is very much a product problem. And, you would have seen, like, even some, like, recent features, like simple things, but they are very powerful, features, like, MC admin trace. It is like the d trace type, but for an optic storage that when you run a trace, it's as simple as, like, admin trace command. You point to a running system, then every single operation that is going on at that point, it will give you good details. Anytime an application has some bug or it's a bug in our code or something, like, this is where s 3 compatibility. Right? It's something that we broke are some new application is using some kind of legacy API that, are there invoking it in the wrong way.
Run admin trace and immediately we can tell bug in their code, bug in our code or what went wrong. From that to, like, even console log, you remotely just attached to a running system and it gives you entire history of the console log. Simple things from that to how, like, you can detect slow drives, bad network. They are all not someone else problem. Right? Almost always that you all you will then see dry a batch of news drives are slow. You would actually end up blaming Min Iwo that Min Iwo slow and time out timing out are in unstable debug and troubleshoot. It's not a scalable approach, right?
This is where the supportability capabilities that we are adding into the system will actually enable not only the users, but also us to reduce the burden at scale that they that we at at this pace, we make new releases every week, sometimes even multiple times a week. We are able to move fast and, almost operate private cloud, deployments, on par or better than public cloud deployments. These guys are are able to run large infrastructure with no previous storage experience. Right? That is, to me, enabling them as part of these tooling is super important.
Then, followed by, other feature that we are adding is subnet. Subnet is how, we can remotely help these users, life cycle management, then ability to expand, clusters on demand or things like that. But in general, I would say we are the anti roadmap company. I always tell our architects and our users that if you send a pull request to remove a feature, I will immediately accept it as compared to adding. Ask 10 times before you really, really need this feature because adding is easy, maintaining is expensive. We try we try very hard to not, not make it feature rich and and that will be the that will lead to the collapse eventually, right? So being very particular about making sure every feature out there is a very important
[01:02:21] Unknown:
part of the system. So that's about it. Are there any other aspects of the work that you're doing at MinIO or the overall space that you're working in that we didn't discuss yet that you'd like to cover before we close out the show? I think,
[01:02:34] Unknown:
in general, I I I see that, like, object storage has come a long way. I would like to see, like, more, like, tools on top. Like, for example, better data governance. Data governance itself, I think, is a is a data governance, data management are kind of a old school, thinking. The like it's not I'm not thinking in those terms. The the real problem, is actually your data is at such large scale and is constantly changing. Now ability to control, access, defined policies, even establishing identity, like, those days you want, like, you want to give access based on user IDs. Nowadays, there are no users. These are applications and you need application identities.
Applications cannot actually do 2 factor authentication. They need to be doing certificate based authentication A lot has changed in the recent times. And, you know, A lot has changed in the recent times, and industry is far behind in terms of their understanding of how to manage object storage, how to establish these identities, and then also even discover data, right? When you have data getting generated faster than all of your past data, how do you then even organize this data, cataloging them, or indexing them, putting them in a in, like, organized folders and buckets is just going to be impossible.
The way then you have to look at it is think of this problem as a search problem. Right? Like, if you have too much of data, you can't organize them as folders, then all you need is a better search engine, better access mechanism and policy, control all that. I think there is there is a lot of room for, for these kind of powerful tools to emerge on top of the data management, data storage system. I would like to see more new projects or startups,
[01:04:45] Unknown:
going after them. For anybody who wants to follow along with the work that you're doing and keep track of the work that you're doing, I'll have you add your preferred contact information to the show notes. And you've touched on this a little bit just now, but I'd still like to get your perspective on what you see as being the biggest gap in the tooling or technology for data management today, if you have anything else to say on that matter.
[01:05:07] Unknown:
Yeah. I think the like, the to summarize that, it's the the tooling is the is the search part, the access management, and in terms of policies, access management. In the past, because you had many different, storage systems, variety of like sand, NAS vendors and database vendors was was hard. This is where the data lake got all the bad rep, right? And in the new world, the good part is all of the data is getting consolidated to object storage and there is only 1 storage system at the heart of the data infrastructure. And everything else is simply stateless containers and VMs around object storage, and, they are all accessing through s 3 API. If this is the case, finally, data management is something tractable.
Right? But then the the key here is platform. I haven't seen a good product yet in the market, but certainly I keep hearing new startups wanting to go after this market.
[01:06:27] Unknown:
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Min. Io. It's definitely an interesting project and 1 that I have been keeping an eye on for a while and looked forward to using for my own purposes. So thank you for all of your efforts on that front, and I hope you enjoy the rest of your day. Oh, thank thank you, Luis.
[01:06:45] Unknown:
This is a common pattern we see. Like, users use us in their home NAS are just the home drives. And also in the in the corporate on large scale machines, the the for me that simple enough for a personal use case. It that level of simplicity is what is needed to actually operate a very large infrastructure. Right? And often I see that, that even when we talk to our users, it isn't in it is even running on their laptop and home NASA appliances. It's cool to hear that. Thank you again, and,
[01:07:25] Unknown:
have a good rest of your day. You too, Tawes. For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Anand Babu Periyasamy and MinIO
The Origin Story of MinIO
Object Storage Landscape and MinIO's Unique Approach
Use Cases and Performance of MinIO
Federation and Compatibility Layers in MinIO
System Architecture and Clustering Strategy
Project Governance and Open Source Philosophy
Interesting Use Cases and Customer Stories
Future Directions and Product Roadmap