Summary
Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data. In this episode Bin Fan explains how he got involved with the project, how it is implemented, and the use cases that it is particularly well suited for. If your storage and compute layers are too tightly coupled and you want to scale them independently then Alluxio is the tool for the job.
Introduction
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Bin Fan about Alluxio, a distributed virtual filesystem for unified access to disparate data sources
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Alluxio is and the history of the project?
- What are some of the use cases that Alluxio enables?
- How is Alluxio implemented and how has its architecture evolved over time?
- What are some of the techniques that you use to mitigate the impact of latency, particularly when interfacing with storage systems across cloud providers and private data centers?
- When dealing with large volumes of data over time it is often necessary to age out older records to cheaper storage. What capabilities does Alluxio provide for that lifecycle management?
- What are some of the most complex or challenging aspects of providing a unified abstraction across disparate storage platforms?
- What are the tradeoffs that are made to provide a single API across systems with varying capabilities?
- Testing and verification of distributed systems is a complex undertaking. Can you describe the approach that you use to ensure proper functionality of Alluxio as part of the development and release process?
- In order to allow for this large scale testing with any regularity it must be straightforward to deploy and configure Alluxio. What are some of the mechanisms that you have built into the platform to simplify the operational aspects?
- Can you describe a typical system topology that incorporates Alluxio?
- For someone planning a deployment of Alluxio, what should they be considering in terms of system requirements and deployment topologies?
- What are some edge cases or operational complexities that they should be aware of?
- What are some cases where Alluxio is the wrong choice?
- What are some projects or products that provide a similar capability to Alluxio?
- What do you have planned for the future of the Alluxio project and company?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Alluxio
- Carnegie Mellon University
- Memcached
- Key/Value Storage
- UC Berkeley AMPLab
- Apache Spark
- Presto
- Tensorflow
- HDFS
- LRU Cache
- Hive Metastore
- Iceberg Table Format
- Java
- Dependency Hell
- Java Class Loader
- Apache Zookeeper
- Raft Consensus Algorithm
- Consistent Hashing
- Alluxio Testing At Scale Blog Post
- S3Guard
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello. Welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. Go to data engineering podcast dotcom/linode today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat
[00:00:56] Unknown:
to join the community and keep the conversation going. Your host is Tobias Macy. And today, I'm interviewing Bin Phan about Alexio, a distributed virtual file system for unified access to disparate data sources. So, Bin, could you start by introducing yourself?
[00:01:10] Unknown:
Yeah. Hi, Tobias. My name is Bin. I'm glad to be here. So I'm right now the founding member of Alexa Complaint and also, the PMC member for Alexa Open Source Project. I was working on the project for about almost 4 years. And before I joined the team, I was working in Google on a large scale and distributed systems storage systems, very similar to, in the same space as Alexio. And before I joined Google, I was working I was actually a PhD student in Carnegie Mellon working on distributed systems and focusing on storage systems. So I've been working this space for a while.
[00:01:47] Unknown:
And do you remember how you first got involved in the area of data management?
[00:01:51] Unknown:
It's from my early PhD years. So actually, maybe it's on the 3rd year I start to look at key value systems, key value study system as my 1 of my part of my PhD thesis. So I was involved in, like, we we hacked actually a lot into memcached at that time to emphasize on memory efficiency and throughput. So we had 1 paper out of that. And then, I actually because my work on that, I visited UC Berkeley M Plab where I get to know the founder of Alok Steel, Haoying Li, who was a PhD student in UC Berkeley MBLAP at that time, because he is also working on similar, like, in memory storage. Yeah. So in the same space.
So we know each other. And and then after that, he got a because I graduated and just go to Google to work. And Haoying Li, he continue studying in UC Berkeley. And finally, his research project was founded by VC and get created this company. And so I, he called me and actually, I pick up the call, and I feel like this is very interesting, and very exciting opportunity. So I just quit Google and joined the Aluxio. So, yeah, that's my,
[00:03:09] Unknown:
post story. And as you mentioned, Aluxio started its life in the AMP Lab at Berkeley. And I know that it was originally released under the name of Tachyon. So can you give a bit of an explanation about what the Alexio project is, and anything organization that you've built up to support it? Yeah. So this is a great question.
[00:03:37] Unknown:
So in the early days, it was a pro research project called Takyon at that time. Actually, it has even less known name before even Takyon. But in the Takyon days, it was open sourced from the very beginning. The motivation for having this project is actually heavily related to Apache Spark. I think at that time, Spark is already taking off from UC Berkeley. And a lot of people are looking at Spark, and they see the great potential for this computation new computation framework aggressively using memory. So 1 complaint people always have is at even at at that time or even now, memory is still considered as a maybe more expensive resource storage resource than the other storage capacities.
And Haoying Li, the founder of the project at that time was taking a look at this project, and he found it can be more efficient if you have a system external to Spark and let different Spark jobs or different Spark contacts to share data more efficiently. So that's how, in the beginning, he, start this project. And and actually, he open sourced it. And gradually, this becomes more like a general purpose file system, distribute file system, rather than just, 1, sharing or caching layer for Spark, but also for other computation frameworks, like Presto, like, MapReduce, or even TensorFlow in a in a new days. So, that's how basically the project is evolving from various it starts as a very specific purpose, just help Spark to manage the data more efficiently, but gradually becomes a, in memory file system because this in memory file system needs to handle data persistency. Because the first thing once you talk about something in memory, the first thing people are asking is, what if the memory if you restart the machines and you lost data from memory, how do you deal with this? So from the first day, Taqian or Alexio at that time is designed to handle, this kind of basically data persistency issue with even volatile storage media like memory. So we have all so this is very fundamental to the design for this system and becomes fundamental in architecture if you compare Alexa with other, distributed storage systems like HDFS or like Google file system from the paper. So this is I I would say this is like something fundamental to the, entire ecosystem to our entire design. And it turns out very interesting in the days, especially nowadays, people are talking about separating storage and compute.
Storage is getting far far can be far far away from the compute because the storage can be some cloud storage. And in this case, having a Luxio works perfectly in this architecture because we can be in the middle between compute and the real persistency story since we are not targeting to solve this data persistence issue. Long story short, I think we we start from a caching layer for Spark using memory to a general purpose file system. It's a file system. It's not a caching system. A distributed file system with the awareness of some other we call ender stores to provide data persistency. So, the architecture becomes different from 10 years ago when people are talking about, oh, yeah. We should move compute to storage. Let's have cup deeply coupled storage and compute. And we build open source around this idea. We have more than 900 contributors from all over the world joining this project because they want to contribute, for example, adapter to read their own data source, their own persistent data store. And we have a lot of interfaces for people to customize. So they have different cache policies.
This open source project is getting more and more popular, and we also provide a enterprise offering. So the core features in terms all the things I mentioned so far are in the open source edition, what call community edition. We also provide as a company, we also need to survive. We need to have a business model around this open source. So we provide enterprise offering, which adds more enhancements in security or in, like, large scale or high availability. This kind of enterprise readiness, features, like owning a lot of these large corporations they are looking for when you run services in their environment.
[00:08:19] Unknown:
Yeah. So this is our business model. And a couple of the things that you mentioned in there preempted some of the questions I have, particularly in terms of the persistence of the data in memory as far as what happens when you have a memory failure in a machine or the instance gets rebooted and how that data distribution gets managed, at least in terms of the long term persistence of the data where you rely on the underlying storage systems to provide those guarantees. I'll probably have a few more questions later on about how you manage distribution among the Eluxio layer once the data has been retrieved. But before we get into that, I wanna talk a bit more about some of the use cases that Eluxio enables because, as you mentioned, there was a big push of moving compute to the data because of the inherent gravity that these large datasets have. But as we move more toward a cloud oriented environment where you have these object stores that don't have any guaranteed physical location, so you have varying levels of latency as you're accessing these object stores. Or if you're trying to run-in a hybrid cloud environment where you have your own local data centers, and then you're also trying to interface with public cloud for bursting capacity or for taking advantage of some of the specialized features that they offer. So, yeah, going back again, just wondering if you can talk about some of the use cases that Alexia enables that would be impractical or sort of too difficult to want to deal with without it. Yeah. Good question.
[00:09:49] Unknown:
The first case, I would recommend to, explore is really if you have data remote from this from compute, and we see cases like people are having a different data centers, and they have cross data data center traffic to load data because you wanna do a join and 1 table is in 1 data center and another parts of the computation, other tables are in the local data center. So in this case, having a Luxio in place will help greatly to reduce the computation time because you can either preload the data the table from remote data center to the Aluxio, caching layer, or Aluxio have the intelligence built in to bring the data on demand. And next time, after the code reads, next time if you read that same data again, this will be cached locally. So we do see a great performance gain in cases like this. So we have a published use cases with Baidu, the search giants in China. They see performance benefits by 30 x in this cases. See, the other case, interestingly, I see more is really the cloud, as you mentioned. So actually, this is also what I observed. The companies of the new generation, they typically born on cloud. They start from AWS or Amazon or Azure on the 1st day. And even the older generation complain, they have aggressive plan nowadays to move their either data or compute or both to cloud.
And once you are moving to cloud, the natural storage choice for you is for AWS, it's s 3. Right? But s 3 is s 3 is awesome. I like s 3 a lot. It's cheaper compared to other storage capacity, and it's scalable. It's very easy. The semantics is very simple. It's very easy to manage. A single global namespace. However, there are also costs come with all the convenience here. For example, once you have all your data in s 3, your compute, if it's, let's say, on EC 2, will naturally always drag the data from s 3 to a local compute engine. So this will go across network. And s 3 sometimes has different semantics compared to the traditional file system. For example, the rename can be expensive. And also listing a huge bucket, a bucket with, say, 1, 000 or 1, 000, 000, objects inside can be very slow. So in these cases, people do want to have a more familiar or more traditional performance implication for the storage per storage service.
And also, the new trending machine learning applications, they emphasize a lot on iterations. They take data from last iteration and, do some computation and output to next iteration. And from that on, the previous iteration the data from previous iteration may not be important at all. So this kind of like a you need a temporary data store rather than, some persistent data store. So we see cases on running computation on cloud. They won't have a caching or another tiering on top of the cloud storage provided by the vendors like s 3 or Microsoft, and allows you to can fit perfectly in this case. So So I named the 2 cases. I think there is a third case, which is very interesting. I see mostly from, users in China. The Internet companies in China. They like to do, a topology that I have a centralized main HDFS data source. But because my HDFS is HDFS service is heavily, heavily loaded because we have giant number of different applications depending on this data. So what we see is to guaranteed SLA, they also set up we call satellite clusters, and each cluster is maybe a zone just for 1 specific computation service, for example, Presto.
And in this case, the data and compute becomes is still in the same data center, but it's still cross different machines, go cross network. And because they have seen issues with, like, high pressure on their HDFS service, so they see, okay, let's put another, storage here collocated with my satellite compute cluster and mirroring the data in the main storage. And we don't need the whole set, just need working set. Right? So Alexa work perfectly in this case too. So we see a lot of use cases around this in this,
[00:14:37] Unknown:
especially for Internet companies in China. And so can you talk a bit about how the Alexia project is implemented to allow for these data reflections and high performance access of data, particularly in the face of varying levels of latency, whether it's the case of interfacing with cloud object storage or the situation that you just described where you're using it as a way to keep a working set accessible to a compute layer that's remote from a centralized Hadoop or other sort of data lake that it would be too large to keep in memory of the compute instances, but
[00:15:21] Unknown:
data lake? Yeah. So as I mentioned in previous question, Alexia was born actually in the in beginning to to solve challenges like this. So there are few things we are doing to help reduce the latency for, and reduce the performance variation, the retrieval latency variation. 1 thing is we aggressively use memory as as the main storage media. So although we provide optional choices, you can use Alexa to manage memory or plus or optionally SSD plus optionally, hard disk. But a memory a lot of we still see a lot of users use Aluxio to manage memory to, which has a higher bandwidth when you, have a very hot data insight. And also, we have done a lot to provide what we call short circuit, data read and data write. So essentially, we aggressively use leverage data locality. For example, if there is a distributed application, for example, it's a spark. And a spark, if it's using HDFS interface to access data, it will try to allocate the tasks closer to the storage. And this is by having some APIs to the, nodes serving this data. So Alexio, if collocated if it's deployed collocated with a spark or other computation framework, we also provide a sim compatible interface.
And in this way, we will try hard to match, the task to our, like, workers serving the data. In this way, they don't need to go cross network to fetch data. So, yeah, memory and data locality. And also yeah. So the third part actually is related to, metadata. So, Luxo is not just a data caching layer. We also, as I mentioned, this is a file system. So we have our own dedicated metadata service. It's basically it's a file system. So we maintain the INO trees and all these journal, all these basic elements for required for a file system.
That means if you have a slow list operation on s 3, that will happen once. But after that, Alux understands how the metadata is on s 3, and we will serve this instead of having s 3 to serve this metadata queries. So by doing this, we can reduce, even the operation variations on the metadata side.
[00:17:58] Unknown:
So to your point about the tiering of storage where your primary operations are being done in memory, but you have the capability of also using the local disk on the Alexio nodes. I'm wondering if you can talk a bit about the life cycle possibilities that you have as far as managing when the data gets moved from 1 layer to the other in the Alexia nodes, but also as far as aging out data that you are accessing where it gets initially pulled into the working set by a request from the compute engines. And then at some point, it is no longer relevant, or if you need to page out data because more data is being requested than can be held within the Alexio cluster, just how that overall tiering and life cycle management happens within the Alexa layer
[00:18:47] Unknown:
itself. This is also a core feature provided by Alexa. Essentially, from very high level speaking, you can think a Luxeo is a cache. It's a distributed cache for file like file data. So each worker, a Luxeo worker a Luxeo worker is the component handling all the data capacity, like, really provide data capacity in it. You, can add more or n different workers into the cluster. Each Elixir worker implements implements a cache. And by default, we use LRU as the caching policy. We also make this pluggable. So, we in addition to LRU, we have some other, policies, built in and use we do see users provide their own caching policies, replacement policies too because they understand their workloads better. So that's on the caching side.
On the other side, we also provide functionalities for users to actively control the data life cycles. For example, we provide commands. You can set a TTL for a file or a directory. Say, I set the TTL to be 1 day, then after a day, the data in this the data in Alexeyo can be, freed. It's up to the workloads, but we do not guarantee it's in memory anymore if you set a TTL. And there are also policies like I want to really ping this data in memory layer. So that's the another command you can use to really put a data in the top tier and closer or faster to the computation.
So, yeah. So that's basically, what we do for the life cycles. And we are actually looking at more, complicated different policies in the future, because we do see, especially for enterprise users, they have a more sophisticated life cycle management requirements. But right now, these are all available
[00:20:48] Unknown:
in open source, community edition. And as far as being able to manage the life cycle of the data in the underlying systems where, for instance, you might have initial data getting loaded into s 3 from an inbound data pipeline, and then you do some analysis or processing s 3 or some form of cold storage. You wanna store it long term in s 3 or some form of cold storage. Is there any capacity in Eloxio for being able to manage that life cycle? Or would it just be based on the mount points for those underlying storage systems and then having the computational framework be responsible for reading from 1 location, and then writing back to a different mount point for that, more persistent long term storage and archival?
[00:21:37] Unknown:
This is a great, great question. And we, this is basically what I mentioned. Like, we do see, enterprise large enterprise. They have requests like this. They really want to have a very intelligent data management system to migrate data from hot storage to warm storage and from warm storage to cold storage to, improve data efficiency and also reduce the their data cost. And right now, with the current Eloxio open source edition, what you can do is I just mentioned, like, you can write ETL jobs to move data from 1 location to another, like, if they are backed by different mount points and representing different end of storage.
So that's possible doing that. And we don't have automation on here yet, but I I believe this is on the road map in the future.
[00:22:29] Unknown:
And another question I have, particularly as it relates to metadata is in a computational layer, if you're trying to search for either a particular set of records or just do a, discovery process of seeing what data is available in which systems, does Alexeyo proactively retrieve the metadata from those storage systems so that those initial, for instance, LS of the file structure return more quickly? Or does it do a fetch at the initial retime and then just cache it for a particular period? And then also, I'm wondering if it has any native support for some of the more, high level data storage formats, such as parquet or avro, where you can potentially query from the computational layer into the underlying files to determine if a particular set of records has the information that you're looking for before you necessarily retrieve it into the Alexia layer for further processing?
[00:23:29] Unknown:
This is not a great question. Go back to the metadata question part first. The boast mode, like, loading the data on the first operation and, loading metadata on the first operation and put it into our metadata store. And also, you can also the second mode is you can also preactively load the metadata. Like, if you know, my computation will touch this bucket in the next hour or so, Let me just try to run something first to pre to to preload that metadata. So both are supported. And in terms so this is how the metadata enters a Luxeo and being remembered by a Luxeo. And remember and and regarding how we manage the life cycle for the metadata, we also provide different policies. For example, you can set up cache expiration time. The metadata I remember for this I n o tree for this part of the file system will be true for maybe an hour. And later on, we will just do another load once some application is accessing data there. So this is 1 policy. The other policy is, oh, okay. So there might be out of bounds modification to my end of store frequently. Let me just don't trust all the cache metadata. And on every single operation, I need you to check the updates on the vendor's Underscore site. So there's another policy to be very aggressively.
But we do see this will, you can see that this will make the metadata operations slower because it's just adding more overhead. So it really depends on the application. And we also adding new features so that for certain end of stores like HDFS, there are hooks you can put. Once there are some modification to their metadata, you can trigger, you can register some action on their site. So we can just preactively update the metadata once there's modification there. So, I think this is, yeah, this is also something we'll do in the new feature in the new release. So the second part is regarding how we can optimize, maybe if we understand more about data structure in the file, like parquet or do we do some push down? So right now, Alexa is a file system. So we understand the data format as a just a very plain bit of stream here. So we do not go into the file file structure to, put more, to provide more optimization.
But on the other hand, I'm working with actually some researchers from universities and to see, if this is I know this is a parquet file. And, can I do something smarter? For example, can I do some clustering, or can I do some more intelligent, replication? Yeah. So that's something I think it's a very interesting direction,
[00:26:24] Unknown:
and we will see how the readout goes. Yeah. And another possible approach to that would be to rely on other metadata storage systems where particularly for Hadoop, where you have something like the Hive meta store or the iceberg table format that's being worked at, in Netflix and with other people where you can potentially interface with those metadata storage systems that are already doing the work of parsing the records within those more high level storage formats, and then be able to use that to determine which subset of files have the information you need to then retrieve into Alexio?
[00:27:01] Unknown:
Yeah. Yeah. That's definitely another way to go. And we are very interesting to to see the, some like, if there's some experimental results or some benchmark. So we are very interested to see that and and see if we can just, influence their roadmap for Alexio in the in the in the in the future. And, I think this is something very interesting. And I hope this can like, because this open source project, university researchers or some other people, if they feel this is something worth doing for their workloads, we are very happy to provide some help or some collaboration
[00:27:37] Unknown:
and on this end. And in terms of the actual code itself, I'm wondering if you can discuss some of the particular challenges that you faced in terms of building and maintaining and growing the project and discuss some of the evolution that it has undergone in terms of the overall architecture and design of the system? Yes. I have a lot. So,
[00:27:59] Unknown:
I should just share 1 interesting story. My first, big project in Alexio after joining a team is to make Alexio modularized. So if you know a little bit more about Java JVM applications so so Alexio is building Java, like 90, maybe 95 percent of source code is within Java. So the way it works is you just build Alexa into jars, different jar, binary jars, and JVM will just execute the jars. So in early days, because it's a research project at that time and developers or maintainers for the project were really looking for fast growing. Right? Just like move fast. So everything, all the components were in the same single Java module, which means you compile a single jar. Even it's on the client side, you put in Hive or you put it into MapReduce or it's, it's also the JAR you you execute the service. You provide RPC service, provide the the the daemon long running daemons. And this is a really bad design, but it gives a lot of convenience because like, the Java in Java land, because it's in a single module, you can reference all the other classes without worrying about managing dependency.
And my first project on Luxo is to break this a single, monolithic JAR into multiple JARs with a clean dependency. So we can distribute a much thinner jar to applications like Hive or MapReduce, or Spark. And and it takes me quite a few, I would say, months to figure out the relationship between all these dependencies, the callers, and then figure out the way to break this into multiple different frameworks, different JARs, different modules. And, so 1 lesson I learned from there is once you start, once the project starts to grow to a certain degree, we should start thinking about architecture much earlier.
So we can have a much better management in terms of the like, all the dependencies. And and this will simply simplify the life down the road much easier. And after that, actually, we are pretty happy with the structure. And but still, after adding new features, more modules come in, and how to make this plugging pluggable, and how to make different components very easy to upgrade. It's also, it's forever going process. So that's something regarding the source code management and also as a Java. If you work in the Java land, you will understand the Jaha issue, which is the pollution of the binary libraries on other applications, class pass. And it's particularly a problem, a challenge for us because we are supposed to talk all different kinds of, data stores. So originally, we put everything, like, we just put everything together on the class pass, and then you will see all the dependency, how issues. We spend a lot of time to dealing with that, and you have to do a lot of dirty work in, like, workarounds or hacky ways to solve this JARHEL issues. So later on, we decided to do this in a very clean way and use the class loader, to isolate the the libraries. So, in this way, we see much less issues in in terms of Jaha.
But it just it also makes the code a little bit more complicated. But I think this is totally worth it. Yeah. Code wise, this is something also we learned, very very important. Other, the other part, I would just wanted 1 last thing is really, the resource management. People know Java. Like, Java is a good language because you don't need to handle the you don't need to delete something you allocated from the heap. Like, the garbage collection will help you to be lazy. This is only 1 part of the resource. There's a lot of other resources. For example, the sockets you open, the locks you acquire, especially this can be a distributed lock. And also, for example, the, the all the handle file handlers, all these different pieces, and even sometimes for efficiency, we also use the low lower level Java APIs and there can be memory leak related. So how to handle this well to make sure you don't leak the resource? So in the beginning, we all just because it's a distributed file system. Right? So it once you have this kind of issues, it's very hard to debug to diagnose. And, once you know that, it might be relatively easy to fix. But even to to diagnose the issue is very hard. So we, what we later on decide in a source code we do is to really have a very strict pattern. Like, everyone should just follow this pattern when you acquire some resource, and we leverage some good things from Java. They some good functionalities from Java they provide to make sure, once you once you quit, like, normally, this what resource will be returned. But also, for example, if you encounter some exception, so you don't really expect, you go the code path you you go through the code path you don't really expect. We make sure in that way, we also return the resource, correspondingly.
So, you know, this is this also helps a lot, to make the source code much more robust.
[00:33:26] Unknown:
In terms of being able to write it in such a way that you can easily add new back end capabilities for interfacing with different storage systems and also making it easy for different computational frameworks to interface with Eloxio. What are some of the particular challenges that have come up on that front? And what are some of the trade offs that you've made in order to be able to provide a single API to the computational layer that works effectively across all of those different underlying storage systems that might have varying capabilities or more advanced functionality that's not necessarily exposed?
[00:34:05] Unknown:
Yeah. So So 1 challenge, as I just mentioned, like, a different if you want to talk to different enders stores, usually, they require different meta different libraries and how to make all the libraries to work together happily. So as I mentioned, like, we use the class loading to isolate all the dependencies to make all of them happy. And also the other part is how to how to find a common API. So that, like it's it's useful enough and also it's common enough and to cover all the different storage, types. And we do have multiple iterations on top of that. Right? So in the beginning, we have 1 set of API. So we call the we call this understore API. And later on, we realized this is not sufficient, especially when we see more use cases with object store. And the object store has very different semantics, not just the semantics, but also like a security all the security requirements with the traditional file systems.
So, we do, put a lot of emphasize on that to make it work. So, I think after multiple iterations, we gradually converge to a version that works for most of the cases. This is something we call us to a southbound, basically, how Aluxio talks to the enderscore. We also have the API. We call the northbound, northbound API that's, that that's what we provide to different applications. And we on that side, most of the applications we see so far use the our Hadoop compatible API. So in that way, applications can just assume this is just the Hadoop file system. This is, just the Hadoop file different implementation for Hadoop file system. So they can continue use whatever the, they assume, but just chain if replace Hadoop HDFS with, Aluxio.
On top of that, we also have our own API, file system API, which is more similar to the Java file system API. And adding more features like setting TTLs, doing the mounts, and doing a lot of Elixir specific, functionalities. And 1 interesting trend we see nowadays is as in in the beginning, it's because people have their legacy. They want to have run their legacy applications on the distributed file system. So they want the POSIX API. So we do, we do have a contributor from, I think it's from IBM contributed the POSIX API for us.
But later on, then we just realized actually a new generation of the machine learning applications, they leverage aggressively on the heavily on POSIX API. So they typically assume the data is on the local directory. So this becomes now popular again among this, machine learning tools. So yeah. So but, essentially, the way we handle the northbound is we provide different set for like HDFS compatible for distributed application like Hadoop or Spark, and the more traditional applications or even a new wave of machine learning applications we provide the POSIX API. And from the operational perspective,
[00:37:18] Unknown:
I'd like to get an understanding of what the different scaling factors are in Alexio and some of the edge cases that come up as you try to add and remove nodes, particularly as it relates to data distribution within the cluster?
[00:37:33] Unknown:
Great question. We see a few different depends on the workloads and the environment. We see different different scale, scalability factors here. The major 1 is really the bottleneck on the masternode. Our masternode, you can think it's equivalent to Hadoop file systems, name nodes. And because the master node is using unkeyed Java memory to store all the, the mapping from file to maps, from, inotrees to different detailed information. So this can be very memory hungry. And also this can be GC intensive. There's a lot if there's a lot of small modifications, intensive modifications to the file system. So this becomes 1 scaling bottleneck once you hit once you store, say, 100 of millions of files into a LogCIO namespace and, and becomes a unstable part for the for the manager service. So in the new release in the next coming months, we're having the new 2.0 whole new 2.0 release. We are moving this part, the on heap memory for file system information, files basic file system, I n o tree to a off heap implementation on that. And by doing this, we can scale the file system to be much larger, like contain billions of files and directories. So this is 1 thing. The other part is really how many workers you can handle because there are heartbeats between workers and, master. And there are, different communication, the control messages sent over. So we do if you if you scale to thousands of nodes, or beyond, you will see the master node becomes a bottleneck or at least very pressured because of this kind of workloads. So 1 way we handle this is we testing this aggressively on release to make sure we can scale to thousands of worker nodes. But also we, optimize a lot on RPC sites, threading pool, all these different things. So, we really want to make it work. I think in the new release, we can handle at least a a few 1, 000 workers in 1 deployment. So these are the efforts. And the the the bottlenecks we have seen in terms of scalability and our efforts to improve on top of it. So,
[00:39:59] Unknown:
what's the other part of question? Just wondering how scaling the node count up or down impacts the data distribution in the cluster. So I'm wondering if you use things like consistent hashing to ensure that you're always able to retrieve a record from a given instance or if you use more of the name node style approach where you have metadata in the master about where all the files are located, and then also any sort of redundancy so so that if 1 node does go away, you don't then have to retrieve the data all over again from the source system and can instead just replicate it from within the Alexio cluster? 2 parts. 1 is on
[00:40:36] Unknown:
the metadata service part. We do provide high availability mode. You can run multiple master nodes, master service inside Alexa deployments. And we internally, we use Zookeeper and index release. In the coming release, we're using raft to decide who is the primary masternode serving the manager service. And also on the data side, we do have replication scheme built. So, this is actually something very interesting I want to highlight. If you think about HDFS, the default replication factor is a 3. Meaning, we want to have this data 3 copy in avoid to avoid, we lost data in some time window. So that's the replication factor. And in us, because we are not aiming to provide persistence, so, we do not have a target by default. We do not have a target replication factor. It really depends on the workload, the temperature for the data. For example, by default, if the data is hot, then we may the application may just trigger more access to retrieve to data and turns out this will make more replications in Alexa space, cached into different workers. So in the coming release, we also provide replications.
So we can set, okay, I want the minimal set of replication to be 3 for this data copy or a maximally this can be 10 copies. And beyond that, I don't really need more. So more fine grained control on the replications against the, fault tolerant, like failures. And also regarding the consistent hashing, as you as you said, like, we use actually, we use both. We use the masternode to, remember where the data is located across different workers. But on the other hand, we have a policy that when we load data into Alexio to avoid some skewness data data skewness, We can also specify a consistent hashing of similar some something similar to consistent hashing way to place the data. So we can get a more uniform distribution across in Alexa space when we drag in data into, from the data source. And you can set said, okay. I want replication to be 3 or 4. And then even you have 10 different workers reading the same data, and they will not compete too much for the end of store bandwidth. And also this will guarantee the 4 or 3 copies you specified are uniformly distributed. Basically, they will coordinate. And when I was looking through your documentation
[00:43:07] Unknown:
and blog posts to prepare for the show, and I was impressed that you consistently you and then rely on users or beta testers to ensure that new releases are bug free and, not subject to regressions. So I'm wondering if you can talk a bit about your overall approach for being able to run these verifications at scale and also any operational considerations that you've built into Alexio to make it easy to be able to build and deploy these clusters in an automated fashion?
[00:43:44] Unknown:
Yeah. We put a lot of emphasis on that in terms of testing and also the easy of deployment the ease of deployment. So regarding the testing, there are multiple different levels of testing happening. So first, we, because I'm from Google, I used to used to work with Google. And in my team in Google, we emphasize a lot, on unit tests and integration tests. So I I try to when I join the team, I try to bring the same, mindset. And we do have a lot of we do have requirements for the unit test. Anyone who contribute the code will ask, okay. Can you also write the unit test? And also, if a possible, get a integration test too. So these are building with a source code coming with the source codes. Once you try to submit something to the our GitHub repository, this test will be automatically triggered so we can have some sanity check. But later on, we realized, okay, this is not sufficient, especially after a few releases. We noticed if some issues will only be, identified if you're running this service for for a while, like 4 hours or 4 days or even 4 weeks. So what we do is we do, we build more tasks. We called nightly integration, nightly task, which we write the whole info entire infrastructure on AWS to, like, launch this test nightly so we can run different workloads including high bench, including TPC DS, or more micro benchmark like a just pressure the master node or just pressure the worker node and see, what's the result in terms of performance in in terms of correctness. So we do have 90 build 2. And before release, we have more strict tests, like, in more complete test suite for the release.
And also, we will ask, partners, friends, or, users heavy users in, in a community to help us test the beta in their environment. Even that, you know, bugs are they will always be bugs. Right? So we, we will also try hard to help users to get the bugs fixed in the first time, in the in the in, like, immediately. So we have the we just recently launched the Slack channel so we can talk to the users more in a more real time fashion. But also we talk to these users like periodically to check out whether they see any issues. If they set any issues, we will just try to solve them, really solve them, help them to solve these issues earlier. I I I think all these are actually common practices in from different, industries and companies, and we just try to implement it in our own, suiting our own team and our environment, and it works pretty well so far. And the regarding the skill scalability task you mentioned, that's something we, actually, we were very proud of. Like, we are doing this test with thousands of, like, worker instances to, really pressure test our service. And we do spend a lot of efforts on this to make it affordable calls because we always use AWS or the cloud infrastructure to build our, tests. So, as I just mentioned, we have a blog on that, and we actually have a white paper, summarizing all the caveats. And, we see we're running this scalability test on AWS. And I think it's a really good read.
[00:47:11] Unknown:
And going back to the deployment topology, as you mentioned, the primary use cases for being able to provide data reflections so that you can get faster access to some of these underlying storage layers and possibly at a remote location in the case that you described with the Chinese Internet providers. So can you discuss a bit about what the typical topology is in terms of how Alexia is deployed in relation to the storage systems and particularly with the compute? I'm wondering if you have the compute operations running directly on the same nodes that Alexia was using, or if they're just collocated from a network perspective.
[00:47:50] Unknown:
So we have our recommendation to co locate Alexio with the set of nodes running Alexio to be the same set of nodes running the, computation, for example, Presto or Spark. So in this way, you can trigger what we call the short circuits data IO, read or write. And so they don't need to go through because we try to aggressively leverage data locality. So if they happen to be on the same nodes, you don't have to go through network. So that's the recommendation for topology to deploy Alexio. We also see cases that Alexio is not really located collocated in the same set of nodes, due to a variety of different reasons. And in that case, it's really depends, what's the bandwidth and and also what's the, latency from your compute to your data source versus your compute to Eloxio deployment.
If it is still a large ratio, meaning even it's not on the same, set of machines, but it's within the same rack or a same data center versus the remote data source is really remote. It's across the ocean. Or in that case, I think that's still considered as close enough words. And and that's basically what we, see. Essentially, the recommendation I will say is make sure Alexa is close enough
[00:49:10] Unknown:
to the compute compared to how the data source is to compute. And for somebody who's planning a deployment of Alexa, what should they be thinking of in terms of system capacity as far as RAM that's available and disk that's on the system, whether it's, spinning disk drives or SSDs, and any sorts of optimizations that they can make either at the node or network layer? The system requirements
[00:49:38] Unknown:
is not really I I I would say for the master it depends on the master or worker nodes. For our manager service running on the master node, you probably want some relatively beefy nodes with enough memory. We'll see, like, on the other 10 to 100 gigabytes. So you can have enough space to, store the file system in, file system information and making sure you don't really see a lot of g c's. For the worker nodes, it really depends on your target SRA. If you're really talking to a very strict SLA, returning data in a very short latency compared to reading from disk, then use Alexa to manage and to allocate and manage memory for you. But if you are really dealing with the SLA to reduce the the target if your target is really reduce the latency of reading data from a remote data center. In that case, I will say, like, having a LogCIO to manage SSD or hard disk is already good enough for really good good enough for the workloads. Because in that case, having data in memory versus having data on as local SSD or a disk do not really provide, like, significant gain. On top of that, You already reduce the network latency by, moving removing parts, dragging data from a remote data center. So that that yeah. That's my philosophy to set up Alexio deployments.
[00:51:04] Unknown:
And what are some of the cases Luxio for a given use case? And I'm curious if there are any other projects or products that are working in a similar or possibly adjacent space that might be a better fit for those situations.
[00:51:20] Unknown:
There are cases we think Alexio is not really the best fit, including if you have your database is your source of choose. And Alexa right now is integrating with file system or objects 2 or more. And, database is not really, 1 available or ready to use data source for Alexa. And in that case, having Alexeyo in between, is not really I don't think that's the target use case. And another use case we think is not really a good fit for Alexa is if you have already colocated computation and the storage and your network bandwidth is awesome and your hard disk bandwidth is also has no pressure at all, like where your memory is a lot. Then from time to time, we see there is also marginal benefits running on this case. But we do also see some other benefits if you run-in the colocated case. Like, I mean, when I say colocated, I mean, the data source is already colocated with the computation.
So we do see use cases by putting Alexa again in here for SLA because different applications are competing for the disk spindles and they do see large variance, but serving Alexa serving data from Alexa memory will just increase the bandwidth data bandwidth. But if that is not a problem, then in the collocated environment is not our target use case. Let me see. The similar or adjacent space, we do see, for example, there is a product called s 3 guard. That is really more like a cache for s 3, but that is only for the I think, that's only for the metadata part. So if you want to cache data, that does not provide you. So Aluxio does both data caching and metadata management. So s 3 guard is 1 way to translate s 3 APIs, the of your store into a file system API for, different applications to consume. But that that's only on the metadata side. And there are also some projects we see doing similar, but only on the data pass. They do not handle the or they do very straightforward translation on the metadata side. For example, Hadoop itself, you can use Hadoop to access s 3 directly. That's no problem. But it's just doing a client side translation and use some different client to access s 3. And there will be no data caching and no metadata caching either. Yeah. So and also we see cases, they do only data caching, but not the metadata management.
So they, I think we see a few GitHub repositories providing functionalities like this. Essentially, they just use a local node once you read from s 3, and you just put this data. You download it from s 3 to your local node. And once the same node request data, you can just serve it from local storage. But this is not a really distributed. There's no coordination between different local cache. So this is very I would say this is very single nodes, and there is no coordination to help applications to access data from a different, cache on a different nodes. So yeah. So my in general, my takeaway is our project is pretty unique. I guess 1 of the reason is in the early days is it was a research project.
So, we we just do whatever this is, this is, we think is interesting to do. And later on, it turns out, has real use cases. And, it turns out, solve real prop important problems for users. And it's open source, and I think a lot of users just take it, instead of having similar launch a new similar project. And what do you have in store for the future of the Alexio project and company? So in the near future, we are releasing Alexio 2.0. As I mentioned, there are major upgrades to Aluxio, including we're moving, I know, trees to our off heap storage. So this is something we are doing. And also we are changing the entire RPC system to be more scalable, more extensible, and a lot of new feature added features like async propagation, from Alexa space to the data source. So that's really in the near future, in a few months. And in the long run, Alexa is really we just want to be the unified data access layer for any applications. And now we start from big data world. And in the future, we hope this can cover, like, different different other areas for different workloads.
For example, we start to also see more and serve more machine learning workloads. And so we just go across more areas. And also, we used to see only people use big data or mostly use big data for unstructured data. But nowadays, in the last few years, definitely, people build more also build more systems to derive insight from struck more structured data. Like, for example, Presto or Spark SQL, all different SQL query big data SQL query engines. That's also becoming a trend we see, and we hope we can have more integration, deeper integration with this kind of systems. As for the company, we are still early stage, I will say. So, also we in the early days, we focus a lot on engineering, And now we think open source is something we really want to focus in the community. I mean, the user community is something we really want to focus in the in the next few years to help really understand how users are using Alexeyo and understand how we can help users to improve that, improve the value. And actually, I'm part of the efforts. And if you need any help or see any interesting you have any interesting ideas, feel free to talk to me and we'll be very happy to have more discussion.
[00:57:26] Unknown:
And for anybody who does want to give you that feedback or follow the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? I think 1 actually very
[00:57:45] Unknown:
important change in the last few years is really the trend moving data to the cloud. And I will say this is something you see every maybe 10 year such as, such a huge change. And cloud providers, cloud storage vendors, they just provide their own mostly objects to our service with some, maybe with some file system wrapper on top of that. But still we are still in the early days to learn how to build a right architecture when you really move your computation to cloud or is a public cloud or on premise cloud. When this happens, how do we really architect the system well?
And based on this new architecture, which means the computation and the storage are more and more disaggregated. Then what are the different possibilities, new possibilities, and what are these toolings associated? Like 1 thing you mentioned, how to automatically migrate data from hot to warm to from warm to cold storage. And also, for example, in the future, can we have even a smarter way to decide which cloud storage vendor I should pick based on the price or based on the topology. So all this data management, I I just see we are entering a new era that we are able to do a lot of more things.
And, I think this will take a few years for us to really understand,
[00:59:10] Unknown:
what are the new things we can do in this new era. Well, thank you very much for taking the time today to describe the Alexio project. It's definitely a very interesting platform and 1 that seems to fit a very big need in the big data community. So I'm happy to see the work that you're doing on that. So I want to thank you again for your time, and I hope you enjoy the rest of your day. Yeah. Thank you so much, Tobias. I'm very happy to share my experience.
Introduction and Sponsor Message
Interview with Bin Phan: Introduction and Background
The Evolution and Purpose of Alexio
Use Cases and Benefits of Alexio
Implementation and Performance Optimization
Challenges in Building and Maintaining Alexio
Scalability and Data Distribution
Testing, Deployment, and Operational Considerations
Deployment Topology and System Requirements
When Alexio is Not the Right Fit
Future of Alexio
Biggest Gaps in Data Management Technology
Closing Remarks