Summary
Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
- Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt.
- Your host is Tobias Macey and today I’m interviewing Einat Orr and Oz Katz about their work at Treeverse on the LakeFS system for versioning your data lakes the same way you version your code.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what LakeFS is and why you built it?
- There are a number of tools and platforms that support data virtualization and data versioning. How does LakeFS compare to the available options? (e.g. Alluxio, Denodo, Pachyderm, DVC, etc.)
- What are the primary use cases that LakeFS enables?
- For someone who wants to use LakeFS what is involved in getting it set up?
- How is LakeFS implemented?
- How has the design of the system changed or evolved since you began working on it?
- What assumptions did you have going into it which have since been invalidated or modified?
- How does the workflow for an engineer or analyst change from working directly against S3 to running against the LakeFS interface?
- How do you handle merge conflicts and resolution?
- What are some of the potential edge cases or foot guns that they should be aware of when there are multiple people using the same repository?
- How do you approach management of the data lifecycle or garbage collection to avoid ballooning the cost of storage for a dataset that is tracking a high number of branches with diverging commits?
- Given that S3 and GCS are eventually consistent storage layers, how do you handle snapshots/transactionality of the data you are working with?
- What are the axes for scaling an installation of LakeFS?
- What are the limitations in terms of size or geographic distribution of the datasets?
- What are some of the most interesting, unexpected, or innovative ways that you have seen LakeFS being used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building LakeFS?
- When is LakeFS the wrong choice?
- What do you have planned for the future of the project?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Treeverse
- LakeFS
- lakeFS Slack Channel
- SimilarWeb
- Kaggle
- DagsHub
- Alluxio
- Pachyderm
- DVC
- ML Ops (Machine Learning Operations)
- DoltHub
- Delta Lake
- Hudi
- Iceberg Table Format
- Kubernetes
- PostgreSQL
- Git
- Spark
- Presto
- CockroachDB
- YugabyteDB
- Citus
- Hive Metastore
- Iceberg Table Format
- Immunai
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy analytics in the cloud. Their comprehensive data level security, auditing, and de identification features eliminate the need for time consuming manual processes, and their focus on data and compliance team collaboration empowers you to deliver quick and valuable analytics on the most sensitive data to unlock the full potential of your cloud data platforms.
Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. That's immu t a. Your host is Tobias Macey. And today, I'm interviewing Anat Orr and Oz Katz about their work at Treeverse on the Lake FS system for versioning your data lakes the same way you version your code. So, Inat, can you start by introducing yourself?
[00:01:57] Unknown:
So I'm currently the CEO of Trivers, and my past experience starts with being an algorithms developer or a data scientist, as it referred to today. I have a PhD in mathematics. And most of my professional life, I worked around data with data or managing teams developing data intensive applications.
[00:02:18] Unknown:
Az, how about yourself?
[00:02:20] Unknown:
Hi. I'm currently the CTO of Traverse. I'm working on all the technological aspects of the product. For that, I worked with Enat at SimilarWeb, which is an analytics and data company for about 5 years. Prior to that, different start ups and all kinds of companies that, I guess, what they all have in common is large quantities of data that needs to be processed quickly.
[00:02:40] Unknown:
Anna, do you remember how you first got involved in the area of data management?
[00:02:43] Unknown:
So I guess when you develop algorithms that are working with large amounts of data, and large had changed over the years. So once upon a time, several gigabytes of data were challenged. Now we're talking about completely different scales. But, when that is the input to all your work, no infrastructure exists. I'm pretty old. So you have to create that yourself. So I would say day 1 when I started developing algorithms, I encountered the challenges of managing data.
[00:03:15] Unknown:
And, Oz, do you remember how you got involved in data management?
[00:03:18] Unknown:
Yeah. So, actually, for me, it was in the military in the Israeli army. I was in the intelligence corps, and most of what we did was around signals and how to process them efficiently. Of course, I won't get into too many details there. And from there, my beginning was mostly around telecom systems, which also deal with very large quantities of data with a very short SLA to process them. For me, it was always kind of like a bread and butter of what I'm used to doing.
[00:03:43] Unknown:
As you mentioned, this has led you to building the LakeFS platform for being able to handle data lakes in a more iterative workflow. So I'm wondering if you can just give an overview of what that platform is and what and what motivated you to build it. As Oz mentioned, we were both part of the engineering team at SimilarWeb.
[00:04:05] Unknown:
SimilarWeb holds over 7 petabytes of data on s 3, which is an object storage. And the reason we chose an object storage was, of course, the throughput, the scalability, and the low cost. But, it also introduced problems because we were actually managing all this data in a shared folder. So a lot of pains around the manageability that we have suffered over the years. And when we decided we wanted to do something else, we thought a systematic way of managing data as a data lake over an object storage is something that requires more tools that we can offer the world. And, Oz had that amazing idea of doing Lake FS, and here we are.
[00:04:50] Unknown:
So there are a number of different tools and platforms available right now that support data virtualization and data versioning. I'm wondering if you can give some context as to how Lake FS compares to those options for things such as Alexio or Packaderm or DVC.
[00:05:06] Unknown:
So data virtualization tools are all about giving 1 endpoint to managing data that is actually stored in different data sources, while data versioning tools are meant to provide different versions of the same dataset over time and to allow time travel between those different versions. Now if we look at data versioning tools, there are different categories of tools. I'd say there are about 4. The first 2 are a bit similar. But, the first group would be the group of tools who are actually meant for MLOps. So you have a machine learning pipeline that you need to manage from experimentation all the way to deploying it into production. And 1 of the things you would need to manage with the versioning of the code would be versioning of the data that you are using. And then Packiderm and DVC would be the right tools for you to use. The other group is a group that allows sharing and collaborating over data. That would be tools like Doltab or Kaggle that gives you the Dolthub is actually a repository, a database that allows versioning and Git like operations.
And Kaggle is actually exposing versions of universally available datasets so that you can take different versions of the data while using it. So it's all about collaborating over data, in a simple way. The 3rd group is completely different because it's actually data formats, like hoodie or delta lake or iceberg. Those are actually data formats that allow you to access the data over time within a specific collection or table. And the reason they have time travel is not necessarily to allow, versioning, but rather because they want to support mutability of data in an immutable environment such as an object storage. And the way to do that would be to save deltas of changes of the collection that you are managing and then on read or on write, merge all those changes into 1 coherent version of the data. It's just that in order to prevent con concurrency issues that they would have to version the, evolution of the data.
And as a byproduct, they actually provide time travel, but for short periods of time. And the last group, the 4th group, is the group that includes LakeFS. LakeFS gives you Git like operations over an object storage, so it is around the environment where the data formats work. But while the data formats allow you to move forward and backward on 1 collection of or 1 table of data, lakef s is actually, format agnostic. And it allows you not only to move forward or backwards, but also to open branches and actually to move sideways and to run things in parallel within the environment.
[00:08:10] Unknown:
And for the 3rd and 4th categories there where things like iceberg allow for the multiversion concurrency control, and lakeFS allows allows for branching and safely integrating new data sources or onboarding new processing pipelines. Do you see those as being potentially complementary where you can use lake FS for being able to branch the set of data that the iceberg table format or something like Presto is operating on? Absolutely. That's actually,
[00:08:39] Unknown:
1 of the goals. So since lakef s is former diagnostic, it also works over Delta or, Iceberg or Hudi and actually enriches them with the ability to have branches and commits over a certain version of the collection that is managed by those formats. So it works very well together, and, OLECFS serves as an enhancement for those formats.
[00:09:06] Unknown:
And so for somebody who is interested in using Lake FS, what is actually involved in getting it set up and starting to use it with their data?
[00:09:16] Unknown:
So 1 of the things we focused on early on is try to reduce the amount of work needed to get, like, a fast environment running. So to get started with a local environment, it's a 1 liner, basically. As long as you have Docker and Docker Compose installed, you run a single command, and you get an environment that you can start using and exploring to take something into production. So we target Kubernetes, something that we see many companies that are adopting right now. So if you're running on a cloud provider and you have a Kubernetes environment, setup is as easy as probably 10 to 15 minutes. For our metadata storage, we use RDS or any other managed Postgres service.
So the idea is to kinda reduce the friction of getting something into production quickly.
[00:09:59] Unknown:
You've released this as an open source project, and you're using the s 3 API so that it is essentially transparent to other tools who are already able to work with object storage. I'm wondering if you can talk a bit about the motivation for releasing this as an open source data
[00:10:24] Unknown:
engineer, most of the tools in this area tend to be open source. Data engineer, most of the tools in this area tend to be open source, which is something that we also wanted to be. We wanted to enjoy the community that comes with that and all the innovation and the ideas that they can bring and not build this as kind of, like, an isolated product that only we get to decide how it looks and how it works with all the other parts of the ecosystem. And in general, if you look at, a modern data lake, it it includes so many different technologies and parts that quickly become a bottleneck if we try to do everything ourselves. So we believe in the power of the community here.
[00:10:58] Unknown:
And and, also, we we view that as the best practice around big data technologies. The vast majority of big data technologies are open source, based on open source.
[00:11:08] Unknown:
And in terms of the reason for being able to version assets within object storage similar to the branching model that you use in version control systems. I'm wondering how that helps with the overall collaboration capabilities and some of the interesting use cases that it unlocks that are either impractical or impossible just using the direct s 3 or Google Cloud Storage APIs?
[00:11:35] Unknown:
Actually, the logic that runs under the hood is not necessarily Git like. But we thought that the language the Git language is really helpful here because the pains are around the management of the data life cycle, similar to code life cycle. And the solutions that we have offered in the code world should apply with some adjustments to good practices of the data world. So we chose a language everyone actually knows and also have a good expectation of what it would do so they could feel free with the system and and use it. As I said earlier, the pain is that we are actually working over a shared folder. So this is an environment that is very error prone on the 1 hand. And there are a lot of technologies that allow ingesting of data into the lake and consuming data from the lake.
So there is a lot going on when we're you're working with data. And when things are error prone, usually, the problems cascade down, and it's very hard then to find the root cause. So what we were imagining is a world where you can actually have an easy way of creating a development environment and experiment over data without risking production. For example, by opening a branch of your repository or your lake. And then this branch is isolated, and you can make changes there without influencing anyone else. You can make changes, revert changes, and once you're done experimenting, discard the branch, and all that would not hurt your production data or pollute your data lake in the long run.
You can open several branches with several different experiments and compare them easily. All those things today require copying of data, a lot of pipelining, and usually assistance from DevOps that are never available. So it really holds everything for, a very long time. So running pipelines in parallel and comparing them is not an easy task today, and we really wanted to alleviate this pain and make it easier to experiment. And then later in the life cycle, many organizations find themselves integrating new data all the time.
We we mean new data sources. And those data sources should adhere to good practices around governance, such as PII, good practices around technology, such as schema and formats, and other quality assurance requirements that the organization has. And today, there is no easy way to create quality assurance pipeline during the ingest process. So when you have a Git like tool that allows you to, again, ingest the data on an isolated branch where no 1 else except this branch is exposed to the data, then run a set of validation as a premerge hook to production.
And only if the tests pass is the data exposed to the consumers the the entire environment becomes less error prone, and the quality of the data is much higher when it is exposed to the users. And, of course, there is the continuous deployment of data. So we imagine this data lake that over it, you run hundreds of jobs from different applications and populate very many dashboards. Even if the code doesn't change, the data keeps on flowing. Updated data keeps on flowing and should be deployed into production in a continuous manner. And simply just taking the data that you just collected or received from the 10 or 15 or in some cases, 100 of systems within the organization, and pour it into production with no validation.
Also, we actually deploy the data into production with no validation. It creates a quality problem. So again, just deploying the data first to an isolated branch, running a set of premerge hooks that make sure continuously throughout the day that the data that you expose to production is in high quality is what we refer to as continuous deployment of data. So we actually use the power of branching, merging, and committing, and, of course, reverting and rolling back to the data world. So after we expose the data to our consumers, either new data that was integrated or new data that was deployed to production, in both cases, we might have not covered everything, and there's still a problem with the quality of the data. When you work with Git like capabilities over the lake, you can roll back and revert to the last point in time where you had quality data and then have the tools, if you worked properly, to much faster identify the bug or create, if needed, an environment to reproduce the bug.
So we feel way more secure. We feel we we work in a application development environment when we have all those tools over the data. So this is what we aspire to build.
[00:17:01] Unknown:
Yeah. That definitely sounds very powerful. Because for anybody who has had to try and work with data and figure out a way to replicate the production environment into a lower environment, there is a lot of issues in terms of being able to manage the volume of data that you're working with, figuring out ways to address PII problems and not wanting to duplicate that information into an environment that doesn't subject to the same levels of control. And so being able to, as you said, have this continuous delivery of data with out a lot of the risk of potentially polluting your production environments is pretty powerful. So I'm wondering if you can dig a bit more into how the Laika Fest platform itself is actually implemented and some of the ways that you allow for those versioning and snapshot and rollback capabilities.
[00:17:53] Unknown:
So 1 of the earliest design choices that we've made is to treat s 3 as the storage layer, but not necessarily, like, the metadata layer. So the way LakeFS is constructed is that there are basically 3 main parts to it. There is a PostgreSQL database, which stores the metadata itself, and this provides all the consistency and all the asset guarantees that we we would expect from a system like that. There's s 3 itself, which provide the actual storage. Like, this is where you you keep storing the data itself, which is, of course, much higher scale than the metadata. And then there is the way you access the system, which is currently API compatible with s 3. The idea there was to provide kind of a lowest common denominator to all applications that know how to work with a modern data lake. They all know how to communicate with s 3. So we thought this would be a good way to kind of, do our go to market and trying to cover all bases and, like, not have a single system in the organization that cannot talk to make a fest because we haven't supported it yet. Therefore, you can't use the system and enjoy it. So we decided to start with that.
We are planning down the road to also provide better integration into specific applications like Spark and the Adoope ecosystem, Presto, and many others. But we decided to start with, like, the lowest common denominator, which is being API compatible with s 3.
[00:19:09] Unknown:
Because of the fact that you do have this separation of the metadata storage and the actual information storage with s 3 and the fact that s 3 itself is eventually consistent. I'm wondering if there are any cases that you've had to deal with or potential issues with synchronization or commit failures where you push the data into s 3, but there's a conflict with the metadata or vice versa and some of the ways that you've been able to work around that?
[00:19:38] Unknown:
So that's a good question. I think, out of all the object stores, s 3 has, like, the weirdest quirks when it comes to consistency. For us, like, we decided to reduce the API that we use internally on top of the object store to, like, the minimal amount of operations that we can do that are consistent. So as long as you have an object store that lets you write an object and then retrieve it immediately, and this is something that s 3, Google Store, and Azure, they all support. So our system can support it. We'll never try to do stuff like listing or any other metadata operation on s 3 itself and rely on whatever it returns because we know that there is no guarantee that this will be consistent.
For that, we we rely on our our own database, which provides much better consistency guarantees. For us, atomicity is kind of a major building block and especially atomicity across different operations. And this is something that even if a single API request inside us 3 was consistent, they probably will never provide consistency across operations or across objects. For us, this is something that is key. So as a byproduct, we also solve that problem of of these storages. Yeah. I mean, we see a lot of companies using tools like s 3 guards on Hadoop or EMRFS if they're inside the AWS ecosystem, to try to work around some of those issues because a lot of the tools that rely on s 3 s storage are already kind of, falling into that place where the consistency issues would can cause corruption.
[00:21:03] Unknown:
Another interesting aspect of this separation of metadata and storage is when you start to work with multiple regions where using s 3 again as the use case, you can potentially have your data stored across US east and US west and EU regions. And I'm curious how you handle scaling across those multiple storage locations and being able to reduce latency of queries to Lake FS when you are working across those different geographies.
[00:21:34] Unknown:
In general, replicating data, even if it's just metadata across geographies, across 1, kind of a risky thing to do, Probably should be doing it asynchronously. So at the moment, we're not trying to target the use case of having a data lake that's spread across continents. So the way we see it is that you have, like, a fast installation, which is local to every region that you work on, and not try to create a repository that, like, spans different geographical regions. However, you can do like, most cloud providers would let you do replication on both, like, their object store and their databases. And using that, you can build a pretty powerful disaster recovery story for Lake EFS as well.
[00:22:13] Unknown:
And have you looked at the possibility of using systems such as CockroachDB or YugabyteDB or Citus for being able to scale out the Postgres QL layer to be able to handle that geographic replication?
[00:22:28] Unknown:
Yeah. So, technically, it's possible. It's not something we're looking at right now just because we don't see, like, a very big need at the moment. This might change in the future. We're currently trying to solve the problems that occur when you're walking in within a single region, which is already kind of difficult enough if you look at the scale of things. But by choosing Postgres as an API, probably allowed to do this pretty easily, especially with Cockroach, supporting the Postgres API.
[00:22:53] Unknown:
And in terms of the overall design and implementation of the system, what are some of the ways that it has evolved since you first began working on it and some of the initial assumptions that you had, which have had to be updated or modified?
[00:23:07] Unknown:
Yeah. So, like, our initial assumption was that as long as we can support that lowest common denominator, like, as long as we can pretend to be s 3, it's probably the least amount of friction for a data engineer to implement. But what we learned is that just by having something that you'll need to deploy and monitor and keep in production, which all the data goes through, the kind becomes a DevOps effort more than a data engineering effort. And DevOps tends to be, like, the most constrained resource in the organization pretty much, and their attention span is already pretty low. So if a data engineer wants to start using the system, they'll probably need some DevOps assistance just to get that part set up, And it's actually much easier for them to implement something within their own code or add a jar to whatever cluster they're running on is to hold the server in production. So this is actually 1 of the directions that we're currently working towards being able to support, like, richer client that the data doesn't have to go through a centralized server.
[00:24:03] Unknown:
Because from a reliability perspective, that would also be a better solution. Yeah.
[00:24:07] Unknown:
Another interesting aspect of this is when you're looking at the metadata layers for things like Presto or Spark for being able to interact with the data on s 3 and the reliance on things like Hive or the iceberg table format. And I'm curious how you think about interacting with those types of systems and being able to populate and provide the necessary metadata for those systems to be able to stay up to date and interact with the different branches of objects on the Lake FS storage layer?
[00:24:42] Unknown:
Yeah. So, like, a Hive metastore or the Hive metastore API became kind of an industry standard. So, usually, if you can represent your table inside Hive meta store, many tools will be able to support it. This is true for Athena, for Spark as well. And 1 of the things that the meta store API allows you to do is to create a table out of, what they call a sim links, which is basically just a text file that points to s 3 URIs or storage URIs, that are stored elsewhere, like, not in the common hive hierarchy of, of partitions. And this is also basically kind of a metadata operation. Right? You create a file that is actually directing you to other files.
So what we do is that if you are using those tools and you want to use, like, a first underneath, we have utility that lets you export those sim links so that as far as Hive is concerned, it looks just like a regular table pointing to, like, the underneath storage for Lake FS. So we'll just create a proper SIM links to point to whatever branch or commit you're looking at, which just allows you a quick way of integrating without going through LakeFS.
[00:25:47] Unknown:
And as you're adding branches and populating that branch with new data, are there any potential challenges that you face in being able to update the table representation in something like Hive, and then having to then go and revert that in lake FS and remove that information from Hive, and being able to reflect those changes in the ways that the data are being used?
[00:26:10] Unknown:
Yeah. So the way we see it is that since you once your storage layer becomes aware of branches, which is basically what Lake EFS does, you can also take data to higher levels in the stack, for example, in in Hive as well. So let's imagine that we have a specific table inside Hive. Up until now, it's basically represented whatever data you have in production for that table. What you can do with Lake EFS is create a table which is basically a copy of the original table, but it's pointing at a specific branch. And this way, I can still use, like, the Hive API, but now it's suddenly kind of branch aware because I created the snapshot of the same table, but for a specific branch.
This gives you all the power of being able to use branches to apply to systems that can directly talk to, like, a fastener using the Hive API.
[00:26:58] Unknown:
Can you talk a bit more about the workflow for people who are using Lake FS and trying to leverage it for being able to handle these different branching strategies, emerging strategies in other tools across the ecosystem?
[00:27:12] Unknown:
Basically, Lake of Fence tries to be, like, not very opinionated on how it's being used. So we we provide the primitives and some good practices that, we think are correct. But it's kind of like, with Git. If you look at, 10 different organizations, you'll find 15 different ways on how to use Git properly. So you can decide on whatever branching model works for you. I guess the simplest way would be to have something, which you call release or master or however you wanna call it. That's kind of like, this is my production data. This is my single source of truth for everyone who's consuming data from the lake. And then writers or applications that are generating or ingesting data into the lake would typically want to work in isolation. So they'll create their own branch and then merge it into master once the data is ready.
Another common use case would be kind of, like, experiment branches. So if I have data scientists or analysts that want to kind of work with the data, massage it a little bit, maybe make some changes, they can create a branch simply to get isolation. So I create my own branch. I can do whatever I want to the data. It's completely isolated inside that branch. Then I can decide whether or not those changes look good and I want to merge them or just wanna throw this experiment away. And I have to guarantee that once I've created the branch, data is not gonna change, like, under my feet because someone else might be touching it. So if you think about it, like, from the way experiments are managed where I might have some assumption, I'm going to a job or a test to see if it's correct, then I'll make a change and run it again. I don't want the the data itself to change between those experiments. I want it to remain exactly the same so that I can compare results.
Usually, nowadays, it's done by copying the data to some other location, which I might also forget to remove a governance standpoint. That's kind of a nightmare. JFS, you can just define that these branches have a very short lifespan and deleted no matter what after, I don't know, certain amount of time. And at least you have the ability to track and see what these branches are and who created them and automated data that comes with that. Yeah. So as long as you have a a system that is able to interact with s 3, the way we work now right now is that the only change you have to make to your code is basically add a prefix to whatever path you're reading, which denotes the commit or the branch that you're looking at. So if I used to have an existing job, on Spark, for example, that reads data from slash collections /events.
All I have to do right now to use the branching abilities of wake of s is change it to slash master slash collection slash events. From there on, I'm using whatever branch I decided to work on. Externally, I can use a temple command line or an API to create my own branches or to merge them, but the job itself and, like, the business logic itself doesn't have to change in order to support this new model. This is true for Spark. This is true for Presto. Pretty much any application that knows how to communicate with s 3 or an object store.
[00:30:00] Unknown:
Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform for cloud scale infrastructure, applications, logs, and more. Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations, and the rest of the company. Go to data engineering podcast.com/Datadog today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt. Another interesting aspect of the overall strategy of being able to branch and merge data and some of the potential issues is when you have long running branches of data and you potentially have a lot of divergence between the main branch where people are committing new objects and new data on a daily basis, and you're maybe running a long running experiment where you need to iterate a lot based on a previous snapshot.
I'm curious what the potential edge cases are or some of the issues that users should be aware of in terms of being able to reconcile those branches back together or maybe
[00:31:17] Unknown:
rebase a experiment branch based on the current master data or what types of capabilities are available for that? Yeah. And, actually, here, the practices would be pretty similar to what we already have with Git nowadays. So even if you create, like, a feature branch inside your own development environment, the repository itself is bound to change. Right? So the longer you keep your branch isolated from other branches, you get the benefit of isolation, but also the downsides of it, which is that the world around you has changed and you have no idea. So the recommendation would be if you have a long running branch, you should probably merge whatever source of truth you have. That's called master for now.
Merge that back into your branch this periodically just to make sure that whatever conflict you're bound to occur, if there are any, you wanna be able to address them quickly and, like, as soon as possible and not build on top of something, that might have to be rewritten later. And for that, this is true for pretty much any versioning system, and data is no exception. 1 thing to mention is that data and code are slightly different, like, in the way that they're being updated in that the files themselves will probably never be updated when you're talking about data on s 3. They might be overwritten. So, like, there is no point in trying to merge 2 versions of the same object. And, also, the meaning of someone, changing file that already exists when you talk about data systems would typically mean 2 different jobs ran in 2 different branches right into the same location.
And this is something you probably wanna address because, like, maybe the the way your application is laid out is not correct.
[00:32:48] Unknown:
For the cases where you do have potential conflicts where maybe somebody overwrote the same file name in their branch, and then they want to merge it back into the main repository. What safeguards do you have in place to be able to identify and alert on those potential conflicts?
[00:33:07] Unknown:
The way we look at it is that Lego Fest works at the object level, and we're not agnostic to whatever the object itself might contain. So, like, there's no scenario where we'll try to merge changes across 2 different versions of an object. The way we see it is that if there is a conflict, it means 2 different branches trying to write the same data. This is something that you have to resolve yourself. So at this case, if if you have a merge operation that conflict occurs in, it will stop. As far as you're concerned, there's no, like, intermediate state that has to be cleaned up. It's as if the operation never occurred, and you have to resolve that conflict manually. Like, you have to decide which of those sources of truth is correct. We won't take a side in, like, trying to decide which 1 you probably want. That would probably mean data loss.
[00:33:50] Unknown:
And another interesting aspect of this overall workflow, which you touched on briefly earlier, is the idea of these premerge hooks where you can run certain quality checks or validation checks before an experiment branch or a new data source is merged into the main repository. So I'm wondering if you can talk a bit about the way that that is implemented and how you execute those checks.
[00:34:15] Unknown:
So the idea behind that is, like, first and foremost, to have some separation between whoever is writing the data and whoever might be consuming it. Right? So right now, there are, like, amazing data quality tools and data monitoring tools. But, like, the biggest hurdle here is that I have 2 choices when I try to use them. Either I let them run after my my job is completed and the data is already created. By the time that they start running, someone else might have already consumed the data. And if there is a problem, they've consumed data that I decided is of bad quality. Or I have to write to some other location, then have the test run on that, and then I have to copy to whatever final destination that they might have, which, again, with s three's eventual consistency is also kind of hard to manage.
So the idea of, like, having a point in time where you can run those checks before you decide to publish your data to whoever is consuming it, we think that's, like, 1 of the key aspects of of Lake Fess. So the the way it works right now is that before you merge something, you have the ability to look at the diff. So you get a list of objects that have changed, either they were overwritten or have been added to your data lake, and you have the ability to run whatever arbitrary code you want on top of that. Like, we don't have a compute environment where we run those tests for you. It it's kind of up to you to make that work. And once, you decide that the data looks good from your perspective, you can run the merge operation. 1 of the things on the road map is to make that, like, the least amount of friction around that. So integrating with the all the existing data quality and governance and, like, security checks that are that already exist and just allowing them to have that hook to run before data is being published. And, of course, automate the merger of the commit once the tests have passed. Yeah. We wanted to feel kind of like Jenkins or GitHub actions in that regard. We kind of see it. It's either green and it's mergeable or it's red and you have to do something about the quality because it doesn't pass whatever checks your organization decided exist.
[00:36:11] Unknown:
For the case where you do have all of that information available where these are the objects that are added, these are the ones that are changed, Do you have the capability for being able to fire some sort of a webhook to an external system where maybe you want to execute a DAXTER job or a Spark run or something like that that uses as input the list of objects that are being modified and then run some arbitrary logic to reconcile whether or not you want to then execute and then return back to, like, a fast with a confirm or deny?
[00:36:43] Unknown:
Yeah. That that's exactly how we plan on implementing it, actually. So we we wanna be as agnostic as possible to whatever checks you have. Webhook is pretty much the easiest way to get that working. As long as you can reply with an HTTP response, you can probably integrate whatever checks you have into Lake FS.
[00:36:59] Unknown:
Another element that you've alluded to is the capability of managing the life cycle of branches and objects with different versions within the Lake FS repository. And so I'm curious if you can talk a bit about the way that you manage that or things like garbage collection to avoid having long running branches that aren't being used anymore or the potential for unbounded expansion of the amount of storage that's in s 3 that isn't actually providing any real value?
[00:37:29] Unknown:
I guess this is 1 of the key differences maybe from if you look at Git. Git works at a human scale. So, like, as far as Git is concerned, we're gonna saw all versions of our files forever, and that's probably gonna be okay because it's gonna be megabytes, maybe 100 of megabytes, but usually not that bigger, unless, like, your Microsoft or Google where you have to solve that problem too. With data, that's not the case. And here, you have an interesting trade off. You want reproducibility. Like, you wanna be able to go back to a point in time on 1 hand, and on the other hand, you don't wanna pay for all the storage that this entails. So the way it works is that we provide an interface that lets you kind of decide on which side of the trade offs you wanna be on for different branches.
So I can decide, for example, that for experiment branches, I'm gonna keep their data around for up to a week or up to a month. But for my master branch, I wanna be able to go back in time up to 6 months, for example. And, of course, this very much depends on the type of data that you're managing and what the trade off is for you in terms of costs, but it's tunable. Like, the the more you try you decide to to store historically, the more reproducibility you gain and vice versa.
[00:38:35] Unknown:
This is another byproduct of Lake EFS that is very good because right now when you run retention over s 3 as a shared folder, it's very hard to create those retention scripts in a way that represents business logic or business interests. And when you use Lake FS, you can manage those trade offs that Oz just described very clearly and very easily. And make sure that you have the data that you need, and you don't have anything that you don't need, and hence, pay for things that are being retained for nothing. And of course, you don't delete production data when you run a retention script, which never happened to anyone, definitely not to us. Right?
[00:39:15] Unknown:
Yeah. And the idea of separating between the logical act of deleting, which is I don't want others to be able to see my data, and the actual data being deleted, meaning I'm no longer paying for it or I can't ever access it, is something pretty powerful. We had, jobs inside similar web that had to do with data retention. Like, there was a common joke that whenever you decide to change them, like, you would first sweat a lot and then pray a little bit and only then click deploy for those jobs. Because if you mess something up, if there is a bug there, it's going to be very hard to recover from. And this has happened. Petabytes of data were deleted by mistake. Like, we had object level versioning, right, to our s 3 bucket. So in theory, it was supposed to be pretty easy to recover.
But just to understand, like, all the different objects that were actually deleted and what was the previous version that you have to go back to, and then issuing millions of API calls to Amazon, which, of course, enjoyed this very much because it costs a lot. It was painful. So people would always like, I'd rather not ever delete my data because there's no way back. So, actually, you might end up spending less just because you now have some room of error to do that and actually test that code before you run it. So you're less risk averse around that.
[00:40:28] Unknown:
Calling out the fact that the different API calls to s 3 can become expensive is also worth digging into a bit where because you have this separate metadata layer and the separate service that has the full catalog of all the objects that are present in your lake, that can potentially save a lot of money in terms of being able to just run scans of what data is in the lake without actually having to retrieve it. So I don't know if you can talk through some of the ways that that influences or changes the ways that people think about working with their data lake.
[00:41:00] Unknown:
Sure. So, basically, there are 2 main elements here. First 1 is that right now, when people talk about isolation, they mean I'm going to copy some data to another location because I just don't have any other way to get isolation on s 3. It doesn't it provides no primitives to do anything smarter than that. Like, a fast when you create a new branch, you get an isolated environment without copying any data. So it's a very cheap operation. And this is all because we separate the metadata of, like, the pointers to the data and the data itself. So you can have the same files, just many different pointers to it. The second part to that equation is is basically the ability to do deduping.
So even if you do copy it and you have the same data in several different locations inside your data lake, you're only going to pay for that storage once because you only see 1 object inside s 3. So So being able to separate metadata from the data itself is a very powerful capability there.
[00:41:51] Unknown:
And in terms of the ways that you've seen people using Lake FS, what are some of the most interesting or unexpected or innovative ways that you've seen it employed?
[00:42:01] Unknown:
We had a pro we we were approached by a company called Immuni. They're actually an amazing company because they work on early diagnostic of cancer. And they have biologists and labs and this, you know, all this real actual science going around. And they have research people that are sharing files in very strange formats that can't really be managed in any file sharing systems that exist today. So they manage this data over, Google Cloud Storage, actually. And because they are changing those files sometimes manually, sometimes automatically from systems, and they are collaborating around them sometimes by sending them by email.
Their engineering team thought they could provide them with a versioning system. They actually use Lake FS as the basic versioning and created over the Lake FS UI, which is really cool. So they enriched the UI in a way that helps those researchers manage their files easily. So we were really surprised and happy with this use case. Just to be involved in a with a company that does something so amazing is great. Innovative way of thinking about their, operational problems and the guts to take a completely new technology and just have it work for them.
And they now have a better version of a really beautiful system.
[00:43:22] Unknown:
And as far as your own experience of building Lake FS and working with people who have been testing it out, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:43:34] Unknown:
So as Oz mentioned earlier, we were surprised that people rather integrate a library than use a middleware.
[00:43:41] Unknown:
Yeah. I think for me, maybe something else pretty interesting, like, when we talk about database, like, what I picture in my head is, like, a single layer of storage, something like s 3 or GCS or something like that, then, like, a very large number of different applications doing analytics and machine learning and very many different tools. Like, that's kind of like the premise of a data deck. You can bring your own tools and work with the same data. So, like, we expected the average organization to have some, like, very big amount of those different applications. But it seems that, like, most of the organizations, like, at least from our perspective, the ones that we've talked to, there's usually, like, 1 or 2 main applications that are, like, 80 or 90% of the stack and then many small, like, satellite ones.
People tend to build their stack around, like, 1 or 2 major building blocks. Operation overhead is very, very big if you do anything else.
[00:44:28] Unknown:
And what are the cases where Lake FS is the wrong choice?
[00:44:32] Unknown:
I think in the use case where you're interested in getting mutability within the object storage environment that is immutable. For example, supporting upsets, deletes, and so on. Lake FS would would not be helpful. And, of course, the formats such as Hoodie and Delta Lake would be extremely useful. You should mind performance though, but this is their their use case. While Lake EFS works very with them to provide the versioning and the GitLab capabilities over the entire lake, we do not provide any assistance when it comes to mutability.
[00:45:08] Unknown:
And as you continue to work on Lake FS and work on building out your overall platform with Treeverse, what are some of the new capabilities that you have planned for the future of the Lake FS project or new projects that you have in mind?
[00:45:25] Unknown:
Our focus right now is on 3 main things. The first 1 is getting that development life cycle approach be, like, as seamless as possible. So as we mentioned, like, advanced CI and CD capabilities, being able to support not just webhooks, but also integrate, like, with existing products so that you do data quality or schema evolution to make that as seamless as possible. That would be 1 thing. The second thing is the way the system is being integrated inside the organizations. So right now, we have the s 3 compatible API. But as we mentioned, we want to be able to also have deeper integration into tools like Spark and Presto and, like, all the common compute utilities that are out there. Which should also reduce, like, the overhead of installing and maintaining make FS.
And the third thing would be around, I guess, scalability. And postures can take you pretty far. But in some cases, like, the metadata itself is so big and changes so often that maybe other approaches are it might be required. So, like, with a few of the people we we talk to where the data lake is extremely huge and, like, very volatile, we want to also be able to support them.
[00:46:34] Unknown:
Are there any other aspects of the work that you're doing on, like, FS or the use cases that you're enabling for people building and maintaining data lakes that we didn't discuss yet that you'd like to cover before you close out the show?
[00:46:46] Unknown:
So 2 things. We are currently working with design partners who are innovative and and interested in joining this journey of Lake FS. And, of course, we will be happy to have more companies who are interested in the use cases we described and would like to take part in the journey. And the second thing, very important, we are looking for an advocate or a developer relations person, preferably in California, with a lot of background in data engineering, enthusiasm regarding blogs, conferences, meetups Podcasts. Podcasts. Right? To help us build this community around JAKFS.
[00:47:23] Unknown:
For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:47:39] Unknown:
We talked a little about the DevOps effort that is required everywhere when a data engineer needs to work. So sometimes 70% of the time, data engineers would find themselves either waiting or trying their own capabilities as DevOps engineers in order to make sure their systems are working and they're capable of progressing. So I think any contribution around allowing faster deployment of big data tools and big data environments in a smooth way would go a long way.
[00:48:12] Unknown:
Yeah. I think even to do something pretty simple on top of a data lake, you have so many different moving parts. So, like, you have your your storage there, and then you have a few computer engines and maybe some way of orchestrating them, maybe doing scheduling and queuing and streaming. And, like, if you try to stitch all those together, there's a lot of effort involved. And it's still it's not easy. It's not trivial to get this to get this working repeatedly. So, yeah, we're seeing data engineers spend a lot of their time on, like, how to deploy stuff on Kubernetes or how to set up airflow in a production resilient way. And, like, I would assume that the data engineer should probably not be spending all his time doing that. So I think there's still a lot of gap in the market, to try to get this correctly.
[00:48:54] Unknown:
Well, thank you both for taking the time today to join me and share the work that you've been doing on Lake FS. It's definitely a very interesting project and 1 that I think is going to provide a lot of value and a lot of flexibility for teams who are working on data lakes and trying to provide the flexibility for experimentation while being able to maintain overall safety of their production data. So thank you both for all of your time and effort on that front, and I hope you enjoy the rest of your day. Thank you for having us. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Anat Orr and Oz Katz
Challenges in Data Management
Overview of LakeFS
Comparison with Other Tools
Setting Up LakeFS
Versioning and Collaboration
Implementation and Design Choices
Evolution and Assumptions
Workflow and Integration
Pre-Merge Hooks and Quality Checks
Lifecycle Management and Garbage Collection
Interesting Use Cases
Future Plans and Focus Areas
Call for Collaboration and Job Opening
Biggest Gaps in Data Management
Closing Remarks