Maintaining Your Data Lake At Scale With Spark

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, speedy SSDs, and a 40 gigabit public network, you get everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distribution, they've got that covered too with worldwide data centers, centers, including new ones in Toronto and 1 opening in Mumbai at the end of the year.

And for your machine learning workloads, they just announced dedicated CPU instances

where you get to take advantage of their blazing fast compute units.

Go to data engineering podcast.com/linode,

that's

l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers for engineers.

Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own.

With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page, And data engineering podcast listeners get 2 months free on any plan by going to data engineering podcast.com/clubhouse

today and signing up for a free trial.

Support the show and get your data projects in order. And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms,

big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. And coming up this fall are the combined events of Graphorum and the Data Architecture Summit in Chicago.

The agendas have been announced, and super early bird registration is available until July 26th for up to $300

off,

or you can get the early bird pricing until August 30th for $200 off your ticket.

Use the code b nllc

to get an additional 10% off any pass when you register, and go to data engineering podcast.com/conferences

to learn more and to take advantage of our partner discounts when you register for this and other events. And you can go to dataengineeringpodcast.com

to subscribe to the show. Sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and Big Data workloads. So, Michael, could you start by introducing yourself?

Yeah. Hi. My name is, Michael Armbrust. I'm an engineer at Databricks, and I'm also the original creator of Apache Spark SQL,

structured streaming, and now I'm the tech lead for delt the Delta Lake project. And do you remember how you first got introduced to the area of data management?

You know, that's a good question, but I think it it probably the moment when the decision was made was back when I started my first internship at Microsoft.

I remember, you know, I was, you know,

a college undergrad,

you know, excited to have this internship, and I was given a choice between

SQL Server

and the C Sharp runtime.

And I had no idea what to pick and I just kind of picked randomly.

And ever since then, data management has followed me. When I went to UC Berkeley,

I tried to do security, I tried to do operating systems, but I just couldn't resist the siren song of the database.

So so so flip of the coin just,

dictated the rest of your career. So that that's funny. And so as you mentioned, now you've been involved in architecting

and overseeing

the work on Databricks Delta, which has now become Delta Lake. So can you start by giving an overview about what it is that the Delta Lake project is and the motivation for creating it? Yeah. So Delta Lake is a transactional storage layer designed to integrate deeply with Apache Spark and also to take advantage of the cloud. And, of course, it also works on prem. And what it does is it brings full asset semantics, you know, which means that when you perform large scale Spark jobs, we're guaranteeing that what happens, happens either completely or not at all. We also guarantee if there's multiple people working at the same time that they'll be isolated from each other and always see a consistent view of what's happening. And we do that while still maintaining the scalability and elasticity properties of these underlying systems.

And, you know, as we've expanded the project, it's actually moved a lot into additional features for incrementally improving data quality.

We first started it actually in collaboration with this team at Apple. It started as a conversation at a Spark Summit, this big conference that we have for the the open source Apache Spark project,

where I was just hanging around and this engineer came up to me.

He runs the,

infosec team at Apple. And basically, his problem was he has network monitors all over Apple's corporate network

that ingest,

you know, basically every single TCP connection, every DHCP connection,

everything that goes on in the networks. We're talking trillions of records per day and petabytes of data. And he wanted to build a system with Apache Spark that could archive this so they could use it for detection and response.

And my response was, woah, it's gonna tip over, but I think we couldn't solve it. So, you know, we kinda collaborated and built it, and then we we it became a much more general tool after that. That's funny that

yet again, an accidental happenstance helped dictate either your sort of future workload and,

something that has ultimately ended up being something that I'm sure is providing a lot of value to people outside of both Databricks and Apple.

Yeah. It was it was really serendipitous. You know, now

I I think it was at this, like, customer advisory board, which normally kinda sounds like a boring thing, but now that is my favorite meeting of Spark Summit because you get to talk to so many interesting people and about their problems. And and finding those problems, I think, is what really allows you to build cool software. Yeah. It's easy to sort of go off into the vacuum of thinking that you know what the best solution is for any given problem or understanding even what the problems are that somebody might run into.

But, ultimately,

all of that is worthless unless somebody actually uses it. And so being able to get that firsthand feedback

of, I've been trying to use this, and these are the pain points I'm running into

to help you understand

where you can provide engineering

that will actually ultimately end up being useful by somebody versus just,

wasting time on something that nobody actually ultimately cares about. That's definitely great to be able to have that feedback loop built into the community.

Yeah. Definitely. And so Delta Lake is targeted at addressing some of the issues in data lake implementations

and some of the

modern trends in terms of how,

data warehouses are used and data lakes and just trying to

provide some level of sanity to handling some of these workloads that have large volumes of data as you mentioned.

And I'm wondering before we get too much into the Delta Lake implementation itself, what you have seen as being some of the common anti patterns

and edge cases and difficulties

are, in the data lake space that you're working to address with Delta Lake. Yeah. So I think the the biggest anti pattern that I've seen in the data lake space

is people believing that all you need to do is collect the data and dump it into a data lake, and it'll be ready for consumption by end users, whether that's machine learning or reporting or business intelligence or whatever.

I think the the ability to collect anything without thinking about it is really powerful in a data lake, but you have to plan to do this kind of whole cleaning step to figure out what the data is, what's wrong with it, what's missing, what needs to be augmented.

And basically what I've seen after that is when you get to step, this kind of cleansing,

refining process,

I have seen so many 1 off solutions that try to do this in a correct way, but it turns out it's very hard to get transactional semantics and exactly 1 stuff right. And so trying to do that for each application, I think, is just setting yourself up for failure. Yeah. I had a great conversation

a while ago about the overall concept of data curation

and how data lakes provide the ability to

have an easy way of

gathering data, but that, ultimately, it's not useful until you've had somebody test out

what they can actually do with it and then use that to inform ways to clean and augment and enrich the data and then ultimately land it in a data warehouse potentially.

And I'm wondering what your thoughts are on the benefits and trade offs of data lakes versus data warehouses, and whether data warehouses in general are a

necessary sort of technology and architectural paradigm in the current landscape of data capabilities? Yeah. So I think the the biggest benefit of a data lake over a data warehouse

is the cost,

the scale, and the effort that it takes to ingest new data. So, you know, data lakes are much cheaper. You can store

huge amounts of data in them without doing anything. But really, I think the key thing here is the effort that it takes to collect data. By bringing

that cost to ingest a new data source down to almost 0, you allow people to capture everything, and you can then later figure out what's actually useful.

With a traditional data warehouse, you kind of always have to start by creating the table, defining its schema.

And, you know, while that's useful and you need to do that eventually,

you don't always want that to be the first step. I think it's actually great to start with these raw data sources and then later figure out how to clean them and make them useful for insight. Because sometimes you don't realize the value of a piece of data until years later,

when you can actually join it with some other piece of data and actually get insight from it. So that's why I think a data lake is better than a data warehouse. You know, today data warehouses have some advantages,

concurrency,

and performance. But, you know, I think in a couple of years, we'll see if that is still the case. And I know that recent additions to the data warehouse market that are leveraging

cloud native technologies.

So things like Snowflake or BigQuery

have tried to address some of the trade offs of data lake versus data warehouse by letting you land raw data and then do some of the transformations

within the data warehouse itself. And I'm wondering

what your thoughts are in terms of how that changes the equation

as far as what technology stack to use and when. Yeah. So the I think the biggest,

benefit of the cloud

is it makes it possible to do all of these things without any upfront cost. It's basically the elasticity

and the fact that somebody else is is doing that elasticity for you. So you both don't have to build out a data center to get started,

but you also don't have to be experienced enough to build out a data center. You can kind of pay as you go, get started collecting data, cleaning it, using these tools on top of it. And then, you know, as you grow, the system will kind of seamlessly scale with you. And so I think really that that kind of fundamental change in the the barrier to entry is why why the cloud and a lot of these more modern technologies are really exciting.

And so for Delta Lake itself,

I know that you said that the original idea came about with this conversation with the Apple engineer trying to capture all of their packet traces and analyze them. And I'm wondering

how Delta Lake itself has been implemented and how the overall design has evolved once you began working on it and getting other customers involved in testing it and using it for their own various workloads?

Mhmm. Yeah. So Delta Lake started as just a scalable transaction log. So it was a collection of parquet files out in your storage system And then this record of metadata alongside that allowed us to build atomicity, but also scalability of the entire metadata plane. Like

the kind of cute trick that Delta does is the transaction log itself

is actually written in a way so it can also be processed by Spark.

So when you have metadata that is, you know, 1, 000 to 1, 000, 000 of files, you know, that's actually becoming a data problem in itself. And so by using Spark here, we were able to very quickly build a system that had the same scalability properties of Spark,

but, you know, gave us these nice asset semantics that people were, you know, used to having. So that was the beginning and that was kind of the core thing. We built that first, but the way it's evolved is we realized that that's only 1 step.

And once you have this powerful underlying asset primitive,

you can do so many cool things on top that just weren't possible before. So you can build the whole suite of DML operations that people expect from a traditional database. So update,

delete, merge into,

you can build streaming. And then now as we've continued to evolve, we've started to even get into the ideas of expressing quality constraint on the data as well. So it really, you know, it started with just this idea about scalable transactions

and it became an entire system for managing and improving the quality of data in your data lake. And as far as the ideas that you had going into it and the assumptions about what the needs were and the ways that Delta Lake would ultimately end up operating and being used. I'm wondering

what your thoughts were at the outset and how those assumptions and ideas have been challenged and updated as it has been used and expanded in terms of the scope and capabilities?

Yeah. That's so I, you know, I think thinking back, when we were first designing Delta,

you know,

we looked a lot at the systems that people were using today to solve these problems. And, you know, 1 of them in particular is the Hive metastore where people store all of the metadata today about, you know, about the data that's in their data lake. And, you know, I think 1 of the initial assumptions I had was that if you if you built a system that managed its own metadata in a much more scalable way, that you could actually just completely dump the meta store in 1 full swoop. But, you know, I think what that was missing

was the fact that data discovery is actually its own challenge that needs solutions as well. So as the project evolved, we actually ended up building

full support for storing pointers to your data still in the metastore so that people could discover it and comment on it and other stuff there. And I actually think that that's a place that as the project continues to grow, that we're going to see a lot more investment. It's not just about schema enforcement and expectations and transactions.

It's also about making the data available to people in a place where they can find it and understand it as well. As you mentioned, 1 of the features that you ended up adding in more recently is the idea of enforcing these quality constraints on the data coming into

the storage layer.

And I know that that is something that

is often either overlooked or undervalued

when dealing with, quote, unquote, big data or working with data lakes. And so I'm wondering

how you,

I'm wondering how the definitions of these constraints

are established

and validated and just the overall workflow of building these constraints, building the,

validation tests,

and what happens when a record comes in that doesn't fit the expectations and how you can,

address those out of band so that you don't just completely

either throw away the data and lose it or stop all processing

and block everything else that's trying to come through. Yes. You said exactly the word that I I like to use when I think about data quality within a data lake, and it's it's expectations. So that's actually the name of a feature that we've been working on for a while and that will be open sourcing this quarter.

You know, when you think about an expectation, an expectation is a way for a user to tell us about their domain's definition of quality. So you can say things like for this table, I expect this column to be a VIN for a car. So it needs to start with a letter and have this many numbers and it must be present or whatever it is for your particular domain. Expectations are similar to this concept of invariance in a traditional database, but I think something that rigid just doesn't work in the big data space. And then variant would just reject anything that doesn't,

fit into the, you know, the the predicate that the user has given.

And and that works. And we, you know, we also support that mode as well. But I think in a big data system, if every time you see something unexpected, you stop processing and make a user intervene, You're never gonna get anything done. And so what expectations bring to the table is the ability to tune the severity,

when an expectation is violated.

So you can fail stop, like abort any transaction that violates this expectation of something very important. Or if this is a table where, you know, basically it's getting towards the end of the quality journey, and you don't want to allow any bad data into it. Accounting is going to read this table directly, for example. But you can also tune them down. So you can say things like, I want to just alert

when,

the number of failed expectation goes over some threshold, Or I only want to fail the transaction if it's above some threshold.

And I think moving forward, we'll have even more powerful,

you know, functionality here where you can also quarantine the data. So you can basically say, I don't want to let data that does not meet my expectations go any further in my pipeline,

but I don't want to lose it either. So I'd like to redirect it to some other table where a human can come in on their own schedule,

figure out what's wrong with it and, you know, remediate the situation. And expectations

are deeply built into the system. So we can actually kind of give you insights into this was the record that came in.

You know, here's how you processed it. Here's the expectation that it violated. And so it basically brings debug ability to your data pipelines when it comes to quality. And another thing that factors into the overall idea of data quality

and particularly

in the long term is how you manage schema evolution,

particularly with these large volumes of data and the potential for different varieties

and

sort of levels

of consistency

based on whatever the source might be.

And so I'd like to get your perspective

on how you approach the idea of schema evolution in a data lake context,

whether you periodically go through and reprocess all of the old records and

either update column definitions or remove them or just elide the fact that they exist within the metadata

representation of them And just your overall thoughts on maintaining consistency of a schema over, you know, potentially terabytes or petabytes of data. Yeah. So I'm not sure there's 1 right answer here. And Delta actually supports a variety of different ways to think about scheme evolution. We kind of automatically support safe scheme evolution,

where you can add columns

with data types, you know, don't conflict with anything that's already there. You can also rewrite all of the data or we also support kind of reprocessing.

But to me, you know, just kind of when I think about this problem and how I manage my own internal pipelines, you know, that I run for Databricks' managed platform,

I like to think back to some rules that I learned when I was at Google on how they evolve schema because they really think about these things as contracts between people. And so I generally have a rule that

once a column has existed,

you do not want to and other people have consumed this data. It's often actually a bad idea to rename that column or to reuse that name for something that has different semantics.

It's much better to create You know, if you find something wrong with it, hey, this column is wrong in this weird way and actually wasn't calculating what we want. A better option often is to deprecate that column, create a new 1 with a clear name, and then switch people over to that. Because if you change stuff out from under people who are are basically the the end consumers of your data, that can have other unexpected

consequences.

And another problem too is that if you have ETL jobs that are populating the data lake or that are working on the raw data that gets landed into the data lake, managing version updates

when they're trying to process historical records.

It's a complicated issue. And as you said, there isn't 1 universal solution, but it's definitely something that

bears a lot of consideration

and

merits a good amount of upfront planning as you're first implementing some of these systems so that you don't end up boxing yourself into some to a suboptimal

situation where you don't have any way to work your way out of it. Definitely. Definitely. And that's why I think we we we, like, out of the box support this. What I what I consider to be the safe way of doing it, where you certainly can reprocess, you certainly can add new additional information,

but you you generally don't want to go and change semantics. I find that that can confuse people and, you know, you don't even know who is relying on those semantics.

And so

the primary purpose of Delta and Delta Lake, as you said, going into it, was the idea of transactionality

and managing

that across these large volumes of data, potentially with multiple users.

And so I'd like to get an understanding

of how you ended up implementing that And particularly when you have

issues or

conflicts in terms of the data ownership where

Delta Lake is responsible for processing the data and adding transactionality on top of it. But you might have some other system that's either writing in new data or

consuming data that Delta Lake is working with or potentially even manipulating it in some way. Yeah. So let me start by describing

transactionality in Delta. So basically, you know, as I said before, Delta is a whole bunch of parquet files stored in your storage system of choice plus a transaction log. And what the transaction log has, it has files in it that are, you know, each atomic unit, so a transaction.

And each transaction has a set of actions that are taken against the table. So an action could be, I'm going to change the schema of this table, or I'm going to add this file to this table, or I'm going to remove this file from this table. And we also have some other kinds of, things for upgradability

and for idempotence

of of transactions. But really, you can think about it as this relatively,

simple list of things that you're going to do to the table. And when you get to the end of those actions, you now have the current state of the table. So we

people modify a Delphi table by adding a new transaction to it.

When we have multiple readers and writers, basically we only need 1 property from the underlying storage system, and that's mutual exclusion.

So if 2 people try to create the same file for the same version of the table, so I'm, you know, I'm at version 0 of the table and 2 people try to create version 1. The thing that we need from the underlying storage system is it needs to allow 1 of them to succeed and tell the other 1, no, I'm sorry, this file already exists. You cannot have 2 writers that both think they succeeded.

And then when there's a conflict like this, we basically use standard optimistic concurrency.

So when 2 people are modifying the table, when you begin your transaction, you start by recording the version that your transaction began at. And then when you go to commit, you see what versions exist,

and you check to see how those things that happened since you started and since you're about to commit, how those have changed the table.

And in many cases, they haven't changed the table in any interesting way. So you can just ignore them and go ahead and commit anyway.

But in some cases, maybe you actually conflicted with them, then you need to retry your operation. So let me give an example here. A very common use case is 2 streams writing into the same table. Well, they both read the schema of the table and then they write files

in and that's it. But since they only read the schema of the table and since the schema isn't changing, it doesn't matter what order they commit in, you'll get the same result either way. And so Delta will just automatically retry until it succeeds.

So that's kind of the underlying transactional core. Now the question is, how do you use this without your systems? And so, yeah, I think answer number 1 here is this is why we're so excited about open sourcing Delta Lake because we can actually put the protocol spec out there so anybody can implement it. We've been in talks with, you know, the Presto people over at Starburst

and some others on what it would take, you know, some of the committers on Apache NiFi and what it would take to build Delta connectors for these other open source projects.

We're particularly excited about that. But even before that happens,

you know, I think this is where Delta Lake's integration with Apache Spark is especially powerful.

So I like to think of Spark as the kind of skinny waste of the big data ecosystem.

It has connectors to read every file format. It can read from Kafka, Kinesis,

Azure Event Hub, HDFS, like wherever your data is stored, Spark can read from it and write to it. And so a pretty typical pattern that I see people set up is they use Spark on the periphery to read from all of these sources.

They capture the data in some, like, raw tables within their Delta Lake, and then they move it through the data quality using Spark and then eventually out to some other system, whether that's machine learning using TensorFlow or reporting with Tableau.

But really what they've used Delta and Spark in the middle for is taking this raw data and getting it ready for consumption by downstream consumers.

And so in terms of the transactionality

itself,

I'm wondering if there is any limit or any sorts of edge cases that come up when you're dealing with particularly large volumes of data that you're trying to encompass within the boundaries of a single transaction,

such as if you have maybe a petabyte of data and you don't have any way to capture and process all of it in RAM to be able to then write it all out atomically.

I'm just wondering,

how you approach some of those edge cases or,

extremes within the idea of transactionality

on these large volumes of data. Yeah. That's a good question. So, you know, like, in terms of just the the scalability question

about Spark about Delta, you know, I think there there are some fundamental limitations in the number of tasks that Spark can schedule in a single stage.

But, you know, even those we've successfully scaled a single Spark job to over a petabyte. That was like 5 years ago.

So I think in general, you can process tons and tons of data. You know, Delta Lake of course also keeps information about statistics and memory. There's nothing fundamental preventing all of that from being pipelined. But really, I think that the real answer to this question here is if you have a petabyte of data

and you want to process the whole thing and do something and you wanna make sure that you you process that entire petabyte,

streaming can be a really powerful tool here.

And I, you know, I kind of often have to explain to people because when when people hear streaming,

they usually think, oh, that's a more complicated system that I need to set up in order to get real time semantics.

But another really useful thing that streaming does is it can take a petabyte job, break it into much smaller jobs,

and then

execute each 1 of those while guaranteeing that when you get to the end of the stream, you have exactly 1 semantics and you've processed that entire petabyte.

So I think, you know, there you can kind of shift your thinking where streaming is actually also a tool for easing the the process of of dealing with such a massive amount of data. And it's worth digging deeper too into the this idea of streaming in the context of these data lakes where

particularly for processing

large batch jobs of historical data, the

original approach when Hadoop was fairly new and people were still trying to figure out how to handle these types of systems was the idea of the Lambda architecture.

And I've noticed that in recent years, I haven't really heard that terminology come up very often. And so

I know that with Delta Lake, you have unified the API as far as being able to process these large batch jobs versus the streaming jobs into

a single interface.

And so I'm curious

how that manifests

in terms of the ways that you approach

designing and implementing

these

processing

jobs, and also your thoughts on the

necessity of the Lambda architecture in general given the current capabilities of our data platforms?

Yeah. So I guess why why did people start with the Lambda architecture originally?

And I think it was basically because there was no way to do streaming and batch

on a single set of data while maintaining exactly 1 semantics.

At the time, we believe that streaming was just too costly, and so the only way to do streaming was to do this approximate processing.

And then later you had to do this, like, more expensive, slower computation to get the correct answer.

And so what Delta really brings to the table with its transactionality

is it allows you to do both streaming and batch with exactly 1 semantics on the same set of data and and get, you know, the correct answer.

And so now you there's no reason to have all of this complexity of maintaining

2 separate systems.

And, you know, I think this is actually really important because I think each of these different paradigms

still have utility. So, you know, like I was saying before, streaming is not about low latency. It's about incrementalization. It's about splitting the job up, but it's also about not having a schedule, not having to worry about dependencies amongst jobs, and making sure that they run at the right time, and worrying what happens if this job finishes late. It's about figuring out what data is new since you processed last. And it's also about failure management. What happens when your job crashes in the middle

and you need to restart it? Streaming kind of takes care of all of those things automatically. It allows you to focus

only on the data flow problems about your your business logic and not the systems problems of how do I actually run this thing efficiently when data is continually arriving.

So that's why I think streaming is very powerful.

But in spite of all of this, you know, batch jobs still happen. You may have,

retention requirements where the business mandates that all data over 2 years old is deleted.

You might have GDPR requirements where some user has served you with a DSR

and you have 30 days to eliminate them from all of your data lake.

Or you might have change data capture coming from an existing operational store. So you have, you know, your point of sale data that's coming in every day. It's super important to your business, but you wanna merge that into an analytical store,

where you can

run run kind of long term longitudinal analysis that you would never wanna run on your actual operational store for performance reasons.

And so bringing these together into a single interface, not requiring you to manage 2 completely separate pipelines, I think drastically simplifies

the work of data engineers.

And going a little bit further afield here, I'm wondering what your thoughts are on systems such as

Pulsar where it has the ability to natively tier the storage

or sort of, age out the data that is maintained in this,

sort of streaming and queuing architecture or things like Prevega that provides a stream native storage layer for workloads such as Flink and, I believe, Spark as well,

and your thoughts on the

capabilities and limitations of those approaches

versus what you're doing with managing parquet files in these,

in these cloud storage architectures

and

maintaining a

unified interface within Spark for being able to

work across the stream and batch jobs. Yeah. So I have to be honest. I'm not super familiar with those 2 systems,

but I can talk about what we do inside of Delta,

with respect to this. And and, basically, this comes down to

the same trick that we're playing before, which is leveraging

underlying storage systems and letting them do what they do best while providing nice guarantees on top. And so, you know, Delta Lake is fully compatible

with, Azure's Glacier storage or sorry, s three's Glacier storage and Azure is kind of a cheaper version of the storage as well. And so, you know, I think I think kind of leveraging those systems is is very important because you wanna keep a lot of data, but you only wanna, you know, pay for what what you actually need. And and it turns out most data you don't need fast access to. And on that idea of managing cost and

the

access to data, particularly in historical contexts,

I'm interested in the ways that you approach data versioning within Delta Lake because I know that that's something that you have also built in

and is particularly useful when you're doing these,

transactional workloads or working with multiple systems that might be interacting with the data.

And so I'd like to understand how you

approach minimizing

the differences

in terms of the,

different versions of the datasets,

how those different versions manifest,

and how the parquet file format helps in any of those efforts. Yeah. So data versioning in Delta Lake is implemented using MVCC or multi version concurrency control, which is just a fancy way of saying that when we change something in the table, we make a new copy. There's a couple of reasons for that. 1, it's just it's it's simple and it works. It's a great way to do versioning,

but also it works really well with, eventually consistent storage systems like s 3. If you never modify anything, then, you know, there is no such thing as eventual consistency, which is great. It's very easy to reason about the correctness. And so what we end up doing is we every time we change something, we make a new copy, and then the transaction log is the authoritative

source

for which files are part of any version of the table. So you can basically look at the transaction log, play it up to the end to see what the table currently is, but you can also stop playing it somewhere in the middle to see what the table looked like at that moment. And this is a really powerful feature because it gives you snapshot isolation. People who have started a job on an older version of the table will just continue to see that version of the table. There's no inconsistencies.

There's no failures because data is being deleted out from underneath you. You just continue to see a consistent snapshot of the table for the duration of your job. But it also allows you to do cool things like time travel. So we've been working a lot with the MLflow guys. MLflow is another open source project, that basically allows you to manage the entire life cycle of machine learning,

from, like, training and and tuning parameters and figuring out what really is the best way to build your model. And when you combine

ML flows tracking

with, Delta Lake's versioning, you have a really powerful tool for experiment reproducibility.

So when something when you tweak parameters in your model, you can actually go and see, was it the data that changed or was it the,

the changes in my tuning parameters that changed the efficacy of my model? And that's something I think really just wasn't possible before. Another thing that I'm curious about with the decision of standardizing on parquet

is, I guess,

what the primary benefits are that it provides as opposed to ORC or Avro or I mean, it's pretty obvious what it provides in addition to JSON, but what the decision making process looked like when you were determining which technologies to standardize on

and any cases where

Parquet itself has actually led to a greater deal of complexity that you've had to work around? Yeah. So I really the the reason here is Parquet is not a bad choice. You know, as as you pointed out, you know, it's it's better than JSON. And in general, when I build systems, I believe in building a system with as few knobs as possible.

And I think what that does is it guides users, you know, especially ones who are less experienced into making good choices.

So 1 of my favorite kind of stories here is we we had a customer who had a massive table. They were storing all this data in sequence files, you know, kind of a Hive a Hive format,

and they loaded the data into Delta, and they made no other changes to it. And they ran their job, and it went from taking hours to taking 15 minutes. And they were like, oh my god. Delta is the greatest thing ever. And I was like, well, really, I just forced you to use Marquee. Like, it does have performance features, but you you didn't hit any of them this time.

And so I think just by making a good choice, you can build

a system that is much easier for people to use. The reason we picked Parquet specifically is just because it has good integration in Spark. So Spark has a vectorized Parquet reader that reads it directly into vectors that go directly into whole stage code gen. So it's just a really fast path into the Spark execution engine. Now,

to be clear, the actual underlying transaction protocol that Delta Lake uses is actually agnostic to the format. And it actually even already has support

for any file format that,

that that Spark supports. So you could use it for JSON

or CSV

or ORC or or whatever it is. And in fact, we have other systems inside of Databricks that use the transaction log for that. But in general, you know, when I'm exposing it to users, I want to simplify things. I want to guide them to making a good choice.

I think, you know, as we continue to expand the open source project though, I think it's very likely that there will be

large companies who already have tons of data in ORC, where it might make sense to actually open up this knob for advanced users.

And diving into the open source aspect of Delta Lake, I'm wondering

how that has impacted your overall product road map and your approach to development now that it is public as opposed to being an internal product and something that you were,

managing directly with customers? And,

now that you have a broader community of people who are using it and potentially

development cycles are designed,

and,

just your overall thoughts on

the benefits

and trade and trade offs of providing this as an open source project. Yeah. It's a visual. So this is very young. We've only been doing this for a little over a month now. But I can definitely tell you kind of what what we've seen so far. And really what it comes down to is, you know, Databricks has a long history of success with open source. It was started with Apache Spark,

Spark SQL was originally created here and and donated into

Apache. And and so, you know, it's kind of in our DNA to do this. And so it's it's been really exciting to see the uptake. And I think really the the kind of fundamental difference here

is now,

people can use

basically they can start with Delta,

even when they're not running in the cloud or they're not running on on Databricks. So, you know, there's a lot of users who have HDFS clusters on prem who just never had the ability to get this transactionality and other stuff. And so I think in terms of of changes, it it's just more open now, which is great. We can get feedback

while we're in the coding process.

We've already had a bunch of contributions to the open source repository.

And what I'm really excited about, you know, coming up in this next quarter is as we,

release some of these APIs for doing declarative management of data and data quality, we'll really get to kind of co develop this with customers rather than having it be with customers and and just, you know, the the greater open source community. Rather than having it be something where we build something in secret and then dump it over the wall and then see if people like it, it can be more of a collaboration. So I think it's actually pretty good. In the short term, it means that my team, we call it the stream team inside of Databricks, is is almost entirely in the process of open sourcing more and more parts of Delta Lake.

So, you know, I believe pretty strongly

that APIs need to be open in order for a system to be successful.

And so I want to take all of the the APIs that are currently proprietary and push them into open source. So this is update, delete, and merge. This is vacuum. This is history.

All of this needs to be pushed into open source. And that's basically our whole roadmap for this quarter is to, you know, detangle those from our our internal platform and get them out there so that they work with with just stock Apache Spark. And so going back to the Delta Lake project itself, I think the last piece or at least the last major piece that we haven't talked about yet is the capability

of creating indexes across the data that you're storing in your data lake. And so I'm wondering how that manifests

and any issues that you have encountered as far as scalability

and maintaining consistency of those indexes

as the volumes and varieties of data that you're working with grow? Yeah. So I think traditional secondary indexes would be very difficult,

to to keep in sync. You know, all the problems you just talked about would come up. So we have a couple of tricks that we play that give you kind of similar performance to indexing, but, you know, operate a little bit differently.

So we support,

you know, standard

partitioning of data where basically you say partition by date or partition by region,

some kind of relatively low cardinality grouping of the data. And as data is inserted into Delta, we automatically segment it into these different partitions. So that's like a standard trick that Spark and Hive and these other systems have supported.

What Delta changes here is, first of all, we make

the management of those partitions totally scalable by taking the data and turning it in or sorry, by taking the metadata and turning it into a data problem. So you can have a table that has millions of partitions and Delta can still handle that rather than tipping over. The other tricks that we play are when you do some of these DML operations,

for example, like a delete. Let's say I partition by date and I want to delete all data that is older than a year. Well, the magic of having that partitioning there is that can be a metadata only operation. We can perform that simply by changing the transaction log without any

doing any actual processing. So that that that's pretty powerful. And just overall

in your experience

of building

Databricks Delta, now Delta Lake,

and going through the process of open sourcing it and just

evolving the system as a whole, what have you found to be some of the most challenging or complex aspects of it? And what are some of the lessons that you've learned in the process that were particularly

unexpected or valuable? Yeah. So there have been a couple of challenges. 1 is actually education. You know, you're asking about Parquet and what complications it led to. Well, 1 problem is Parquet is almost too open in that as soon as people started using Delta Lake, they would also go and start trying to read those Parquet files without understanding the transaction log. And So we had a lot of people say, oh man, you know, it's duplicating my data, not understanding that that was, you know, the MVCC aspect was actually by design. So you spend a bunch of time adding kind of guardrails around the system to make it possible

to use it without needing to understand everything that's going on under the covers. The other aspect is

just how many different workloads

exist in the world and how they exist in, you know, all different dimensions.

So, you know, Delta is now used at, thousands of customers. There's over a 100000 Delta tables in existence.

Last month, we processed over an exabyte of data. But what that means is people have stressed out every different dimension of this system. And so I have seen jobs crash in ways that I never expected. And we've gone in and kind of fixed those, you know, each little individual scaling

bottleneck, you know, so that it runs without you needing to settle out of tuning knobs. Going back to the point we talked about before, I think, you know, that's really important to building a usable system is to minimize the amount of tuning that you need to do in order to be able to use it successfully. And so I think that's been both very interesting, but also very challenging to to make that possible. And going back to the duplication of data that your customers have been seeing

and issues in terms of cost control,

I'm wondering what some strategies are that you can potentially employ, whether it's something like compaction

or going back and pruning old versions of data and any cases where that might be a bad idea or cases where it would be necessary

either in terms of space savings or cost savings, particularly when using cloud resources?

Yeah. So

Delta supports this command called vacuum,

which basically looks at a table and it looks at what's on the file system

and what's in the transaction log, and it does like a left anti join to basically remove, you know, find everything that should be there and then find everything that by definition shouldn't be there.

And this is totally tunable. So when you deleted a file from Delta, we create a tombstone that tells us when it was deleted. And by tuning that retention window, you can decide how long you want to keep stale snapshots for. So at 1 far end of the spectrum, if you vacuum retaining

0 minutes, you know, as as we call it, that would basically just turn it into a normal parquet table. It would remove everything that is not part of the current snapshot.

And so now it is just a normal parquet table and you could read it with other things and there's no extra storage being consumed

other than a transaction log, which is pretty small.

At the other end, we have customers who, for compliance reasons,

want to never delete anything and they just don't run vacuum. And they're willing to pay the cost of keeping everything forever because it means if they get audited, they can go back and say, oh, here's exactly what the data looked like at that day. And

so Delta Lake is definitely a very,

interesting system. It appears to be very well designed and provide a lot of great value in terms of

useful

interesting capabilities

that will potentially obviate the need for data warehouses

by adding transactionality

to these data lake systems. But when is it the wrong choice? So Delta Lake is not an OLTP database. You should not try to use it for lots of small transactions.

If you have a lot of tiny updates and you apply each of them individually, it will be horribly inefficient. And so really what we're looking for

is scale and throughput. That's what we've actually optimized for. So if you can take all of those updates, feed them into Kafka, and apply them in batches, that's a great pattern. But if you try to call update every time 1 arrives, you're certainly going to to back up the system.

Similarly, you know, I would not use this as a key value store. It's not good at doing individual point lookups, but it is great at doing massive table scans. And it's even good

at locating relatively small needles in a haystack without the cost of actually having a key value store holding all of that information.

Opportunities that you could have addressed but consciously decided not to work on? Yeah. So Delta, at least the the current implementation,

and this is, you know, nothing to say about what the what the format actually would allow. But in our implementation,

we don't implement our own processing. And, you know, actually in in in some early versions we tried and we learned that was a bad idea. So I'll tell you a story where in the in the first version when we built the merge operator. So this is where you take a set of changes, updates and inserts and deletes, and apply them as 1 batch into the table. It's for, like, change data capture. In the first version, we actually wrote our own implementation of join that we thought would be slightly more efficient and kinda hand coded it. And when we benchmarked it, it turns out Spark's joins with whole stage code gen are much better, much more scalable. They spill the disc, they do all this stuff. So Delta in general is actually just a bunch of cleverly written Spark SQL queries that then leverages the processing of Spark. And we we try not to get into that business. When we need to fix Spark, we actually go to Apache Spark and fix it instead.

And so looking forward for the project, I know that you said that in the near term, you have some work to extricate some of the code from the internal repositories

and bring it into the open source repo as far as some of these DML capabilities.

But

looking to the medium and long term, what do you have planned for the future of Delta Lake? Yeah. So what gets me really excited is the idea of having a full declarative spec for defining

the graph,

of data pipelines that you need to run.

So what what I've seen as soon as we as soon as someone starts with Delta and starts with streaming and starts with the other capabilities,

they almost immediately go from 1 Delta table and 1 stream to thousands of Delta tables and thousands of streams. We actually have customers who have that many. And then

the orchestration, the management becomes very difficult. And we're gonna start this quarter, and I think, you know, we'll continue this for the next year or so. Open sourcing APIs that allow you to declaratively specify

the layout,

the quality constraints, the metadata, the kind of human readable descriptions

of the data at rest within your Delta Lake, but also the flows of how

data moves in between these different systems. And I think the goal here is to really ease the burden

the people who have to manage

these production data pipelines.

So I like to think about the whole life cycle here. People usually just think you build a production pipeline and you're done. But really, I think there's a bunch of different steps here. You start by building it locally. You write some code, you get in your ID, you write some Python, you write some Scala.

You want to test it so that you know that it's actually correct and that it stays correct, and then you need to deploy it. But that's actually only the beginning. Then you need to worry about someone comes to you with

or you find a bug. So you need to be able to start that cycle again and then upgrade it. You need to monitor it so that you know that it's working, you know that you're meeting your SLAs, and you need to tune it so that you're minimizing your costs. And so our idea here is, you know, in Delta Lake to give you the language for describing this whole system, to make it easy to do the testing, to do the deployment,

and and really make it make this whole system more declarative.

And are there any other aspects

of Delta Lake or Spark or the work that you're doing at Databricks or just the overall landscape

of data lakes and data warehouses that we didn't cover yet that you'd like to discuss before we close out the show? No. I I actually think, I think I'm tapped out.

Alright. Well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes.

And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. To me, the the biggest gap is cobbling together all of the different solutions that you need to build something that is correct.

You know, I think Spark started this journey by unifying a bunch of APIs into a simple system that you could use.

But really that didn't handle storage, that didn't handle metadata.

And so Delta Lake, you know, I think we're kind of trying to bridge that gap. And I think there's a lot of work to be done. Like I said, you know, it's still a lot of work to actually manage a full production pipeline, but I think that that's that's the gap that I see. And, you know, I I think it's something we're trying to address. Alright. Well, thank you very much for taking the time today to join me and discuss the that you've been doing with Delta Lake. It's a very interesting system, and I was excited when I first saw the announcement. So I appreciate you taking the time to discuss your experiences of working with it. I definitely am planning to dig a bit deeper into it and keep track of it as it continues to progress. So thank you for all of that, and I hope you enjoy the rest of your day. Yeah. Thanks for having me.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links