Let The Whole Team Participate In Data With The Quilt Versioned Data Hub

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Truly leveraging and benefiting from streaming data is hard.

The data stack is costly, difficult to use, and still has limitations.

Materialise breaks down those barriers with a true cloud native streaming database,

not simply a database that connects to streaming systems.

With a Postgres compatible interface, you can now work with real time data using ANSI SQL, including the ability to perform multi way complex joins, which support stream to stream, stream to table, table to table, and more, all in standard SQL. Go to data engineering podcast.com

/ materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring.

Your host is Tobias Macy, and today, I'm interviewing Anish Karve about how Quilts Data helps you bring order to your chaotic data in s 3 with transactional versioning and data discovery built in. So, Anish, can you start by introducing yourself? Sure. I'm the CTO of Quilt Data. My cofounder and I met at graduate school in UW Madison,

and that school is really well known for high performance database systems.

And we kind of got together around this frustration around the simultaneous power and inaccessibility

of those databases,

and we had this crazy idea that what if you could put all the world's data in 1 logical database and that kind of started the the journey for Quilt.

Before computer science, I did chemistry and math, and so I'm really interested in

structured ways

as in mathematically structured ways for people to think about data. But at the same time, how do you make it so that people who don't wanna write code or don't have the ability to write code can participate in this the data management process? Because data is very much a team sport. And so you you as much of a mathematical background as you have, you need to think about the humans that are in that equation.

And do you remember how you first got started working in data?

Yeah. So I think there was a there was an alumni event. I I was living in Mountain View at the time right around the corner from Google, and, Kevin and I were just standing in my kitchen and talking about what the the change was that we wanted to see in the world. At that time, so the year is about 2015

and GitHub was kind of a phenom

at that point and

we've definitely changed what our mission is since then, but we asked ourselves the question of what would that look like for data. And so I got started in data

really with appreciation for how well CICD

worked for code and then being frustrated why they didn't have these structured methods of contributing that the whole company could could participate in.

And, of course,

before actually starting Quilts, I I worked as a product manager, so I I spent some time, I think, immediately before Quilt as a Matterport,

and we would have questions all the time like, oh, how many how many iOS clients are gonna be affected by this app change or can we break compatibility with an older version of iOS?

And

I think I really became curious about data when we're looking for it was my job to find an analytics platform, And so that that was my first kind of hands on experience of of how data affects decisions and affects companies

was through analytics and and products like Mixpanel, and and I've been kind of obsessed with it since then. And so in terms of the Quilt project, I actually had your cofounder on the show all the way back in 2018 when you were very early in your product journey. But for people who haven't listened to that episode, I'm wondering if you can give an overview about what it is that you're building and,

what it is about this particular problem space made you want to dedicate your time and energy to it? Yeah. Our mission really is to verify the integrity of enterprise data so there our customers

can release discoveries to market more efficiently.

And what's changed, I think, since you spoke to Kevin is we have a lot more customers and a lot more field experience.

So what got us into this

the vein that became Quill was really this idea that our motto, which we launched on Hacker News with was manage data like code, and

that was from this frustration of like, hey, we we have really good ways for people to collaborate in a structured way,

pull requests,

unit tests, CICD around code, and then how do we make a process like that possible for data. So that that was really the the genesis of the thought.

And then

what we realized as we developed this idea was that the ideas were very compelling, but until and unless, you know, they say no product survives first contact with the market.

So I'd say what's changed since since you last spoke to Kevin is just putting a lot of enterprise customers

through the product, through the platform, and and learning kind of the difference between expectations and reality on what we design for and how people use the product.

And I'll mention just 1 of those and and that's that we we really feel a lot of pull from the market in the what's called the biopharma space or life sciences space,

and we didn't expect this. They weren't our first market. We started out different things, finance, horizontals like machine learning, and loop I wanna talk about kind of some of the go to market challenges there. But in life sciences,

they're deluged with multi structured data or heterogeneous data. So what we didn't know is that they've got data from instruments, which is totally unstructured.

They've got a bunch of business files, some of which are unstructured like PowerPoint, some of which are structured like Excel files, and then they have this really strong need

for human beings and machines to be able to look at the same data entities.

And so that's a long way of saying our our passion is really what we call universal data containers. That's what the open source standard has done for us and then we kinda built an enterprise product on top of that. So

our our mission has really been to make these universal data containers, which I wanna talk more about

useful in the enterprise to to cross functional teams in enterprise, not just coders,

not just data engineers, but, product managers,

leadership, and other people who need to to consume data in this this team support dynamic.

As far as the

overall

kind of focus of the product that within the past 5 years from when I last spoke with Kevin,

definitely has a very different feel in terms of how you're presenting it from when I first spoke with him where, initially, it was very focused on kind of the data sharing element of being able to create these different kind of packages of data that you could then publish either internally within your organization or being able to share externally and be sort of like a data exchange.

And I'm wondering what you have seen in that intervening 5 years around

the practical elements of how people are working with data, maybe some of the ways that the regulatory environment has shaped the way that you think about the problem space,

and the surrounding ecosystem of data tooling, how that might have influenced the

specific kind of pain points that people are experiencing around this question of how do I work with my data, how do I manage versioning my data, what data am I even working with, and that kind of

overarching,

question of where does this fit within this large problem space?

Yeah. There's a a lot of different threads to tease out there. I think the first thing is that

during this time when we kind of founded the company, so between, let's say, 2015 to present 2023,

over those 8 years, some really interesting things happen in landscapes. First of all, we had 2 really

trying economic downturns. Right? Which means that the I would say the economics of our business, we had to focus on the economics of our business a lot more than this is the other significant thing is that

companies like Docker kind of win. Now they're recapitalized and I wish them well and I think they have a very important product, but, they had to completely recapitalize and essentially went for pennies on the dollar after having a a $1, 000, 000, 000 valuation.

And so the first thing that changed in the market was that

this idea of, hey. Just get eyeballs, and you'll figure out monetization later became false.

And so living through those economic downturns as a start up, we had to figure out, okay. What what is the economic

value in our product? And the first thing we realized is that it's all in private data.

And so there's this, I guess the 2 ends of the poll. While on the far side, you've got, like, completely air gap which you're gonna see at something like a department of defense. And then you've got, like, fully online hosted multi tenant infrastructure.

Right? And in between, there's virtual private cloud. And so what we realized, first of all, is that I mean, the shift to cloud was really obvious. It's still taking place and, you know, there are a lot of large companies that are very slow to undergo the digital transformation process.

But we had to focus on private data as a matter of really creating a fundamentally sound business.

Because open data, you tend to get a lot of demands from the community, but there's no actual business model or monetization model. And, you know, looking at the trajectory of things that we all know and use and love like Docker, it's really that like, it's very instructive what happened to them because their monetization strategy was on the Swarm ad. They wanted to monetize with Swarm with orchestration.

And then Kubernetes came and, you know, they didn't they didn't really have a business model. So of necessity, we focused on private data, and we took our we we turned our service in a completely so we still have the open source APIs and universal data containers,

and the only place to credibly do that is the open source.

And the way we weaponized our open source strategy is is we told our customers you should never be in a data hostage situation. What does that mean? We said, well, your data should be in s 3 buckets that you control

under I'm policies that you control under open source APIs. And so that was the blend. That was this finesse. That's how we kept our open source roots, but didn't die on the hill of of open data, which is, you know, even Kaggle, I think, had a I think they've been very successful now, but they had, I guess,

an okay exit to Google. And, it's just very hard. It's a very thankless business to run a lot of open data. We do care about those data. We still host petabytes of open data. You can go to open.globaldata.com.

We do that in collaboration with Amazon.com and their registry of open data. But I don't know. I think we probably have some entrepreneurs or would be entrepreneurs listen to podcasts. So I love this hard question of business models. And the way we kept our integrity was just tell our customers, no. You should have full control of your data. And we designed our product in a way that you can turn the product off and your data is just sitting there just as you left it in s 3, which I think is pretty amazing.

Yeah. The the kind of ownership and control of data is definitely a very

common thread that has come about a lot in recent years where

maybe

more in the mid 20 teens, there was the premise that, oh, well, people just don't wanna have to deal with working with their data, so we'll run the platform. You just send it all to us, and we'll worry about it. And then with with a combination of more kind of large organizations actually getting into the game of working with data and moving into the cloud and saying, we can't do that, or we don't feel comfortable doing that. And then also the regulatory regime and particularly in terms of GDPR and CCPA of, hey. You need to make sure that you know exactly what you're doing with that data. People really started to be leery of handing it over to anybody. And I also think that this is

coincident

with the downfall of the original

framing of big data of just just capture everything, and maybe it'll be useful eventually.

And people actually being much more cognizant

and judicious about what data they're capturing

and making sure that they actually have

a plan of what to do with the data once it is captured and aggregated,

and also being cognizant of the need to actually delete data periodically.

I love that you mentioned that. It was just on Hacker News the other day. 1 of the former GCP engineers

posted this article on how the actual it's like big data has nothing to do with with the size of data. And then, you know, there's your typical fees, velocity, volume, variety, veracity. That's actually not a good definition. And his definition was big data occurs when it's cheaper to store data than it is to figure out what you need to

delete. And so and this is a lot like because storage is so cheap and people are just hoarding data, and to your point, they very oftentimes do not have a a plan for what they're gonna do with that data. You also brought up compliance regimes and, you know, this is I wanna go back to this idea of data hostage situation and, you know, I mean, some platforms are are better actors than others, but I want, like, all the listeners to really think about how they ever avoid that. And some of the failure modes are you spend, I don't know, some number of years or months accumulating data in a third party system, and then the APIs to get data out are either very slow or unavailable, number 1. And number 2, that data is not adjacent to your own compute. And, like, 1 of the key benefits of the cloud is if you, at least in AWS, if you do think same data center, you don't pay anything for data egress, and this is just a huge efficiency savings. And so 1 of the things that killed us is that there are all these third party systems,

things like Box and Dropbox, where people are piling up data. It's not connected to the rest of the cloud, number 1.

It's not in your single security perimeter, number 2. Right? Like, if you're a CIO or if you're a CISO, you wanna be like, hey. Can't we have to have a single data security perimeter cloud? That's not possible with 3rd party service.

And then the 3rd piece on the compliance side, and I I think this really touches on some some of the pull we're seeing from the market in when what I'm calling biopharma

is when you have clinical data, it needs to stay in a specific country oftentimes. You you don't even have the authority to egress that data.

And so what I love about this strategy, and there is a risk in this strategy, you know, I'll mention it just a minute. What I love about this strategy is the customer gets to decide the tenancy and they get to decide the residency of their data. Now the risk, of course, is that now you do have to trust Amazon. Right? And right? Or GCP or whomever your cloud provider is. I don't know if we ever I think we will get to fully trustless systems, but it's really interesting.

Let's just say we've just spent time moving away from on prem as an industry and, like, I don't know how we're gonna manage this this tension of not controlling the hardware,

but controlling the the data plane and the control plane at at a virtual level through the cloud. As far as the

challenges that customers are experiencing

when they ultimately come to Quilt Data, I'm wondering what are some of the main problems that they're trying to solve and some of the ways that they have tried to address them up to the point where they say Quilt data is actually the right solution for me, and this solves the thing that I am dealing with at the moment.

Yeah. Let me give you a 1 sentence answer. It took us 8 years to figure out what this 1 sentence was, but, it's really data chain of custody

from instrument to scientist to filer.

Okay? And and what that means is that the problem that they wanna solve and, you know, this is true whether you're in an ML organization. Again, that's a horizontal, not really a vertical.

Whether you are in finance, at the end of the day, you are a data driven business and you there are certain discoveries that you wanna make either through analytics or through model building that are gonna give you insights into your customer base or into your own products or allow you to create new products.

And so first of all, let me talk about the state that literally everyone is in. And, I think some people who hear this will cringe and be like, how does he know? Like, you know, what state our data is in? But that's because everybody starts in kind of this really what we call the foul state, like fragmented,

obscure, uncertain, and lost. So everyone has Excel spreadsheets or or spreadsheets of some kind.

They're all used as a poor man's database and they break really quickly. So they so they break as row scale. They break they break as you have multiple editors.

There are there are any number of things that, that spreadsheets don't do well, but they are a lingua franca for people who can program and people who can't. So they're always the first go to. They just break very quickly.

The second thing is everybody has data silos. Like, okay, something's in Ignite, something's in Box, something's in Dropbox, something's in email, something's in Smartsheet. And, by the way, our scientific teams or or our computational teams are using whatever s 3

and, let's say, Postgres. So everybody kinda starts in that state, and then they really need a a crawl walk 1 approach to kinda slowly knit together data from those different silos, and and and that's kind of been

where our company name came from and and what our, approach has has been.

So, yeah, it's,

the themes are age old and, you know, there's a saying that whenever someone uses Microsoft Excel, it's a business opportunity.

I think that's still true. Yeah. That's that's definitely

absolutely true. It it's interesting too to see the people who are trying to

kind of reimagine

that

the the use cases for Excel in a more kind of data native

environment

while still being able to maintain some of the ease of use aspects of it? Oh, 2 interesting things there. So we found a a great open source project that we use as part of our open source core catalog. So, you know, at a very high level, we have the

immutable dataset APIs

in Python for,

moving immutable datasets to and from s 3, but also the s 3 catalog.

And, that itself is open source and it's strong because the open source libraries that that we

incorporate. And Morgan Stanley did a piece called, I believe it was Morgan Stanley, a FinOS

perspective. And so and what this does is it lets you parse tabular data in the browser and you can do self-service or self exploration, which is like a big a big theme in data mesh, you know, which is this category that they were working with AWS self defined is self-service. Right? So

the

the benefit of decentralization is everyone can create data in their own way. The challenge of decentralization is everyone can create data in their own way. It's like, how do you mesh that all together?

And so,

just speaking to spreadsheets, 1 of the simple things that we can do is land them in s 3. Right? And if we could give s 3 an interface a little bit more like Box, a lot more people would stay in cloud, which has all these advantages of like, oh, we could do in data center transfers. Like, data science can write a pipeline over things that used to be inbox.

And so that's kind of the journey we're on, and I think it's

where we're really challenged is there's so much opportunity and just as a small team, like, how do we prioritize,

which features we actually need to incorporate first? Because those are huge companies, right, that that we're talking about disrupting.

As far as the overall ecosystem

of

solutions for being able to work with these

kind of, amorphous datasets in the cloud. There are a number of different approaches to it. There are you know, from the versioning aspect, there are projects such as Lake FS and Pachyderm.

From the just kind of unstructured data aspect, there are tools like Unstruck and any number of other just kind of ad hoc solutions for being able to process bulk data. And I'm wondering what you see as kind of the Venn diagram

of

the different problem spaces that you are addressing and what you see as kind of your strengths and weaknesses

within the different components of that diagram?

Yeah. This is a super interesting question. I think so the first the kind of trend that we see industry wide is separation of compute and storage.

So 1 of the decisions we made as Quilt, and this would be a differentiator,

we're we're pretty friendly with the the founders from Pachyder. They just got acquired, by the way. They went to HP. They both they came out of Y Combinator as well just at Quilt.

And, to me, the those products are kinda logical duels, and and I wanna explain that. So the separation of compute and storage, we made a decision as a company that we were gonna focus on storage to be a 100% compute agnostic.

So people use Quilt with Databricks,

people use Quilt with Snowflake, you can use Quilt with systems like Pachyderm.

Pachyderm is primarily

focused on compute, and they

really wanted to be, I guess, Spark for Kubernetes. Kind of 1 way to think about that is, like, how do you write all your pipelines? I think they have a great system.

I'm I'm super impressed by their execution in the open source, and they had to eat a lot of snakes, a lot of technical snakes. Like, making Kubernetes work and making it easy for people to bring up pods is a totally non trivial task. So I think the first differentiation is that we are a 100% focused on the storage layer. And you can run Quilt like you don't need any compute at all for Quilt packages to be at rest in s 3.

You asked us kind of interesting question about what is our approach to multi structured data, and so this is something that I'm really proud of. It's a very simple innovation and we did it in the open source and it's something we call the eternal schema.

And the idea is that so a quote package is ultimately a table if you really x-ray it as a computer scientist and that table only needs 3 columns to handle any type or structured data, Physical key, logical key and metadata

and I just wanna break those down. So physical key is where it resides in storage. That could be s 3colon/.

It could be a database connection string like you name it. Any type of URI, URL, that's a physical

key. Actually parsing and reading data from that physical key is a separate question.

Logical key is how the user thinks about the data, so let me give you an example. Let's say you have a file in s 3, a parquet file, and you want it to be in 5 or 6 or 50 or a 100 analytical datasets. You don't wanna copy that file over and over again.

So the logical key is what allows you to nest physical data anywhere you want in the virtualized structure of the package. And then metadata is just is is a string. Okay? And, you know, this gets into schema on read magic and all that. So so the first key thing is that now here here's what we just did. What we just did is we created a schema that will always be true whether you have semi structured, unstructured, or structured data. So as long as you're dealing with something that is URI linkable, which includes, you know, all types of files, you can really build you can put anything you want in these universal data containers. So that's our our contribution

is that these containers should be totally structured agnostic, and we literally don't care. So we have customers that will put, like, a PDF, a JSON fragment, an Excel file, and 10, 000 cell images in the same container.

And that's something that's very hard to do with a database. Like, I'm enamored of Snowflake and how well the system works, but you're not gonna put images in Snowflake. And so anyway, hopefully hopefully that's an answer to your question of of dealing with multiple forms of structure and and where we sit in the space. Our focus is universal data containers.

We'll work with any compute, like, any compute that talks to us 3 at least.

And I think that, you know, going back to the beginning where we started, that was 1 of our major frustrations

with databases was like, okay. Like, oh, we have the data warehouse. Great. We have Postgres.

The very thing first thing that happened is that business users are like, okay. Where do I put my cell phone? Where do I put this PDF?

And if it's not in some kind of data lake or lake house, then it just the business gets deprived of that data. Anyway, so many so many extreme questions to that sector.

And so in terms of the

implementation of Quilt itself, I'm wondering if you can talk through some of the technical aspects of how you've architected it,

some of the design challenges and engineering challenges that you've had to address in the process of getting to where you are.

Yeah. So there's kinda 2 major languages we use. For the open source well, 3 really. So there's Python and JavaScript, which is how the the front end is done in JavaScript.

Python is how we implemented the quote client. We're generally very happy with that. I think we made some suboptimal decisions that we'll fix, in the coming year as we think about universal data containers. 1 of them was just using JSONL as a file format,

and so 1 of the challenges we ran into is that when our customers so these universal data containers or quote packages I've been talking about are backed by manifests and the manifest are just tables with this internal schema. And as you know, JSONL files are not splittable in systems like Athena.

They're also really huge, and this is 1 of the things we got bitten by. Deserialization libraries aren't always efficient deserialization

libraries for JSON and Python.

So your memory footprint bloats, you spend time parsing and destructuring.

So I think the the first issue that we ran into was JSONL is something we're looking to replace. It's really interesting because there's a bunch of row based file formats. There's a APRO and ORC. I mean, we kind of looked at these different systems. But from performance perspective, last time I looked at Wes McKinney's blog, like, Parquet is still really good and it's so widely used.

So I think 1 1 of the surprising limitations we ran into is is going over a a 1000000 keys for a package having package over a 1000000 keys. Parquet is 1 of those keys.

I still think Python was a good decision. I the engineers still look at me and and, you know, wake Rust

every every once in a while. I don't know.

I'm sure we'll get lots of comments. I don't know that the Rust packaging and distribution

ecosystem is as mature, like, we're I I just think about like, oh, like like, we've got Pypi, we've got umpteen different wheels for different platforms

and then on the so Python and JavaScript on the front end and on the back end, we use CloudFormation.

I'm being very transparent here because these are our decisions that I'd make differently today.

So, when we chose CloudFormation,

the year is like

2019,

like right around there and, a few things. So first of all, I'm really happy we never write YAML code and we use a nice a nice Python library called troposphere

that gives us bindings to the YAML library. So we still are same people who write Python.

I I pity anybody who who has multiple engineers in a YAML file, you know, trying to trying to do a a pull request around that. And then the the declared a Python leverage just emit YAML at the end. End. So so you build the tree, you have a bunch of keys, and it's clean, you get Python modules. I think it was a smart choice at the time. In retrospect, I would have picked something I I would look to Terraform

and I would have looked at CDK. I'm seeing a lot of customers that been successful with CDK today. So I guess that's a very honest view on technologies that we picked and and what we might do differently today and and some of the the things that that, we're planning to change moving forward.

Another interesting aspect of what you're building and something that you touched on earlier is the question of data being a team sport, having lots of different people who need to be able to collaborate on the data, different ways that people want to be able to produce data

and make it available. And I'm curious how you have approached that challenge of being able to

address the different

personas

or patterns of producing and consuming the data and how to make that a at least a solvable experience if not a pleasant

1? Yeah. That's that's

very deep existential question. So, what I I wanna zoom out for a second and, so this all comes back to I think what data mesh is gonna be for the industry. In in AWS,

Amazon Web Services whom we work pretty closely with has kind of 3 bullet points for for data mesh architectures. 1 of them is connecting data lakes and and that's just, you know, the reason we have a new data something every 5 years because, you know, we we we exhausted or or retired the data something before. So data mesh should connect data lakes, number 1. Number 2, and this is gonna be the answer to your question, it should allow people to work in decentralized ways. We're gonna talk about what that means. And then 3rd, it should the data mesh should allow for self-service.

Right? So so individuals and what does that mean? When you need an analysis, if I'm a nontechnical user or non developer, I shouldn't have to go, like, file a ticket with IT,

go to like, this is the worst. It's hell for the data engineers too. It's like, oh, send this guy an email. Like, can you run this job for me? That's so horrible.

It creates lags. It creates loss translation errors. So thinking about those 3 characteristics of data mesh. Okay. So the first thing is that

the simple fact that people are creating data in distributed ways, what does that mean? That means some people are using APIs, some people are using GUIs. It means you're gonna have every structure under the sun. So the first trick is to have a u an abstraction that lets you bring structured, semi structured, and unstructured data together, and this is this idea of universal data containers. The next piece, and this is a really interesting part of our business that we learn more about every day, is that you have to, in the data discovery process, bring what we call the business users, people who don't write code to the table. Okay? Because what happened with data lakes, data catalogs, data warehouses is that they were designed by developers for developers.

And the reason we have shadow IT is our fault. It's the fault of the technical people because we made things that were so hard to use,

and I wanna make sure we talk about this tension between security and usability. We made systems that were so hard to use that people just did something else.

Right? And, like, I know big, even government organizations with 1, 000, 000, 000 of dollars,

and their employees pay $7 a month on their personal credit cards for dogs because, you know, all all the systems are too hard to use. And think about what happens from a security perspective is you just end up you know, your security perimeter is defeated, basically. And we've seen, like, you know, who doesn't have a classified email leak that works for government. Right? And a lot of it is because if you use the real system, like, you know, signatures and triple bit and, you know, passwords and triple bit. So all of that to say, I think a lot of it has been, when we do discovery with our customers bringing the business users to the table and, like and having them allow them to not be embarrassed by the fact that we have Excel sheets. This is 1 of the big failures of big data too is, like, you know, we talk about scale, and then a lot of users get embarrassed out of even using your product or thinking about your product because, like, they don't have terabytes of anything. They have 6, 000 Google Sheets.

And, like, you know, you know, can you help me get get a munch all these together? So all of that comes down to building interfaces for business users,

interviewing the business users, and making these containers 2 sided. Right? That's 1 of the really important differences between code and data. Code has a fairly limited audience and Docker containers have a fairly limited audience,

but data containers need to everyone they need to be human readable. So, human readability is a big part of this so that you can you can have this team sport where some people are using APIs and other people are using GUIs.

In in terms of actually working with Quilts, can you talk through a typical workflow

both at the individual practitioner level and then how that translates to the broader team being able to interact with Quilts and how it gets integrated into their broader data ecosystem of an organization?

Yeah. Good question. We we subscribe to a philosophy called bright spots, which is to say that

you you can just start with 1 Quilt package. Right? And that means so a lot of times where people will start is they'll go to the open source. It's just pip install Quilt 3 and then Quilt 3 push a something.

And then they can use Quilt 3 catalog to walk around the company and say, hey. I made a thing. And, you know, here's your PowerPoint presentation plus my parquet file plus, you know, whatever other business data in a visual format.

So we like people to start simple, start with 1 package,

and then you go kinda 1 1 workflow or 1 team at a time.

And

a lot of people talk about FAIR data, findable, accessible, interoperable, reusable.

1 of the challenges we've had is that that's a destination. It doesn't tell people how to get there. So we like to go 1 package at a time towards FAIR, then you build an automation. And, like, okay. Now our airflow DAG is emitting these package things, and that means that this is publishable to the business and it's consumable as a report to wet scientists, to leadership,

to product

managers. So we very much recommend a gradual approach and, you know, this is 1 of the things about data mesh.

The data warehousing architectures

required you to commit hard. Like, it's a big 6 month interview. What's the snowflake schema for the whole company? And Blob Storage

doesn't care what you put in it. Now that could be a liability as well.

Is, we encourage our customers to know that they're gonna throw away the first rev of the design. So it's a lot of it's very iterative. It's, hey. Build a bright spot. Start with 1 person. Move on to 1 team. Move on to 1 pipeline. And then before you know it, you're slowly kinda draining out the the dysfunctional data swamps, and you've got these data products,

which which are consumable.

Digging a bit more into that integration question,

because of the fact that at the end of the day, it's just s 3 data,

it seems like Quilt

from the kind of tooling perspective is largely transparent

where your spark cluster doesn't need to know that Quilt is part of the equation or your business intelligence system doesn't need to know that Quilt is,

how this data was published. But I'm wondering what are some of the points where people are maybe building direct integrations or automation

around the fact that Quilt is a component in their data platform?

Yeah. There there are kinda a few big areas there. So 1 is, with notebooks, integration with notebooks. So, there are data scientists notebooks like Jupyter. There are specialist scientists notebooks like Benchling.

So wherever the notebook systems tend to be really good at recording compute and documentation of how you got where you were going, and they tend to be really weak at capturing data. So 1 of the simple integration points, like, you know, we ate our own dog food was you should just be able to put a link or a Python import statement in your notebook that says, hey. I depend on these datasets. I depend on these packages. That's the that's the simplest integration point. The next big place is, where we have data in foreign systems. So if you think about something like Delta Lake from Databricks,

now you need a way to not just point to data. Even though the data may be in s 3, it's managed by Delta Lake. So you kinda need different connection strings and different URIs.

So So I think the integration points have really been around these links to to quote packages, and and that's 1 of the things that a data mesh should do for an organization is you should have a canonical address for every piece of data in your organization. And that alone is a big change because if you don't have canonical addresses, people end up copying things,

and then you just don't know which version of anything to trust. And I and I talk to customers about this all the time is that if you don't have uniqueness

from a data governance perspective, you don't have trust because how do I know if, you know, file_a_b_version2.xlsx

is actually

what we're talking about. So so hopefully, I answered your question on integration points. We try to keep it really simple and light, make s 3 the integration 0, and 1 important question which you asked and I and I wanna answer is, we don't care

if you use s 3 through the front door or the back door. So Quilt kind of is a management layer for your s 3 buckets, but you don't have to go through the Quilt API. You could just use s 3, and this is part of this philosophy of your data, your rules. Like, it'd be really kind of

I don't know. It'd be jarring if we made Quilt the only front door to s 3 data, and so, you can use multiple systems. I'll tell you 1 of the challenges

is that, the old prior generations of systems like like Athena and

which and I think drill and some of these other systems, they require

physical folder structure for partitioning.

And so we're we're thinking a lot. What does that mean that, like, you know, you have to put a file in a folder of a specific name in order for athena to jump to the right partition? That's at tension with, like, hey. All these things are in a package. And so we're working a lot and thinking a lot about separation of physical and logical structure. And where I'd like to take the industry is and, you know, we can talk about iceberg that came out of Netflix to these other systems. Where I'd like to take the industry is

we should we should develop the physical the physical structure on the fly on demand. So what does that mean? If you talk directly to s 3, there's only 1 physical structure, 1 layout of the data. If you talk through an abstraction layer like Quilt, you can say, oh, well, you want you want this partitioning seam? Okay. Alright. Here's how it looks.

And so that's what I'm super interested in, and and and we're kind of big fans and users. I said Athena, but it's really PrestoDB, now Trino.

And, Yeah. How can we mark work better with schema on read systems that and at the same time be location agnostic? It's a really interesting question.

And that's also an interesting problem to explore of

allowing

users to be able to layer Quilt on top of existing data without

having Quilt necessarily

be

the authority on the data itself and being kind of the the choke point for that data. Because a lot of different systems that are working on some of the underlying data, you do have to route everything into and out of a particular location through that system. Obviously, the most notable case of that is your data warehouse, so your Snowflake or your BigQuery because that's the thing that owns the data. But when you're talking about data in s 3, you know, and another example from the s 3 cases, Lake FS, where if you want to be able to get Lake FS

to properly version your data and give you that transactional semantics, you have to use that as the interface. And so I'm wondering how you think about the

tension of kind of allowing for s 3 being the lowest common denominator,

but also wanting to be able to provide a more advanced user experience on top of that data and

how to kind of direct people into the fact that, hey. That data is in s 3, but it's also visible and accessible through Quilts, and you can get a much better experience of exploring it if you go over here.

Yeah. I think this boils down to the agony and ecstasy of decentralization.

So I think you pointed out that there are other systems that like Delta Lake is a good example. It's a front door system. You have to do the inserts again.

You you have to do your database operations against Delta Lake to get the advantage of Delta Lake. And, the decentralization of Quilt means, as I said earlier, that you can come in through the front door and the back door. 1 of the areas where we kind of but what you get now so now the question becomes, okay. Like, I can come in through any angle. It's great for a business from a flexibility standpoint.

It's very bad from a a a canonicity or a normalization stand. And so 1 of the ways, like, in addition to this physical versus logical structure that we're thinking about towing this balance is so first, we said that the data containers are totally universal as you accommodate any kind of structure,

but we attach workflows to these which mean you can make declarative statements in JSON schemas about what the shape of the dataset should be. What does that mean? It doesn't even have to do a structure. You can say I require that there's at least 1 dot markdown file. K? So people can document ensuring that people are documenting the data and then you can have what we call packs level or container level metadata and that you can have a schema applied to that. And so, it's this gradual structuring process that we call schema hardening and I think the difference is this is a really good way to put it. You mentioned that the data warehouses have these central front doors. This is the top down mentality.

And it's the mentality that if it's not in the database, it doesn't exist, and the DBAs are kind of like the high priests and, you know,

either you can get through the DBA and get into temple or not. And what we found the need for is, like, how do you go from totally unstructured files up to the warehouse, not down from the warehouse to. So that is the schema hardening process. And, actually, I'll I'll I'll let something slip here, which I'm really proud of. The same way that Git has branches,

we use in Quilt s 3 buckets like branches.

So you have different levels of data doneness and you can merge things up to harder and harder schemas. That sounds a little abstract, but we call these workflows. And I don't know. I'm fascinated by this question because I think that that the industry has had a really top down bias.

Again, because this 5 programmers, 4 programmers mentality,

and I wanna see if we can bring this bottoms up, and it's very organic. Right? That's this whole it's a very organic structuring process where you don't know the schema upfront, and you kinda start to assemble things and figure it out as you as you go on.

And in terms of your experience

of

building Quilt and working with customers and exploring this space of being able to

make working with the underlying datasets

more reproducible,

you know, easier for more people to be involved, what has been the most interesting or innovative or unexpected

ways that you've seen Quilt applied? I have a a really 1 that hit me really hard, and I think you've mentioned compliance earlier on. And and compliance, there's the the bad thing about compliance is,

boy, it's a lot of regulation. It's very complex, and it's often written by by teams or committees that don't have a deep understanding of technology, and I'll give you an example. Like like in GDPR,

when when someone invokes right to a ratio and says, hey. Take me out of your system. You have to provide a certificate of destruction

for the data. Like, what does that mean? Like, this wasn't a piece of paper. So anyway, the so I think that's the first challenge

with compliance. The good thing about it from a business perspective is you basically have to do it. So if you have a good compliance product, then people have to buy your product. Now so here's a surprising connection.

If you don't have a data versioning strategy, you don't have a compliance strategy.

And that's the thing that really blew us away and continues to blow us away to this day is that

in in our industry or in biopharma, there's this concept of an IND filing, investigational new drug.

And

I'd say compliance is proving that you did what you said you did when you said you did it. Well, guess what? That's exactly what a data versioning tree that's exactly what a revision control history does for you. And so I don't know. I think we're gonna

I'm very tantalized by this idea of compliance and data versioning actually being the same thing. And, I wanna I wanna keep pushing on that because the compliance is a process of of gradually hardening

your your schemas and and getting approvals from people. And now I'm interested in thinking about, you know, what's the human in the loop? Like, what's what does pull request for data look like? Right? Because this team's board of data, I can't do it on GitHub. Right? Like like, like, you know, you're gonna get hate mail from GitHub over a gigabyte. Right? And how do I hand something off to my manager and get a signature and say, like, hey, this data is okay. Let's move to the next stage. I don't know. And

1 thing that I'd like to revisit is in terms of the versioning aspect, I'm curious what are some of the, I guess, edge cases or pain points that you've had to address around how to allow for versioning of the dataset while maintaining some sort of canonical reference to that data. So so you don't end up in the situation of, you know,

financial report dot, you know, version 2 final, no merely, I mean at this time dot xlsx.

Yeah. It's a great question.

So the second problem we solve by having

logical as opposed to physical AMP. So, like, in a package

in this universal data container, if I have foo.xlsx,

it can point to any any physical storage, number 1. Number 2 is we create a top hash for all of those things. So it's the hash of all the shots 56 hashes. So the canonical location of the data is not only like a namespace, okay, a repository which are which is an s 3 bucket and then a package name, but also the actual hash of the data. Oh, challenges. I can write you a book on this. So there's a couple. So the first is

Amazon

s 3, which is the 8th 1 of the world. Like, I so much love s 3, and let me just give them a quick shout out. For them to go from eventual consistency to universal consistency, do you realize that just happened? Like, s 3, like, I don't know, 2 or 3 years ago just became globally consistent. That is amazing. So congratulations for them. There's a few silly design decisions they made. So first of all, the default hash for s 3 is mb5.

Okay? And it's like people like, why it's not a cryptographically

secure

hash. So you can you can develop collisions pretty easily. Why is that dangerous? Well, you could say you could have a package. If we used for a quote data package, we used mb5 signature. People could and someone could say like, oh, Pfizer created this data and then, you know, go find an mb5 collision and falsify Pfizer's data. Right? So so our first challenge was we needed to use shots 56. That created a bunch of IO issues like, oh, now we need to stream the data. We need to see every byte of the data. And there's a lot of interesting innovations here. There's different ways to fingerprint data. Of course, the problem if you invent your own hashing algorithm, your own hashing schedule, it's like how do other people reproduce the data now. So, first problem was how then Amazon came out and did s 3 checksums,

which were very helpful, and you can now get s 3 to calculate a checksum for you, which saves you a lot of idle because if you're outside of of the Amazon cloud and, like, you know, you wanna to version a 1 terabyte package, you don't wanna stream everything out just to hash and stream it back. Well, that's all well and good, but of course they did it in their own way.

And, you know, the shot 256

checksum for an Amazon object is not the same as the shot 56 check sum of the object itself.

It's the check sum of all the parts if it was a multipart upload or it's just a check sum of the single so

the answer to your question, this is something we think about every day, but,

getting hashing consistency and getting and getting the cloud provider to do the hashing has been the the biggest trick. And last funny story for you, when Amazon announced s 3 checksums,

they basically said we did this because we have customers

who have like huge armies of EC 2 machines whose only job

is to hash data. Right? And if you could push that complexity down to the the object layer, great.

But then you also have to reverse engineer Amazon's hashing power. So which isn't too bad.

And

maybe this is the same answer, but in your experience of building this platform and growing it to where it is now, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I go back to, first of all, getting developers and non developers to work together,

from this team support aspect.

Also,

1 of the biggest challenges was we started the company with managed data like code, and great, we're gonna we're gonna make this GitHub for data like thing. That's not our company anymore.

And there were a lot of dead bodies on that road. And so the biggest challenge for us has been understanding

that just naively

adopting

code metaphors and the data doesn't work. And there are a couple important reasons. So data is not executable. Data is much larger than code. Like, we mentioned earlier, like, go go try and put, you know, a file of any size on GitHub and see what happens see what happens to performance. LFS is scarcely much better.

And then audience is much different. And so I think our our IP

is really in understanding

how you can have

code like data management without doing the naive thing and just copying what code did because because that doesn't work.

And 1 of the reasons is that, like, Git, as you know, under the hood has its own database. Right? That's totally unacceptable for enterprise data solution. Like, if we told our customers, like, hey. You can use our data versioning. You can have data chain of custody, but put your thing, a, in my system or b, in my format. We can just go home.

Because people aren't you can't move your data warehouse overnight. You don't want to, and nobody should ask you to.

So, anyway, these are kind of the interesting things. And, it's been that's why it's hard to even describe our category, frankly.

And for people who are interested in being able

to take advantage of this versioning and discoverability

aspects and being able to bring more people into collaboration

on some of these data assets in s 3? What are the cases where Quilt is the wrong choice?

Yeah. There are a few things that Quilt doesn't do. So when you need heavy compute, you you should, you know, look at look at Databricks. You should look at Snowflake. So we do we do tell our customers, hey. Break Athena first before you go out and buy Snowflake. That's just a crawl, walk, run, just an intelligent thing to do. So we don't do a lot of compute. There is an Elasticsearch cluster if you're an enterprise quote user and we do make quote packages

available so that they're athena queryable.

So any compute workloads, you're gonna need to use something in tandem with Quilt. That's number 1.

Number 2, I think there are very few of these use cases, but if all of your data is structured

and you already have, like, a reporting and dashboarding system, there's no need to use Quill, right, because everything's gonna stay. Again, this is a rare case, but if it's, like, all transactions,

that's another case. So, again, all structured data, no need to use Quill, And

transactional data, Quilt is not really for that. Use a database.

And as you continue to iterate on Quilt and

try to evolve alongside the broader ecosystem, what are some of the things you have planned for the near to medium term or any particular projects or problem areas that you're excited to dig into?

Well, I have to I have to answer this question the way that my product management team doesn't attack me when the podcast is over. I'll tell you the areas that we're really curious about. So so 1 is this is a cross company collaboration.

The biggest silo we see is the document store. Right? So you've got object stores. Everybody knows what that is, s 3,

Google Blob Storage, Azure Blob Storage, and then you've got document stores, Box, Ignite,

Dropbox. Why can't those be the same thing? I'm just gonna you know, that's that's 1 of our big themes. The second product I'm super excited about. I mentioned earlier that we we're thinking about what the the next

manifest file format is for Quilt, like something like per k, so so obviously scaling the solution. And I'm I'm very excited about the Apache iceberg project. I think it solves a lot of these problems and reproducibility,

and I would very much like as ZTO, I would very much like to have a proper SQL queryable back end. Whether or not people interact with that that way is separate to quote packages.

And so,

I love the iceberg. Now,

iceberg, I think, is more scalable

than than Treno or Presto DB, but it has a problem. You have to copy the data in the iceberg. Right?

Partitions are much more flexible. So anyway, I'm super excited about how we can scale

these universal data containers or quote packages to at least 100 of millions, if not billions of objects. Billions is always hard no matter what system you use with underlying pieces like iceberg

and continue this schema on read, right, which isn't the ideal

which isn't ideal from performance standpoint, but it's super ideal from a flexibility standpoint.

So integrating the the document stores, the object stores,

scaling packages to billions of objects with things like Iceberg, and then continuing to work on this schema hardening of idea of, hey, like, I'm gonna come to the party. My data team's gonna come to the party with no structure, And how do we figure out the structure? So 6 months from now, when it's all hard and we can load into Snowflake or whatever we're gonna do next.

And are there any other aspects of the Quilt project or the specific problems that you're trying to solve that we didn't discuss yet that you'd like to cover before we close out the show?

Yeah. I would love to talk a little bit about open source

and

how closely we work with community.

So

I personally think this idea of universal data container is sorely needed and I guess that's a call to action. It's like, you know, we it's important to us to develop in the open source and that really

goes to next level when contributors come in. So if you're if you're a Pythonista

or, if you if you love React, come in and, join us in the open source and and help us this year to define the universal data container standard. That that's where we'd love to have help from the community, and and those standards don't work unless multiple people sign on.

And from that data container standard perspective,

I'm assuming that you have already

spoken with him, but I'm curious how much kind of engagement or how much influence you've taken from the dataset project.

I actually know very little about dataset. I've seen frictionless

kind of before this. They they were in the package manager

arena.

Google has, at least from a JSON LD perspective, their own cost about dataset. But I'm gonna look that up, the dataset project. I'm curious. Yeah. So dataset,

that's spelled sort of like a portmanteau of data and cassette,

is effectively

how do you taking a seek read only SQLite database, publishing it with your, you know, your code and analysis so that it is read only and,

interactive in a web environment, but being able to publish these kind of small datasets so that people can interact with them.

There's a a number of projects. So this reminds me a little bit of DBC as well. I think our our

position in this is gonna be, hey, what's the simplest possible kernel we could make inside of this? So we don't wanna depend on a database. And so I think there's room for these standards to play together,

and and that's the beautiful thing about the universal schema is anybody can implement it. Right? If you have these 3 columns, let's go. Like, it's a package. It's a data set, whatever you wanna call it.

So, yeah, it's interesting, and I think we should help each other because

it's hard to have anything that's universal if there's, you know, 3 slightly different ways of doing it. Absolutely.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I have to say it's around this bottoms up data management. And so the biggest gap is the gap between whatever you have in terms of a lakehouse or warehouse and the business user data, which is in Box or Dropbox.

And, with some luck, we're building exactly that bridge. And it's it's an evergreen problem because I mean, this is just a question for the audience. I won't even answer it. How can your business function at an analytic level if 30

percent or 50% of the data is in a silo that's not connected to your cloud compute? I don't know.

Absolutely. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing at Quilt. It's definitely a very interesting project. It's great to see the work that you've been doing over the past few years. I appreciate all of the time and energy that you've been putting into making this a more attractive problem, and I hope you enjoy the rest of your day. Okay. Thanks for your time. I I really enjoyed your questions, and I never get tired of geeking out about data engineering. So here we go.

Thank you for listening. Don't forget to check out our other shows, podcast dot init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine show notes. And if you've learned something or tried out a product

from the show, then

and read the show notes.

And if you've learned something or tried out a product from the show, then tell us about it.

Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links