Package Management And Distribution For Your Data Using Quilt with Kevin Moore

Hello. Welcome to the data engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and 40 gigabit network, all controlled by a brand new API, you'll get everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/

Linode to get a $20 credit and launch a new server in under a minute.

And are you struggling to keep up with customer requests and letting errors slip into production? Wanna try some of the innovative ideas in this podcast but don't have time?

DataKitchen's DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality.

Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end data ops solution with minimal programming that uses the tools you love.

Join the DataOps movement today and sign up for the newsletter at datakitchen.iode.

After that, learn more about why you should be doing data ops by listening to the head chef in the data kitchen at dataengineeringpodcast.com/datakitchen.

And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.

Your host is Tobias Macy, and today I'm interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning your data. So, Kevin, could you start by introducing yourself? Hi. Thanks, Tobias. So I'm the cofounder and CEO of Quilt Data. My background is in computer architecture and database systems. But a few years ago, I got to thinking about how,

there were so many amazing tools for managing code, but we didn't have the same things in the data world. And having worked in data processing from,

thinking about it from the hardware systems up through memory through

internals of database systems, it was really fascinating to think about it from the,

application standpoint. And I just couldn't help be struck by,

the lack of the collaborative tools that we had for code in the datasets.

And do you remember how you first got involved in the area of data management? So I was working my last regular job was,

in the research labs at Oracle.

And,

after deciding it was time to to move on from there, I was thinking about the transition in computing to the cloud and thinking about what a fascinating,

transition this was really for the whole world of computing to be moving essentially.

Almost all of the bulk of the world's computing into

a few massive data centers. And the thing that kept hitting me was what was happening to the data. And 1 of the major challenges for working with,

large datasets was always bringing data together, getting

the, you know, pieces, if you will, you know, from the Oracle's days. We were working in tables, getting the the

represent,

the table fragments close enough together to join them together. And now, you know, instead of datasets being distributed in servers and server rooms all across the country, now those servers were all in the same room on the same super high speed network. And it just got to be thinking, well, this is really an exciting time to move into and really an inflection point in in terms of data management. So

I decided that it it was time to think about, switching from now the internals to, thinking more holistically about the whole whole system. And as you mentioned,

the intent for Quilt is

to create some

tooling to enable for collaborative

use cases for data. So I'm wondering if you could describe a bit about the origin story of Quilt

and what the initial

use case was for it.

Yeah. I'd I'd be happy to. So

going back, a ways, I I mentioned just second ago this idea that there were collaborative tools for code but not for data. I was actually sitting in a at dinner, having conversation with a woman who, of all things, worked at a charity in Kenya, and her passion is helping women and girls stay in school by managing their health. And she said something though that that struck me. She said, you know, we don't have any way to hold the,

the government and the schools and the organizations, even in NGOs that she works with accountable and said that where she worked, she dealt with corruption all the time and what was really needed was

accountability and and evidence. And and the the piece that she was missing, as grandiose and ironic as it sounds, but was actually a way of pulling the data together,

putting that evidence together. And this is where that idea came into play of like, gosh, you know, if this was about software, I would know exactly where she could set up a code repository to share some open code. But how could I tell her and, you know, here I am, PhD in computer science. I was working at Oracle. I should know to tell her, you know, where to go to just start working on a database and and get set up and running and and, you know, share that and let people upgrade. And it occurred to me. I was like, well, I could tell her about RDS. I could tell her about BigQuery, but none of those tools fit her need. There there wasn't any place that they could just get a simple tool together, put a dataset together, have others react to it, contribute to it, collaborate on it, and distribute it in an efficient way. So I guess I guess that's in some ways the, the origin story that that that moment got got me and my cofounder thinking, like, there should be a better way. For somebody who is using Quilt now that you've been able to build it up and iterate on it a few times, What does a typical workflow look like for being able to step through the life cycle of the data that they're using within the Quilt platform?

Happy Happy to do it, but I I just wanna give at least a little hint of what's in between. So with this idea in mind of creating easy collaborative tooling around datasets with the vision of making a place in a system for people to connect all the world's data

and do it a lot more easily than what was available at the time. But we really

had to change our strategy dramatically. So the the first vision, the first product we put together to execute on I I guess it's the same vision, but the first product we put together to work on that vision was very, very different. So when I describe what people are doing with Quilt, it it'd be important to shift the kind of the mental model. And so instead of we we tried to build something that we thought was more like a GitHub for data where there was this sort of social hub in the cloud and and the idea was you could just sort of start a project from clean slate or you could bring your your datasets there. And that really didn't work for several reasons that maybe are not as related to this interview, but we switched our our mental model because we really

needed something that helped get people get started right away. And it occurred to us that after watching people use this, you know, social database in the cloud that we had built, the most common thing that people were doing with it was export to CSV,

download CSV,

read Pandas,

and pull it into their Jupyter Notebook. And so,

having read Lean Startup and all those sorts of things and and having gone through Y Combinator and telling us to listen to our users, listen to our customers, and and iterate, we we scratched our heads and and said, you know, I think we're doing this wrong. So we we scrapped that first product completely. And instead, we we were inspired by

the first inspiration was honestly really package managers, but it it morphed into Docker. And we had been using Docker to deploy our software,

and we realized that

installing, you know, a a running software a runnable software environment was so easy. And the same way, you know, installing and importing a library,

if you're running Python with, with PIP, if you're running JavaScript

with NPM, was so much of an easier process than this. Searching Google for the dataset that we needed,

finding out what formats were, finding out where the files were, what are the URLs, downloading them somewhere, finding a staging area, figuring out how to unpack them and interpret them. We said, well, we started thinking, well, can we get that kind of experience, that PIP install

for data? So,

as a a very

long and winding way of of explaining what the workflow is is with Quilt, the simple answer is it it entirely vary depending on what each person organization's datasets are, and what their tools are. But a very simple get starter workflow is

I'm collecting some files, say CSV, maybe there's some images and I wanna pull those into a Jupyter Notebook. And now I wanna share that Jupyter Notebook with my collaborators.

And instead of having to throw everything into the same git repository or, you know, worse, like, I don't know, park it all in a in some f 3 bucket or something, we found a way of separating the data versioning and the data transport from the code transport. So you can check your get your notebooks right into GitHub as you always would, and now you can wrap your data assets in this quote and, you know, quickly and seamlessly in a single line of code, import that into your notebook and the notebook will be will be portable. The dataset is identified by its name. And inside Python, it acts like you would want, which is to say a a Python object. Backtracking a bit more to what you're saying about the

social platform

for data, is that something that when you abandoned it, you ended up leaving some remnant of the code for people to explore. And, also, I'm curious

how that original vision compares to some of the work that they're doing at data dot world or if you're familiar with that platform.

I am. I I think it's very similar to to data dot world. I mean, our model I think how it works under the hood and the the way of interacting with the the datasets was was different, but the spirit of it is is very similar. And I have followed them a little bit and, you know, wish them the best of luck in in doing what they're doing. We also saw that a really successful effort

around a really successful example in that space is actually what Kaggle is doing. You know, Kaggle has put together a tremendous collection of datasets and kernels, and it's a fantastic source of of open data.

And now with the current model that you have of this package of data, I'm wondering if we can just start by stepping through what the actual data package consists of and

what the various layers of it are for somebody who's first getting started with working with it? Sure. Sure. Happy to. So a data package is a collection of serialized data and metadata.

Essentially, it's a way, a metadata wrapper around

a transportable

dataset that will take it all the way the libraries that come with Quill

will are able to take that dataset all the way into memory, into code. So inside Python,

you can see a dataset as a Python module, if you will, or a collection of of ready to use, ready to analyze objects. The kind of canonical example of that is the pandas data frame. But that's it's certainly not limited to tabular data, so it's not not the case that you have everything there. But that's simply 1 example of an an object that is

ready to go in Python and and in the form that, say, a data scientist would wanna work with it. And when I was reading through the documentation, it looks like if you are able to identify

or easily see that the data that somebody is first importing into the package is in that tabular format, then when you actually store it in the package, you automatically will serialize it to parquet files or h d f 5 depending on the way the that it's formatted. Is that accurate?

Mostly. Mostly.

We don't think of it as automatic as much as as that that's the default. The package creator can choose the serialization format and eventually we'll be able to to supply plugins for

serialization and deserialization.

So the idea is that you can, as a package creator, create both the in memory,

you know, encode experience for that dataset and standardize that and also,

help guide the,

the serialized

transportable

representation.

That said, we wanna provide

high performance, generally good choice defaults because a lot of our users, of course, will be using the default. So for tabular data, parquet is a great choice. It it's much smaller

and that, you know, compression works very well on it by rotating column reform. It's easily shardable and distributable, and it's immediately

readable in, you know, the whole Hadoop and and Spark ecosystems and, of course, also now Python with the Arrow project.

Do you have any integration points for some of those other ecosystems where somebody can easily

import a quilt package

into a running Spark system or Hadoop file system and then automatically

be able to easily unpack the data from that package and then just start processing it as they would as if as if it were just a flat file? With Spark, yes. And at the moment, only PySpark, but we're working on that. So we've started with Python and PySpark. We have

Excel and R waiting in the wings,

and and Scala

probably eventually, but, we're gonna see how far PySpark takes us. It'd be interesting to see in Spark community how when when someone really, clamors for the Scala or, you know, native Java implementation or interface.

So you mentioned that 1 of the initial and main use cases

for people using Quilt is within the context of Jupyter Notebook.

So when they version the notebook in GitHub and the data in Quilt, what's the process for being able to then

recombine those 2 pieces

on somebody else's computer who is, using the notebook for being able to try and reproduce the previous person's work? It's very easy. So there are 2 pieces. If they want to make a,

download a local copy of the dataset, they would do a Quilt install,

sort of analogous to a a PIP install. We have a a mechanism that's akin to the requirements TXT. There's actually a QUILT. It's a YAML file instead of a TXT file. I'll let that debate happen in in GitHub issues. But so this file, you can just quote install from this quote dot YAML file just like you would PIP install from a requirements TXT. And that would work with something like,

a popular way of sharing notebooks in a,

reproducible environment is to use the binder project, if you're familiar with that. So we've actually written a a little plugin for binder, and you can add a quilt dot yamil to your binder, you know, the folder that you're using for binder and that that will just work. And when the Docker image is created, it can pre install all of those, quilt packages or, of course, you could just, have it set to to grab that data on demand from the cloud. And for the data package itself, can you go a bit deeper into how it's structured and

what the interface looks like for somebody who is building or consuming 1 of those packages?

Sure. The actual,

nuts and bolts of the bytes, if you will, of a cool package, there's a package manifest, which is a JSON, and that specifies the the tree where you know, a a tree of Python objects, if you will. It worth thinking about it in Python. But that tree is is actually

structured as a Merkle tree, so it's a hierarchical hash tree. And each of the leaves actually can be a list of hashes of individual data fragments. And those data fragments are identifiable by their SHA-two 56 hash and you can fetch from the Quilt system by that hash. So, it's possible to send, if you will, only the manifest and then load the fragments on demand. We call it a a metadata

install. And each of the fragments are deduplicated and stored only once at any given location. So the local storage on, say, your laptop or your compute server is essentially a local cache of the data that's backed by Quilts and a Quilt registry. And the Quilt registry, which by the way, it's not like the 1 Quilt Registry.

It's open source. You can run your own. We have multiple there are multiple Quilt Registries running around in the wild. The intention is for there to be lots of them. And each quote registry can only,

will only store 1 copy of a of a unique data fragment. So this could be, you know, a file or a piece of a file and using that to allow

hierarchical caching and, efficient replication of the data. And within the manifest, is it possible to

add some additional metadata about what the datasets are and their intended purposes

or perhaps,

where they came from so that you can record provenance of the packages?

Oh, great question. Absolutely.

So,

there's a few a few ways to add metadata to the package. So each node can have

a user

defined or, you know, package builder defined

a blob of metadata, just a little JSON fragment or you can supply it in Python as as a simple

dictionary. And that'll allow programmatic

filtering of a package. So for example, 1 of our, largest data publishers in quote right now is the Allen Institute For Cell Science and they published these amazing, it's amazing collection of cellular images, some of the best cellular microscopy images in the world. And these are very, very large collections of data, at least by most people's standards, large, I guess, Large is a relative term in data world, but several terabyte per package

and tens of thousands of of images. And a researcher who wants to work with that, say, to to feed it into a machine learning model. They could go through that large data package, run a filter on it, and choose out only the nodes that match certain criteria in the metadata. Particular tissue type, a particular cellular condition type,

looking for images that that have a certain feature in it and pick out a a reduced set of nodes. And that that will act also like a quote package. They could then, you know, make a local copy of just the fragments that they want, or they could navigate that and feed it into,

downstream code, like feed it into their model,

for training.

They could make a dir derived package out of it. They could add to it and publish it as as a new a new data package, a new, you know, a new dataset with the results of their predictions. For example, if they were predicting

from the images 1 of the the use cases in the Allen Institute case is predicting from the images with machine learning, predicting where each of the cellular,

features are. And, apparently, they've gotten quite good at that. Then the next piece of metadata is simply the the human targeted metadata which, we support a readme in markdown format. So if you attach a readme

file to the data package, that will automatically

be displayed in the data catalog that goes along with Quilt.

And in terms of these packages, particularly in the case that you were describing with the Allen Institute where somebody might be creating a package that's a derivative of an original 1, is it possible to specify

a dependency chain or a reference chain where somebody is able to then publish the results of their research and then have it refer back to the source data that was actually pulled into it based on the versions that they used at the time so that there's 1 canonical source of the data rather than having to copy it around a bunch of different places?

Oh, absolutely. And and we encourage that. You know, the the typical place that you see that is is right in the in the read me, but it can also go in the the package metadata, which would allow it to be, you know, more

efficiently searched perhaps. But

the reason we don't have anything more more formal is we don't yet know all the ways that people wanna express lineage. There's so many complicated dependencies.

Typically, these data packages are or at least commonly, let's say, a data package will consist of data from many different sources. So, and sources that are external to Quilt. So there's typically or often links even if it's just a web page or the page of a lab that produced it or,

a government report that census

results, even information about, you know, for for things that are private packages.

For example, if some of these datasets came from queries of a transactional database, they might have the information about the actual query that's in there so that the lineage can be traced and that the,

the source can be traced.

It's also possible

and not uncommon to see

references to

get hashes, which will specify a version of code, perhaps ETL code that produced some of the derived data. Yeah. That's really cool.

It was definitely an interesting project at face value when I first came across the work you've been doing, but the more we talk about it and the more use cases you describe, it definitely starts to sound even more compelling.

And the registry piece in particular is interesting,

particularly

since it is there there isn't just 1 canonical location. It's something that many people can run depending on their particular needs and be able to possibly pull data from multiple registries

to build, various packages.

So can you describe a bit more about how the registry itself is implemented

and the architecture that you've built for it?

Yeah. Happy to. So, yeah, we've designed the registry to be as simple and lightweight as possible. It is really a metadata only service,

and its function is to provide versioning and permissions

to the the large data collection and then let the raw storage engine, you know, usually in the cloud, that's something like s 3,

do all of the or as much of the heavy lifting as possible. So the registry service,

there's a catalog which is JavaScript React. In our

open source community additions usually hosted on on NGINX. We have a Docker container for that. Though in production, I think there's a CDN involved. And the registry itself is a Flask service that runs the registry API. It sits on top of a postgres database

and all of that again is is deployed with Docker. So for for anyone out there that wants to run your own registry, we have Docker images ready to go and you can get it at least a a development testable development

quote registry up and running in about 5 minutes with Docker Compose. And then

as I said before, the the goal is to use the storage system for the actual transfer of,

of the data. So the client, which is, right now, we just have the the 1 implementation that we support in Python, but but others are are coming down the road. And the client will talk to the registry and get back a package manifest

or even, you know, a subset of a a whole package benefit. We call it subpackage.

And in it, it'll contain a list of permissioned URLs. And then from those URLs, it can directly talk to the storage And

so I And so I imagine that that helps a lot both in terms of managing your hosting costs, but also for people who want to have some sort of optimization in terms of the

way that the data is stored, either in terms of

cost based on the volume the data or optimizing for latency

of updates and retrievals

for where their main users are going to be. And

And I'm wondering what other optimizations you've made in terms of the way that either the registry or the client are implemented to accelerate the speed or effectiveness of synchronization

from the

source repository

and the client that's interacting with it? Oh, great great question. So, you know, as I mentioned at the very beginning, I'm a computer architect by training. So the the 2 things I think about most often are latency and bandwidth. So the first thing that we thought about with Quilt and the architecture itself is bandwidth. And and the solution for

high bandwidth was really to get out of the way. And, you know, there's this these cloud providers are providing really incredible

scale in their storage and also incredible bandwidth in and out of them. So the first step is to just make sure that to the degree that that we can not screw up anything in the way that we're,

laying out the data on the storage resources

and then not have the registry be a bottleneck anywhere,

that would that would reduce the the maximum bandwidth available. Now if you want higher performance than that, it's oftentimes it's question as you set up latency. And that's why allowing content addressable,

being able to address the data fragments by their content makes caching much easier because there's there's not much of a coherence problem if everything is addressed by its contents. You sort of either it's present or it's not present, and you would just need a chain of where to look for it if you don't find it. So it's very easy to set up caching and for for quote packages. And we've seen that with that's in use at the Allen Institute. It's in use at a in a couple of university settings

where they still have big network drives that are actually, you know, much closer to the compute resources that they're using than s 3. So it's easier to kind of install, I mean, if you will, a lot of the the commonly useful packages into this cache.

And then, that saves them, of course, upstream bandwidth, but it also cuts down the latency tremendously.

Going back a bit more to the

workflow for somebody who is

using these Quilt packages,

when you get into a team setting where you have data engineers and data scientists,

what do you see as being the typical interaction

as far as how the data engineers might create and curate these packages and how the data scientists

might use and version the data?

Alright. Really really glad you asked the question. So, when we've gone into

larger

data shops, you know, companies that have really forward looking and and really sophisticated data

engineering and data science teams and and have built, you know, really scalable infrastructure in the cloud. We we hear a lot of common problems out there. And 1 of the first ones we hear is, man, we just have all these high tables and we don't really know what they all are or who maintains them or which ones we need to keep. So let me just sort of imagine Quilting a scenario like that or at least tell you about some of the nascent projects that we have

ongoing in a couple of those places where to oversimplify

and generalize

what we've heard at several companies now. Data are streaming in all the time and and, you know, maybe it's coming in through Kafka, maybe there's just airflow pipelines that are are putting together

usually often a common pattern is the ETL or the streaming engines are putting together table partitions in Parquet and adding those to tables in Hive. So every every night or every day or, you know, every hour, every minute, whatever the granularity, there's new partitions being created, Tables are updated, and this dataset is just constantly growing. What we imagine with Quilt could do there is that Quilt's kind of a final step in that ETL. Data is still streaming in in as they were all the ETL processes that happened are still happening. But those fragments are being registered with Quilt in a way that that does a couple of things. First of all, it's gonna add reproducibility and versioning on top of that. So instead of of Hive, which is really designed for everybody seeing the latest and greatest, you know, whatever the state of the world is right now, We can have essentially snapshots of tables that will say,

okay. These are the query these were the the state of these tables when we did this query, when we trained this model that we're now deploying. So if we have some event that happens in production, we could go all the way from existing model weights and trace that all the way back to training data and the snapshots that that fed it.

It also can handle permissioning. So,

particularly in companies that deal with more sensitive data, common

ask that we hear is that different groups, different data science teams should be allowed to see different different subsets of the data. And

this is particularly interesting, I think, to me in in the biotech and pharma industries where they have research studies and and very fine grain

requirements about who can see which pieces of data. And even more interesting perhaps is data can be pulled back, and they need to have an auditable trail to say, you know, who has had access to this? How are we how can we be sure that this data is no longer being used? Because someone has opted out of this research study or, you know, this patient has withdrawn their permission to be part of this cohort,

for example. So anyway, going back to the the big picture, I think imagining,

just a a general scenario,

though, though every every company, every organization would be different. Let's say you have airflow

building, and the last step of that airflow pipeline is building a new quilt package every time it's run. And you might have packages that these packages are kind of the logical view or,

a set of data on which, decisions are driven. So maybe you're make used to making decisions in the data science team with, you know, the last 30 days the last 30 days of a of a time series that could be snapshotted in a package. 1 of the things that's also exciting about Quilt is that it's very easy to combine

tabular data and

and non tabular data like images. So there could be something like, for insurance, you might have all of the claims seen photos and all of the tabular data that goes with the the history of the parties involved or the the aggregated

statistics of of the the area or the,

the industry. And again,

each of that

can be updated as usual in the ETL pipeline. So the data engineers are the keepers of the the recipes, if you will, these these build specifications

and deciding they hold the keys to which data scientists have access to, which data sets, which views of the data, which which are these data packages. And

also, they are now with Quilt given an audit an audit trail of which data scientists have pulled not only which packages, but exactly which versions of which packages.

Yeah. That's definitely a very compelling use case. And it sounds like

with being able to create that audit trail, is that some custom tooling that,

they might build to automatically register additional metadata anytime a client interacts with the registry, or is that something that's already by default part of the current Quilt tooling?

The audit workflow is part of a product we have that's our our more enterprise tier product

called,

Quilt teams, and that just gives an an admin view. It's oh, I'm going out on a limb. I believe it's actually available in the open source project as well if you run your own quote registry. I haven't gotten any emails that anyone has that any of our open source partners are using it yet. Though if anyone out there wants to, I'd be happy to respond to that email. But, no, that that's been a feature that's been asked for more in the

corporate enterprise setting where some of our early enterprise partners have have really wanted a a couple of key features. And the most important was administrative access to revoke

administrative ability to revoke access. And then the second 1 is that audit trail of saying, okay. Where did it a particular data package go? Who saw it? And, it can even be more sophisticated. It can even go down to if 1 particular file or 1 particular even, you know, fragment of a table,

contains an offending or or, you know, redacted,

datum, that fragment can be traced to all of the packages

that contained it and all of the versions of all the packages that contained it and all of the users who installed it. And that's a good

entree into discussing

the business model for the company that you've built up around this project and the goals that you're trying to achieve through the course of that business.

Sure. So as in all things startup, you know, we're we're iterating

constantly. But what we imagine the big business that it appears to be, we're certainly on a trajectory right now to be an enterprise company. And we're working with some very large enterprises who have very complicated data management and data governance problems. And so our our eventual business model is to have an on prem

fully supported

enterprise business model. What we're concentrating on right now, however, is community and developer adoption. So we really wanna get the Quilt data package as a as a standard first. I mean, the first thing we wanna do is we really wanna get the community of data engineers and data scientists believing what we believe, which is that datasets should be versioned, that we should think about

data as we're using data and deploying data right now without even necessarily thinking about it in the way that we think about versioning and deploying code. So the first the first goal is to be 1 of the, many voices who's hopefully changing the the industry and the communities for the better and saying, okay. We need this extra capability. We need to be managing our data that with the same level of care that we manage our code with the same level of reproducibility

and and auditability. And then the second piece is we'd like to have Quilt be an integral part of the stack that that people are using to do that versioning and

and and deployment. So,

to go back to the Docker simply as an analogy, you know, Docker is an essential part of so many companies'

software development workflow. Think about all of the projects that are built and deployed in Docker and with the rise of Kubernetes, I think that's even getting more and more popular. We're doing more things inside Docker containers. You know, we'd love Quilt to be an open standard for data snapshotting and data versioning and and really for data deployment as you could deploy your software in Docker environments. We feel that you know, Quilt is is really a great way to deploy the data asset to get that data into the code that that needs to operate on it. 1 of the other questions that I have

about the project as a whole and the business and how it relates to it are some of the biggest challenges

that you either have faced or are facing

and some of the limitations or edge cases that you've come up against that you're currently working through. Oh, man.

Where do I start?

Yeah. I I could talk the full hour just on challenges. It it's very exciting, and it it's full of lots of interesting, Yeah.

The the Yeah. The the biggest challenge for adoption

is

getting started. And really, I think this may be true for any data project, but or any data tools. Getting data in. So either writing the connectors or writing the documentation, writing the tutorials,

teaching users how to use it, and just making it incredibly easy to get data in. And this is part of

of what you see in our strategy for having automated build from simple

data sources like CSV files and images, but we need to go further. You know, we need to to be,

have automatic connections

to, say, important pieces of data infrastructure like the Hive metastore in particular. That's probably, you know, maybe the the single most burning

needed connector. But we also need to work with relational databases and and find a better way of

snapshotting and database tables with from a relational databases like,

Postgres and MySQL without losing our deduplication. That

that's a huge challenge at getting data in is definitely

a big 1. The other 1, we're still iterating on what exactly the right abstractions are in terms of data packaging.

It's really working for certain use cases where

package builders find,

you know, small is the wrong word. It doesn't have to be small in terms of number of bytes, but logical subsets of objects that can be, you know, installed as a package. But there are other use cases where users like the general,

approach of Quilt. They like how easy it is to get data into code, but they don't have they kinda just want all their data to be accessible as the mental equivalent of 1 giant package. And so,

from an architecture standpoint,

the implementation would and and the design of the system would need to be a little bit different there. We had

approached this with the thought that users would want to have millions of smaller packages,

And we've gotten from at least a couple of of early users pressure to say, well, I'd just rather have everything as, you know, like, everything under 1, you know, package name space and I wanna do the same kind of filtering on everything.

Yeah. Can you filter across all my packages? Or can I just, you know, put everything in in 1 giant package and and filter from there?

And I'm sure we could, you know, I'm sure that's it's possible to build that system. It's just it's hard to know how to how to be the best of both worlds in that case or or which 1 is gonna,

ultimately be more useful and and catch on faster.

And do you have any

specific plans for the near to medium term in terms of

either new features or new capabilities

that you're planning on adding to Quilts or new services that you'd like to build around it?

Yeah. Yeah. Let me let me mention a few of those. So, the road map is really exciting and would love to talk

to people from this audience about contributing to the project. It's open source, and we're

glad to to take contributors in all of these areas and and more. We are working on a few things. So 1 of the things I mentioned before was an interaction with the Hive metastore. We're looking at 1st and foremost,

a a a first baby step in that endeavor is to have a a better concept of a table inside a quote package. Right now we have,

you know, we started with data frames, which if you will, would would sort of be the logical equivalent of a a partition of a table. This is something that almost by definition or perhaps actually by definition fits inside

your memory in a Pandas data frame. But, of course, the tables that that we work with in practice are much, much larger than this. So what we've done

as a,

you know, an MVP, if you will, of of tables is that you can group together

like schemed

data frames backed by parquet fragments, and you can use a tool like PySpark

to generate a view of a of a much larger than memory distributed table. So if that works today in Pyspark, you can take inside a a data package. You can have a large table that's distributed

into a a set of of many nodes. Those individual nodes are then materializable

as pandas data frames. But in Spark, for example, in PySpark, you can go up the tree 1 level, and by materializing

the the parent, you would have a Spark data frame that represents the distributed set of all the the note the all the children.

What we really need to do is is prevent a proper table abstraction that maintains the higher level,

metadata

and perhaps even add some things that, for example,

a system like

uses that are not natively in the parquet format. So we've been looking at a project

of Netflix

called, Iceberg and speaking to the primary authors of Iceberg. And perhaps we can use that as a,

a table snapshot format inside quote packages. And so that that's a really exciting potential development. We're also working on a plug in where we could very easily take a snapshot of a a hive table and then

create 1 of these table snapshots as part of a package. So you'd be able to you could build a a a package out of queries from Spark SQL or you could build a package from just actually specifying Hive Tables and including those as nodes in the package. So that's 1. Another 1 on that sets on the bigger data side. On the smaller data side,

working on a a plug in for Excel that actually lets you load data from Quilt packages directly into an Excel sheet. We're working on

a more scalable

format for the

package representation. This is on that. You know, what if what if people wanna put millions or better yet tens or 100 of millions of objects into the same Quilt package then, you know, 1 1 giant JSON file isn't really the answer. So we're we're kicking around some more scalable representations

of the the tree structure.

And we're also doing simple management things like we'd really like to make it easier for the community to run their own registry. So we wanna port the registry deployment from plain Docker and Docker Compose to, Kubernetes.

It'd be great to make a helm chart for Quilt so that you could really get a a production version of Quilt up and running in your infrastructure

in a short amount of time. And are there any other topics

or areas of discussion that we didn't cover yet that you think we should touch on before we start to close out the show? No. I I think you asked some some great questions.

And,

no. I I think I think we covered a lot. Yeah. There are definitely a lot of subareas that we could cover in much greater detail and spend another 2 or 3 podcast lengths worth of time to cover them. But,

for anybody who wants

to get deeper into those weeds, I'll have you add your preferred contact information to the show notes. And as a final parting question,

from your perspective, what do you see as being the biggest gap in the tooling or technology that's available for data management today? Well, I think, you know, having visited a lot of really amazing,

data engineering teams, I'm I'm really glad that some of the companies here, particularly here in San Francisco, but even across the country have been so open with us and and taking us in a lot. It's absolutely amazing what they've done. So data, working with data at scale, I don't wanna call it a completely solved problem, but they seem to have a handle on it. You know, the tools are are they're in place. Workflows are in place to work with data at tremendous scale. You know, harnessing enough of compute power,

the algorithms they're using for machine learning seem to be I'm sure that work is not done. I'm sure it's gonna change very quickly, but it doesn't seem to be the problem. They're not limited by that. The the limitations seem to be more of in

the area of organization

and collaboration

and particularly focused on the metadata. So I think I mentioned earlier on

what are all these Hive tables and

which ones do we need and which ones do we not. And did we get the right version or did, you know, did we run the query against the right tables and and things like that. So that kind of organizational

understanding

the and and being able to share the,

the knowledge of a group in an effective way, I think are the the biggest challenges. So in particular,

you know, this is fine grain permissions. It's fine grain versioning being able to construct that lineage graph. You know, the raw source of this information that's behind this visualization

was here. It went through the following transformation steps. It was augmented in this way by adding these features that was done with this particular code. You know, understanding all of that so that companies have a,

a way of of debugging problems that arise, but also a a toolkit for starting the next project 4 or 5 steps ahead. Alright. Well, thank you very much for taking the time out of your day to join me and discuss the work that you're doing with Quilts. It's definitely a very interesting project and 1 that I'm likely to find some use cases for in the near term. So I appreciate that, and I hope you enjoy the rest of your day. Likewise. Thanks for having me.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links