Data Lineage For Your Pipelines

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And understanding how your customers are using your product is critical for businesses of any size. To make it easier for start ups to focus on delivering useful features, Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events.

You only need to maintain 1 integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch.

Not only does it free up your engineers' time, it lets your business users decide what data they want where.

Go to data engineering podcast.com/segmenti0

today to sign up for their start up plan and get $25, 000

in segment credits and $1, 000, 000 in free software for marketing and analytics companies like AWS, Google, and Intercom.

On top of that, you'll get access to the Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product market fit.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, DataVarsity, and the Open Data Science Conference.

Go to dataengineeringpodcast.com/conferences

to learn more and to take advantage of our partner discounts when you register.

And go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And please help other people

find the show by leaving a review on Itunes and telling your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Joe Dolaner about Packaderm, a platform that lets you deploy and manage multistage,

language agnostic data pipelines while maintaining complete reproducibility and provenance.

So, Joe, could you start by introducing yourself? Yeah. Sure. My name is Joe Dolaner. It's great to be here talking to you today, Tobias. I am the founder and CEO of Packetter,

and I started life as a software engineer. The first company I ever worked at was RethinkDB,

and that's

basically the only other company I've worked at besides a little while at Airbnb between.

And so do you remember how you first got involved in the area of data management?

Yeah. Absolutely. I mean, I

have always been interested

in data infrastructure

tools. So RethinkDB was an open source database, and I knew coming out of college that I wanted to work on these types of data management, data analysis, data manipulation tools. So

I joined that company right out of college and got to cut my teeth doing, like, open source software development and data infrastructure and things like that. Absolutely fell in love with it.

And then after I left Rethink DB, I got really interested in big data. You know, Rethink DB is more of a, like, transactional database. You use it as, like, the back end of your website. And so I wanted to learn what the world of, like,

data science and data analysis and everything looked like. And so, I started sort of hacking

on Packeterm in my spare time. It was actually sort of started because I wanted to use the Hadoop platform

to analyze some chess games. I'm a a big chess fan.

And the system was just really, really kludgy, and it was all based on Java, which I didn't like that much. So I sort of started hacking on what an alternative to this might be. And along the way, I spent some time working at Airbnb, and so I got a chance to see, you know, what their Hadoop infrastructure looked like and what the challenges were there. And

so doing this concurrently with hacking on my own stuff, it sort of eventually turned into the platform that became Packeterm. And, you know, then we we managed to get funding as a company, and the company sort of took off from there. And so I actually had Dan Whitingac on to talk about

PacoDerm way back in episode 1 about 2 years ago. But I'm wondering if you can talk a bit about what has happened in those 2 years, both in terms of the platform itself and the company and just the

overall environment of big data and data analytics that you're fitting your platform into?

Yeah. Absolutely.

The the core mission of the company hasn't really changed much. You know, when I was working at Airbnb, I saw a lot of

gaps in the data infrastructure that existed in the day in that day and age.

The biggest 1 I saw was sort of an absence of the ability to track any sort of provenance or or lineage of the data. And the way that

this really came up for us at Airbnb was

we, you know, had this massive pipeline of data analysis tasks that had been written by a bunch of different data scientists.

And it was really, really challenging to

keep all of these green at the same time because, you know, everybody's modifying them, and and they're all sort of working independently. And someone makes a change that's incompatible with the ones downstream, and then the whole thing just cascades red all the way down. And so we would have important tasks like our fraud models that would just sort of start coming out blank when when something went wrong. And when that happened, I'd be going into debug it and sort of trying to figure out, like, alright. Where where along the way did this break? And I didn't have any way to ask the system, like, give me the full lineage of this data, you know, because because it looks wrong or something like that.

And so that hasn't really changed.

What has changed is sort of the the rest of the platform maturing around us.

So when you first talked to Dan,

we were probably about 6 months into using Kubernetes,

and that was because

Kubernetes hit 1 0 had been released,

probably 5 months

ago. And so

we were sort of trying to figure out what we could do on this platform and what sort of stuff it could provide.

And now that's a lot more clear. And there's been a lot of features that have been figured out in Kubernetes that we've been able to just sort of, like, pass along to our users.

We've also figured out a lot how to integrate with various machine learning packages

that exist. So Kubeflow didn't exist at the time when you talked to Dan or when Kubernetes first came out. But it does now, and it gives you a very, very good way to sort of deploy a machine learning pipeline on Kubernetes, which by extension gives you a good way to deploy these machine learning tasks in Pachyderm. And it's actually very complementary with Pachyderm because Pachyderm basically takes the data right up to the point where it gets into the machine learning model. So, you know, any any type of sophisticated machine learning model is gonna have a lot of steps in it that are cleaning the data, getting it into the right format, joining it with the right data you need to train stuff. And then the actual, you know, training process happens inside of Kubeflow.

And then that comes back out into Pachyderm, and then we start doing the inference steps and the the, you know, checking how good this machine learning model is and stuff like that. And all that all happens within Packaderm as well. And I know that there are a number of other features of Kubernetes itself that have occurred in those past 2 years, including things like statefulsets. So I'm wondering what are some of the other primitives of the platform that have come along that have simplified or obviated certain parts of the pachyderm code base itself?

Statefulsets are definitely 1 of them, because we, you know, are a we're not a stateless service. Packetorm, in fact, is all about storing state because it's, like, for storing large amounts of data.

The majority of the data is actually stored in object storage, and so that that worked before statefulsets. But we also rely on etcd

as the sort of metadata and and, consensus

system for for Packeterm. And so

having a stateful set set up to manage etcd

is really nice and manages things really well. Another thing that's been really, really big for our customers in particular is GPU support.

So it you know, you can now in Kubernetes, and this has been true for a while, you can, you know, submit a resource request to say that this pod needs this much memory, it needs this much CPU,

and you can also have it ask for GPUs.

And what that'll do is it'll tell the scheduler that this needs to be scheduled on a machine that has a GPU, and it needs to be given to GPU and have that available to it during processing. And so this is really nice when you wanna run these high powered machine learning tasks that train a lot faster on a GPU. See, other things in in Kubernetes that we've used, we've been

relying a lot on the ingress features for some of the, cloud stuff that we're building now. We're in the process of of rolling out our cloud offering for Packeterm. And the fact that Kubernetes can do, you know, a lot of sophisticated ingress things with load balancers and you can build authentication right into those has been really, really useful for us. And so

as you mentioned, Kubeflow has come along, and that's, as you said, complementary to the capabilities of Packaderm. But I'm wondering if you can just briefly talk about what are the sort of main pieces of Packaderm itself. I know that there's the Packaderm file system for supporting versioning.

There's the pachyderm pipeline system.

And I'm wondering if you can talk a bit more about any sort of additional complementary

aspects of the overall big data ecosystem of things like Airflow or,

Kafka or, you know, various other sort of big data pieces that fit together nicely with Packaderm or that,

Packaderm

sort of supplants in terms of the overall workflow of somebody who's building an analytics pipeline on the Packaderm platform? Yeah. Absolutely. So at a very high level, the the 2 pieces that you just touched on, the Packarderm file system and the Packarderm pipeline system, are basically all of Packarderm. Everything that we have, we think of is in 1 of those 2 camps. The file system, like you mentioned, is responsible for providing version control

for your your big data. So if you, you know, for those of you who haven't heard the first episode where Dan talked about it, it's semantics very similar to Git. You've got, you know, commits. You've got repos. You've got branches, but it can store massive amounts of data, and it's storing it in in cloud storage. So it's storing it in, like, s 3 or GCS or something like that. The package on file system is also the thing that's responsible for enforcing the provenance constraints.

So

it's got basically this constraint solver built into it where you say, you know, here's here's a repo that contains images, and here's a repo that contains tags on those images. And then here's a branch

that is associated with those 2 branches, meaning that it contains computations

that have been done using those images and those tags on those images. And

the Packetor pipeline system uses this API

to then implement a machine learning pipeline that takes these tags and these these images and trains

a classifier based on those. But in in theory, something else can use that, and other people do implement their own things on top of this this and basically use the provenance system without using our containerized execution system. And, you know, you can insert SQL queries in there. You can insert all sorts of things in there. The pipeline system is what's responsible for the scheduling of those these tasks. And so that's you know, uses Kubernetes

to say,

I want this branch to be materialized that contains machine learning models trained on the images that come in here and the tags that come in here. And this the pipeline system knows, okay, when a new commit comes in, I need to spin up these pods. I need them to have GPUs. I need them to, you know, have these containers in there so that they have TensorFlow.

And then it runs all the code and it slurps up all the data and make sure, you know, the data gets into the pod and then the data gets out of the pod. And ultimately, you get your results. And because of the provenance system, your results will always be linked to the inputs that created them. So there's no way to short circuit

this. It's not like a system where you need to sort of, like when you check-in your results, you also check-in a manifest of where it came from.

It's basically, you know, hard enforced by the system. To give you idea an idea of some of the sort of new things and how this plays into other data systems,

We recently released this feature called spouts,

and these are sort of like a pipeline in that they schedule a pod on Kubernetes.

What's different about them is that

rather than pipelines which normally take inputs,

process that data, and produce outputs,

These just stay up all the time and they produce outputs.

So it's like a spout of data coming into your system. And so this is really, really useful for

subscribing to a Kafka topic,

for example.

And this sort of allows you to have, like, a very convenient shim between Packarderm and any other system that you can subscribe to because it's a container. You can put whatever code, whatever libraries you want in there. So, you know, you can very easily have something that subscribes to a feed on Twitter

and gets new tweets coming in, and those will just show up in your packet or file system. And then downstream of that, you can have all sorts of sophisticated pipelines and stuff,

that are processing those tweets, that are training models on those tweets, stuff like that. Yeah. And I think that that's definitely 1 great differentiating factor between the Hadoop platform that you're working to sort of replace

where it's entirely batch oriented,

And there are sort of streaming capabilities that have been bolted onto it, but having it built into Packaderm as a first class feature, I think, is definitely useful given that there is, particularly in the past couple of years, a lot more of a push to doing real time and streaming analytics.

Yeah. Absolutely. And we,

this this was 1 of the sort of earliest features that we conceived of because

we we have very sophisticated

streaming capabilities,

but it's not like when you're making a packet or a pipeline, you choose, like, okay. This is gonna be a batch pipeline, and this is gonna be a streaming pipeline.

There's really no difference between the 2. And the reason that

that is and the reason that we can do that is the underlying version control system. So because we can always say, alright. This is you know, this data has this hash. It's part of this commit. It hasn't changed since the last commit. We processed it then. We got a successful result. Here's the result, again, identified by a hash. So we know that it corresponds to the same code, the same data, and everything. We just get to reuse that result. And so, really, the reason that it's a streaming system is that we've got this pretty sophisticated computation

deduplication

system in the background that just go whenever it goes to compute something, it tries to figure out if if it's already computed it. And if it has, it just uses that result.

And so this is often a bit of a magic moment for people when they first start using Packeterm is that they put in a bunch of data. They churn through it. It takes a little while because it's an expensive computation.

And then they add a little bit more data, and it happens super quickly. And we actually get people coming into our user channel asking, like, why did this happen so quickly? You know, I think something's broken. It didn't process it. Like, no. The system just figured out that it didn't need to repro reprocess all of that data, and so you got a result really quickly because there actually wasn't much to do. Yeah. And when I was reading through the documentation, I was definitely impressed by the deduplication

and data hashing capabilities that you have in the file system and how that supports the incrementality

of computation so that, as you said, you don't have to do a complete rebuild of an entire batch job. You can just work on the data that's new since the last time you ran something.

That was that was 1 of the things that I was most excited about having before before I even really started working on Packeterm because I

spent so much time waiting for things to recompute.

You know? And I I I think probably anybody who's tried to to do a decent sized data project has experienced this where, like, you you write out all of your code, you run it on all of your data, and then you find that there's this, like,

1 or 2, like, files that have, like, some slightly different format that crashed the whole thing. And so then you fix your code and try to get it to run, and you can't get it to run

on on just the the stuff that it failed on. And so you have to sit there and wait for 2 hours to see if it works on these, like, 2 files. And then if it doesn't, you have to do that again. And this was this was even worse at Airbnb because we had, like, so many things depending on each other and we had so much data there that basically the granularity that we had was

running stuff once a day because the pipelines would run every single night. And

so if things were broken,

then we've gotta we basically write some new code, we commit it, and then we come in the next morning and hope that it worked. And if it didn't, then we do the same thing the next night. Yeah. That's definitely a quick way to

build a lot of frustration and burnout on a data team. Mhmm. Yeah. You know, that's and that's really the

the biggest reason that I wanted to do this company and this open source project is just that I felt like

data teams in general were in a state where

they could be a lot more productive if the tools looked a lot better, it reminded me a lot and and still does to a certain extent of what

making websites looked like before the LAMP stack

in the, you know, people had all these CGI scripts. There there were all these things that you could sort of cobble together, but there wasn't just this,

like, well known good platform that you could just get out of the box and build a website in, like, a weekend in your garage or something like that. And then once that platform existed and people started to congeal around it and the tooling started to explode, you got all of these, like, explosion of websites and people were able to make all of this cool stuff. And I feel like that still hasn't quite happened yet for data science and data engineering, but we're getting a lot closer to it. And that's definitely something that I'd like to talk through in the context of Packaderm is how

the sort of collaboration

between data scientists and data engineers and the sort of breakdown of responsibilities and workflow happens

within data teams, both when it's just 1 data scientist doing everything or when you're working at a medium to large organization where you actually have that separation of roles and just the

overall process of going from conception to delivery of a data project?

Yeah. Absolutely. So 1 of the first most important things to say about this, because it's often

sort of a misconception that people have that throws them off a lot at the beginning, is that Packeterm is not trying to be a replacement for Git or GitHub or any of these other,

you know, version code version control and collaboration tools.

The we're version controlling different things. And so when people are successfully collaborating on Packetor, normally, what this looks like is you have your code in GitHub or GitLab,

you know, somewhere stored in version control,

and you have a a repo. I I like to have it all in 1 repo, but you can have across multiple repos. You have a repo that has your analysis code

that can be compiled into Docker containers and then also has your pipeline manifests

that explain how to deploy this onto a Packeterm cluster.

And then from there, you set up a CI pipeline that basically redeploys

these

these pipelines when when commits come in so that you can basically, like, merge into master and you can have a CICD process on top of this. And then from there, where you start to leverage the Packarderm features is the fact that

when you wanna have a branch that people are working on that's a sort of experimental thing, you can have your CICD process deploy that into separate

branches and separate pipelines in Pachyderm that can still

share all of the underlying data. So you don't need to make a copy of the data. It's still all version controlled and deduped. But you can have, you know, these 2 pipelines running concurrently, and you can see, you know, okay. This one's

running like this. You know, it's this one's succeeding, whereas this one's failing, so we wanna move the 1 that's succeeding. And this 1 is performing this much better based on, like, these metrics pipelines that we've put on at the end. And you can basically have a a collaborative process around this because the tools enable it. It's a very open ended tool, you know, similar to Git. Like, people have a 1, 000, 000 different branching strategies on Git. People use monorepos.

People use,

like, small micro repos for their projects. And and Packeterm isn't

particularly more prescriptive than Git in that regard. So we peep see people using this in a bunch of different ways, but the core, like, underlying concept is that you can collaborate because the system is tracking your versions for you. And so you sort of always know which way is up

because you can always just ax ask the system, what's the history of this data? What's the the lineage? Meaning, like, take me back to how this data was produced,

versus history is taking me back to what it looked like, you know, yesterday, a year ago,

etcetera.

And you can do things like biceps. You know, you can say, like,

this looks bad now. It looked good a week ago. Where in between did it change?

And

1 of the

challenges inherent in Pachyderm is just understanding some of the principle some of the, primitives of things like Docker and Kubernetes. And so I'm wondering what you have found to be some of the common challenges or points of confusion or stumbling blocks for people who are coming into this project and trying to get up and running with it. Because even just trying to define a Dockerfile can oftentimes be Mhmm. A nightmare in and of itself. Yeah. So that's definitely 1 of them is just understanding,

you know, this idea that Docker is like

a machine that you're sort of setting up every single time, but it's not really a VM. And sometimes, like, the details of your machine poke up into it because it's the same Linux kernel, everything like that, that's definitely

1 of the challenges. I think that that 1 that's 1 that that people normally get past at least these days. That used to be a lot more of a challenge maybe 3 years ago, but I think that just the both the sort

of communal sort of communal knowledge of that have really started to take hold. So, you know, most people at this point, if if you're working at a decent sized company, even if you don't know Docker,

somebody there does and will be happy to sit you down and explain, like, here's how you make a Docker file, here's how you build things. I don't think the same can really be said for Kubernetes

yet. And in some ways, I think that makes sense because Kubernetes is

newer,

and it's also a more specific tool and a more complicated tool.

So

definitely the biggest stumbling block for people getting started with Packetroom is just getting Kubernetes set up. And, you know, we

we help people with that all the time as much as we can, but

we're actually not

super Kubernetes experts either. You know, we understand how to use it and we understand

how to deploy it in our our system and stuff like that.

But, you know, people who wanna deploy it on prem, people who wanna deploy it in in sort of weird settings and stuff like that. We don't always know what to tell them about how to get Kubernetes to work.

I think that those are the biggest 2. The other 1 that is

kind of interesting is getting, like, the underlying storage setup. So to run Packeterm, you need access to an object store, and you need some sort of a persistent volume for etcd to run on. And

on

AWS

or GCE or Azure, this is all pretty well known, and we have, you know, a deploy assistant that will basically just spit out a manifest that you can give to it that will set up all of these things for you on Kubernetes.

But

the

variety of object stores that people wanna run against,

seems to be growing in our experience. And so there's all of these sort of slightly off the beaten path ones like CEF and Swift Stack and ECS and things like that.

And

each 1 of those is a little bit of a new adventure adventure to get the system set up on. And then it's also a bit of a new adventure for us because

while they all ostensibly support the same s 3 API,

there are little subtle differences in how they support that s 3 API that occasionally trip our system up. And so we've been doing a decent amount of work recently on just, like, trying to cover all of these different subtle differences between them and get it to work on all of these object stores.

And another thing that can often be challenging when working with cloud oriented workflows

is trying to figure out what the local dev story looks like. So I'm curious

what the general approach is or at least what your general approach is

for trying to do local experimentation

and iteration

on some code or maybe trying to pull in some subset of the data in the Packeterm file system for getting things ready to go before you ship it off to production? Yeah. Absolutely. I mean, I'll I'll say sort of upfront that this is 1 of the parts of Packetor that I'm least satisfied with how it is right now. There's, I think, a lot of work to be done on it. And and I think that there's a decent amount of work in just Docker

land in general

to to make this really good. The

sort of anti pattern that you get into

that really sucks is that your development look loop looks like

write some code,

build a Docker container,

push that Docker container to Docker Hub,

redeploy a pipeline

that points to that container, which then pulls down the container and runs it. And then you see you know, that's that's that's possibly taken, you know,

10

minutes or so, and then you get you get some results back on what you need to change. And it's like, oh, you you know,

this Python code doesn't run. Like, you you're referencing a variable that doesn't exist, and then you try it again. And you can't just run it on your local machine because you don't have the data accessible to you. What I do when I'm developing pipelines on Packeterm that works pretty well

is I do everything entirely on the same Docker host. So I have minikube running,

and

that's just running on Docker on my local machine. And then when I build my image,

it just builds on my local Docker host. And then when I run it, it's it's the image is right there. So I don't need to push it anywhere. I don't need to pull it anywhere because it's right there. And that leads to a pretty quick development loop. The other thing that you can do

that can be pretty effective is supports

a fuse mount for your file system. So you can just do pat control mount, and a directory will show up that has all of the data in that's available within your distributed file system within PFS.

And it's kind of cool because you, like, LS this directory and you're like, oh, shoot. Here's, like, a file that's that's terabytes in size. And, of course, this is only working because it's not actually on your file system.

And then you can run your code against this fuse bound,

and you can run with actual data and see what how things are gonna work. The challenge with this is that, 1,

it

doesn't create the the Kubernetes

environment around it. So if you wanna have, like, a secret available to you in Kubernetes such that you can access some outside service,

then you need to sort of mock that up. And sometimes the time spent mocking it is, like, not really canceling out the time that you're saving by not just pushing this into the Kubernetes cluster.

The other thing is that Packeterm

gives you

this pretty nice way to describe how data gets split up, which is just using glob patterns, which are the things that you're like if you're familiar with LS ing around on a command line when you do, like, LS star, that star is a a glob character

for glob patterns.

And this in packet arm is how you define that, like, you can process all of these things in parallel and it parallelizes

it. But when you just mount data in, it's not respecting that in any way. So

we have some work to do in terms of the local development story for Packeterm for sure. It's, right now, it's good enough that people can get things done.

And the real

when things really get nice is when you have some code that you sort of want to be running in production and you wanna be able to rely on this just running every single night. And then Packeterm is great at just, like,

running every single night, keeping it keeping it going, and letting you know when there's an error. And going back to the idea of data provenance and data lineage, you've mentioned that some of the way that it's tracked is through these versioning capabilities of the file system. But I'm wondering if you can just dig deeper into

the underlying way that it's represented as far as tracking it both from source to delivery

and how that actually is exposed when you're trying to trace back from the end result all the way back to where the data came from and what's happened to it along the way? Yeah. Absolutely. So

the

layers the level that we track providence at is the commit level.

And the sort of first problem that you have to solve if you wanna track providence

is

you wanna store a reference to some data that you know isn't gonna change. Right? Because if I tell you, like, this machine learning model was created using all of the images in this image directory,

and then I go and add 2 10 new images to that image directory,

well, then that doesn't tell you anything anymore. Right? Because you don't know what was actually used to create the model. You just know where that data happened to be stored at the time when it was used.

So commits allow us to have this immutable snapshot of what data looked like

at a certain point in time. From there, we link these commits together. So if you've got pipelines in Packarderm, then

the input to those pipelines is data commits, and the output from those pipelines is also data commits. And

the relationship between these commits is the provenance relationship. And so any any commit in packet room basically has this metadata attached to it that just is all of the commits that it is provenant on. And you can, you know, inspect these commits using the command line, using our API, using the, the web interface,

and it'll just show you a list of these commits. And then, of course, you can, like,

track those commits up and look at what's in those commits. And so the actual you know, the the structure of this is a pretty strand standard directed acyclic graph structure,

from computer science.

Now something that's sort of

a a cool aspect of the provenance system is that we actually track provenance

at another level, which is the branch level.

And this doesn't quite mean the same thing as commit provenance. Commit provenance is this sort of immutable snapshot that tells you here's where this data came from.

The provenance on branches basically describes

how your data is flowing at the time. So if a branch is provident on another branch,

then that means that every time you get a commit to the upstream branch, you also get a commit to the downstream branch.

And that downstream commit is the results of processing the upstream commit, which means, of course, these commits are gonna be linked via provenance as well. What are some of the other advanced capabilities of Packaderm that you think are worth calling out that are often overlooked or underutilized?

I think I I wouldn't say it's

necessarily underutilized,

but it's definitely

not something people immediately associate with it, but that gets a lot of use, which is our sort of cron functionality.

And that's the ability to have a pipeline that isn't triggered by putting data in the top and getting data out the bottom, but rather it's triggered just on a cadence.

And so people use this a lot of times as a way to, you know, do something every hour, do something every night. They use it to scrape things. They use it to push things and stuff like that. I think that that is

definitely that's 1 of those features that's not actually

super sexy. It's just super useful. Let's see. I think that the

the fact that you can do

you can sort of expose all a lot of the,

various Kubernetes,

underlying Kubernetes things is something that isn't hasn't been fully explored. People are sort of finding new things to do with that every single day. So you can

in Packeterm,

you can attach to pipelines

any sort of random modifications

to your pods. And so this can be useful for assigning affinities. It can be useful for, you know, declaring resources that you need, but there's always, like, these new things being added to Kubernetes that are really, really useful,

and those sort of just naturally propagate up into Packeterm.

And for the file system and for interacting with other source systems, does Pachyderm support things like the s 3 select API or being able to run push downs on the different data sources for trying to optimize for,

speed and latency and reducing the amount of data that actually needs to be transferred over the wire? So it does. We we can sort of, like, select individual pieces of it if that's if that's what you're talking about. I'm just I actually don't know what the select API does specifically.

So my understanding is that s 3 recently added a an API where for certain file types, you can actually run a select query so that rather than just pulling down a blob, it can actually

index into the data itself and understand what's contained within it so that you don't have to return the entire object.

We'll probably have some trouble leveraging this because Packeterm is designed to work on a bunch of different object stores,

and so we're pretty reluctant to implement anything that's only gonna work on s 3. 1 thing that this did remind me of, though, that's a cool new feature that we just added, and so it hasn't gotten anywhere near enough love because it is only very recently released,

is

we now support an s 3 API on top of PFS.

And so if you have applications

that

are

used to sort of writing data into s 3 as their data lake, then you can just swap in Packeterm and it speaks the s 3 API, API, and you can put things in there and those will turn into

files in PFS that are committed and stuff like that. And underneath the hood, this is all still going into s 3. So it's gonna have much the same, you know, storage characteristics that you are used to in terms of costs and everything like that. But you're gonna get this version control and the ability to, like, you know, run pipelines on top of it

in addition. That's definitely really cool being able to just transparently

put Pachyderm in there so that the end user doesn't even have to be aware of it. But at the same time, they're

getting some of that, added benefit of provenance and

Yeah. This is also

how we support

to reading stuff out of s 3. So for example, this is how we support Spark

is that Spark, you know, can could be told, like, read this data out of s 3, perform this this Spark operation on it, and then write it back into this other place in s 3. And

now because we speak the s 3 API, that can just be under the hood. And, you know,

you're now have provenance on

your Spark operations.

And so in terms of the provenance, I know that because you're versioning the containers that are executing as part of the pipeline, that is an added piece of information that goes into it as far as this is the data that was there when we started it. This is the code that actually executed, and then this was the output.

But for external systems, do you have any means of tracking the actual operations that were performed to enrich the metadata associated with the provenance? Yeah. So those can basically use the same system we we use, which is that so we we track the information about

all of the the code that ran and, you know, the Docker container and everything like that. But we actually just use that by piggybacking on PFS's

provenance system because we just add that as a commit. So every job has what we call a spec commit that specifies how the job is supposed to be run, and that includes the the code and the the Docker container and everything like that. And so outside systems are basically just expected to, you know, whatever whatever you can serialize

this information as, just put it in a commit. And then, you know, that's just in in essence, it considered as an input into the pipeline. Like, it's really not in terms of the provenance tracking in the storage system any different than any other input. It's just that this 1 happens to define the code that's running in the computation.

And so earlier, you were saying how pachyderm,

because it is so flexible,

the ways that people are using it is sort of up to everyone's imagination. And so I'm curious what you have seen as far as being the most interesting or innovative or unexpected ways that people have been leveraging the pachyderm platform.

Man,

let's see. I mean, so there's things that are

interesting

because the end results are interesting. So I think that, you know, a lot of the image processing and machine learning things that I've been seen trained on that are

the most interesting to me. They're not really,

like,

you know, sort of cute little hacks in the system or, like, interesting abuses of the system.

Some of some of the really interesting things that people do in terms of, like, the things that I never thought anyone would do in terms of the system

are sort of calling out to other packet or APIs from within

the pipeline.

So, you know, you can have a pipeline that as part of its operation, like, creates another pipeline or does something like that. And this is something that we're

we don't officially recommend that people do because

we we haven't really thought about it enough and think there might be some weird things, but we've seen people do some really, like, cool things with it and stuff. So, you know, we're not we we don't police this in any way or anything like that. That's sort of the great thing about open source softwares. You know, we're not gonna stop you from doing stuff. It's yours. You can do whatever you want with it. But

it's not something that we had sort of officially thought about as a use for it. We'll let you aim your foot gun on what whichever foot you want. Right. Exactly. That's, you know, that's a very important principle to us is that, you know, we don't wanna give you, like, foot guns in disguise.

You know, we don't want wanna trick people into using foot guns, but we also you know, it's if you can't if you can't shoot your foot off with the system, you also can't do anything clever with it. And this is, you know, true of

all of the sort of Unix ecosystem and things like that. It's like you can you can shoot your foot off with it, but you also, in those abuses, can find really cool useful things to do. And so we

feel like we have to be open to, you know, people doing that with Packetroom because we have gotten a lot a lot of our best, you know, features and our best understanding did originally come from people abusing the system. And so a few months ago, you announced that you had raised a series a round of funding. And I know that with most venture capital, that usually comes with some strings attached where they're hoping for some measure of hypergrowth.

And so I'm curious how you're approaching

that stage of growing and scaling the Packet Air and platform and business.

Yeah. Absolutely. And this is always I think has an extra wrinkle to it when you're talking about an open source company because,

you know, there have been some, I think, notable cases where an open source company has sort of, like, raised money

and, like, stopped really ascribing to the open source roots that they that got them where they were, and it's gone pretty badly for the community.

We

feel like we are very aligned with our investors both in terms of what long term Packarderm needs to do to be successful.

And so we're not as much focused on, like, okay. We need to have this amount of revenue by this day. We need to buy, you know, this not this day. No. We put, like, this quarter, this year, or something like that. We need to have this number of users.

And

we're we're much more focused on what does it take to build a long term sustainable open source project and a company that is also long term sustainable around that. And so

we are much less focused on any particular revenue goal in the short term and much more focused on

basically making Packeterm into the platform that we've always believed that it could be and making it into something that's, like, a ubiquitous tool for sort of the underlying data infrastructure, per particularly on top of containers. But we feel as if

containers are kind of gonna be the underlying, like, cloud infrastructure for everything. And so the data infrastructure that goes on top of them is really gonna be the defacto infrastructure for everybody. It does, of course you know, investors invest because they ultimately wanna see a return. And so we

need to make money off of Packeterm,

and that comes from both support contracts, our enterprise product, and, we're currently rolling out our cloud offering, which we think is going to over time become basically

the vast majority of our revenue. And so far, we don't feel as if any of these things are

at odds with each other. They're not maligned. It's just a little bit, you know, tricky to get all the puzzle pieces to fit together to make sure that we're staying true to the open source community and, you know, everybody who's used and contributed to this product up until this point and also keeping the company around it going because the reality is the open source project probably isn't gonna survive without the company contributing to it. And so in terms of your overall experience

of building and maintaining and scaling the Packadigm project and business, what have you found to be some of the most challenging or useful or unexpected lessons that you've learned? Definitely,

the most useful

lesson I think I've learned is just

to really listen to your users and see how they're using the product and try to go from there. You know, I I came in to this with a whole bunch of ideas of what I thought a cool data infrastructure system would look like and what I thought was gonna be important to people. And

I wouldn't say that I was wrong about everything,

but I was surprised how much I didn't know. It it wasn't even so much that the things I knew were wrong, just that there were these, like, massive things that I hadn't even thought about. I mean, provenance is kind of a great example, actually.

We

we initially implemented provenance as

a sort

of internal thing that we're like, okay. We need to do this to track and keep things consistent

and and everything like that and, you know, be able to sort of see see this stuff. And then it started to get more and more important for people, and people wanted it more and more. And then there started to be, like, things like the GDPR that actually legislated

provenance

into the system and stuff like that, or at least legislated them on companies. So they had to be able to, like, give people an explanation for machine learning made decisions and things like that. And so all of these things would would have been easily missed if we hadn't really been listening and sort of going back every single day and see, like, okay. How are people using this? How are people failing to use this? Things like that. The other thing I think

that I've I've learned and been been rewarded with is

both taking risks on new open source projects. Like, Docker was pretty new when we started using it, and Kubernetes was, like, brand new when we first started using it.

There were a decent amount

of internal discussions

about, like, do we wanna use a platform this new? And, like, even at the beginning, there were a lot of discussions of, like, why are you guys building this system

Dockerized thing on top of Hadoop or, like, a provenance tracking thing for Hadoop. And

it took a lot of conviction to just say,

no. We're gonna build something new. We're gonna, like, take a stab at doing this our way and and see what happens. And, ultimately, I feel like we've been very rewarded for that, but it it took a lot to be confident in doing that. And

what are some of the

limitations or

edge cases of Pachyderm, and when is it the wrong choice?

So it's definitely the wrong choice when what you're doing is sort of, like, a very well established

data pattern that there's very good tools for. I think the best example of this is SQL. You know, we have a lot of people ask, like, what you know, imagine I wanna do Redshift style, like, data warehouse queries against Packeterm. What's the best way to do that? And right now, the best answer to that is to just use Redshift because it's it's really good at that or, you know, any of the the various options. There's, like, BigQuery. There's Hive. There's Presto and things like that. You can sort of start to integrate those things into Pachyderm. Like, people will build Pachyderm pipelines that basically just orchestrate

Redshift pipelines or BigQuery pipelines or things like that. But

SQL is not something that we're able to beat to beat the things that just do SQL at because there's just there's there's a lot there, and it's not the most interesting challenge to us right now. Packet arm really tends to do well in kind of, like,

the everything else data case. You know, when people are thinking like, I've got, you know, these genetics

files and while, you know, there's some pretty goo good toolkits for

analyzing these, like, on a single machine or something like that, there isn't really, like, the distributed genetics pipeline

tool or anything like that. And so for those, because Packeterm is a super generic system and you could just package those tools up into Docker containers and run them stuff, it's

very, very nice for that. It gives you some structure to these tools that otherwise you'd just be, like, firing off with scripts ad hoc on, like, random EC 2 boxes,

things like that. And looking forward, what do you have planned for the future of Pachyderm?

So the biggest sort of change

in terms of what the company offers is rolling out of our our cloud offering,

which is called PACHUB.

And if you sort of think of everything in open source Pachyderm

as Git,

and that, you know, it enables collaboration on data science, things like that, then

PackHub is kinda like GitHub for data science. And so it's basically

an online site

where you can go and you have your account, and it contains, you know, your data repositories and your pipelines that process those repositories, and you could fork other people's pipelines, and you can pull in other people's repositories and things like that. It's a way for people to actually collaborate on live running big data pipelines.

That's the thing that we're most excited about in terms of what's different. There's also, of course, tons and tons of work to be done on the core open source project. So there's a lot of upgrades to the storage layer that are gonna make it a lot more sophisticated

and a lot faster that I'm very, very excited about. And there are a lot of sort of new pipeline features that are coming out. I mean, Spouts

was

1 of the first ones of those. We're also sort of implementing more sophisticated join support so that you can join 2 datasets together and the ability to have more sophisticated pipelines that do, like, loops and conditional branching and things like that. And so for anybody who wants to follow along with you and the work that you're doing at Packarderm, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. I mean, that's sort

of the biggest gap that I see is is the 1 that I'm trying to fill because I felt like that's that's why I wanted to do this company and and what I felt like the opportunity was. But, you know, I would basically describe that as

the absence of a really good set of tools

that are just sort of prescriptive

in how you're supposed to do these things.

There are ways to do all of the things that Packaderm allows you to do. You know, you can

write some form of version control on top of object storage. You can, like, use Git repos and theory and stuff like that. But there's nothing that really ties it all together and gets out of your way and lets you focus on the actual data science that you're really good at. And, again, I'll go back to the analogy of the LAMP stack, wherein, you know, it used to be to build a website. Like, you needed somebody who was an expert on actually implementing databases because none of them worked that well for you. You needed somebody who, like, understood how to run all of these servers and and all of this stuff. And then once you get this stack that people can congeal around, it's just a very well known, well trodden path, the documentation starts to get really good because there's so many people using it, then

we can stop thinking about that stuff and do all of the interesting stuff that that allows, like build, you know, Facebooks and Ebays and things like that. And so we we still feel that that hasn't really happened with data science and that the way to get that to happen is to focus on the infrastructural layer that's needed to tie everything together and do it in a generic enough way that people can use all of their different tools on top of it. So, you know, I think that the LAMP stack worked really well because

you could do all sorts of things that you wanted to do. And, you know, the p part, the PHP part became very generic, and people started swapping Python in there, and people started swapping Perl in there, everything like that. We have that same level of flexibility

with our Docker container centric workloads,

but we provide the same underlying, like, storage and orchestration primitives that we think are basically what people need to get stuff done. Well, I appreciate you taking the time today to join me and discuss the work that you're doing on Pachyderm and how it has grown and evolved in the past couple of years. So I definitely think that it's a great project.

It's 1 that I've been keeping track of for a long time now, and I hope to be able to use it for my own purposes soon. So thank you for all that, and I hope enjoy the rest of your day. Thank you, Tobias. It was great to be here.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links