Pachyderm with Daniel Whitenack

Hello, and welcome to the Data Engineering Podcast, the show about modern data infrastructure. You can go to dataengineeringpodcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on iTunes or Google Play Music and tell your friends and coworkers.

There's also a community site at community. Dataengineeringpodcast.com,

where you can join the conversation with other listeners and help out by giving suggestions and feedback. Your host is Tobias Macy. And today, I'm interviewing Daniel Whitenack about Packaderm, a modern container based system for building and analyzing a version of data lake. So Dan, could you please introduce yourself? Yeah. I'm Daniel Whitenak, as you mentioned. I'm a data scientist

and developer advocate with Packeterm.

And how did you get started in the data engineering and data analysis space? Originally,

back in the day, I I started out in physics.

Eventually,

decided that I didn't want to stay in academia,

so moved into industry.

Finally,

kind of landed in this data science and engineering space. Was fortunate

to start out at a at a small company, telecom startup called Telnyx,

where I had to kind of wear a lot of hats. So I did a lot of

data engineering along with doing data science. And,

I gained an appreciation for, I guess, this kind of end to end ownership

of a data project. So

that's kind of how I, how I got into this stuff and, and how I developed the the interest that I have. Today, we're gonna be talking about the Packaderm project. So I'm wondering if you can describe a bit about what that is and the problem that you were trying to solve when the project was started. Yeah. Definitely.

Pachyderm is an open source project. It's, under the

pachydermorgon

github.

And what it does is it provides a system for

data pipelining and data versioning.

It's a totally language agnostic

framework because it's built on containers,

as you mentioned. And thus, you can build data pipelines

with basically any

language or framework you like, including

Java, R, Scala, Go,

or even things like TensorFlow.

And

the output and input data of these data pipelines

on each stage are versioned in data versioning.

So we can talk about that a little bit further, but

it does some some nice things for us. When we commit new data into Packeterm, we can automatically trigger pipelines, so doing things like streaming and batch analysis.

That's the basic overview.

Regarding,

you know, where it came from, the cofounders,

Joey and JD, who who kinda started things,

they ran

data infrastructure at Airbnb for a while. They were on that team.

So working with Hadoop, they were also

the earliest engineers at at Rethink DB. So they they worked a lot with with data infrastructure and basically

started seeing that

new things like NoSQL databases and Redis and Docker and Core OS felt really awesome and and were great to work with, and things

like Hadoop felt a little bit ancient.

So the the vision in starting Pachyderm was really kind of a reimagining

of Hadoop for

that's built on on top of modern tools like Docker and Kubernetes.

And I'm sure I could probably hazard a guess, but where does the name come from? Packeterm, the name in general is,

is kind of this defunct

classification

of mammals that doesn't really make any sense. It it kind of means, like, thick thick skinned

mammals, like elephants or hippos or rhinos.

But being that it kind of refers to

to elephants, I guess the name Packeterm originally came from kind of a not so subtle pointer to the the elephant imagery

used in, in the Hadoop ecosystem. Because again, this really,

the inspiration came as a reimagining of what Hadoop would look like in a in a modern world, I guess. So what are some of the competing projects in the space, and what are some of the features that Packaderm offers that would convince someone to choose it over the other options? There's a couple things.

They mostly fit into

either the category of data versioning

or data pipelining, not necessarily

both. So there's some projects like Flocker from ClusterHQ

and Infinite, which, I think it was this week or last week was just announced that it was acquired by Docker or even, like, get large file storage. These these projects

kind of

latch on to the idea of, like, stateful

Docker containers

and data versioning.

Whereas on the other side, you have projects like Luigi or Airflow

that are doing data pipelining things. So

in comparison to Packeterm, I guess the 1 distinction would be that these things kind of are targeting

1 or the other of those those topics. However,

at Packarderm, we're very passionate

about the idea that combining data versioning with data pipelining is really way more powerful than trying to glue something like Luigi onto another framework

for data versioning.

Basically, that

data pipelining is best paired with data versioning.

There's this kind of symbiotic relationship between the 2 and in our framework

and really

allows

data engineers and data scientists to produce workflows

that are reproducible,

collaborative,

and easily distributed as well. And 1 of the

factors that you call out in the marketing site and the documentation is the fact that because of the integration of the data versioning and the pipelining, it makes it easy to track the

provenance of the data. So why is that such an important capability of the system? There's a couple answers to that question. There's different aspects or different, I guess, I should say, benefits

when it comes to to data provenance.

The first thing is that,

really,

as we build more data pipelines

and models

and statistical analyses and other things that impact user

impacts users directly,

we're gonna be held more and more accountable to that. And this is already being seen in some, like, EU regulations that are coming out that basically are

telling people that they have a right to an explanation for algorithmic

decisions.

So 1 part of data, you can't really give that explanation. And so there's, like, a compliance element to this or an auditing element, maybe if you're an insurance

agent agency or something like that.

But then the other side of it is that data provenance is really a precursor

to

true collaboration

in an organization. So if I'm developing some nifty model or some analysis or some amazingly

efficient pipeline, I'm not working in a silo. And I I want people to be able to reproduce what I've done using the same data that I used to produce the same results

that I saw.

And this isn't really possible at all if you don't have some way of tracking

how your data is moving and changing

and, some way of pairing that with the actual analysis that you're doing. And

the versioning capability of the data is definitely something that is

not as

widely used in a lot of contexts, largely because of the fact that from what I've been able to gather,

the primary storage mechanism for distributed

data lakes is the Hadoop file system,

which I know is based on Java. But what does Packaderm use as the,

base layer for being able to provide that versioning capability as well as the distribution and scaling mechanisms that are used in conjunction with that. 1 of the kind of major

mindsets that we have at Pacaderm is that we want people to

stay focused on their their problems and

solving these data engineering challenges and

using their expertise to solve these data engineering challenges and not necessarily having to worry about this this file system. So, basically, PFS

sits on top of any generic object store,

like s 3 or GCS

or Ceph, Swift.

And, basically, what we do is treat this object store like a CDN for data.

And along so that, along with some smart smart caching,

allows us to have this distributed, scalable,

and resilient data

along with kind of the solid performance. And because we're because all of this is scheduled on Kubernetes,

it means that

basically, for end users, we can scale up and down arbitrarily to service more requests or

add more cache space, for example. And if I recall properly, I believe that in 1 of the earlier permutations of the pachyderm system, it used better FS for being able to handle the versioning of the data as well. Is that correct? It's possible. I'm a recent addition to the team, so I'll have to go defer to my colleagues on on that particular 1. I know that, we've utilized Fuze. Although in our most recent release, there's been some changes to that where, we actually

prefetched some data

not relying on FUSE, which gives some performance advantages.

And the

versioning capability of the data I know is handled at least partially by tracking the diffs as you apply different manipulations to the data itself.

So given that the fact that it is that the the current state of the data is computed by applying the the various diffs on top of the base state of the data. I'm wondering how that affects the additional storage capacity that's necessary for storing those changes as well as the performance aspects of the impact that

having to apply those various diffs on to compute the current state might have, when somebody is trying to do analysis on top of it? Yeah. That's, a great question and and kind of basically getting into

the trade offs of of data versioning.

But I I would say so Packarderm, it definitely doesn't save copies of files in terms of,

saving

the the different commits of your data into versioning. We only store the diff, so it's space efficient in that way. And to to just give you an idea as far as, like, the

space efficiency

in terms of numbers,

we store maybe about,

64 bytes of metadata per 8 megabytes

per an 8 megabyte block of actual

data that that you're pushing into versioning.

So it's pretty space efficient in that way. Also, in in terms of, like, the actual computations,

1 of the things that's nice is, like, a Pachyderm job or a Pachyderm pipeline

either subscribes to a certain repository of data

or is actually run on a specific commit of data.

So

for each analysis,

basically, you're analyzing

data at a certain state. You're not having to kind of scrub through a history of data commits in order to figure out how to process your data. So on a per job or per pipeline

basis, we're we're processing

1 commit of data that's that's coming in, and then we're making 1 commit of data out. And for for a typical use of Packaderm, would somebody keep all of the revisions

of the data in perpetuity or are the change sets primarily just useful in the context of the analysis work flow, and then the final state of the data would then get merged back into the base layer? This persistence question is really, I guess, on a case by case basis.

We've seen some users that do want

to do want to keep the whole history of their of their data. So maybe if if you're thinking back to what I said earlier about

a insurance company or maybe a company that's, like, labeling people as fraudulent or or whatever it is, it could be that you want an audit trail for your data, a good ways back, and so you might be persisting this whole commit history in perpetuity.

Other companies,

like, let's say, maybe, like, a a smaller

web company that has certain policies around privacy or or whatever it is, or maybe they're only concerned about the data for the past 48 hours, they might stream that that data in and then

basically delete those those repositories or those commits afterwards.

Or they might have a part of their pipeline that specifically

anonymizes and persists only only certain data after after a period of time.

Packaderm is designed in general to store as much or as little data as you you would like to to persist. The the mechanism is the same around commits and branches.

It's just up to you how you'd like to to organize and persist it. And given the versioning capability, is this something that would be feasible to use as sort of a secondary store for being able to keep a historical record of transactions for a primary database for somebody using some, for instance, Postgres? Yeah. We we've actually seen a couple of uses

specifically with Postgres, actually, in the the last couple weeks. Basically, people so 1 use case had to do with a variety, you know, this the series of tables being being generated

in a pipeline

and then dumping

those tables into

pachyderm in order to basically

have the whole history of of those tables that were being generated in Postgres

in pachyderm's file system. So it's definitely something that that we see,

and it's something that is oftentimes maybe a first step for people as well. So maybe they have this pipeline that's creating,

pushing data to Postgres or whatever data warehouse they're using, and they they just wanna the history of that such that they can either revert or such that they can

analyze diffs between certain times or whatever it is, this might be a first step in order to kind of start start getting into Packeterm and and learn about the versioning is is to actually use that as kind of a time machine

for your database. But then those companies also, once they kind of get there, they're they begin to think about, okay. Now I have this kind of time capsule for for my data.

This also allows me to kind of keep my analyses

in sync with that with that data over time. And this this is what, sometimes leads into

bringing in the analysis pieces as well. And what's the typical interface for somebody trying

to load their data into the Pachyderm file system, either streaming or in bulk? Is it that they would generally upload it as a batch job to the s 3 or Google Cloud Storage, or is there an interface that Packeterm provides to sort of abstract the

backing data store away so that they could just interact directly with Packaderm itself? The object store and Kubernetes

can be basically transparent to to a user. So Packaderm does provide its its own interface to to these things, and you can think of it basically,

very similar to how you interact with

GitHub via Git. You have,

files locally or you have files somewhere or data somewhere,

and you can commit that data to Packaderm

via commits and you define a branch,

and then you can transfer those files or or that data. And this could be everything from, you know, some of our users are committing, you know, hundreds of gigabytes at a time

into their data repositories,

you know, maybe slightly less frequently than other customers who are maybe feeding,

pipelines off of Kafka or something like that. And they might have a a service that basically is just making high frequency commits into Packeterm,

which is then triggering

streaming analysis with within Packaderm. So it's everything in there. We have a CLI tool that will let you do these things, you know, manually and inspect things manually, but also

you can utilize

our Go client. We have a nice Go client, and there's there's, actually other clients in the works as well. There's Rust clients and Python and actually a Rust client in the work that will basically make this kind of interaction universal as well. And are there any particular

data or serialization formats that that are better supported in Pachyderm? Or is it largely just a matter of

the language and tools being used to actually interact with the data once it ends up at the Pachyderm file system?

Like, is there any particular

capabilities

that the versioning layer

adds on top of serialization formats such as Avro or Thrift? Or is it largely agnostic to that? Generally, I would say it's largely

agnostic to that. So the Packetern file system and the way that your analyses work with it is basically just like you would work with any other

file system. So how you interact with version data is the same way that you would interact, you know, via file

IO with any other files. So you can utilize whatever libraries, whatever formats you like.

There are some

interesting features for certain data types. So there's some functionality

that's been built in around, around JSON data,

like line based

delimitation of, of data. And there's other things to

possibly be aware of. For example,

if you have lines of of JSON data

and you modify

a certain field within 1 of your JSON blobs, the way that we do parallelization,

it might not be so it might have to process that entire blob instead of, like, with rather than just adding a new line where you would only process the the new line or the new part of your data.

So there's some

caveats here and there, but for the most part, it's just generally

whatever files you wanna use and interacting that with them the same way you would with any other file system. So as you mentioned earlier, another 1 of the compelling features of Packaderm is the fact that it can natively support the use of any language for interacting with and analyzing the data. So I'm wondering

why that's such an important capability,

and what is it about some of the other systems that might make it more difficult to achieve that same result? This is a a hugely important point in in my mind,

working both with with data scientists

and data engineers.

I've seen that basically everyone comes from a from a different background.

They've used different tools. They prefer different tools.

Data engineers are maybe

more interested in JVM languages like Java Java or Scala, whereas some data scientists might might have come from, like, a statistics background and work with Mathematica or R and other ones

maybe are like scripting with Python.

So there's this whole range of tools and whole range of backgrounds,

and so this is really the the situation we're in and will be in for quite some time. And I would say that no 1 really knows

the tools that they're gonna be using even even a year from now. So building up your pipelines in in a language agnostic way is is hugely important

and also in the sense of,

like, bridging this gap between data engineers and data scientists.

Again, data data engineers,

we love to think about how how something will scale

and

might get very frustrated sometimes with scientists who build incredibly inefficient code. Again, 1 of the nice things about Packaderm is you're utilizing these very simple file IO operations to access your data. You can write something simple, whether it be in R or whether it be in Java or whether it be in Python,

and insert it into Pachyderm and instantly be able to distribute that and scale it across your data and also instantly

be able to have, you know, your data scientists'

Python script interacting

in this distributed scalable way with your data engineers'

parsing and in other pipelining

stages that are maybe written in in Java or Scala.

So I think this is hugely important.

Other frameworks, for example, like Airflow or Luigi,

they might be tied to

to specific languages.

A lot of times you see see things tied to Python,

which is kind of limiting,

especially as I say, you know, for data engineers that that really like JVM languages.

And it really doesn't promote

the kind of autonomy for

individual data scientists and data engineers to be able to say, like, this is the best tool for this situation. I'm gonna choose it, and I will be able to deploy it in a consist consistent way.

That's kind of what the Packethern philosophy is, like choose the best tools for these different stages,

and we'll deploy them in a very, you know, consistent way that will also be reproducible

across your data. So couple of questions off of that. The first 1 being,

I'm wondering if there are any built in capabilities for being able to alert

on any failures in the data pipeline so that they can be caught and remedied early on. And if there are failures, is it easy to

I'm assuming that it's easy to resume from where that failure occurred

because of the fact that it does have that versioning step so that the each step of the pipeline, I'm assuming, is associated with a particular

version

marker in the file system itself. Yeah. That's,

great deduction and very much the way it is. So Packarderm does have some failure capability

built in, including,

you know, notifications of failures, but also being able to, like, get your logs from your containers that are running in Kubernetes for particular jobs,

that are running off of particular commits.

But also, like you said, there's this feature

that is

it's maybe a little bit subtle when you start thinking about it, but

very powerful when when you grasp it. The fact is that, like, if you have this complicated pipeline

with maybe 7 different stages that branches out these ways and then there's 5 other stages over here and 2 other over here,

what Packeterm allows you to do is on commits into the input of your pipeline, things will run and maybe something fails,

but you don't have to rerun every single stage in order to recreate your result. Because things are versioned,

all of those commits to the other data repositories

are existing,

the the successes.

So basically, what you can do is gain some efficiency by the next run. You know, once you fix your problem, maybe you fix your image and you update your pipeline,

you can run just on those on that new data on those stages that that need to be run-in order to to produce your result. In the

versioning of the file system, does it also record any sort of metadata as far as the time that the version was created so that you can go back and see

when a failure happens? You can see when the actual

time of the last data commit happened because I can see where that would be potentially useful in a number of situations.

Yeah. And this gets to another really great implication, I guess, of data provenance.

And by data provenance, what I'm meaning is this data came out or this result came out of this pipeline or maybe a pipe maybe a stage of a pipeline failed. With things being versioned

and with the provenance features of Packaderm, what you can say is, you know,

this series of commits to these

series of data repositories

were processed by these series of pipelines,

eventually leading to the point that you're interested in, whether that be a weird result that you get or whether that be a certain failure in a certain stage. So this also has

some really great debugging

functionality because

once something fails, you can look at the whole provenance of what went where, what was transformed where, in order to produce the input that is actually input to the stage that that failed.

And you can look back, and like you said, you can look at the different time stamps of those,

those commits and the different commit IDs

and look at the state of that data wherever you'd like in any series of data repositories that that you would like. As well, the the pipelines and the jobs have time information

about when they ran, when they finished running,

and there's some inspecting

functionality that's built into Packeterm there. And I can imagine that another thing that would be fairly important for being able to properly track the provenance and reproduce

the results at various stages of development is

having

a good way of being able to reliably

determine the exact container images that were being used for each iteration of the pipeline. So I'm wondering if there's any sort of integration with a

Docker registry or any sort

of mechanism for caching the actual container images that were used at various points in the pipeline? You're definitely right. I mean, you can version your data, but if you don't know

what images were paired with that, then you don't

explicitly know

what the complete story was. There's a couple of points here. So the first point being that Packeterm will work

or can pull images from any

Docker registry that you're running, whether that be an internal private registry

or, Docker Hub or whatever it is, you can pull images from any of those places.

Because of that and because that interaction

is just as you would expect

with working with the registry,

you can, make use of tags in order

to,

properly tag your images

as far as, like, which tagged image was was used in which analysis. This this information

will be tracked. As well, there's there's some nice development functionality

related to this because

it's very possible that during development,

you pull an image in a Packetner pipeline and then it fails and then you have to correct it, so you have to change the image

and, basically,

you might not want to keep updating the tag or it might not be ready for you to tag, you know, your production

container yet. But what you can do is utilize this update pipeline functionality in Packaderm, which will, under the hood, it will pull the latest build of your Docker image, and it will basically create a unique tag for that image and store it in an internal Docker registry.

And each time you update, it will do that. So it's basically during this development cycle where you're iterating around the images you're using, you can do that very quickly with these, this kind of internal tagging. And given that you're using Kubernetes as the orchestration mechanism for the containers, does it also support the rocket container image format?

Right now, we only work with Docker containers,

which I think covers most cases.

There is some conversation going on around that, and,

I can also

provide some links to that and to our Slack channel for for more information on on that after the podcast. Yeah. So I'll make sure that all that gets in the show notes.

And another question that I had about the pipelining capability and the intent of being able to

move the analysis code that's generated by data scientists into the operational

pipeline without having to

reimplement all the algorithms because I know that that can be a fairly large investment

when you're using 1 1 language to do the analysis and running the actual production deployment of it in another language.

But it also potentially runs into the situation where the analysis code itself,

even though you don't have to rewrite it, might not be appropriately scalable to have the necessary performance characteristics. So I'm wondering

what are some of the things to watch out for there, and what are some of the mitigating

capabilities that Packetder might provide? In this sense, the way that we distribute

our analysis

via containers and Kubernetes

can be very powerful.

Again, as I've kind of, like, switched back and forth from data engineering to data data science throughout the things that I've done, I've seen exactly what you've talked about. So this this hard transition from data scientists having something they like on their laptops and and transitioning that into something that can actually be used in production.

A really nice feature of Pacaderm is that Pacaderm

can really smartly distribute

your analysis across your data,

even if that analysis

is kind of, quote, unquote, dumb analysis

or kind of,

simplistic without a lot of thought as far as

parallelization or anything like that. For example,

recent example I did, I was just using basic

scikit learn models, which don't necessarily

scale super great. But I was able to do kind of this massive hyperparameter

optimization

with just these these simple scikit learn models because I could put simple scikit learn model

in a container,

and then I could use Packaderm to say

spin up a 100 of these

containers

and

supply each 1 with 1 1 hundredth of my data, or you could say, like, supply each 1

with different parameters for the optimization.

And then at the end, we can have a container that reduces these these results.

And in that sense, your data scientists didn't have to worry about making some model that was out of the box, you know, easily distributed,

but they were able to do kind of, like, their own what made sense in their mind and distribute that easily with Packerderm. So that's 1 thing to keep in mind. And that kind of paired with this language agnostic feature

allows your data engineers to write very efficient things in in Java, Scala, or or Go, or whatever it is, and kind of feed that into this

distribution of,

data science tools. And the

integration of the data pipelining and dependency graph

is definitely very useful, particularly given the the level of integration to the file system and the versioning interface.

So I'm wondering if that precludes any requirement for using external tools such as Luigi or Airflow. Basically, the answer is yes. I mean, again, we we're very passionate that the

combination of data versioning with the pipelining piece is really

what empowers

these innovative, reproducible

pipelines.

And,

you can define this whole pipeline, whether it's a very simple ETL pipeline,

maybe it just consists of 1 stage, all the way through a very, very complicated

pipeline used for research or or whatever it is, you can define all of the specification

with Packaderm

and pull in whatever images you need and whatever languages you need. Packaderm provides this way of doing the pipelining

as well as something that Airflow and Luigi

kind of don't provide, which is

a natural way for you to keep your pipeline in sync with your input data. Because the versioning is there,

you know exactly which data has been updated.

And so your dependency scheduler should only process that new data. And in terms of the

analysis patterns and sort of algorithmic approaches, I noticed that the documentation mentions the map reduce pattern.

And I'm wondering what other sort of approaches are supported in terms such as streaming or interfacing with tools like Apache Drill. Map reduce is definitely

possible in Packeterm.

I don't think we could be a proper,

reimagining

of Hadoop without some some MapReduce

pipelines,

developers have flexibility here. So you could have MapReduce,

you could have multiple maps, and you could have filters.

You could have a single nonparallelized

pipe where you're basically utilizing the data versioning capabilities maybe to do some some ETL.

Some of our users are doing just that. They're processing

log lines with a Pachyderm pipeline to parse values out or format output

or maybe process web related events.

Pacaderm

is is a natural fit for both

streaming and batch processing. And once you start kind of thinking about this data versioning

combined with analysis

mentality,

really the issues of whether something is streaming or batch to some degree disappear because, you know, we have users that are doing, quote, unquote, streaming with Packaderm

because they're making very high frequency commits into data versioning. And those high frequency commits

are triggering

pipelines

in a streaming fashion to

process all of those commits, those events coming off of maybe a message queue, maybe Kafka,

and process that while other users are doing very, quote, unquote, batch things where they're maybe making a commit of a large amount of data daily

and doing some aggregation and or driving a dashboard or whatever it is. So, yeah, all of those scenarios are covered.

1 of the challenges in kind of starting with Packaderm is because there's so many possibilities with those. Sometimes it can be hard to think, like, what should I start with? But I would say, you know, for those that are

interested in these just,

more

straightforward ETL things, that's kind of where you should start with Packaderm. And maybe for those looking to do distributed MapReduce, we have examples of that too,

and all of that's in our docs. And speaking of getting started, what does the deployment story look like for somebody who wants to start experimenting with Packaderm,

either locally on their laptops or in a production context? Locally, it's very easy.

Kubernetes now has this mini cube

program. So, basically,

you can use that to deploy Packaderm within, you know, I think 3 commands now, and I think we're getting rid of 1 of those commands fairly,

soon.

But, basically, you have to deploy minikube, and then you run pat control deploy and to deploy your,

Packarderm cluster within minikube locally. And that's actually how I do a lot of development work too. I'll just have minikube running on my laptop and develop my pipelines in that way. So that's a really another really good use of that. But then also we have,

deploy scripts as well and instructions for how to deploy

on, on Google or Amazon

and Azure.

Most of the time,

if there's any issues, it might be in spinning up your Kubernetes cluster.

And once you have your Kubernetes cluster up, deploying Packeterm again is just like deploying a a Kubernetes application. So

if you have Kubernetes

up in your respective cloud provider

or if you can get it up, then that next step to Packaderm is is very straightforward.

And what are some of the most interesting

or unexpected

uses of Packaderm that you've seen? That I really like coming

from a physics background, especially,

is our work with general fusion. So general fusion

has probably been, I think, the most interesting

and longest running,

uses of Pacaderm.

They're the company that are designing the world's first

full scale demonstration

of

fusion of a fusion power plant. So it's it's pretty it's pretty cool stuff.

Some of the things that they liked about Packaderm

were, 1, the language agnostic piece, which allowed them to

not have to worry about rewriting their whole tech stack, and also

this data versioning piece.

So they had a bunch of different research teams that were working on various things. They would pull down before Packeterm, they would pull down things onto their laptop and and work on their analyses. But inevitably,

those analyses would fall out of sync with their latest calibration. So this has basically eliminated that kind of frustrating cycle and allowed them

to,

collaborate in a in a more sane way.

Another project that that I'm pretty excited about is,

a project with 1 of the the major newsrooms that we're working on right now and basically building a

a time machine for a comment thread.

So

as we all know,

comment threads on blog posts or Reddit or wherever it is can get a little bit hairy.

So what we're trying to think about is building

a way for moderators

to

basically have

time snapshots of what a thread looked like at various stages in the past.

So that's like this data versioning piece, as you can imagine. But then on top of that, we can pair the different snapshots

with different statistics and aggregations of metrics

for those different times. So, like, we could use NLP, for example, to get sentiment

for various

users at a certain time window and and that sort of thing. Those are the ones that I'm I'm definitely enjoying watching right now. And building something like pachyderm is certainly fairly time consuming process,

and I'm sure that you and everybody else who you work with isn't doing it for free. So I'm wondering what the business model behind Pacoderm looks like. Right now, I think we're 8 people in total.

I mean, everybody's an engineer, really.

We've focused a lot of the time until now on on building the core platform.

But kind of the business model around that and and something we're already seeing is that along with other open source projects that are out there that that are actually

backed by some business entity,

People love open source because they can they can dive right in and they get everything for free, of course. And we provide a great community around it and including, you know, a Slack team and and all of that. And there's there's a lot of open discussion about it. But as well, a lot of enterprise users

want a little bit more hands on support.

So providing the that engineering support to them and,

support in general is 1 way to bring in in,

those funds that are kind of powering the development.

So what are some of the areas that you're looking for help from the community? And are there any particular issues that the listeners can check out to get started with the project? We are, really looking for 2 things right now. I mean, users and contributors. That's the 2 big ones. Right? We're definitely

wanting users to

come online, join our our Slack team, which you can find, on our website, pachyderm.

Io,

and basically discuss with us, you know, what what are your use cases. And like I said, we are very, very much wanting users right now. So if you come on there, our whole engineering team will will be very quick to respond to you and then help you get your toy model up. And then a prototype and maybe a POC will help you all the way along that road. And that's really helpful for us because along that road, we get feedback about

kind of little kinks and workflows. So everyone has a slightly different workflow that they'd like to see

and figuring out trends in what people

where people are getting stuck. That's that's really helpful to us at this point. So anyone who is even slightly interested, I would encourage you to get online and just try out a few things, give us some feedback, and that would be much, much appreciated.

On the contributor side, there's definitely a lot of open issues in,

online. In the past, we have tagged

quite a few,

noob friendly,

I think for new contributors that want to start contributing.

I'm not sure the current status of how many of those there are, so that would be 1 thing to look for.

But if you're not finding those, I would encourage you I mean,

some initial

things that were would be really helpful is, you know, we put up all this documentation

and just kind of going through each of the examples

and doing 2 things. So fixing any documentation errors is obviously

helpful. But also, as you go through these different examples, everyone has a different setup on their laptop or they're they're deploying to maybe a different cloud provider or whatever it is. So as you're going through those examples, you're also finding

potential bugs

or maybe,

issues with our deploy script. And those are things that you can kind of hop on our Slack channel, see if anybody is experiencing the same thing or look through the issues, see if it's been brought up, and then, bring up those issues and maybe even address them as well. And what is the language or languages that Packaderm is actually implemented in so that people who are familiar with those languages can go ahead and check it out? Yeah. Packaderm is written in, in Go. So that's primarily

the language that you'll find throughout throughout the repository.

And, if you're wanting to contribute.

Although,

we are in the process of doing some front end things as well as far as the UI is is concerned,

If there's interest on that side as well, that's that's definitely something

something that we're exploring. So are there any other topics that we should cover before we close out the show? No. I think that covers,

most things.

It's a good discussion. Great.

So for anybody who wants to keep up to date with you and the project and, get in touch, what would be the best place for them to do that? You can find me online.

I'm dwhitena

on Twitter.

And also

regarding pachyderm

itself, I mentioned our website, pachyderm.

Io. Make sure and check that out. And on the website, there's links to our GitHub

and our Twitter and our Slack.

Even if you're just experimenting or playing around with the examples, it would be great. You can join our our Slack channel and just have some basic discussions

with us about Packeterm.

So that's a a great way to to get connected.

And various of us are are also on Gopher Slack if you're a Go programmer or maybe a potential contributor and wanna talk to us there. So Alright. So as an aside,

this is the first interview I've recorded for this podcast. So I'm currently trying to figure out a good sort of

closing question to ask all of the guests.

So some of that I'm considering are, you know, what's your favorite dataset?

Or

if you had a free weekend,

what sort

of analyses would you do? Although given that it's a data engineering focused podcast, I'm not sure that those kinds of questions would be the appropriate focus. I don't know if you have any suggestions. I think the free week weekend 1 is, is a pretty good 1. I mean, it could be phrased not not only in terms of analysis, but in terms of your

your side projects or what you're involved in with the main project. What, you know, what would you tackle on a on a free weekend that you can't get to this next week, you know? Okay. That's a good 1. I'm sure I'll probably go through a few permutations of it. But,

let's go with that 1. Alright. Sounds good. Okay. So if you had a free weekend to spend working on anything either related to pachyderm or not, what do you think you would be spending it on? Well, there's a a lot of things in the queue. But, I think if I was to have to choose,

I would work on my my side project,

Gopher Notes, which is the Go kernel for Jupyter Notebooks.

And, the 1 thing I'm wanting to implement in that, which, I think is

a major feature that that needs to be there is in line plotting

in the notebooks.

And, that's probably what I would work on. Well, I really appreciate you taking the time out of your day to join me and talk to me about the work that you guys are doing at Packaderm. It's definitely a very interesting project and it seems to be solving a lot of important issues for people in the data engineering and data management space, as well as people doing data analysis to potentially make it easier for them to manage their own pipelines.

So certainly something I'll be keeping an eye on. And I'm sure that a number of the listeners will be interested in taking a look at it once they hear a bit more about it. So thank you again. Yes. And thank you for the opportunity. It was a it was a great time.

Data Engineering Podcast

Summary

Preamble

Interview with Daniel Whitenack

Keep in touch

Free Weekend Project

Links