Summary
Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today!
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- Your host is Tobias Macey and today I’m interviewing Daniel Whitenack about Pachyderm, a modern container based system for building and analyzing a versioned data lake.
Interview with Daniel Whitenack
- Introduction
- How did you get started in the data engineering space?
- What is pachyderm and what problem were you trying to solve when the project was started?
- Where does the name come from?
- What are some of the competing projects in the space and what features does Pachyderm offer that would convince someone to choose it over the other options?
- Because of the fact that the analysis code and the data that it acts on are all versioned together it allows for tracking the provenance of the end result. Why is this such an important capability in the context of data engineering and analytics?
- What does Pachyderm use for the distribution and scaling mechanism of the file system?
- Given that you can version your data and track all of the modifications made to it in a manner that allows for traversal of those changesets, how much additional storage is necessary over and above the original capacity needed for the raw data?
- For a typical use of Pachyderm would someone keep all of the revisions in perpetuity or are the changesets primarily just useful in the context of an analysis workflow?
- Given that the state of the data is calculated by applying the diffs in sequence what impact does that have on processing speed and what are some of the ways of mitigating that?
- Another compelling feature of Pachyderm is the fact that it natively supports the use of any language for interacting with your data. Why is this such an important capability and why is it more difficult with alternative solutions?
- How did you implement this feature so that it would be maintainable and easy to implement for end users?
- Given that the intent of using containers is for encapsulating the analysis code from experimentation through to production, it seems that there is the potential for the implementations to run into problems as they scale. What are some things that users should be aware of to help mitigate this?
- The data pipeline and dependency graph tooling is a useful addition to the combination of file system and processing interface. Does that preclude any requirement for external tools such as Luigi or Airflow?
- I see that the docs mention using the map reduce pattern for analyzing the data in Pachyderm. Does it support other approaches such as streaming or tools like Apache Drill?
- What are some of the most interesting deployments and uses of Pachyderm that you have seen?
- What are some of the areas that you are looking for help from the community and are there any particular issues that the listeners can check out to get started with the project?
Keep in touch
Free Weekend Project
Links
- AirBnB
- RethinkDB
- Flocker
- Infinite Project
- Git LFS
- Luigi
- Airflow
- Kafka
- Kubernetes
- Rkt
- SciKit Learn
- Docker
- Minikube
- General Fusion
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data infrastructure. You can go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on iTunes or Google Play Music and tell your friends and coworkers. There's also a community site at community. Dataengineeringpodcast.com, where you can join the conversation with other listeners and help out by giving suggestions and feedback. Your host is Tobias Macy. And today, I'm interviewing Daniel Whitenack about Packaderm, a modern container based system for building and analyzing a version of data lake. So Dan, could you please introduce yourself? Yeah. I'm Daniel Whitenak, as you mentioned. I'm a data scientist
[00:00:46] Daniel Whitenack:
and developer advocate with Packeterm.
[00:00:49] Tobias Macey:
And how did you get started in the data engineering and data analysis space? Originally,
[00:00:55] Daniel Whitenack:
back in the day, I I started out in physics. Eventually, decided that I didn't want to stay in academia, so moved into industry. Finally, kind of landed in this data science and engineering space. Was fortunate to start out at a at a small company, telecom startup called Telnyx, where I had to kind of wear a lot of hats. So I did a lot of data engineering along with doing data science. And, I gained an appreciation for, I guess, this kind of end to end ownership of a data project. So
[00:01:31] Tobias Macey:
that's kind of how I, how I got into this stuff and, and how I developed the the interest that I have. Today, we're gonna be talking about the Packaderm project. So I'm wondering if you can describe a bit about what that is and the problem that you were trying to solve when the project was started. Yeah. Definitely.
[00:01:45] Daniel Whitenack:
Pachyderm is an open source project. It's, under the pachydermorgon github. And what it does is it provides a system for data pipelining and data versioning. It's a totally language agnostic framework because it's built on containers, as you mentioned. And thus, you can build data pipelines with basically any language or framework you like, including Java, R, Scala, Go, or even things like TensorFlow. And the output and input data of these data pipelines on each stage are versioned in data versioning. So we can talk about that a little bit further, but it does some some nice things for us. When we commit new data into Packeterm, we can automatically trigger pipelines, so doing things like streaming and batch analysis.
That's the basic overview. Regarding, you know, where it came from, the cofounders, Joey and JD, who who kinda started things, they ran data infrastructure at Airbnb for a while. They were on that team. So working with Hadoop, they were also the earliest engineers at at Rethink DB. So they they worked a lot with with data infrastructure and basically started seeing that new things like NoSQL databases and Redis and Docker and Core OS felt really awesome and and were great to work with, and things like Hadoop felt a little bit ancient. So the the vision in starting Pachyderm was really kind of a reimagining of Hadoop for that's built on on top of modern tools like Docker and Kubernetes.
[00:03:26] Tobias Macey:
And I'm sure I could probably hazard a guess, but where does the name come from? Packeterm, the name in general is,
[00:03:32] Daniel Whitenack:
is kind of this defunct classification of mammals that doesn't really make any sense. It it kind of means, like, thick thick skinned mammals, like elephants or hippos or rhinos. But being that it kind of refers to to elephants, I guess the name Packeterm originally came from kind of a not so subtle pointer to the the elephant imagery used in, in the Hadoop ecosystem. Because again, this really,
[00:04:01] Tobias Macey:
the inspiration came as a reimagining of what Hadoop would look like in a in a modern world, I guess. So what are some of the competing projects in the space, and what are some of the features that Packaderm offers that would convince someone to choose it over the other options? There's a couple things.
[00:04:17] Daniel Whitenack:
They mostly fit into either the category of data versioning or data pipelining, not necessarily both. So there's some projects like Flocker from ClusterHQ and Infinite, which, I think it was this week or last week was just announced that it was acquired by Docker or even, like, get large file storage. These these projects kind of latch on to the idea of, like, stateful Docker containers and data versioning. Whereas on the other side, you have projects like Luigi or Airflow that are doing data pipelining things. So in comparison to Packeterm, I guess the 1 distinction would be that these things kind of are targeting 1 or the other of those those topics. However, at Packarderm, we're very passionate about the idea that combining data versioning with data pipelining is really way more powerful than trying to glue something like Luigi onto another framework for data versioning.
Basically, that data pipelining is best paired with data versioning. There's this kind of symbiotic relationship between the 2 and in our framework and really allows data engineers and data scientists to produce workflows that are reproducible, collaborative, and easily distributed as well. And 1 of the
[00:05:46] Tobias Macey:
factors that you call out in the marketing site and the documentation is the fact that because of the integration of the data versioning and the pipelining, it makes it easy to track the
[00:05:57] Daniel Whitenack:
provenance of the data. So why is that such an important capability of the system? There's a couple answers to that question. There's different aspects or different, I guess, I should say, benefits when it comes to to data provenance. The first thing is that, really, as we build more data pipelines and models and statistical analyses and other things that impact user impacts users directly, we're gonna be held more and more accountable to that. And this is already being seen in some, like, EU regulations that are coming out that basically are telling people that they have a right to an explanation for algorithmic decisions.
So 1 part of data, you can't really give that explanation. And so there's, like, a compliance element to this or an auditing element, maybe if you're an insurance agent agency or something like that. But then the other side of it is that data provenance is really a precursor to true collaboration in an organization. So if I'm developing some nifty model or some analysis or some amazingly efficient pipeline, I'm not working in a silo. And I I want people to be able to reproduce what I've done using the same data that I used to produce the same results that I saw.
And this isn't really possible at all if you don't have some way of tracking how your data is moving and changing and, some way of pairing that with the actual analysis that you're doing. And
[00:07:30] Tobias Macey:
the versioning capability of the data is definitely something that is not as widely used in a lot of contexts, largely because of the fact that from what I've been able to gather, the primary storage mechanism for distributed data lakes is the Hadoop file system, which I know is based on Java. But what does Packaderm use as the, base layer for being able to provide that versioning capability as well as the distribution and scaling mechanisms that are used in conjunction with that. 1 of the kind of major
[00:08:03] Daniel Whitenack:
mindsets that we have at Pacaderm is that we want people to stay focused on their their problems and solving these data engineering challenges and using their expertise to solve these data engineering challenges and not necessarily having to worry about this this file system. So, basically, PFS sits on top of any generic object store, like s 3 or GCS or Ceph, Swift. And, basically, what we do is treat this object store like a CDN for data. And along so that, along with some smart smart caching, allows us to have this distributed, scalable, and resilient data along with kind of the solid performance. And because we're because all of this is scheduled on Kubernetes, it means that basically, for end users, we can scale up and down arbitrarily to service more requests or add more cache space, for example. And if I recall properly, I believe that in 1 of the earlier permutations of the pachyderm system, it used better FS for being able to handle the versioning of the data as well. Is that correct? It's possible. I'm a recent addition to the team, so I'll have to go defer to my colleagues on on that particular 1. I know that, we've utilized Fuze. Although in our most recent release, there's been some changes to that where, we actually prefetched some data not relying on FUSE, which gives some performance advantages.
[00:09:34] Tobias Macey:
And the versioning capability of the data I know is handled at least partially by tracking the diffs as you apply different manipulations to the data itself. So given that the fact that it is that the the current state of the data is computed by applying the the various diffs on top of the base state of the data. I'm wondering how that affects the additional storage capacity that's necessary for storing those changes as well as the performance aspects of the impact that having to apply those various diffs on to compute the current state might have, when somebody is trying to do analysis on top of it? Yeah. That's, a great question and and kind of basically getting into
[00:10:15] Daniel Whitenack:
the trade offs of of data versioning. But I I would say so Packarderm, it definitely doesn't save copies of files in terms of, saving the the different commits of your data into versioning. We only store the diff, so it's space efficient in that way. And to to just give you an idea as far as, like, the space efficiency in terms of numbers, we store maybe about, 64 bytes of metadata per 8 megabytes per an 8 megabyte block of actual data that that you're pushing into versioning. So it's pretty space efficient in that way. Also, in in terms of, like, the actual computations, 1 of the things that's nice is, like, a Pachyderm job or a Pachyderm pipeline either subscribes to a certain repository of data or is actually run on a specific commit of data.
So for each analysis, basically, you're analyzing data at a certain state. You're not having to kind of scrub through a history of data commits in order to figure out how to process your data. So on a per job or per pipeline basis, we're we're processing 1 commit of data that's that's coming in, and then we're making 1 commit of data out. And for for a typical use of Packaderm, would somebody keep all of the revisions
[00:11:35] Tobias Macey:
of the data in perpetuity or are the change sets primarily just useful in the context of the analysis work flow, and then the final state of the data would then get merged back into the base layer? This persistence question is really, I guess, on a case by case basis.
[00:11:50] Daniel Whitenack:
We've seen some users that do want to do want to keep the whole history of their of their data. So maybe if if you're thinking back to what I said earlier about a insurance company or maybe a company that's, like, labeling people as fraudulent or or whatever it is, it could be that you want an audit trail for your data, a good ways back, and so you might be persisting this whole commit history in perpetuity. Other companies, like, let's say, maybe, like, a a smaller web company that has certain policies around privacy or or whatever it is, or maybe they're only concerned about the data for the past 48 hours, they might stream that that data in and then basically delete those those repositories or those commits afterwards.
Or they might have a part of their pipeline that specifically anonymizes and persists only only certain data after after a period of time. Packaderm is designed in general to store as much or as little data as you you would like to to persist. The the mechanism is the same around commits and branches.
[00:12:57] Tobias Macey:
It's just up to you how you'd like to to organize and persist it. And given the versioning capability, is this something that would be feasible to use as sort of a secondary store for being able to keep a historical record of transactions for a primary database for somebody using some, for instance, Postgres? Yeah. We we've actually seen a couple of uses
[00:13:18] Daniel Whitenack:
specifically with Postgres, actually, in the the last couple weeks. Basically, people so 1 use case had to do with a variety, you know, this the series of tables being being generated in a pipeline and then dumping those tables into pachyderm in order to basically have the whole history of of those tables that were being generated in Postgres in pachyderm's file system. So it's definitely something that that we see, and it's something that is oftentimes maybe a first step for people as well. So maybe they have this pipeline that's creating, pushing data to Postgres or whatever data warehouse they're using, and they they just wanna the history of that such that they can either revert or such that they can analyze diffs between certain times or whatever it is, this might be a first step in order to kind of start start getting into Packeterm and and learn about the versioning is is to actually use that as kind of a time machine for your database. But then those companies also, once they kind of get there, they're they begin to think about, okay. Now I have this kind of time capsule for for my data.
This also allows me to kind of keep my analyses in sync with that with that data over time. And this this is what, sometimes leads into
[00:14:41] Tobias Macey:
bringing in the analysis pieces as well. And what's the typical interface for somebody trying to load their data into the Pachyderm file system, either streaming or in bulk? Is it that they would generally upload it as a batch job to the s 3 or Google Cloud Storage, or is there an interface that Packeterm provides to sort of abstract the backing data store away so that they could just interact directly with Packaderm itself? The object store and Kubernetes
[00:15:07] Daniel Whitenack:
can be basically transparent to to a user. So Packaderm does provide its its own interface to to these things, and you can think of it basically, very similar to how you interact with GitHub via Git. You have, files locally or you have files somewhere or data somewhere, and you can commit that data to Packaderm via commits and you define a branch, and then you can transfer those files or or that data. And this could be everything from, you know, some of our users are committing, you know, hundreds of gigabytes at a time into their data repositories, you know, maybe slightly less frequently than other customers who are maybe feeding, pipelines off of Kafka or something like that. And they might have a a service that basically is just making high frequency commits into Packeterm, which is then triggering streaming analysis with within Packaderm. So it's everything in there. We have a CLI tool that will let you do these things, you know, manually and inspect things manually, but also you can utilize our Go client. We have a nice Go client, and there's there's, actually other clients in the works as well. There's Rust clients and Python and actually a Rust client in the work that will basically make this kind of interaction universal as well. And are there any particular
[00:16:34] Tobias Macey:
data or serialization formats that that are better supported in Pachyderm? Or is it largely just a matter of the language and tools being used to actually interact with the data once it ends up at the Pachyderm file system? Like, is there any particular capabilities that the versioning layer adds on top of serialization formats such as Avro or Thrift? Or is it largely agnostic to that? Generally, I would say it's largely
[00:17:00] Daniel Whitenack:
agnostic to that. So the Packetern file system and the way that your analyses work with it is basically just like you would work with any other file system. So how you interact with version data is the same way that you would interact, you know, via file IO with any other files. So you can utilize whatever libraries, whatever formats you like. There are some interesting features for certain data types. So there's some functionality that's been built in around, around JSON data, like line based delimitation of, of data. And there's other things to possibly be aware of. For example, if you have lines of of JSON data and you modify a certain field within 1 of your JSON blobs, the way that we do parallelization, it might not be so it might have to process that entire blob instead of, like, with rather than just adding a new line where you would only process the the new line or the new part of your data.
So there's some caveats here and there, but for the most part, it's just generally
[00:18:08] Tobias Macey:
whatever files you wanna use and interacting that with them the same way you would with any other file system. So as you mentioned earlier, another 1 of the compelling features of Packaderm is the fact that it can natively support the use of any language for interacting with and analyzing the data. So I'm wondering why that's such an important capability, and what is it about some of the other systems that might make it more difficult to achieve that same result? This is a a hugely important point in in my mind,
[00:18:35] Daniel Whitenack:
working both with with data scientists and data engineers. I've seen that basically everyone comes from a from a different background. They've used different tools. They prefer different tools. Data engineers are maybe more interested in JVM languages like Java Java or Scala, whereas some data scientists might might have come from, like, a statistics background and work with Mathematica or R and other ones maybe are like scripting with Python. So there's this whole range of tools and whole range of backgrounds, and so this is really the the situation we're in and will be in for quite some time. And I would say that no 1 really knows the tools that they're gonna be using even even a year from now. So building up your pipelines in in a language agnostic way is is hugely important and also in the sense of, like, bridging this gap between data engineers and data scientists.
Again, data data engineers, we love to think about how how something will scale and might get very frustrated sometimes with scientists who build incredibly inefficient code. Again, 1 of the nice things about Packaderm is you're utilizing these very simple file IO operations to access your data. You can write something simple, whether it be in R or whether it be in Java or whether it be in Python, and insert it into Pachyderm and instantly be able to distribute that and scale it across your data and also instantly be able to have, you know, your data scientists' Python script interacting in this distributed scalable way with your data engineers' parsing and in other pipelining stages that are maybe written in in Java or Scala.
So I think this is hugely important. Other frameworks, for example, like Airflow or Luigi, they might be tied to to specific languages. A lot of times you see see things tied to Python, which is kind of limiting, especially as I say, you know, for data engineers that that really like JVM languages. And it really doesn't promote the kind of autonomy for individual data scientists and data engineers to be able to say, like, this is the best tool for this situation. I'm gonna choose it, and I will be able to deploy it in a consist consistent way. That's kind of what the Packethern philosophy is, like choose the best tools for these different stages, and we'll deploy them in a very, you know, consistent way that will also be reproducible
[00:21:10] Tobias Macey:
across your data. So couple of questions off of that. The first 1 being, I'm wondering if there are any built in capabilities for being able to alert on any failures in the data pipeline so that they can be caught and remedied early on. And if there are failures, is it easy to I'm assuming that it's easy to resume from where that failure occurred because of the fact that it does have that versioning step so that the each step of the pipeline, I'm assuming, is associated with a particular version marker in the file system itself. Yeah. That's,
[00:21:42] Daniel Whitenack:
great deduction and very much the way it is. So Packarderm does have some failure capability built in, including, you know, notifications of failures, but also being able to, like, get your logs from your containers that are running in Kubernetes for particular jobs, that are running off of particular commits. But also, like you said, there's this feature that is it's maybe a little bit subtle when you start thinking about it, but very powerful when when you grasp it. The fact is that, like, if you have this complicated pipeline with maybe 7 different stages that branches out these ways and then there's 5 other stages over here and 2 other over here, what Packeterm allows you to do is on commits into the input of your pipeline, things will run and maybe something fails, but you don't have to rerun every single stage in order to recreate your result. Because things are versioned, all of those commits to the other data repositories are existing, the the successes.
So basically, what you can do is gain some efficiency by the next run. You know, once you fix your problem, maybe you fix your image and you update your pipeline, you can run just on those on that new data on those stages that that need to be run-in order to to produce your result. In the
[00:23:07] Tobias Macey:
versioning of the file system, does it also record any sort of metadata as far as the time that the version was created so that you can go back and see when a failure happens? You can see when the actual time of the last data commit happened because I can see where that would be potentially useful in a number of situations.
[00:23:24] Daniel Whitenack:
Yeah. And this gets to another really great implication, I guess, of data provenance. And by data provenance, what I'm meaning is this data came out or this result came out of this pipeline or maybe a pipe maybe a stage of a pipeline failed. With things being versioned and with the provenance features of Packaderm, what you can say is, you know, this series of commits to these series of data repositories were processed by these series of pipelines, eventually leading to the point that you're interested in, whether that be a weird result that you get or whether that be a certain failure in a certain stage. So this also has some really great debugging functionality because once something fails, you can look at the whole provenance of what went where, what was transformed where, in order to produce the input that is actually input to the stage that that failed.
And you can look back, and like you said, you can look at the different time stamps of those, those commits and the different commit IDs and look at the state of that data wherever you'd like in any series of data repositories that that you would like. As well, the the pipelines and the jobs have time information about when they ran, when they finished running, and there's some inspecting
[00:24:49] Tobias Macey:
functionality that's built into Packeterm there. And I can imagine that another thing that would be fairly important for being able to properly track the provenance and reproduce the results at various stages of development is having a good way of being able to reliably determine the exact container images that were being used for each iteration of the pipeline. So I'm wondering if there's any sort of integration with a Docker registry or any sort of mechanism for caching the actual container images that were used at various points in the pipeline? You're definitely right. I mean, you can version your data, but if you don't know
[00:25:27] Daniel Whitenack:
what images were paired with that, then you don't explicitly know what the complete story was. There's a couple of points here. So the first point being that Packeterm will work or can pull images from any Docker registry that you're running, whether that be an internal private registry or, Docker Hub or whatever it is, you can pull images from any of those places. Because of that and because that interaction is just as you would expect with working with the registry, you can, make use of tags in order to, properly tag your images as far as, like, which tagged image was was used in which analysis. This this information will be tracked. As well, there's there's some nice development functionality related to this because it's very possible that during development, you pull an image in a Packetner pipeline and then it fails and then you have to correct it, so you have to change the image and, basically, you might not want to keep updating the tag or it might not be ready for you to tag, you know, your production container yet. But what you can do is utilize this update pipeline functionality in Packaderm, which will, under the hood, it will pull the latest build of your Docker image, and it will basically create a unique tag for that image and store it in an internal Docker registry.
And each time you update, it will do that. So it's basically during this development cycle where you're iterating around the images you're using, you can do that very quickly with these, this kind of internal tagging. And given that you're using Kubernetes as the orchestration mechanism for the containers, does it also support the rocket container image format? Right now, we only work with Docker containers, which I think covers most cases. There is some conversation going on around that, and, I can also provide some links to that and to our Slack channel for for more information on on that after the podcast. Yeah. So I'll make sure that all that gets in the show notes.
[00:27:35] Tobias Macey:
And another question that I had about the pipelining capability and the intent of being able to move the analysis code that's generated by data scientists into the operational pipeline without having to reimplement all the algorithms because I know that that can be a fairly large investment when you're using 1 1 language to do the analysis and running the actual production deployment of it in another language. But it also potentially runs into the situation where the analysis code itself, even though you don't have to rewrite it, might not be appropriately scalable to have the necessary performance characteristics. So I'm wondering what are some of the things to watch out for there, and what are some of the mitigating
[00:28:20] Daniel Whitenack:
capabilities that Packetder might provide? In this sense, the way that we distribute our analysis via containers and Kubernetes can be very powerful. Again, as I've kind of, like, switched back and forth from data engineering to data data science throughout the things that I've done, I've seen exactly what you've talked about. So this this hard transition from data scientists having something they like on their laptops and and transitioning that into something that can actually be used in production. A really nice feature of Pacaderm is that Pacaderm can really smartly distribute your analysis across your data, even if that analysis is kind of, quote, unquote, dumb analysis or kind of, simplistic without a lot of thought as far as parallelization or anything like that. For example, recent example I did, I was just using basic scikit learn models, which don't necessarily scale super great. But I was able to do kind of this massive hyperparameter optimization with just these these simple scikit learn models because I could put simple scikit learn model in a container, and then I could use Packaderm to say spin up a 100 of these containers and supply each 1 with 1 1 hundredth of my data, or you could say, like, supply each 1 with different parameters for the optimization.
And then at the end, we can have a container that reduces these these results. And in that sense, your data scientists didn't have to worry about making some model that was out of the box, you know, easily distributed, but they were able to do kind of, like, their own what made sense in their mind and distribute that easily with Packerderm. So that's 1 thing to keep in mind. And that kind of paired with this language agnostic feature allows your data engineers to write very efficient things in in Java, Scala, or or Go, or whatever it is, and kind of feed that into this distribution of, data science tools. And the
[00:30:29] Tobias Macey:
integration of the data pipelining and dependency graph is definitely very useful, particularly given the the level of integration to the file system and the versioning interface. So I'm wondering if that precludes any requirement for using external tools such as Luigi or Airflow. Basically, the answer is yes. I mean, again, we we're very passionate that the
[00:30:51] Daniel Whitenack:
combination of data versioning with the pipelining piece is really what empowers these innovative, reproducible pipelines. And, you can define this whole pipeline, whether it's a very simple ETL pipeline, maybe it just consists of 1 stage, all the way through a very, very complicated pipeline used for research or or whatever it is, you can define all of the specification with Packaderm and pull in whatever images you need and whatever languages you need. Packaderm provides this way of doing the pipelining as well as something that Airflow and Luigi kind of don't provide, which is a natural way for you to keep your pipeline in sync with your input data. Because the versioning is there, you know exactly which data has been updated.
And so your dependency scheduler should only process that new data. And in terms of the
[00:31:48] Tobias Macey:
analysis patterns and sort of algorithmic approaches, I noticed that the documentation mentions the map reduce pattern. And I'm wondering what other sort of approaches are supported in terms such as streaming or interfacing with tools like Apache Drill. Map reduce is definitely
[00:32:05] Daniel Whitenack:
possible in Packeterm. I don't think we could be a proper, reimagining of Hadoop without some some MapReduce pipelines, developers have flexibility here. So you could have MapReduce, you could have multiple maps, and you could have filters. You could have a single nonparallelized pipe where you're basically utilizing the data versioning capabilities maybe to do some some ETL. Some of our users are doing just that. They're processing log lines with a Pachyderm pipeline to parse values out or format output or maybe process web related events. Pacaderm is is a natural fit for both streaming and batch processing. And once you start kind of thinking about this data versioning combined with analysis mentality, really the issues of whether something is streaming or batch to some degree disappear because, you know, we have users that are doing, quote, unquote, streaming with Packaderm because they're making very high frequency commits into data versioning. And those high frequency commits are triggering pipelines in a streaming fashion to process all of those commits, those events coming off of maybe a message queue, maybe Kafka, and process that while other users are doing very, quote, unquote, batch things where they're maybe making a commit of a large amount of data daily and doing some aggregation and or driving a dashboard or whatever it is. So, yeah, all of those scenarios are covered.
1 of the challenges in kind of starting with Packaderm is because there's so many possibilities with those. Sometimes it can be hard to think, like, what should I start with? But I would say, you know, for those that are interested in these just, more straightforward ETL things, that's kind of where you should start with Packaderm. And maybe for those looking to do distributed MapReduce, we have examples of that too,
[00:34:07] Tobias Macey:
and all of that's in our docs. And speaking of getting started, what does the deployment story look like for somebody who wants to start experimenting with Packaderm, either locally on their laptops or in a production context? Locally, it's very easy.
[00:34:21] Daniel Whitenack:
Kubernetes now has this mini cube program. So, basically, you can use that to deploy Packaderm within, you know, I think 3 commands now, and I think we're getting rid of 1 of those commands fairly, soon. But, basically, you have to deploy minikube, and then you run pat control deploy and to deploy your, Packarderm cluster within minikube locally. And that's actually how I do a lot of development work too. I'll just have minikube running on my laptop and develop my pipelines in that way. So that's a really another really good use of that. But then also we have, deploy scripts as well and instructions for how to deploy on, on Google or Amazon and Azure.
Most of the time, if there's any issues, it might be in spinning up your Kubernetes cluster. And once you have your Kubernetes cluster up, deploying Packeterm again is just like deploying a a Kubernetes application. So if you have Kubernetes up in your respective cloud provider or if you can get it up, then that next step to Packaderm is is very straightforward.
[00:35:30] Tobias Macey:
And what are some of the most interesting or unexpected uses of Packaderm that you've seen? That I really like coming
[00:35:38] Daniel Whitenack:
from a physics background, especially, is our work with general fusion. So general fusion has probably been, I think, the most interesting and longest running, uses of Pacaderm. They're the company that are designing the world's first full scale demonstration of fusion of a fusion power plant. So it's it's pretty it's pretty cool stuff. Some of the things that they liked about Packaderm were, 1, the language agnostic piece, which allowed them to not have to worry about rewriting their whole tech stack, and also this data versioning piece.
So they had a bunch of different research teams that were working on various things. They would pull down before Packeterm, they would pull down things onto their laptop and and work on their analyses. But inevitably, those analyses would fall out of sync with their latest calibration. So this has basically eliminated that kind of frustrating cycle and allowed them to, collaborate in a in a more sane way. Another project that that I'm pretty excited about is, a project with 1 of the the major newsrooms that we're working on right now and basically building a a time machine for a comment thread.
So as we all know, comment threads on blog posts or Reddit or wherever it is can get a little bit hairy. So what we're trying to think about is building a way for moderators to basically have time snapshots of what a thread looked like at various stages in the past. So that's like this data versioning piece, as you can imagine. But then on top of that, we can pair the different snapshots with different statistics and aggregations of metrics for those different times. So, like, we could use NLP, for example, to get sentiment for various users at a certain time window and and that sort of thing. Those are the ones that I'm I'm definitely enjoying watching right now. And building something like pachyderm is certainly fairly time consuming process,
[00:37:41] Tobias Macey:
and I'm sure that you and everybody else who you work with isn't doing it for free. So I'm wondering what the business model behind Pacoderm looks like. Right now, I think we're 8 people in total.
[00:37:52] Daniel Whitenack:
I mean, everybody's an engineer, really. We've focused a lot of the time until now on on building the core platform. But kind of the business model around that and and something we're already seeing is that along with other open source projects that are out there that that are actually backed by some business entity, People love open source because they can they can dive right in and they get everything for free, of course. And we provide a great community around it and including, you know, a Slack team and and all of that. And there's there's a lot of open discussion about it. But as well, a lot of enterprise users want a little bit more hands on support.
So providing the that engineering support to them and, support in general is 1 way to bring in in, those funds that are kind of powering the development.
[00:38:43] Tobias Macey:
So what are some of the areas that you're looking for help from the community? And are there any particular issues that the listeners can check out to get started with the project? We are, really looking for 2 things right now. I mean, users and contributors. That's the 2 big ones. Right? We're definitely
[00:38:59] Daniel Whitenack:
wanting users to come online, join our our Slack team, which you can find, on our website, pachyderm. Io, and basically discuss with us, you know, what what are your use cases. And like I said, we are very, very much wanting users right now. So if you come on there, our whole engineering team will will be very quick to respond to you and then help you get your toy model up. And then a prototype and maybe a POC will help you all the way along that road. And that's really helpful for us because along that road, we get feedback about kind of little kinks and workflows. So everyone has a slightly different workflow that they'd like to see and figuring out trends in what people where people are getting stuck. That's that's really helpful to us at this point. So anyone who is even slightly interested, I would encourage you to get online and just try out a few things, give us some feedback, and that would be much, much appreciated.
On the contributor side, there's definitely a lot of open issues in, online. In the past, we have tagged quite a few, noob friendly, I think for new contributors that want to start contributing. I'm not sure the current status of how many of those there are, so that would be 1 thing to look for. But if you're not finding those, I would encourage you I mean, some initial things that were would be really helpful is, you know, we put up all this documentation and just kind of going through each of the examples and doing 2 things. So fixing any documentation errors is obviously helpful. But also, as you go through these different examples, everyone has a different setup on their laptop or they're they're deploying to maybe a different cloud provider or whatever it is. So as you're going through those examples, you're also finding potential bugs or maybe, issues with our deploy script. And those are things that you can kind of hop on our Slack channel, see if anybody is experiencing the same thing or look through the issues, see if it's been brought up, and then, bring up those issues and maybe even address them as well. And what is the language or languages that Packaderm is actually implemented in so that people who are familiar with those languages can go ahead and check it out? Yeah. Packaderm is written in, in Go. So that's primarily the language that you'll find throughout throughout the repository.
And, if you're wanting to contribute. Although, we are in the process of doing some front end things as well as far as the UI is is concerned, If there's interest on that side as well, that's that's definitely something
[00:41:36] Tobias Macey:
something that we're exploring. So are there any other topics that we should cover before we close out the show? No. I think that covers,
[00:41:43] Daniel Whitenack:
most things. It's a good discussion. Great.
[00:41:46] Tobias Macey:
So for anybody who wants to keep up to date with you and the project and, get in touch, what would be the best place for them to do that? You can find me online.
[00:41:55] Daniel Whitenack:
I'm dwhitena on Twitter. And also regarding pachyderm itself, I mentioned our website, pachyderm. Io. Make sure and check that out. And on the website, there's links to our GitHub and our Twitter and our Slack. Even if you're just experimenting or playing around with the examples, it would be great. You can join our our Slack channel and just have some basic discussions with us about Packeterm. So that's a a great way to to get connected. And various of us are are also on Gopher Slack if you're a Go programmer or maybe a potential contributor and wanna talk to us there. So Alright. So as an aside,
[00:42:37] Tobias Macey:
this is the first interview I've recorded for this podcast. So I'm currently trying to figure out a good sort of closing question to ask all of the guests. So some of that I'm considering are, you know, what's your favorite dataset? Or if you had a free weekend, what sort of analyses would you do? Although given that it's a data engineering focused podcast, I'm not sure that those kinds of questions would be the appropriate focus. I don't know if you have any suggestions. I think the free week weekend 1 is, is a pretty good 1. I mean, it could be phrased not not only in terms of analysis, but in terms of your
[00:43:11] Daniel Whitenack:
your side projects or what you're involved in with the main project. What, you know, what would you tackle on a on a free weekend that you can't get to this next week, you know? Okay. That's a good 1. I'm sure I'll probably go through a few permutations of it. But,
[00:43:27] Tobias Macey:
let's go with that 1. Alright. Sounds good. Okay. So if you had a free weekend to spend working on anything either related to pachyderm or not, what do you think you would be spending it on? Well, there's a a lot of things in the queue. But, I think if I was to have to choose,
[00:43:43] Daniel Whitenack:
I would work on my my side project, Gopher Notes, which is the Go kernel for Jupyter Notebooks. And, the 1 thing I'm wanting to implement in that, which, I think is a major feature that that needs to be there is in line plotting in the notebooks.
[00:44:01] Tobias Macey:
And, that's probably what I would work on. Well, I really appreciate you taking the time out of your day to join me and talk to me about the work that you guys are doing at Packaderm. It's definitely a very interesting project and it seems to be solving a lot of important issues for people in the data engineering and data management space, as well as people doing data analysis to potentially make it easier for them to manage their own pipelines. So certainly something I'll be keeping an eye on. And I'm sure that a number of the listeners will be interested in taking a look at it once they hear a bit more about it. So thank you again. Yes. And thank you for the opportunity. It was a it was a great time.
Introduction and Podcast Information
Guest Introduction: Daniel Whitenack
Overview of Pachyderm
Origins and Inspiration Behind Pachyderm
Competing Projects and Unique Features
Importance of Data Provenance
Versioning and Storage Mechanisms
Use Cases and Data Persistence
Integration with Databases
Loading Data into Pachyderm
Language Agnostic Capabilities
Failure Handling and Debugging
Tracking Container Images
Scalability and Performance
Pipelining and Dependency Management
Supported Analysis Patterns
Getting Started with Pachyderm
Interesting Use Cases
Business Model
Community Involvement and Contributions
Technical Implementation
Closing Questions and Final Thoughts