Effective Pandas Patterns For Data Engineering

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking,

object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

The only thing worse than having badged data is not knowing that you have it. With Bigeye's data observability

platform, if there is an issue with your data or data pipelines, you'll know right away and can get it fixed before the business is impacted.

Bigeye lets data teams measure, improve, and communicate the quality of your data to company stakeholders.

With complete API access, a user friendly interface, and automated yet flexible alerting, you've got everything you need to establish and maintain trust in your data.

Go to dataengineeringpodcast.com/bigeye

today to sign up and start trusting your analyses.

Your host is Tobias Macy, and today I'm interviewing Matt Harrison about some useful tips for using Pandas in your data engineering projects. So, Matt, can you start by introducing yourself?

Sure. Thanks, Tobias. So my name's Matt Harrison, and I

have a computer science degree. I've worked in

the data realm for almost 20 years. When I graduated

with my degree, I went to work for a search company.

Since then,

I've also worked for a testing company, a storage company.

For the past couple years, I've mostly been doing corporate training in the Python and data science space. So most of my career has been using Python, and it's

all been data related. So

that's my background here.

What I spend

most of my time doing these days is training people in Python and data science and also doing some consulting as well. And you mentioned that you kind of got your start working

in search, and I'm wondering if you can talk a bit more about some of

the sort of early days of your involvement in data management

and how you're working in the search space kind of, I guess,

introduced you to this overall problem space?

I guess my search this is back in 2000. I worked for a company that

back in

the good old days, Google had this yellow box that you would take and you would plug in to your network, and it would, like, scan all of your documents and give you local search. And and so

the company I was working with was sort of making a competitor for that.

That was actually how I got introduced to Python is writing tooling for that.

Following that, I worked for a testing company doing testing management and testing stacks.

And then I did business intelligence,

which was,

I guess, more of

a introduction

or fun with, you know,

that point in time, large datasets

and, you know, snowflake

star schemas, all that sort of fun stuff. To put a little bit more context to what we're talking about today,

when I was doing business intelligence, basically, we wrote,

I would say, an OLAP engine

to manipulate

data

using Python.

A little bit after that, Pandas came out. And so pan there's a lot of overlap between what I was doing and what Pandas was doing. And so that's,

I guess, some background.

So I started using Pandas soon after that and wrote a couple of books on it. I've been teaching it to clients as well. So that's, I guess, more complete

story. You mentioned the use of trying to build this OLAP engine and some of the capabilities of Pandas. And so I'm wondering if you can maybe dig into some of the

main tasks or the core workflows that you've seen folks in the data engineering and data management space using pandas for?

Maybe later on, we can dig into some of the ways that pandas could be applied to OLAP.

In case your audience doesn't know, so pandas

is not named after the cute animal. Pandas is named after a data structure, the panel data structure, which is a data structure no 1 uses. But you can think of that as a 3 d data structure that doesn't really get any use, but, sort of like Python's named after Monty Python. But Pandas,

to be a little bit more precise, Pandas is used for manipulating structured or tabular data, something that you would find in a database or in Excel.

And so

from that point of view, we were using

Python to manipulate

tabular data and slice and dice it. And Pandas

offers a lot of the similar capability,

But I guess the dirty secret or the secret powers of Pandas,

and this is probably a lot of the reason why Python is popular in the data science realm these days is because

pandas is built on top of technology called NumPy. And NumPy is

a Python library that basically lets you manipulate numeric data but not as Python objects. So Python objects have overhead. In Python, everything is an object. And so if you make a bunch of Python numbers, there's a bunch of overhead to that. Contrast that if you make, like, a c array of numbers,

there's no overhead to it. It's just a buffer of numbers. And so NumPy basically gives you an interface to do something similar to what you do with c, but writing Python code.

And then I like to say that Pandas is some nice

cleanup and makes it a little bit easier, nicer presentation than NumPy, and allows you to mix data types as well in that tabular or structured format.

Given the

sort of nice high level interface and the

useful capabilities of being able to work with these tabular data

structures. It's definitely a very natural fit for a significant portion of data processing workflows that are

text oriented, working with tabular data from databases or CSV files, etcetera.

And I'm curious, what are kind of the main workflows that you've seen folks use Pandas for in whether it's ETL jobs or

data prep or data cleaning,

some of the sort of core workflows that Pandas is being applied to?

So Pandas can

slice and dice. So if you're doing, you know, ETL extract, transform, load, or or whatever people are calling it these days,

Certainly, you can do a lot of that stuff with pandas.

You can

restructure your data so you can pivot it if you want to.

A lot of the machine learning that people are doing is being done with pandas.

Not pandas directly, but, you know, to go along with the old story that 80% of machine learning is cleaning your data and then the 20% of it is complaining about cleaning your data. You can say that a lot of that 80% of cleaning the data could be done with Pandas if you're in the Python world

and working with structured data. So

what I've seen in my experience, again, sample size 1, but I've consulted for some larger companies and worked with a lot of large companies, is that a lot of people, if they're sort of on the Python bandwagon, and that seems to be a good number these days,

use Pandas for

anything that's tabular. And so that could be data analysts

just sort of replicating something you would do in Excel. That could be data scientists

getting data ready before throwing it into a model,

and that could also be data engineering,

pulling something out of a database or some other source and then doing some transform on it that maybe couldn't be done in SQL or whatever

language their data source speaks,

or if they need to apply additional functionality that the whole Python language and ecosystem gives them on top of that.

And so you're mentioning

being able to do some of the kind of tabular manipulation

that maybe folks would do in SQL. And 1 of the

kind of common

paradigms that I've seen folks do is either to translate

from Pandas to SQL or from SQL to Pandas or even trying to add a SQL layer on top of Pandas. And so I'm wondering what are some of the

kind of concepts that do map cleanly between those 2 landscapes and some of the ways that folks might end up shooting themselves in the foot because they're trying to

apply,

you know, SQL based logic or ways of thinking about transformations from the SQL space into pandas and maybe

missed opportunities to be able to take advantage of the features that Pandas provides?

You hear that a lot. People are like, why would I use Pandas? I have SQL or vice versa. And I think it's sort of an apples to oranges comparison,

Tobias.

Certainly,

you could do a lot of things that you do in SQL with pandas. In fact, I have a course on pandas.

Part of that course is like, here's a bunch of SQL,

and we're going to rewrite that in Pandas. Actually, it's a SQL course and a SQL course has Pandas

equivalences for it.

And then you show that to people and they're like, oh, the SQL makes sense, but the Pandas doesn't, or people who are familiar with Pandas say the opposite. So I think 1 thing is just familiarity, but I think you should know the pros and cons of both. So oftentimes,

if you have a database, a database is optimized for certain operations and can do them quickly. And the less you have to, like, move around lots of data, the better. So if you're doing a bunch of filtering

in

the database,

that can often improve your performance quite a bit versus throwing all of your data

into pandas and then doing some filtering and only keeping 1% of it there. That might be an expensive operation in pandas,

but could be done relatively quickly in the database.

I think part of it is, you know, familiarity. Part of it is what you're optimized for.

Part of it might be business reasons. I mean, certainly, you have come across cases where it's like, this is the database, and you can only read these parts of it, and you can't your queries can only do these certain things.

And so you might have business reasons to, you know, extract your data and then do some other manipulations. And if Pandas is your tool of choice, that might make sense there as well.

To the point of saying, I'm just gonna pull all my data into pandas and do a bunch of manipulations there,

Or in the case where you're using it in your ETL pipeline, and so you haven't actually even loaded into the database yet, what are some of the

types

of misuses or mistakes that folks might find themselves making when doing Pandas transformations or data manipulation

that seems like it'll work when they're maybe just proving out the concept, but as they start to try and scale up to larger datasets

or trying to do this in a repeatable fashion that will start to

become problematic either because it will be slow and

unoptimized,

or it might be

what seems correct at a small sample size ends up compounding to actually a large error as you scale up the usage or use it over a longer period of time?

That's a good question.

I've got a couple takes on that. First 1 is probably just to make people aware that another way to describe pandas, sometimes I've described as this, as an in memory

NoSQL database.

The point there is that

it's not really SQL, but you can basically do SQL slicing and dicing with it. But the other point is that it's in memory.

So Pandas itself requires you to have your data in memory, which might be a little bit different from your database. Right? Your database

generally has your data on the disk, and you can sort of query it from the disk, and it will pull some in memory and that sort of thing. Pandas doesn't work that way. In Pandas, you basically pull everything into memory.

That's 1 thing to be aware of. Just because you've got, you know, a disk that has however many gigs or terabytes on it, unless you have that equivalent amount of memory, it's probably not gonna work you pulling that into Pandas.

So that's 1 thing. And another thing to be aware of is that, in general, pandas does not

mutate. Rather, when you're doing manipulations,

it tends to copy that data, so you do need some overhead for that as well. Generally, I recommend, like, if you're playing with a dataset in pandas, you probably want 3 to 10 x the amount of memory for some overhead for manipulating that, which might be a little bit different.

As far as, like, issues of scaling out for, like,

going on beyond 1 machine, I'll be honest with you, Tobias. I haven't had much experience with scale out systems.

I've played a little bit with DaaS and a little bit with Spark, but I haven't really used them in, I guess, happiness or in anger in either personal or consulting projects. So I can't really speak a lot to that per se,

but I can speak to

people trying to process pandas on computers with a larger amount of memory, which I have worked and done consulting with clients who have.

We specifically don't want to set up DaaS or,

Spark cluster, but we'll go and rent a large machine on some cloud platform and then just,

run some Pandas code on that rather than trying to scale out. And that might be an effective way to do some of these things just because you're running it on a single machine rather than, you know,

a handful or tens or more machines. That might be a little bit more optimal. But some other things to watch out for

with pandas that might

get you,

1 is just what your data source is. Right? A lot of people

have data in CSV, concept rated value files. And while pandas can probably read 99.99%

of CSV files out there, CSV files in and of themselves are inherently problematic.

And so

there's no type information in there.

They're good in that you can read them, open them up, and look what's in there. But other than that, there's no type information. There's not really compression per se, although you can, like, zip them or whatnot.

So if you're having, like, a lot of overhead sort of reading that data and then doing some type manipulation on that, you might wanna consider using an alternate format for that, especially if you're working in the big data realm. Something like Parquet

might work. Arrow and the feather format might be another option there as well.

Another thing just to be aware of, again, because Pandas is putting everything in memory

is that you're using the correct types, especially if you're reading from a CSV file. Pandas

makes some guesses about some things. It tries to convert some things to numeric types,

but it's gonna use 8 byte wide types for floats or integers, which may or may not be appropriate.

And by changing those types and also

changing what the types it used for string representation,

by default, Pandas

basically uses Python strings for string representation, which is nice in the Python strings handle Unicode and allow you to do a lot of things that way. Downside of that is it's not particularly fast,

and there might be some memory overhead, especially if you have, like, low cardinality

categoric data.

And so, again, by default,

CSV files don't handle that, but you can

I mean, I've seen datasets where by converting a few columns to categorical columns, you save 90%

overhead in data? So if you think about, you know, Pandas is a small memory where by small memory, I mean single machine

tool.

You know, being able to shrink that memory by a good

amount can make it so you can actually load up a lot more memory and process a lot more memory in it as well. 1 more thing, I guess, on that note, you asked about, you know, what do you think people run into. And 1 thing I have seen is that so Python has this thing called the Zen of Python, and and Zen of Python said there should be 1 and preferably only 1 way to do things. And that's,

I guess, a good goal, but oftentimes

not

realistic or not reality, what we see in either the Python world or especially in pandas. Pandas often has 2 or 3 or 4 ways to do things. And so a lot of times, you will see

advice on how to do something in pandas or people saying this is the 1 true way to do things in pandas.

And sometimes it's good, but sometimes that

advice might be

size dependent, where depending on how big your data is,

certain operations might be faster or slower. Like, an operation that's fast with a small amount of data might be slower than an operation that's slower with a small amount of data or vice versa. So 1 thing that you would want to be aware of is, you know, if you're profiling your data and you're saying, we're using this because it's fast, I would recommend that you profile it with a dataset that's similar size of what you're actually using, not like 1% of that. Because oftentimes, if you're using

a very small dataset,

your timings might be off. As a brief side note, I always find it interesting how people generally misquote the zen of Python of there should be 1 obvious way of doing things. That that obvious word is usually what gets left out. But I think I left it out. Yes.

That's okay.

I'll allow it this time.

And then to your point of being sort of careful and judicious about the typing information that you're specifying

as you're loading these datasets in and the impact that it can have on the actual in memory representation and the amount of sort of overhead that that it allows for in your processing is

datasets or datasets that are going to

increase over time where, you know, maybe this operation works just fine today. But, you know, 5 years down the road where your dataset keeps getting larger, it's just going to

start exploding, and you have to just keep getting larger and larger machines, and that's the useful optimization tip.

And then to the point of

sort of the processing overhead,

it's also

I'm wondering if there are

any

sort of utilities built into pandas or specific

functions or

development styles that allow for

reducing the amount of impact that you get because of the fact that you're doing these copy operations

rather than doing mutations in place and ways that you can maybe tell Pandas to do, like, a copy on write so that you don't actually

increase the memory footprint until you

perform some operation or being able to do some sort of lazy evaluation on it to reduce the amount of kind of explosion in memory space as you do these different manipulations on it? There is a bug in Pandas on like, that sort of deals with this copy on write. So

as of last time I checked, which was a few months ago, it was, like, sort

of, hopefully, we get to this in the future sort of thing.

So I will say this, Tobias. Like, there is a common misunderstanding

that even a lot of the operations in pandas have an in place parameter where you can say in place is equal to true, and people think that if they do that, they'll save memory.

Generally, that is not the case. In fact, there's another bug in Pandas to deprecate in place and remove it completely because at this point, it's sort of misleading. People think they are saving memory when they aren't saving memory. So I think, again,

it's probably judicious to, you know, look at what your memory usage is today. And then if you're planning on growing a lot to sort of figure out where you will need to scale,

Again,

using

types correctly can help bring that down. But as of now,

to play in the pandas realm,

you have to sort

of accept that you're gonna have some memory overhead to do that. And I think we're gonna get to this later, but it might be the case that some of these up and coming competitors to Pandas might provide a solution there.

Or,

you know, in the future, Pandas might have some sort of, like you said, copy on right or query optimizer similar to, like, what a database has where it can do things intelligently. I mean, I think there are some cases right now where pandas probably could do things a little bit more intelligently be but due to the flexibility

of pandas and the way people use it, they sort of have to do the copy way

instead.

Today's episode is sponsored by prophecy dot I o, the low code data engineering platform for the cloud.

Prophecy provides an easy to use visual interface to design and to deploy data pipelines on Apache Spark and Apache Airflow.

Now all the data users can use software engineering best practices.

Git

tests and continuous deployment with a simple to use visual designer.

How does it work? You visually design the pipelines and prophecy generates clean spark code with tests and stores it in version control, then you visually schedule these pipelines on airflow.

You can observe your pipelines with built in metadata search and column level lineage.

Finally, if you have existing workflows in AB Initio, Informatica, or other ETL formats that you want to move to the cloud,

you can import them automatically into Prophecy, making them run productively on Spark. Learn more at dataengineeringpodcast.com/prophecy.

You also mentioned the

Arrow

integration

for Pandas. And for folks who aren't aware, I think I've done some shows about this in the past, but

Arrow is an in memory

columnar sort of tabular representation of data that allows for

in memory interchange and interoperability

between different programming languages without having to flush it to disk so that you can do, for instance,

integration between Java and Python or Python and R, etcetera.

And along with it has also been adding in capabilities

to handle a lot of the

IO manipulation

of fetching data and pushing data to and from different storage systems and storage formats.

And I'm wondering what you see as the

kind of indicators of when it makes sense to bring Arrow into the equation when you're doing Pandas operations

and

some of the maybe complexities that that brings in either from a installation perspective or integration or some of the ways that you need to think about

how to actually tell Pandas when to use Arrow and when not to?

Yeah. So full disclosure,

I'm sort of, I guess, Python person. I haven't used

R really ever. I have programmed in Java

and some lisps and some other languages.

But as far as my data work, I mean, it's mostly

Python and, I guess, some Excel and some database

work there.

So

I can't really speak too much to using Arrow

in my work. My exposure to it is more limited to the feather format.

And so arrow is this end memory representation.

Feather is a persistence representation to sort of serialize that to disk or whatnot.

And pandas can leverage that pretty easily.

You you need to install,

I believe,

Py Aero, the package. And if you install the Py Aero package, pandas already has

the methods to read and write from that, but it just

exposes the methods. But the implementation isn't there unless you have the library there, so you just need to do a PIP install that library.

And the benefit of that is, like what you said, this

py arrow format and this feather representation

is basically

a way to you know, if you need to shuttle data around to different systems. And, again, I haven't seen, like, I haven't seen my clients needing that in practice. But, I mean, sample size 1, there presumably, there are people who, you know, are doing something in Java and then maybe something in R and then maybe something in Python. And so

that would make sense there. My understanding is that, basically, there's a

serialization mechanism

from the arrow package that you pass a pandas data frame into, and then it represents it as this arrow object.

And then, you know, if you need to load that in

R or Java, you sort of get that for free.

But as far as, like, serialization, that can be a huge boon. We talked about how CSV files are

particularly problematic

in that they don't have type information.

They're not necessarily compressed. And so Pandas has to work pretty hard to parse those, and it can take a long time. In fact, there's a lot of optimization that goes into read CSV.

And,

I mean, I just saw some other blog post the other day talking about, like, oh, you can do these options or, you know, alternate read CSV implementations,

those sorts of things. But 1 thing that you can do is once you basically got your type set up straight is you can then persist your data

to something like parquet or something like arrow using this feather format.

And then when you read it and write it, generally, it just works and you get the type information. You don't lose that, and it's fast as well. So

my experience has been, yeah, use that if, you know, serialization

is a problem.

And I could imagine, you know, you'd get similar benefits

if you're in an environment where you've got data scientists, data engineers using multiple languages.

And

they already have, you know, some blob of memory representing that data, and they don't want to make a copy of it. They just wanna start using it in Java or R. There will still be a copy of it. Right? Because my understanding is that you'll have your pandas dataset, and then you're making a new copy of that using the Arrow format. So there is some overhead there, at least at that mechanism.

And Arrow has some subset of Pandas functionality. It's

not quite as featureful as Pandas to

to manipulate that data. Yeah. The majority of my experience with Arrow has largely been with being able to translate

between Pandas data frames

and parquet files and

being able to handle some of the sort of tabular representations.

And it also has capabilities to make it easy to do

partitioned

table sets of parquet files so that you can actually have so so if you're dealing with multiple Pandas data frames over a longer time period, then you can use that partitioned dataset

to automatically translate

from, you know, your CSV files that you get from the FTP server every day into

a incrementally

updated set of parquet files that live in your data lake so that you can process them with, you know, Presto or Athena or etcetera.

Sounds like that certainly could be useful, and I could see, you know, me 15 years ago doing reporting, that would certainly have been useful at that time. The other thing that you were mentioning

was the fact that in the data ecosystem,

data engineers

are often interfacing with data analysts and data scientists.

And so

a lot of

the logic and code that we're writing is going to need to be read and understood across

different teams. And I'm curious what you have found to be some of

the useful practices or stylistic elements for being able to write the data transformation and manipulation code, keeping in mind the fact that

you are either going to be handing off these datasets to data scientists and data analysts, or those data scientists and data analysts will need to be able to read and understand and possibly contribute to and manipulate the code that you're writing and being able to

clearly

express the

business logic that you're actually trying to achieve beyond just shuttling bits around for the fun of it? Yeah. That's a great question. And, you know,

you hear it said all the time that the most important skill of a data scientist or especially, like, a developer these days, especially in in sort of work from home situations, is the ability to communicate.

And I'd build on top of that to your question, like collaboration

and communicating

is hard, and

it's difficult. And when you involve not just, you know, human language, but also computer language, it can be problematic as well. Because

what I've seen to advise is that a lot of people

you know, I train a lot of people in Pandas,

and a lot of those people who are using Pandas more from maybe a

data science or

analyst side, not necessarily

the data engineer side, but that side tend to want to use something like Pandas as a tool for doing their job.

And they don't want to necessarily approach this as, hey. This is programming, and there are best practices that have been gleaned over the years from programming.

And I learn those and start speaking that language or whatnot. People are gonna call me a programmer, and I really don't want to program. I just want

to do my analysis or whatnot, and

Pandas and Python and some of these

other tools make it really easy and handy for me to do that, get access to my data. And so that's 1 thing. It's just sort of the mindset of a lot of people is that I don't wanna be a programmer. Right? Another 1 that people need to be aware of, and this is maybe Python

ecosystem specific, is, I guess, the Jupyter phenomenon.

So there's a talk a couple years ago by a fellow named Joel Grues,

and he says, like, I don't like Jupyter. I think that was his talk.

And, basically,

Jupyter, if you're not familiar with it, is a notebook environment for interacting with.

It can interact with various what they call kernels, but, generally, most people that I've seen who are interacting with it are using Python on the back end.

And so

his talk was like, these notebooks just tend to get completely out of hand. And sort of the running joke is that

at the end of every day, you will take your notebook, and then you will go to the file menu and say save as and just take the name of it and tack on today's date as sort of the running joke so that

sort of poor man's version control.

Another thing about notebooks

is that there's no inherent order to them. I mean, they do have an order, but you can run things out of order. And so that was 1 of Joel's complaints. It's like, if you give me a notebook, there's no guarantee that I can even run it even though it works for you.

So,

just to be clear, I happen to like Jupyter. I use it almost every day. I train people with it. I have clients who use it. So

I like it. However and I also like, you know, Python. I like programming. I like Pandas. However, all of these have the capability to be misused, which I I think is sort of getting to your point in maybe a long winded way. So some best practices,

specifically with Pandas, but also somewhat related to the Jupyter

side, is that

in my course of consulting and using Pandas

and teaching Pandas, I've seen a lot of Pandas code.

And my experience

is

that a lot of that is because

bad software practices,

but also

people being sloppy, people putting things out of order.

And 1 way to sort of help that, I found, is this thing called chaining.

And

so my

latest book on pandas, effective pandas, sort of goes over this in great detail. But the point is,

will return Pandas data frame or Pandas series. And so because you have these operations, you can chain them on top of each other.

So, generally, what I will do, Tobias, is I will take my raw data after I've either load it from a CSV file or from SQL or, you know, from parquet or whatnot.

And then I will chain these operations

to start manipulating my data. And what I'll do, first of all, when I have fresh raw data, is I'll make a chain

that will basically clean it up. And, you know, if if there's some things that need to be cleaned up or renamed or tweaked or if I got missing data,

I'll do all that in sort of 1 fell swoop.

The advantage of using a chain like that is I can sort of debug it as I go, and so I kinda make sure that it works as I go. But it's also

forcing me to think about it as a recipe. These are the steps that I did to my data.

And then after I have this chain,

it's very trivial to take it and put it into a function.

So after I got my chain, I put it into a function. The next thing I do is I take the code that loads my data and then this function that sort of cleans it up, does the janitorial

work at the very top of my notebook.

So the next time when I come to our notebook, I just run those 2 cells. They're at the top, and I know that my data works, and I'm good to go. And that tends to make

sort of a lot of Joel's points about why notebooks don't work and people

misusing

sort of not doing software engineering

tends to help with a lot of that because now you're not doing global state. You're using functions, and then that will lead you to be able to test it as well. So that's something that

sounds simple, but to a lot of people, you know, getting to that point can be a little bit of a challenge, or it can be reframing of how they're looking at their code.

But I found it to be very successful,

and,

generally, people I would sort of compare it to, like, white space in Python. A lot of people who are new to Python, they see the white space thing, and they're like, don't like that white space. And then

after, like, a day of using it, most of those concerns just sort of go away. And I think it's similar to this. A lot of people haven't seen code written in that style, But after they do and after they've experienced it and see sort of these benefits of, like, I can sort of debug it as I'm going along, and then it makes my notebooks very clean. And then I can easily put it into a function. And now that I can take this function, I I can put it in the library if I want to and share it with engineers or with data scientists or analysts. Or if I want to start doing tests on it, I can put it into some continuous integration system and start doing tests on it. All of those things are possible, and they're sort of enabled

by basically adopting this chaining style.

That is 1 thing that I found that helps a lot. You know, sort of extra mile stuff on top of that goes back to this notion of, like, people don't wanna be software engineers. I think that probably doesn't relate super much to your crowd because it's more of the data engineering side, which are sort of software engineers applied to data.

But to the this crowd, I would say that once you have a code written in the

style, in addition to looking like a recipe and being easy to debug, 1 of the things that, you know, data engineers get a lot is, like, they have some result or answer and the boss comes back to them and says, I want you to tell me what happened. So I have a sort of data provenance issue.

And if you have this sort of chain of operations

all in 1 place, makes it very easy to trace through your code. And when your boss asks you, what's going on there? You can tell them, this is what happened to the code as or the data as it went through, and that's why we got the result we did. In addition, you can start, like I said, leveraging some software engineering best practices. So pretty easy to leverage something like Pytest to start putting some tests around your pandas code at that point. If you want to go further than that, there are other libraries as well, libraries like Great Expectations.

Great Expectations

allows you to basically wrap pandas, and then you can do

some ways to say, I expect this column to be strings and I expect them to be only these 5 types of strings or I expect the mean value of this column to be less than this or I expect that the the percent of missing values is less than this. So you can put all these assertions on your data, and then you can start running tests on that with your data and then make sure that

your data

is what it looks like. You can also use things like Hypothesis.

I've done this. Hypothesis, for those who aren't familiar with it, is a testing library in Python for doing property based testing.

So the notion there is

you have some invariant

in your code. Invariant is something that should hold true. And so

similar to some of these great expectation things, you might say, you know, after I've loaded my data, I should have data. Right? I shouldn't have filtered out all my data. And so you can start doing tests using hypothesis

and Pandas. And Hypothesis also has this notion of generating data for you. So you can basically

write a little bit of code that will generate a data frame based on some sort of schema.

And then it basically tries to do some fuzz testing or whatnot where generates different types to make this property fail or see if it can fail. So that can be useful as well to test your Pandas code and make sure that you weren't missing maybe some corner case that your data hasn't shown you but might pop up in the future.

So those are some, I guess, extra bonus best practices for the software engineering side. But to sort of move this from, like, this is just a notebook with random codes spewing all over the place to more structured and taking it more as, like, this is code, and let's apply some software engineering best practices to it as well. Yeah. It's definitely useful

advice, particularly

for the case where a data engineer has handed a notebook from a data scientist or an analyst who said, I did some exploratory analysis. Now we wanna productionize this. And you say, okay. Well, what's actually happening here? But being able to take the root data frame and say, okay. Well, these are the operations that are happening on it. So now I'm going to build up this chain as I go and validate that what I'm getting at the end of it is the same as what happens in this, you know, notebook with dozens of cells that don't necessarily all make sense sort of together.

And then another interesting

entry in the sort of ecosystem of Jupyter and repeatability

and data engineering

is, the project called papermill for being able to actually convert

these notebooks into

deployable and repeatable artifacts where you can actually parameterize it and say at the top of the notebook, these are the parameters based on that. It's gonna feed through the notebook. It's always gonna execute top to bottom, and then the output is going to be stored in s 3, etcetera,

so that you can have that be an actual artifact

that's used

in your pipeline definition as this kind of interchange format between data engineers and data scientists.

Yeah. Maybe 1 more point if people are using notebooks as a form of collaboration or maybe 2 more, 2 things that I like. 1 is there's a plug in called JupyText.

Jupyter itself is stored as JSON, and JSON doesn't really work well with Git for doing, like, source control. So if you're trying to apply

software engineering best practices for this, it just makes it somewhat problematic to do, like, divs because they don't really line up. This tool, JupyText, basically, to back and forth from your code, and you can serialize it to basically a pure Python file, which Git handles fine. And so I've actually used this a lot, and it makes it really easy to make diffs. And you can look at the differences of those diffs and looks like a normal, like, Python file with a diff in it so you can pick out those changes. So if you're adopting sort of these best practices of software engineering, you might wanna check that out. Another 1 is

if you are collaborating with other people in notebooks I tend to use notebooks in 2 ways. 1 is sort of this loose goose exploratory data analysis style where you might have a bunch of cells that you've sort of tried around with and you might have run them in arbitrary order, which is fine for yourself. But once you want to collaborate with others, it's not fine. And that's the first style, is this loose style. The other 1 is if I'm collaborating, I want to make sure that whatever I hand off to my colleague, I can run it from the start to the end and not have an exception. Right? And that you run it in order. If you can't do that, then you probably shouldn't be using that notebook for collaborating.

Yeah. There's,

definitely an interesting and constantly growing ecosystem around Jupyter and its use in the data sort of data manipulation, data engineering, data science landscape. So definitely interesting to see how that continues to evolve.

And so

you mentioned some of the useful principles for being able to

create these repeatable and understandable

workflows

for using pandas for analysis and ways that you can structure it to support testing and validation because data quality and data provenance are definitely

core issues in building data engineering

tools and workflows so that you can

understand, like you said, when the boss comes along and says, well, why am I getting this number? You can actually go through and say, you know, point to the line. This is why. This is what's happening, and then you can have the conversation about whether that's accurate or not or, you know, what you might need to change.

And then the other thing that we've touched on briefly is this question of

scaling from a single machine to saying, okay. Now I need to process,

you know,

terabytes or petabytes of data.

And so then you start to jump into these tool chains of DASK or Modin or Ray or,

you know, the Koala's

library for being able to use the Pandas API on Spark data frames. And

recognizing that you don't have a lot of direct experience with some of these tool chains, I'm curious what you have seen as some of the

useful

sort

of attributes to look for as you're making evaluation

or figuring out, like, when to make this jump from, I have this Pandas workflow. Now I need to take this sort of same API, but apply it to these scale out multicore architectures.

Yeah. As I said before, like, I don't have a ton of hands on experience with these, but let me let me sort of speak to this at a high level and then sort of describe what I see going on in sort of the Python and data world going on here. I do hear a common complaint. People are like, why would I use Pandas? It's like a single machine. So I'm a big data person. It doesn't make sense to use that. You've already mentioned some of those alternate platforms for those. And so there are various alternate platforms. Let me like I said, I don't have a lot of experience with them. Let me just sort of talk about some of them, and then we can make maybe a meta commentary after doing that. So we've got Spark.

Spark

originally had a Scala interface and a Python interface,

and and they had this data frame interface.

And the data frame was kind of pandas esque. I had a lot of pandas operations. In fact, you know, when I was using Spark,

I had, you know, here's my pandas operations, here's my spark operations, and I'd say, like, 80% of them were very similar. Right?

Eventually,

though, apparently, the demand was that we don't want 80%. We don't want to shoot for higher than that. So they came out with this other project called Qualas,

which was basically Panda's implementation

on top of Spark,

but, like, instead of 80%, like, shooting for a much higher number than that.

As of recent version of Spark, that has now been pushed into Spark proper. That was a Databricks

extension. Now that is part of Spark,

and it's now called, I believe, the Pandas API.

So the merging of that is now in there. So

Spark natively

speaks Panda's API. So that's sort of a key thing to take away from because as I go through these other things, you'll you'll see that recurring.

And then we have Dask, and Dask is a scheduler.

And you can think of it as sort of an alternate implementation

of Spark,

but it has some things that basically allow you to do scale out to multiple machines

of Pandas operations. And it has its own library, but where it has holes in its library, it it basically falls back to calling

pandas

on those operations. So Pandas sort of fills in the gaps. So close competition there to Spark and the Pandas API is Dask, which basically has a Pandas API as well. You have Ray, which is a scheduler, and then you have another library called Modin.

Modin is basically a Pandas API

that you can plug in a scheduler into. So you can plug in Ray as a back end for that, or you can plug in

Dask as a back end for that as well. So, basically, another

way to scale out your data. And let me just go through some other ones as well. So there's 1 called v a e x. I'm not sure how to pronounce that, bakes or whatever.

But, basically,

that is

Panda's API

for what I would call small data. In my term, small data is what fits on a single machine. Whereas pandas is what fits in a single machine in memory, this bakes, b a e x, library is a Pandas API, but it does memory mapping as well. So similar to a database where a database holds the data

on a disk and can pull it in as necessary. That's what this library does. So it basically allows you to instead of just using your memory, you're using the disk as well to store your data.

There's another library called RAPIDS.

RAPIDS

basically well, there's subprojects of RAPIDS. So the subproject that we're probably interested in here is called CUDF,

which is a

API for I'm gonna call it small data,

and it is

a pandas API that leverages the GPU. So the idea there is instead of saying import pandas as

pd, you would import

the cuDF library as pd.

And, basically, hopefully, 99% of your code doesn't change, But now it's running on the GPU instead of the CPU.

So various optimizations

can be done with that.

There's another 1 I just heard about recently, Terality.

I'll call this

serverless

auto scalable pandas service.

So, you know, they are a software as a service, and, basically, you throw your Pandas code on that and it scales it out

and does its magic,

but it's speaking pandas.

There's

a library called DataTable. This is by H2O, which is a company that's doing machine learning. It's basically similar to VAAX. It's a small data memory map library,

basically using the Pandas API.

There's

Fugue

and IBIS. Both of those are basically

Pandas APIs

for Spark.

So leveraging Spark's back end, but a Panda's API on top. So in addition to Qualas and the other data table and Panda's API, there's 2 more APIs that you can run on Spark with the back end. There's a library called multiprocess pandas,

which basically is just some additions to pandas. But when you're writing apply code in pandas,

apply is notoriously slow in pandas because it's going through pandas rather than taking advantage of this NumPy

architecture that we talked about.

And so that's a way to leverage multiprocessing

on your apply code to speed that up. So that would be sort of a bonus library that you might want to consider if you're using apply a lot. There's another library called Pandap py, which is basically an optimized

small data library.

They claim it's faster than Pandas with the Pandas API. And another 1,

Polars, is Pandas API implemented in Rust for small data as well. I think I've got, like, 15 or so different libraries here. But this common theme on these

is, at this point in time,

Tobias, pandas is a library, but pandas is also an API.

And so I think from my point of view, it looks like at least in the Python world,

for the next little while, unless there's some

revelation in, like, SQL plus plus that everyone's adopting.

The way to manipulate structured data if you're using Python is with the Pandas API,

Whether you're using Pandas directly or Dask or Spark,

knowing the Pandas API is going to be a useful tool.

Yeah. It's I've heard of

a majority of those, but there are still a few libraries in there that I had yet to come across. It's definitely interesting to see sort of how

far this has spread and to your point of the sort of, like, SQL plus plus. And it'll be interesting to see if Panda starts to creep its way into

database engines as well as a possible

either augmentation or alternative or maybe eventually replacement of SQL.

Yeah. And, I mean, I think there could be cross pollination

to what we were talking about earlier before where you have, like, a copy on write or you have a query planner for pandas as well. Right? As long as you're writing this code, what's to stop someone from not just, you know, throwing it on the GPU, but also making sure that you do predicate push down and those sort of operations in the right way to make it run even faster. Absolutely.

And in your experience

of using Pandas for your own work, working with your clients, teaching people how to use Pandas effectively,

what are some of the antipatterns

that engineers should be thinking about and guarding against when they're building their data processing logic?

We talked about a few of those when we talked about Jupyter. Right? Just like sloppy Jupyter

writing things out of order, that sort of thing. I personally consider not chaining to be an anchor pattern. It might be hard to chain, but I truly believe that if you do that, it's like a restriction that forces you to write better code that looks like a recipe. Some other common things that I see a lot is people calling apply.

I mentioned this a little bit previously, like that multiprocess

pandas optimizes apply. But a lot of people, you'll see, you know, blogs or that saying, oh, just use apply, and it will allow you

to run arbitrary Python code

on each cell and

do whatever calculation you need, which is true. It does allow you to do that, but you need to realize that when you're doing apply, what it does is it takes a single value from a cell. If it's a number, pulls it out from this NumPy

buffer that's in your memory, converts it to a Python

number, which has a lot of overhead just converting it, puts it into the function, pulls it out, and sticks it back in. So there's certainly an overhead for doing that. And, generally,

if you're doing apply with numeric data, my advice to clients or students is that your spidey sense should go off and say, this is a slow path. Consider whether you should be doing this in an alternate way. And, generally, there is better ways to do that. We'll run 10 to 50 times faster if you're not using apply. So that's 1 anti pattern.

Another 1 would be, you know, if you're not using the right types, again, especially for strings. It turns out that if you're using apply, apply with strings is I'm fine with, like, my students or clients doing that because when you're using strings and pandas, you're sort of in Python world already, so you're already in the slow path and apply

is not really a huge hit there.

However, you can leverage those multiprocessing

libraries to make that run even faster. Or alternatively, if you're using the right types, like if you are doing applies on categorical data, but it's left as a string instead of categorical, you're gonna have that run a lot slower because if it's on a categorical and you only have 4 unique values. It only has to run it 4 times instead of however many rows you have. So using the correct types with apply can be useful there. I think those are the big ones. I mean, another 1 is just a lot of people who I come across are like, I don't want to be a programmer. And

so I just see a lot of, like, general bad code. But, you know, from the dead engineering side, hopefully, I wouldn't see so much of that, more so just not knowing the API and

not realizing that if you do things a certain way in Python, you sort of stick in that fast path. And if you do other operations, you jump back to the slow path in Python.

What are some of the most interesting or innovative or unexpected ways that you've seen Pandas used in data processing or data manipulation

workflows?

I mean, for me, I think the reason why I like

being involved with data is to get insight and to get value.

And so, you know, making some pipeline in of itself is maybe

boring or not particularly interesting, but if it leads to some, you know, value or insight that can be useful.

I would say some of the most interesting things for me are when clients start

being able to leverage, you know, these tools to slice and dice, and then that enables them to make some insights. So 1 example is a client I had.

They

started slicing and dicing their server utilization and made a really cool visualization

that

very clearly

illustrated, like, when their

clusters had a lot of usage and when they didn't, and that was cyclical. It had to do with a weekly cadence. And so being able to leverage pandas and some of these tools that with very little code, they're able to make a really compelling graph that lets those insights of, oh, okay. Yep. Here we see that, you know,

every weekend that the server usage goes up, and then it comes back down. And, you know, had they not had

experience, exposure with these and started to think about their data and organize their data in a way that lets them visualize it very easily,

probably wouldn't have had those sorts of insights. So enabling people to have insights and start thinking about

things that humans are good for versus things that computers are good for. Like, a lot of people are like,

I just want a spreadsheet. Really? And they don't want a spreadsheet. I mean, they say that because, you know, if you have a spreadsheet, then you can be the spreadsheet person, and it's hard for someone to you know, that's job security if you're the spreadsheet person.

But humans are not optimized for looking at large rows of spreadsheets and not optimized for filtering data. So leveraging a computer for what a computer is fast at, a computer is super fast at filtering data and slicing that data, and then leveraging a human to start doing things like, okay. Let's start making

a visualization or let's do a correlation

or

understand the relation between these. Those are things that humans can be good at and where you can get some insights. So using computers for what they're good at and combining the powers

with human capability to augment humans' insights, it's probably the the most beneficial things I've seen. And for folks who are starting

to design their

data pipelines and figuring out what their manipulations

look like, and they're starting to think that Pandas is the right choice. What are some of the cases where that might be misguided and they're better suited leading on

something like Spark or other data processing engines or pushing these operations into the data warehouse engine or things like that?

Yeah. I mean, I think 1 is just

right now, if you're gonna use Pandas, you do need sort of some minimal Python experience. You know, pandas does leverage things like index operations

and slicing, which is a little bit weird even to people who are familiar with Python.

The way that pandas and NumPy do that might be a little bit weird.

So

just familiarity with the syntax, if you're not familiar with that, there's gonna be some overhead to doing that.

You know, as I think we've referred to a lot, if you have a large dataset, you might wanna consider

whether Pandas,

the library, is the correct choice. I think it could be the case that Pandas, the API, is the right choice. I mean, I have seen clients

say, okay. Well, I do have large data, but I still wanna use pandas. I'm gonna start off using pandas, the library, with a subsample of that data and then move to something else as needed.

Another place where Pandas would not be the right choice would be nonstructured

data. Right? So if you're doing typically,

deep learning is

on audio,

text, video, that sort of thing.

You can stick that sort of data in Pandas, but that's not really what Pandas is for. So if it's not tabular,

trying to leverage Pandas for that is,

I would think, not effective use of your tools.

As you continue to work with your clients and teach people how to use Pandas effectively, what are some of the

projects

or

additional

ecosystem

around Pandas and its use for data manipulation

that you're keeping an eye on or that you recommend folks take a closer look at? Yeah. I mean, I think I've mentioned some of them.

The CUDF,

the RAPIDS project, which is taking Pandas and putting that on the GPU,

I think that's gonna be interesting.

Some of my clients are interested in that. Right? And so being able to, instead of using the CPU,

change an import and have it run 50 to a100 to maybe more times faster just by putting it on the GPU

could be pretty compelling.

Again, I'm sort of keeping my eye out on this whole notion of the API. Pandas as the API, not just as the library.

And, you know, we're seeing that with Dask and Spark. So

that gives me hope that someone who's investing in Pandas today is not just investing

as, like, a short term investment just to leverage some library maybe as a replacement for Excel, though I think it's an excellent replacement for Excel, but also something that can scale out and your data gets bigger.

Hopefully, this API can help you handle it as well with minimal changes.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management other

than

that, I would say 1 of the problems I run into is teaching. Other than that, I would say 1 of the problems I run into is

teaching. And when I'm teaching,

getting everything set up and installed can be a pain, and I imagine a lot of people are nodding their heads

in podcast land as they're listening to that. So the notion of a hosted environment or hosted

notebooks,

collab is a good start on that, but there are drawbacks to that. And I know that there are some other vendors who are starting to offer those sorts of things. But I think as things sort of migrate to the cloud,

making that just really easy and then also being able to switch projects very easily.

Like, Collab right now is nice in that you can go there, but it it's not nice. If you need to use a different environment, it's sort of a pain, then you still have to go through the setup. And at that point, it's like, oh, well, I can just do this on my local computer. So why bother with something that's if I don't move my mouse for an hour, it's gonna kick me off.

Well, thank you very much for taking the time today to join me and share your experience

of using Pandas and how to apply it effectively to data management and data processing and some of the ways that data engineers can think about how to collaborate with their

analysts and scientists

peers. So

definitely always great to catch up with you and explore some of the specifics of Python and Pandas and data management. So I appreciate all the time and energy you put into that, and I hope you enjoy the rest of your day. Can I offer your listeners a discount to my latest Pandas book? Is that okay?

Absolutely.

Offer a discount to my effective Pandas book.

Alright. Yeah. So we'll, put a link to that in the show notes. So for folks who've listened through all the way to the end, there's the bonus and the reward for sticking with it. Awesome. Thanks, Tobias. It's been a pleasure. Likewise.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links