Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Kevin Koh about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, Pandas, and SQL code on Spark and task without rewrites. So, Kevin, can you start by introducing yourself? My name is Kevin,

and I work for Prefect as an open source community engineer.

Prefect is a workflow orchestration company, and my day job is helping our users adopt workflow orchestration. So I primarily help them through Slack and GitHub.

And then on the side, I maintain Fugue, which we'll be discussing today. And do you remember how you first got involved in the area of data? Yeah. So I was a data scientist for 4 years before joining Prefect,

and I think it's just natural that as a data scientist, you know, the more, like, technical data scientists get pushed further and further back into the stack. So at my last job,

we were, like, full stack data scientists and responsible for,

you know, the data engineering of the pipelines that we needed

in order to perform our data science work.

And just in general, as we went deeper into the data warehouse, we had to use Spark more because of the volume of data that we were ingesting.

So while I was at that job, I watched the Spark AI and Data Summit, and then there I met Han

who was presenting about Fugue at that time. And then that's when I started, you know, working on open source, getting involved in Fugue, which we'll discuss later.

And then

I ended up joining Prefect because somebody that I presented Fugue to, you know, told me, hey. You're probably a good fit for this role at Prefect. So I ended up talking to them. I I wasn't, you know, looking to change job, but then I found that Prefect was a place where I could

grow, like, learn to grow my own open source project. So it's kinda interesting that, you know, my day job is, like, open source workflow orchestration,

and my side hustle in Fugue or, like, this Fugue

project is all about, like, open source,

like a compute focused framework.

Given the fact that Fugue is

targeted to

abstract across these different execution engines to allow people to go from, you

know, single machine to distributed systems, I'm wondering if you can talk to some of the core goals behind the project and maybe some of the story,

as much as you're aware of anyway, about how it came to be and

what it was that was lacking in the ecosystem that made this a worthwhile endeavor?

Yeah. So there was this previous guest on this podcast

from bodo.ai.

His name was Essan, their CTO.

He framed this problem really well. Right? He talked about this

performance productivity gap with high performance computing,

and he was talking about that in order to optimize the performance of programs

with your high performance computing cluster, you really have to go deep, you know, into

c plus plus or deep into these, like, compilers

to be able to optimize performance.

And in the same way, we find that in distributed compute, there's the same problem where if you want to get your code, like, as performant as possible, it takes a significant amount of effort,

specialized effort, in order to optimize

every little bit, and we felt that this was a barrier to adopting distributed compute.

A lot of times, if, like, new to distributed compute, it's very easy to just, you know, try Spark out, write some code, and then, you know, put that into production.

But if you're not familiar with a lot of the concepts that are required to fine tune your program, it's very easy to fall into these pitfalls of having inefficient

execution.

Right? So Fugue is meant to bridge that gap from,

you know, local execution

and bring your code without any rewrites to distributed execution.

The core goals of Fugue are,

basically, that nobody should need to have specialized knowledge in order to harness distributed compute.

We want to minimize the effort for big data projects. And when we talk about effort, we talk about execution,

development,

and maintenance

of these big data projects because you can have really good performing code. But if it's at the expense of, you know, very long development

and very high maintenance,

then it's not worth it. Whereas, you can have, you know, code that's not performing as great, but at the same time, you can churn a lot of products out faster.

So Fugue is meant to minimize the total effort across all these 3 things, execution, development, and maintenance.

So our goal is to provide guardrails around your code so that when you bring it to a distributed setting at the baseline level, it should perform above average compared to if an experience, you know, Spark developer was writing that code.

And we do this by enforcing best practices. Like, for example, we enforce schema.

Because, like, with Dask, for example, if you don't provide schema, there's a risk of rerunning, you know, code twice to infer that schema.

So it easily doubles the computation. That can be easily avoided

if you just provided the schema beforehand.

2nd is that there's a lot of boilerplate code when it comes to distributed compute.

A lot of times, like, if you want to bring, like, a pandas function over to Spark, you often have to write maybe 2 or 3 helper functions just to bring it over.

And at that point, you introduce a lot of boilerplate code into your code base, and it becomes harder to just focus on the logic.

And along with adding all of this boilerplate code, you're writing more code that you then have to unit test. Right? So aside from my original function,

I now have to write tests for those 2 to 3 helper functions that are bringing that code over to Spark.

And because there's such

a high effort to maintain a good code base, we find that people end up not following best practices

and trying to just, you know, get code out in production because it's very hard to adhere to best practices

when you have so much boilerplate code.

So with Fugue, we believe that the code should adapt to the user. You already have your logic defined,

and we want to be responsible for bringing that

to Spark for you.

We believe that the code should adapt to the user because currently with Pandas and Spark and any compute framework in general, we find that your logic

and the execution of that is coupled together.

So

if you write your code in pandas,

it describes both the logic and the execution.

Same thing with Spark. You normally have, you know, your transformations alongside

your partitioning strategy and how it's gonna execute distributedly.

We want to simplify this by decoupling logic and execution.

So can I define my code in native Python or in Pandas or even in SQL and then worry about the execution later?

And then we find there's a lot of benefits when you're able to decouple it this way.

As far as the

target users

for Fugue, I'm wondering if you can talk to the different personas that you either

envision as you're designing some of the features or working through the implementation details or

actual

concrete examples

of the types of role that are using Fugue in your experience

of working with the community

and some of the ways that those users and personas influence the way that you think about the feature priorities and the design of the APIs and interfaces is and what the kind of core capabilities and core

foundational principles of the framework actually are? Yeah. So before I talk about, like, who the

target user is, I think it's helpful to describe what the journey is for an average person that starts their distributed compute journey.

So for example, with the, you know, small to medium sized company,

in a small data science team, I write all my code in pandas,

and I have these pipelines already in production.

But then now the data is starting to get too big for pandas to properly handle. For those who are not as familiar, pandas is single core. There's no, like, multiprocessing, and it's also confined to 1 machine. Right? So when the data becomes too big for a single machine, that's when you have to scale out to a distributed framework like Spark or DAS that can take advantage of a cluster

and distribute that data over several machines that, you know, work on it in parallel.

So for that data scientist,

you know, with a relatively small company without a lot of resources,

the very tempting thing to do is just to increase to scale vertically,

meaning that

I have, you know, a machine that has maybe 16 gig of RAM. I'm running out of RAM. I'll just bump it up to 32 gigs, and then I'll keep running that same pipeline.

But, of course, this isn't efficient. Right? Because a lot of times, maybe you only need 20 gigs of RAM for a single step, and then otherwise, the utilization is low all throughout. Right?

So scaling vertically tends to be very expensive.

It's not a good use of resources compared to if I could scale horizontally

where I could introduce more machines,

and I can divide my data across several machines, now I can have auto scaling enabled, which Spark and Dask provide.

And I can have my cluster scale up for only the steps that need a lot of resources,

and then I can scale down when I don't need as much resources.

So

now we have the this data scientist who's in a dilemma. Hey. Do I scale vertically where I can just bump up the resources?

Or do I scale horizontally? But that would involve rewriting my code

in a way that makes it distributable. Right? So now I have pandas based code base,

and now I have to convert that to either Spark or Dask.

So on the Fugue side, we want to make this transition, you know, seamless.

But for the, like, very basic

use case,

maybe you just have 1 step that's very intensive that you want to parallelize,

and that's why Fugue opened the the transform function. And the transform function takes your Python code as is and then brings it to the distributed execution for just 1 function.

But then if you want, like, a full workflow,

then Fugue also has an API for that. So if we treat

your program as a set of logical steps,

Fugue creates a DAG out of this logic. And if you treat it as just logic without execution, then we have this execution agnostic DAG. Right? So we just have a set of logical steps that's not necessarily coupled to Spark or coupled to Pandas. Right?

And at this lower level, because you committed more of your code into Fugue,

Fugue provides other benefits. For example, we can do some compile time checking to see if the schema is accurate throughout that tag and see if, like, there's some downstream operation that will not have a column that it needs. In that situation, we can raise a compile time error preventing

expensive mistakes using your production clusters.

And then even if you're experienced with

Spark or Dask already, there's still a lot of edge cases that are hard to develop for.

For example,

if I have, like, 10 items that I want to distribute

over to a cluster and

each of these 10 items takes 1 hour to process,

what you'll find is that, hey. I'm gonna divide these 10 items, send them to the workers. You'll find that it's very common that 1 of those workers or 1 of those partitions has 2 items and another 1 will have 0. And this is just because of the default hashing

that happens.

Although, to be fair, at larger scales with more partitions, it kinda gets averaged out, and you kinda have even partitions. But at the very there are cases when, like, for example, want to train 10 machine learning models.

Now what happens is that that partition that ends up having 2, you know, items to train

can easily double the total execution time of your program.

So this is something that we solved on the Fugue level that allows you to just specify, hey. I want even partitioning,

and we guarantee that on our end.

Something that's also

used a lot that takes a lot of effort to code on your own is checkpointing.

So in Spark,

what happens is if your lineage gets too long where you have a lot of steps in your program,

you eventually run into errors because this lineage is too long. So the best practice around it is to truncate your lineage by explicitly checkpointing

and then fetching out, you know, that file later and loading it back in. So on the fugue level, we have added more checkpoints.

For example, do you need

this to be a permanent checkpoint or just a temporary checkpoint to break the lineage? We also added, for example, this concept of a deterministic checkpoint.

So if you're, like, in a Jupyter notebook and you're just iterating,

now we can load that checkpoint in if the code that produced that checkpoint

remained the same. But if something changed there, then the our determinism is lost. So that's when we decide to rerun

the code and generate a new data frame. Right? So even for experienced Spark users, there's a lot of cases where you have to write a nontrivial amount of code in order to get things to work as efficiently as possible.

So Fugue provides a lot of that also for these already experienced Spark users.

And what we find is that

the the average Spark user actually doesn't use Spark as effectively as possible.

Again, like, it's very easy to just put code out there. It's running,

you know, okay. But with some certain tweaks, it could be run much much faster.

So from our metrics, what we've seen is that in general, people that use Fugue, their costs have dropped 50%,

and ball time of Spark jobs also drops by, like, 80%. Right? And this is already, you know, coming from a big company that has a lot of technical people.

So we think that for the average Spark user, they'll probably see a lot of gains moving their code to Fugue because Fugue puts those guardrails and the best practices around it to execute it effectively.

The second part of your question is how does this influence priorities and API design?

So 1 of the things that we encountered was that we first introduced Fugue as the

DAG, and

this did not have that, you know, 1 liner transform function. So now it was an all or nothing thing that I have to totally buy in to this DAG,

to to FUSE DAG, in order to get all of the benefits.

What we found was that it was too constraining. People didn't really, you know, like that. They just wanted something, you know, hey. Bring this function to Spark. Bring this to DaaS. And because of that, we exposed those, like, very high level APIs of just, like, 1 liners

in order to bring,

you know, 1 pandas function over to Spark or 1 pandas function over to DaaS. The second thing to mention is that we also have

a lot of people who

they

don't have a cluster. And because they don't have a cluster,

there's really nothing to distribute to. Like, you can't, you know,

use Spark or use DaaS because you don't know how to obtain your own cluster.

That's why on our end on Fugue, we've been putting a lot of recent development into optimizing the local experience.

For example, we added DuckDb as a back end. DuckDBS is a in memory OLAP

database

that now allows you to,

like, query a CSV or a parquet file,

perform some SQL query on it, and get back, like, process results already that which is now smaller, and then now you can process in pandas.

Yeah. DuckDb is definitely an interesting project that I've been keeping an eye on, and my understanding is that it's effectively

SQLite for analytical use cases and column oriented

aggregations.

Exactly. Yeah. So for us

because we have a SQL interface as well. Right? So where we were inspired by Spark SQL,

and we have a SQL interface for both DASK and Pandas.

But what we're finding is that the SQL interface that we have for Pandas, which is something that we wrote ourselves by mapping

SQL commands

over to pandas code,

is much slower than duckdb. And I'm talking by, like, a magnitude of, like, 100 times just because duckdb is a very optimized for that, like, local

querying of, like, pandas data frames or querying from a file. And that's why we introduced it as a back end. So now you can use that on top of your pandas data frames.

Every data project starts with collecting the information that will provide answers to your questions or inputs to your models.

The web is the largest trove of information on the planet, and Oxylabs helps you unlock its potential.

With the Oxylabs scraper APIs, you can extract data from even JavaScript heavy websites.

Combined with the residential proxies, you can be sure that you'll have reliable and high quality data whenever you need it.

Go to data engineering podcast.com/oxylabs

today and use code dep25

to get your special discount on residential proxies.

You've talked a little bit about some of the sharp edges or pitfalls

that engineers run into when they're trying to move from a local development single machine environment to

executing across a Spark or a DASK cluster.

I'm wondering if there are any other sort of general categories of problems or points of confusion that people run into when they are trying to make that transition.

And at the point where they do decide that I need to move this off of my laptop onto a cluster,

how they think about what the

decision points are for whether they want to use Spark or Dask in the event that they don't already have the existing infrastructure.

First off, a lot of people, when they see Fugue, they think

that, oh, it's just some kind of map or that, hey. You know, I have some abstraction layer, and I just decide if it's gonna, you know, go to the pandas code path or the spark code path. And underneath the hood, like, it just

maps it. Right? But what they don't realize is that a lot of Fugue is all about maintaining the consistency

that you get the same results

across these execution engines.

Because

Spark and Pandas, they generate a lot of different results for a lot of, you know, similar operations.

For example,

nulls join with nulls

or are nulls dropped from the join? Right? In pandas, they actually join together,

while in spark, they don't.

Same thing with group by.

Are nulls kept in a group by operation, or are they dropped in the group by operation?

So for this 1 specifically, for Spark, nulls are kept in the group by,

and in pandas, nulls are dropped from the group by. Right?

But this

keeping or dropping of nulls, it's still understandable.

You know, you just have to write a bit of extra code. But the 1 that really stands out to me is sorting.

Because in

sorting, in pandas,

they decide, hey. Are nulls at the bottom or are nulls at the top of the sort? Right? Whereas in Spark, which is Java based,

it treats nulls as the biggest value.

So if you're descending, nulls are at the top. If you're ascending, nulls are at the bottom.

So for sorting, we actually have 2 completely different systems

for both Pandas and Spark.

And

now it becomes Fugue's question or and any framework that, you know, does some sort of, like, migrating Pandas code to Spark, it becomes a question of which 1 do you want to be consistent with. And on the Fugue side, we chose to be consistent with SQL and Spark

because if the code executes well on a distributed environment,

then it can go backward to the local setting, whereas it's not necessarily true the other way around.

So consistency is a big issue.

And then the other 1 is also

mindset related.

A lot of people expect

some functionality

to exist in a distributed setting, and they may not be

aware of how expensive that is. Right? So, for example,

median is the most basic example where

a median

in local in pandas is trivial to execute.

But if you bring the median operation to Spark or DASK and you're looking for a global median,

it becomes way harder because now data is spread across multiple machines, and you need some kind of movement between that data in order to be able to get a median value.

Right? So

same thing with transpose.

Right? Transpose is a very common operation in pandas,

but

it's something that doesn't translate well to a distributed setting. So if you look at Koalas, for example, which aims to be a pandas interface for Spark, I think in their documentation,

they explicitly say that, hey. There is a limit of, like, this doesn't perform well after 10, 000,

you know, columns or 10, 000 rows. Right?

So mindset is also definitely a big deal. And then aside from that, there's just a bunch of concepts that people are not exposed to on the local setting, but they have to be cognizant of in the distributed setting.

So these are stuff like lazy evaluation,

persisting,

you know, managing your partitions and the shuffling of data around, which can easily cause bottlenecks if you don't do things effectively.

Right? And then the second part of your question is when would somebody

choose Spark versus Dask?

So on the Fugue level,

because we're an abstraction layer for these 2,

we hope that our users don't really have to make this choice. I'm gonna choose Spark or I'm gonna choose Dask. Instead, it should be about, hey. You know,

my company already has a Spark cluster, so I'm just gonna use that. Or, hey. My company already adopts DaaS and uses a DaaS cluster, so I'm just gonna use that.

But we can still answer this, you know, question as fairly as possible given

that by doing the Fugue project, we interact a lot with Spark and DaaS. So in general, it feels like Spark as a framework is more optimized

because it's more mature. The optimizer is really developed, whereas in DASK, I think we're still just about to start

getting that optimizer into the data frames. The second thing is that,

you know, Fugue is all about data frames

and being an abstraction for data frames. But you may have, you know, certain use cases

that don't fit into the data frame setting necessarily,

and DASK has other structures that deal with this well. For example, you have the DASK bag, which is like the JSON

kind of dictionary that you can deal with, or it has, you know, the DASK array, which builds on top of the NumPy array,

which is like a distributed array. Right? So there are these other data structures. And, like, DASK also, you can do stuff like and Prefect does this where you can just submit task. Like, hey. Run this function

somewhere in the cluster. Right? Spark doesn't have as much freedom for this. Spark is more constrained to that data frame and because of that more optimized and more performant

in that data frame setting.

If you don't have any cluster yet, though, like in your workplace,

it's significantly

easier,

in my opinion, at least, to spin up a DaaS cluster.

Because of the DaaS ecosystem, I can just spin up 1 on, you know, AWS Fargate, or I can spin 1 up on Kubernetes.

But deploying Spark on Kubernetes myself, that's not something, like, I would even, like, you know, personally consider. It's quite challenging. So if you don't have anything, Dask is much easier to get a cluster for and scale your code. And then the last piece of comparing Spark and Dask is local execution.

Right?

So because Spark runs on Java, you know, you need to instantiate that Java environment. There's a lot of overhead to doing it. And it's kinda painful, actually, because even if I'm

developing locally

and I'm running a script and I tweak my script and I try to run it again, it easily takes, you know, 20, 30 seconds to spin up that Java environment.

Whereas in Dask, because it's all Python based and it builds on top of the, you know, pandas and NumPy and the Py data stack, it's a lot more seamless to just easily test code, and it just runs automatically.

In terms

of kind of not necessarily prior art, but similar art in this ecosystem,

the thing that comes most readily to mind is Modin, and he also mentioned Koalas, which is the pandas API on top of Spark. But Modin is a little bit more in line with what Fugue is working on, where it provides

different execution across

either Dask or Ray. It also has some different interfaces

beyond just the Pandas API where it has a SQL layer. And I'm wondering if you can give your thoughts on the comparison between the

goals and utility of Modin as it compares to what you're building with Fugue.

Yeah. Definitely. So this is a question we get a lot just because people think, oh, you know, porting Python Pandas code to Spark or DAS, you must be like Modin, or you must be like Koalas. Right? They put us in the same bucket,

but, actually, I don't think we are. So for Modin specifically,

we've been looking to add Modin as a back end.

So for those who are unfamiliar, Modin is

Panda's interface,

but with a DaaS back end. They have OmniSci as a back end, and I think they have Ray as a back end now. Right? So on the Fugue level,

we don't have Ray yet as an execution engine. And if Moden, like, perfected that path of having an interface for Ray, we could easily

add Modin as a back end to Fugue. Right? So I think we're still 1 level of abstraction above from Modin where we could use them as a back end if it made sense too. But in general, Modin and Koalas are

pandas interfaces for distributed compute,

and

Fugue intentionally decides to not be a pandas based interface.

And then this becomes sort of a philosophical discussion now of is a pandas interface

translatable

to distributed compute?

And we think it's not, and there's a couple of reasons for this. Number 1,

especially when we're talking about Spark,

we brought up consistency

earlier.

So

if I asked you, hey. Is Koalas consistent with Pandas, or is it consistent with Spark?

Would you know the answer readily?

And the thing is,

not really. Right? Because

how am I going to use Koalas? Right? A lot of their, like, use cases, for example, involve having a Koalas data frame, and then maybe I'll convert it to Spark, and then maybe I'll run it through Spark ML or something.

Right?

Or maybe I had need to convert it to Spark in order to be able to use Spark SQL on it because I can't readily use Spark SQL on Koalas.

But then if you read the documentation,

Koalas is actually consistent with Pandas,

right, in terms of how it handles nulls.

And

because of this discrepancy,

it's a bit awkward in the middle, kinda not quite Pandas, but not quite Spark. I personally think that

design decisions

become really hard, like, when you're working on a framework, if you want to be more oriented with pandas or with Spark, especially if you're, like, a pandas interface

for distributed compute.

The second thing about the pandas interface is that, in general, when you use pandas,

it's very tied to the index.

You do a lot of operations. That's set index, reset index. You group by. It makes a new index.

Right?

And if you're working with pandas, the index is

very, very

ingrained in what you do. So

Modin

when Modin created their project and when they were making their API,

they decided to prioritize

based on the order

of how often the functions were used in pandas, and then they would knock those out so that they could, you know, get the most used functions in first.

Index operations

are in the top 5 of that,

and that just shows how tied things are to the index. So why does pandas have an index, and why does Spark not have an index?

And the answer is because the index

is

sort of a global

ordering of all of your data.

And

in a local setting, that's very fine. That's no problem at all. In fact, it's very efficient for some operations because now I can easily say, hey. Get me the item at index 20

or get me the item at this date time index, and it's just fetched easily.

But now if I have my data across multiple machines,

maintaining a global order for that, especially as I do group by operations,

it's certainly doable.

It's very expensive,

and it's very questionable if it's even needed. And that's why we think Spark chose not to have an index because if I needed, you know, so and so item at index 20,

in Spark, it would just translate to a filter operation where I can say, hey. Get me the item that fulfills this criteria.

And this is also why we think that with Koala specifically,

there are a lot of cases

where it's like, yes, it's a pandas interface to Spark, but there's a lot of cases where performance

doesn't really hit the bar that you'd expect. Right? There's a lot of cases where and you'll see this on Stack Overflow where people perform a group by apply operation, and all of a sudden, it's even running slower than native pandas.

And we think it has to do with that distributed index

where you're maintaining that global order.

So, definitely, the index is questionable, and we chose purpose. We chose intentionally to not use the pandas interface

because it didn't necessarily translate well. The second thing with the pandas interface is

if a user just sticks to pandas, they're not exposed to things

like checkpointing,

persisting,

you know, in distributed computers also broadcast.

Pandas just doesn't have the grammar

to necessarily

apply these operations.

And then the question becomes, hey. Do I add

these operations into the pandas grammar?

Right? But if you do that, it kinda isn't necessarily a pandas

interface anymore. Right? And it kinda deviates from it. There's a lot of times

when pandas has something that doesn't make sense, and you have to make a choice. Hey. Do I deviate away

from this operation? And then at that point, you kinda lose parity.

Right?

The other thing I wanna discuss is that it seems to be a bit magical in a lot of cases where

a lot of times these frameworks will just say, hey. You know, you just need to change the import statement,

and it's all gonna distribute magically.

This statement

assumes that there is, like, a 1 to 1 parity

between the pandas interface

and that framework's interface.

And, of course,

like, it's never gonna have 100% parity because pandas is developing,

pandas is changing,

and then, you know, they're always gonna be 1 step ahead. So now there's a question.

If there's no API parity, like, how does it fail

on

the distributed side? And I can tell you for Modin specifically

what their design decision was is that if

something fails, it defaults back to pandas.

And what does defaulting back to pandas means? It means collecting everything

back into a pandas data frame,

performing that operation,

and then spitting it back out, right, as a distributed data frame. And if you think about that, that's such an expensive

way to do things

that is a result of not really respecting the fact that I can bring this to Spark or Dask with these certain considerations. And we think at Fugue that if you just devote a bit more time

and rearchitect

your solution in a different way, you'll be able to get much better performance

and be very mindful of the distributed system components that you're interacting concepts,

and

concepts

and kind of hide away some of that complexity and be able

to do push down operations to optimize across the

specific context of those execution engines. I'm wondering if you can talk to the

implementation details of the Fugue project and some of the ways that you've thought about the architecture

to allow for optimizing

in these different execution engines and still being able to provide a consistent interface

to the end user as they're developing

and being able to maybe have some

progressive reveal of complexity where at this top level, it's just, you know, just pass it through this transform function. We'll do everything for you. Or if there's something very bespoke or custom that you need to do or being able to have some sort of pass through capability to say,

you know, I want to be able to call into

this utility of Spark ML to be able to build this model in this, you know, broader DAG, being able to actually maintain that capability,

but to have that mapping back up to the Fugue API?

Yeah. Yeah. This is a very loaded question, but I'll try my best to answer.

So

on the first level with Fugues architecture,

again, like, it's pretty much what you'd expect in some ways where, like, hey. If I naively made some kind of abstraction layer that, hey. When this join is called, go to the pandas join. If we're using a spark back end, just use the spark join. Right? And I've talked to a lot of, like, engineers who kind of made their own solution to this, and they made their own abstraction layer. But then what they get bitten by is the consistencies

that we mentioned earlier.

So for Fugue, our guarantee is consistency,

and we write that extra code so that

you get the same results across different execution engines. Right?

So that's the first level. So first is for each of those operations, we go down to, you know, Pandas or we go down to Spark or DaaS,

but we do it in a way that it's consistent across all of those engines.

So this means that we have an execution engine spec

that we've come up with, and this execution engine is kind of like the contract

for

what we need to be able to fulfill in order to add an execution engine. So, for example, if Ray came in, we made an execution engine for Ray. We would need to be able to perform all of these operations, like how do I partition, how do I

run a function inside a partition,

how do I join, merge, etcetera.

And this is, like, the contract for what has to be fulfilled.

Now once we have this execution engine contract

and we're true to it, now we can add engines and evaluate, hey. Can they actually fulfill

the spec of the execution engine? So from the unit testing perspective,

we have a unified test suite for all of our execution engines, and we strongly maintain our 100%

test coverage.

And that's a very important thing for us because it takes a lot of effort to do so, But besides the abstraction, the fugue is guaranteeing consistency.

That is something that we have to uphold.

So we have that unified testing suite extensive unified testing suite that guarantees that. Now once we have this execution engine, the second component of Fugue is that DAG. Right? Or you can just bring 1 function to Spark or to DaaS, but,

really, what Fugue has is the abstraction layer has that full workflow as well

where you can bring all of your logic into the DAG, and then the DAG uses the brings that to the underlying execution engine. So for example, if my logic were something like, hey. Load this.

Perform some transformation.

Load this from another place.

Join them and spit a file out.

That is a logical graph that I can then say, hey. Run this on Pandas. Run this on Spark, or run this on DaaS. So we have our own data frame

that's engine agnostic.

And this data frame, you perform operations on it. And as you perform operations on it, it's compiling that tag, similar to Spark or DaaS. Right? Because when you have Spark operations and DaaS operations,

they're evaluated lazily.

So we have our own, like, data frame that's evaluated lazily.

And once that computation graph is created,

you say, hey. Run that graph on

DASK. Then we go down to the DASK execution engine, and step by step, we say, oh, let's use the DASK load, the DASK transform,

the DASK join, and the DASK save. Right? Same thing with Spark. If that execution engine went to Spark, you say, oh, use the Spark load,

use the spark transform,

join, and then save. So it's the same thing.

We have both of these for your standard operations,

but then now we have to also

provide stuff around execution.

How do I partition? How do I broadcast? How do I checkpoint? The distributed computing operations that I talked about earlier.

Now we add them also as operations

with our, you know, distributed

engine agnostic data frame.

So these are the components that make Fugue. And then after that, it's all about, like, you know, adding the cherry on top of, like, how can we make

it even easier for people to use? So, for example, we have

this DAG now. We do schema validation. We raise errors if ever, you know, you're doing operations on columns that don't exist at a certain point.

Or, for example,

when I said take in both

native Python code and pandas code, we can also operate

independent of the pandas framework where we have functions that are defined on, like, list of dictionaries or list of lists. We can also take in these data frame types, and we can also operate on them. So now you can describe your logic in whatever grammar you want,

and then we can worry about bringing that to Spark or DaaS for you. So now that we have these 2 components, we have a Python interface in that DAG.

We can build the SQL layer on top of it. Right? So we have a SQL interface in Fugue SQL,

and the SQL interface is a first class interface that shares all of the

features of that Python interface.

So in order to match that Python interface, we added keywords to SQL. So, for example, you can use SQL to say, load this and then do some transform using this Python function,

join, and you can do that all in SQL. So when we designed Fugue SQL, it was meant to be a full end to end

interface for compute workflows

built on top of that abstraction layer with 1 to 1 parity.

As far as the

execution engines, you mentioned some of the differences between DASK and Spark and some of the ways you might optimize

the

execution across them for those specific contexts. And I'm curious,

what would be involved in being able to expand to other execution engines where maybe you decide that we care more about being able to perform

streaming compute, and so we want to be able to execute on top of Flink, or

we want to,

you know, move from a data frame to a concrete table implementation. And so we wanna be able to

push down some of this logic into,

you know, Snowflake or Presto and an actual, you know, database or query engine, and just some of the

ways that you think about what are the

extensions that make sense to invest in and what are the points where we say

that is out of scope for what we're trying to do with Fugue, and so we're not actually going to try and, you know, go after that target.

The things you

mentioned are definitely things that we want to add in the future. Number 1, streaming, and number 2, like, going deeper into the database.

So, for example,

we can use the same, like, SQL interface

query. Like, when we said we want Fugue SQL to be an end to end interface, we can envision it querying from the database, loading it in some data frame, whether that be Pandas or Spark,

and then, you know, doing your operations in Pandas and Spark that can't be done with traditional with ANSI SQL.

So for that, we've been thinking about it. But the thing is if we want to have a unified interface

for

everything like, for us, like, we want to be able to run on Pandas, Spark, and DAS without, you know, changing the underlying functions or changing the underlying code. We just want to specify the execution engine.

The thing with databases is they have their own, like, custom functions.

Like, they have different ways to express the same thing. Right? And because they have different ways to express the same thing, there is some

design that needs to happen around mapping

to make sure that, hey. If you use this Fugue SQL keyword or something, it will map to this function in MySQL or this function in Postgres or this function in Oracle because they all have different ways, especially, like, around daytime stuff like that.

So that's certainly been in our head. But on

a more, like, realistic, like, short term level,

for example,

something that's really nice about Fugue SQL right now is that you can have an end to end workflow

in DuckDV,

and then you can translate it to Spark SQL.

And a lot of the times, it will run as is without code change, and that's sort of amazing. But that has to do with the fact that DuckDV SQL interface

matches Spark SQL's interface very well with, like, good parity.

So now you have to find those edge cases where it doesn't map well,

and then you have to make some kind of syntax for it so that we can eventually

resolve it when you declare your back end without having to change the code that you already wrote.

On the streaming side, we're definitely looking into adding Flink. Streaming will definitely take a different form, especially just because when you go from batch job to streaming, of course, latency has to go down,

and

the overhead with batch job is a lot more

generous compared to having your overhead with streaming. So we definitely have to pair it down. It's certainly something we've been thinking about how we can incorporate streaming frameworks as well, and that's something in the future plans.

The only thing worse than having badged data is not knowing that you have it. With Bigeye's data observability platform, if there is an issue with your data or data pipelines, you'll know right away and can get it fixed before the business is impacted.

Bigeye lets data teams measure, improve, and communicate the quality of your data to company stakeholders.

With complete API access, a user friendly interface,

and automated yet flexible alerting, you've got everything you need to establish and maintain trust in your data.

Go to dataengineeringpodcast.com/bigeye

today to sign up and start trusting your analyses.

For people who are interested in getting started with Few, can you talk through the workflow

of incorporating it into an existing project and how that might differ from the way that you would approach it if you're building something out greenfield?

This is a good question because

a lot of times, like, I think it varies depending where you are, you know, how much code you've written in SparkScale.

Right? So

let me first say what attracted me to start contributing to Fugue and getting involved with the project.

So first of all, I worked in a payroll company as a data scientist,

and payroll has a lot of these, like, custom business logic.

Hey. If this person is this, use this. If not, you know, use this other thing. And then there's a lot of nested logic

in these if else's.

But then what we had to do was we had certain projects that called for Spark because of the size of data, and we had other projects that called for pandas.

And then we had to reimplement the same logic twice.

We had the Pandas implementation and the Spark implementation.

And because we have these 2 sets of code that maybe live in different places that we have to maintain, it's very easy

for them to fall out of sync.

So what I wanted to do is I was very attracted to Fugue because I felt, hey. We can write our code just once,

and then we can, again, specify,

hey. Run this on Pandas. Run this on Spark. So at that level, like, if you have something written already, like, in Pandas, it's very easy to port it to Spark. But the challenge is if you have something written in Spark,

it's a bit harder to bring it to Pandas because it's very coupled to the Spark framework, and that's exactly why Fugue wants to prevent locking into these frameworks. We want you to define your code, like, in the most native way possible,

and we'll be responsible.

If you have a significant amount of code that is already locked in to Spark, it becomes very hard to use Fugue for new projects.

But on the other hand, if you have a lot of Pandas code or Python code and you are going into that big data direction and you need to port it over,

then it becomes very easy for Fugue to take over that responsibility of porting it over.

If I were doing a greenfield deployment, I'd really look at that end to end workflow

tag type of setup because then it's very easy to prototype in pandas.

You know, you just use your native Python or Pandas execution engine,

and then when it's ready to put in production, you use Spark. It helps in local development because we talked about how much less code you have to test. We talked about you don't necessarily need a cluster to run your code, and then when you're ready, you can bring it to production.

You reduce a lot of expensive mistakes as well because

instead of, you know, testing your code on the spark cluster and then finding out after an hour that it failed, like on the smaller sample

with the Pandas engine, you can run your test. You find out something failed. You tweak it before you actually put it in production with the spark cluster.

In terms of

the capabilities of Fugue

at the surface level, it's very easy to see, okay, I've got this pandas logic. I'm just going to pass this through a transform function, and now I can run this on Dask or Spark. But what are some of the

more nuanced or

less visible capabilities

of the Fugue project and Fugue framework that you think are worth calling out or that folks should take the time to investigate and understand more fully?

A lot of times, we'll hear something along the lines of,

I really need Spark's functionality

because it does so and so. Right?

I'd like to point out that Fugue does not

lock in, like, access to the underlying framework.

So if you need a specific piece of DAS code or you need a specific piece of spark code,

Fugue knows is aware of the execution engine it's running on. So you can do some sort of if else behavior

on that execution engine

and then use that spark code and then use that task code. So if you already have a lot of spark code written,

then you can decide I'm just gonna write the pandas equivalent of it, And then now I can do that if else depending on my execution engine. And now I have this function

that if I'm on Spark, it resolves to the native Spark code. And if I'm on Pandas, it resolves to the native Pandas code.

In your experience

of

working with Fugue, using it for your own purposes, and interacting with the community? What are some of the most interesting or innovative or unexpected ways that you've seen it applied?

I'll talk about unexpected

because this 1 took us by surprise. Like, we did not foresee, like, this kinda use case. So we are talking to someone who

they are in the Internet of things space. So what they do is they have a lot of edge devices

that have limited compute, but they want to be able to process the data the same way on their edge

and on the cloud where they have more resources.

So they're thinking, hey. I can use Fugue, and then I can, you know, have that same code exist on both

the edge

device and on the cloud on the cloud using a

Spark back end or DaaS back end and on the edge using the pandas back end. And now I don't have to rewrite anything. I just have 1 code base for my IoT stack. Right? That's certainly interesting.

The second thing is that I wanna bring up a collaboration we had with PyCaret.

PyCaret is a low code data science framework

that data scientists can use to further end to end machine learning pipelines and very little lines of code. So Pycarrot primarily runs on pandas,

but what we did for them was we can say, hey. If you add

a few back end to your code, then we can point, like, this piece of code to Spark, DAS, or Pandas. Right? So now instead of just, you know, training what they do is they have an AutoML

with, like, 15 models, and they train all of these in parallel. Right? Now we can say train them in parallel

over Spark, or we train them in parallel over DaaS

by setting Fugue as the back end. So

for a lot of open source projects, if they integrate Fugue into their stack, I think they can see that, hey. I have a pandas based library now, but maybe if I use Fugue as a back end, I'll be able to apply it to the Spark engine as well or to the DaaS engine as well. So, definitely, we have a lot of open source collaborations that are in top where we can port their library over to Spark or DaaS for them without major rewrites on their end because we solved it already on the abstraction layer level.

In your experience of working with Fugue, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Yeah. So on a technical level,

I would say the SQL parser was definitely challenging because that's something, like, you're not definitely used to. And then, like, now we use Antler specifically,

and, like, there's not a lot of resources on it.

So now you have to figure out, like, how to build your vocabulary and then map those words

over to the

respective Python code. But on a nontechnical level,

I think this is very understandable

now that I'm saying it, but we certainly have a lot of pushback

because data professionals tend to be strongly opinionated.

So, for example, we have a LinkedIn post that demos Fugue SQL.

We get both sides of, like, hey. You know, this looks great. I can't wait to use Fugue SQL and that SQL interface on my pandas data frame. And we get the other side where it's, like, people, like not quite cursing us, but as close as you can get

without cursing.

Like, why would you ever want to use this if pandas is very performant already?

Right?

So, also,

on the pandas side, what we see is that a lot of people

kinda

really love their tool,

and,

really, it takes quite a bit of convincing

in order to expose them to, like, other things. Like, for example,

pandas being the interface for distributed compute, sometimes people just want to stick with that. So I think this is pretty expected now that we are talking about it, but, definitely, it, for us, came as initial surprise, and we thought, hey. You know, people would see the value of Fugue immediately,

but definitely, we've learned to become better communicators about that over time as well. In terms of people who are interested in being able to

abstract their logic from the execution

context, what are the cases where Fugue might be the wrong choice and

maybe they're better served with a modin or koalas or

using 1 of these other

implementations

of the pandas API for different execution contexts or just writing their own abstraction layer to be able to translate arbitrary

Python code into some other runtime?

With Modin specifically, I mean, I guess we don't have a Ray back end yet. If Modin has 1 and you want need a Ray back end because you have a Ray cluster,

definitely, by all means, go for it. Same thing with OmniSide. They have us a back end. But when that back end gets really developed anyway, at some point, we would maybe add Modin as a back end or add Ray as a back end ourselves with Ray native code rather than go through Modin.

So in that sense, I think it's just a matter of time before like, if something is really picking up, then we would just add

it as an execution engine on our end. But to their credit, though, if you switch your import

and your program works perfectly,

then maybe

it doesn't make sense for you to rewrite it anymore and bring it to Fugue, right, if it does work perfectly. But I honestly find that a lot of times, there is at least something that is like, hey. This is not working as expected. Right? And if that's the case, then maybe you should consider using Fugue where things are more explicit

and things are more in your control rather than kind of sort of magical that, hey. This code is just ported. And it wasn't necessarily 1 to 1. Right? Because a lot of times, compromises

have to be made. But there is 1 case

where Fugue is definitely

not the option for you, and that's for people or for teams. I've seen conference talks where people say something along the lines of,

we had to edit the spark code ourself, or we had to create this, like, extension or change the memory buffer or whatever. Like, these kinds of things where you have to edit the framework to fit your use case,

You definitely are probably using Spark for some heavy duty use case that is maybe, you know, in the petabyte level of data.

Like, you're a big company. You have a lot of tech resources. You can do this thing. If it means that much to you to perform these optimizations,

then Fugue is definitely not for you. Right? Because we perform some optimizations, but only up to a certain level where it's, like, reasonable

practitioners. But we would say most people are around this level anyway

of, like, your use case could probably be satisfied by Fugue unless you're with a really big company or with a really, really heavy duty data engineering pipeline.

Otherwise, we think that Fugue

can probably solve your use case maybe with a bit of tuning or tweaking, but definitely.

As you continue

to work on the project and evolve it and, you know, keep it up to speed with these different underlying execution engines, what are some of the things you have planned for the near to medium term or any projects you're excited to dig into?

Yeah. So we definitely want to add on the application layer of Fugue because we have Fugue. We have this abstraction layer. What's nice about this abstraction layer is that we can now make a library, and it's compatible with Pandaspark and DaaS. So we already have 1 in, like, few tune, for example,

which is a hyperparameter

tuning framework for machine learning. And now we can say, hey. Run these model trainings on Spark or

on Pandas as a back end. Right?

Now

we want to explore other parts of the stack as well. So, for example, data validation. Can we make a data validation framework

that is immediately compatible

across Pandaspark and DAS once you add a validation rule? Or, for example, for feature engineering,

how can I

use this DAG, like, to represent a lot of, like, transformations?

And then how can I further optimize

on that feature engineering DAG that's created

so that it can perform better on the distributed engine such as Spark and Dask?

So we're exploring Ray on our end as a back end.

In a bit of time, maybe when some of the stuff matures a bit, it will certainly be able to be brought in and fit in as an execution engine.

And then we're also, like we mentioned earlier, looking at that unified SQL interface

where how can we go further into the database

and actually perform operations. Actually, on this front, what we did was we added IBIS

as

a back end already.

So you can use Fugue IBIS to kinda use Python to query those database tables. But how do we do this on the SQL interface as well? Definitely.

Are there any other aspects of the work that you're doing on Fugue either at the technical level or the use cases that it enables or your work with the community or areas of contribution that you're looking for help with that we didn't discuss yet that you'd like to cover before we close out the show? We're definitely looking for collaborations

with other open source projects. Like, if you maybe have

a project that works well on Pandas already, we can definitely bring that to Spark or DaaS with you. So that's something we're looking into. And then if you also

are a company that's maybe thinking of migrating to Spark or DaaS for your pipelines but don't quite know where to start, we're always happy to chat about

that. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question,

I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

It's not that I see a gap, but it's more of I see a very interesting intersection

because

my day job, right, with Prefect

is macro level workflow orchestration.

So we deal with, like,

full on data pipelines and, like, there's a very macro level DAG

that tracks the state of these operations

of these jobs

and then report some kind of state, and then there's monitoring on that level. But then there's also

the micro level

orchestration,

and I think this is kind of what Spark and DASK have and Ray has, right, where they have a DAG themselves.

And their DAG also does retries on operations.

Their DAG also does some sort of optimization.

Right? So we

have macro level workflows, and we have micro level workflows, which are more compute based.

And what I've been thinking about myself is that

the intersection

of these tools is, like, pretty interesting in that a lot of the macro level guys are going in the micro direction,

and a lot of the micro level guys are going in the macro direction. Right? So we have Ray

that recently came up with Ray workflows, for example.

And then Prefect,

for example, has a very, very granular task

definition,

and I think the intersection of this is pretty interesting.

And I haven't quite been able to figure out, you know, like, how this industry will turn out. But this is something I think we'd think about for Fugue as well because, you know, we are an abstraction layer for the micro level. Is there a way to unify that with the macro level as well? Alright?

So no real answers. Just some thoughts there.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on FUGUE. It's definitely a very interesting project, and I'm excited to see

more work being done to

elevate the logical component of the work that we're trying to do so that it can be extracted away from the

specific execution context and implementation details of where the code is running. So appreciate all of the time that you and your collaborators are putting into that, and I hope enjoy the rest of your day. Yeah. Thank you for having me. It's really an honor to be on this podcast just because of the previous companies and guests that you've had, and definitely been a fan. Thank you.

For listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links