Massively Parallel Data Processing In Python Without The Effort Using Bodo

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Atlin as an internal tool for themselves.

Atlin is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar,

Packaderm,

and Dagster.

With simple pricing, fast networking,

object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Ehsan Totony about Bodo, a system for automatically optimizing and parallelizing Python code for massively parallel data processing and analytics. So, Ehsan, can you start by introducing yourself? Thanks for having me, Tobias. I'm Ehsan Tutony, cofounder and CTO of BoDo. And do you remember how you first got involved in the area of data management?

Interesting story. I was actually working on high performance computing.

And along the way, I realized

that probably the biggest application of high performance computing is data management and data processing

because the amount of data is growing, machine learning is growing,

and these are all on the compute level, high performance computing problems to solve. And

so that brings us to what you're building at Bodo because I know that some of the principles that you're building into the platform are based on your work with HPC. And so I'm wondering if you can just discuss a bit about what it is that you're building there and some of the story behind how you ended up at this specific juncture of problems

and why you've decided to focus your time and effort on this problem domain.

Definitely.

I did my PhD at University of Illinois, Urbana Champaign in high performance computing, sort of a hub for HPC.

A lot of interesting problems being solved, but it was all low level code.

MPI

with Fortran,

c plus plus

many years in development for a lot of the codes, a lot of HPC experts

developing those codes.

But you look across the hall, all the data scientists and domain experts write code in scripting languages.

I used to be, you know, a lot of MATLAB. Now Python is more dominant.

So I want to solve this problem, which is called productivity performance gap where, you know, for a programmer, it's simple to write scripting code, but it runs on a single core, and it's very slow and inefficient.

But if you want performance, you have to write low level parallel code that can scale to

millions of cores these days on supercomputers,

but only few people in the world can write those codes. There's a huge gap in this space.

So I joined Intel Labs to solve this problem,

and it hasn't really been solved for a long time because the solution

would be to have a compiler

that could automatically

take this high level code and generate

that performance expert

tuned code

automatically,

and it's extremely difficult from computer science point of view. But I was fortunate that my project was successful after 4 years of research

and came out of Intel to build that along the way because something like Bodo didn't exist. Datasets are growing. Moore's Law is slowing. So single core performance isn't

improving as much anymore since really

2, 004, 5. Things like Hadoop and Spark came along to help

developers work with larger datasets.

But from our point of view, those systems are suboptimal because they are not following Pell competing principles.

And we think

the Bodo solution

can be orders of magnitude more efficient and faster than these other systems.

So because of that, I joined forces with my co founder, Bessard Nasri.

He has an interesting story too because at that time, he wasn't at Intel, but he has had a career at Intel.

And along the way, he was involved in Intel's very large investment in Hadoop and Cloudera, which went sideways.

So he knew all the challenges of big data,

and, you know, Spark replaced Hadoop, but he saw the same complaints

that he heard about Hadoop, you know, being raised

for Spark as well. So he really believed that there should be a better solution for processing large datasets.

And, Bodo, we believe is the right answer. You mentioned that you're focusing on Python as the target for this compiler

oriented approach to being able to paralyze and optimize the actual computation. And I'm wondering what it is about the language

and the tool chain that drew your focus as that being the logical target for your initial implementation.

Actually, the research project didn't start in Python. At Intel Labs, we started with Julia, and we were working with the MIT team behind Julia.

And things were going well.

But Julia as a language

didn't really take off, and Python took over. So from practical point of view, if you look at the data from all different sources,

you know, on Stack Overflow, some of the other surveys and other data out there, Python is the dominant programming language

in the world in general

in a lot of the metrics

and, in particular for data applications.

So that's a practical choice, and we believe Python is a great language for processing data. APIs like Pandas

and NumPy are, very mature,

and they present the right level of abstraction for developers and programmers

to work with data.

From our point of view, our core competency is building this technology,

which we call inferential compiler, and it could be done for other languages as well. But today,

Python and SQL, which are our targets, we have a SQL solution as well, which is in beta, but it can will come out very soon. We think Python and SQL are the answer for a lot of the data problems

in practice today. And for a long time, they will be dominant.

In terms of the Python ecosystem

specifically, there are some projects that are focusing on trying to be able to handle the scale out computation

of these workloads that are inherently single core because of the nature of the CPython virtual machine.

And so there are projects such as Ray and Dask that aim to act as a sort of horizontal scale out with a scheduler for being able to chunk up the computation and do sort of a map reduce style workflow.

And there are other projects that are focused on compilation and efficiency and speed gains of the Python code, such as Numba. And then there are things like the, you know, NumPy

library that actually have a lot of optimized c and Fortran code for being able to handle some of these vectorized arrays for the data processing.

And I'm wondering what you see as the shortcomings

of those different elements of the Python ecosystem

and some of the benefits that you're looking to provide with Bodo as it compares to what exists in that community?

So let's dig into what this technology really is and how it's different than all the other options that you mentioned.

So Bodo is a new type of compiler we call inferential compiler.

In a regular compiler,

the code is maybe written in c, c plus plus, and it's translated to binary. That's it. Maybe some optimizations.

But in Bodo,

the compiler

really understands

the program at a very high level to automatically

paralyze and optimize it. It's as if an HPC expert

rewrote the code for you in MPI or c MPI and c plus plus, except that it's done automatically and transparently

for you. So we believe this is the holy grail of scaling compute where the code is symbol Python,

standard APIs, it's NumPy, it's not Pandas like,

it's actual Pandas. If you you know, Bodo provides

a jit decorator for functions. If you remove the decorators or disable them, it's standard Python and Pandas. You your code works.

So

code is simple Python with Pandas and NumPy

and scikit learn,

but execution

is HPC.

You know, it's native code generated. We actually use Namba and LLVM under the hood, plus some other c plus plus libraries that we have developed.

And it's as if you rewrote the code, you know, using a lot of performance and HPC optimizations.

So that's where we are. In terms of comparison,

the best comparison is I would start with Spark and PySpark

and these other tools. We can discuss how they are similar into Spark and how kind of their cycles of Spark they are building that image.

So without Bodo,

the way you would scale computation

is

you create a library

which exposes some APIs,

Some high level APIs started with the MapReduce library, then had MapReduce inside Hadoop, then Spark, and now Dask, and Ray, and others. So you create a library.

The program is actually running on a single core.

So it's not running in parallel.

But when it reaches

those API calls,

the library

breaks up the call into tasks.

Spark, let's say, goes across the cluster on the executors,

runs those tasks, and comes back to the driver. So it's a driver executor paradigm

using task scheduling.

You pay a lot of overheads per task. We call this waves of tiny tasks.

And you're trying to essentially,

with this approach, approximate in parallelism

because the program was running on a single core, on a driver. It wasn't really parallel.

And if you have a library, you don't have a compiler, that's what you can do. Because without looking at the whole program, how do you know it's safe to parallelize it? And how would you even parallelize it ahead of time? The language is sequential.

Right? What Bodo is is that

the compiler sees the code end to end

and creates

actual

parallel code, which,

architecturally, we call

single program multiple data, SPMD,

and it follows the parallel computing principles.

And we believe this is the scientifically correct way to

run parallel compute.

So each process

and we use MPI under the hood for message passing and parallelism. So each process in the MPI world called rank

owns a chunk of the data

and doesn't need to go to a master. There is no scheduling.

Right? So its process owns a chunk of the data, knows what to do with it and what to communicate and when to communicate according to the parallel algorithm,

and there are no extra inefficiencies.

So these processes run-in a bulk synchronous manner.

And when they get to a communication, it's called a collective

operation,

they do it very efficient.

So as an example of the kind

of communication that are problematic,

typically, and how Bodo solves that, shuffling data

is a very common operation

in data processing. If you are joining tables, group by, things like that, keys with the same value need to end up on the same process, so you shuffle the database on keys, and systems like Spark do it under the hood.

But because the MapReduce architecture and this driver executor is not built for parallelism,

this communication is very difficult.

Asynchronous

tasks cannot really directly exchange data,

so they write shuffle files,

and they put a lot of pressure on the OS and network and JBM and all the machinery

to write all of these files and then schedule reducers to combine them. But in this parallel computing world,

the processes

get to the communication phase at the same time, which is called all to all collective,

and they exchange messages directly.

And in a lot of settings, optimized cloud settings, for example, they actually write into each other's memory, remote direct memory access, RDMA,

and it's much more efficient

than this other master or driver executor paradigm with Spark.

So there's orders of magnitude potentially improvement

just because of the parallel architecture depending on the workload.

That's the key difference of this inferential compiler

parallel computing paradigm

versus the previous approach, which was a driver executor library.

So the new Python libraries,

Dask and Ray, they're really built in the old approach, which is there are libraries run on a single core. There is some scheduler

that breaks up the compute into tasks

and goes across the cluster, come back. Except that in our testing, because they use Python and Python is slower than Java, they're slower than Spark. So they say, you know, there is some simplicity

better than Spark, but they are slower. That's their trade off. However,

speed really matters, especially in the cloud. If you're a 100 times

slower than efficient code, you are paying a 100 times more money.

So your $10, 000

bill becomes a $1, 000, 000, and I don't think any organization

is okay with that.

So

there are various aspects.

If I were to summarize,

the code in both of stays standard Python, you don't have vendor lock in. You can take it anywhere, and you can divide it and run it in regular Python.

But these other approaches, Spark, Dask, Ray, they expose new APIs

that the developer needs to learn

and,

adapt to, and you get locked in because you reload your code for DaaS. We have had issues with some of the customers where, hey. I've already written my code in DaaS. I can't take it anywhere else. Even though, you know, from the surface, it tried to expose Pandas like APIs,

but in real applications and complicated settings,

their code becomes really task code, not really pandas code. So you need to learn the new APIs, and you need to understand

the parallelism.

Because

in these approaches,

you need to know what data has to be distributed,

what data should be kind of replicated

on all the cores.

And if you don't do that properly,

you have incorrect results. Sometimes we have had in our customers where, hey. I used DAS, but my results are incorrect. So Bodo

automatically parallelizes code, which means that by looking at the whole

program and understanding it, it will decide,

okay, this data structure

is kind of on the map side of MapReduce, if you want to go with that terminology, and we have to distribute it. But this other data structure, which may be a data frame, paralyzable

thing,

but it's on the reduced side. And, semantically,

it's like a scalar has to be replicated

on all the cores. So

Bodo does the parallelism automatically for you versus others that you need to be really careful using them. Otherwise, you will get to incorrect results. And if the

program, structurally, has some issues,

model will tell you, hey. This cannot be paralyzed. You have to fix this issue. So this compiler intelligence helps in that aspect as well. So on the simplicity side, there are all these implications.

On the performance side, you can't really beat an efficient parallel computing architecture.

You can build all the libraries and optimize for years.

You know, 1, 000, 000, 000 of dollars has been spent on Spark,

but at the same time, you see all these academic papers say, oh, why is Spark so much slower than MPI?

Everyone knows that, but there's nothing they can do because it's an architectural issue.

So you can't really ignore those efficiency issues,

which, by the way, other than cost for a lot of organizations, we have a Fortune 10 kind of large tech company as our largest customer, and they care about being green as well,

reducing the carbon footprint

in their data centers. And with Bodo,

they see

that these orders of magnitude improvements

translate into

efficiency on the carbon footprint as well, which is kind of related to cost as well. So these are some of the benefits,

but something that we didn't expect at all is benefits

of Bodo

for reliability

of the system. The marketing line says that MPI is not reliable. Spark is reliable.

But in

practice, because of all these layers of these libraries,

there are so many failure points.

You know, your OS runs out of open files because you have too many shuffle files,

or your JVM runs out of some resource,

or the library itself makes a mistake, or there are too many tasks,

they fail all the time. But with Bodo, we have had deployments that over a year, no failure because there's nothing to fail. It's native code running efficiently.

There are none of these layers. There are no shuffle files.

And you fail only if your infrastructure goes down or some hardware fails, which is very rare these days. So all those reliability issues of these libraries

go away, and

the infrastructure is much more streamlined in general as a result.

So that's on the Perl compute side. But you mentioned other tools for

accelerating Python.

Yes. There are approaches for accelerating Python. You can and there have been a lot of workarounds,

but they may not be comparable because we are talking about data processing.

A lot of them

may target

sequential Python in general with

sometimes

smaller improvements in 10, 20%,

which is not what we are talking about. We are talking about thousands of times

faster execution of Python code here and targeted at data processing, not general Python code.

That's 1 aspect. And the other aspect is if there are tools that make sequential Python faster, we can use them as well, and we do use them. Although it's based on Numba, and we contribute to Numba as well. We are developers of Numba as well, and that's applicable to the kind of thing we are doing.

Other projects may not be as relevant based on the target use case.

Then on the other side, there's also the HPC aspect, which you've touched on as far as, you know, MPI and being able to parallelize

these

algorithms and spread them out across multiple cores and multiple instances. And

most of the time, when I think about HPC, I think about, you know, some massive supercomputer that has, you know, some number of 1, 000 or 1, 000, 000 of cores with optimized network interconnects and everything is operating as a single system

versus when you're running in the cloud, you're talking about scale out multiple instances.

The only real sort of communication

synchronization is over into the network, and I'm wondering if you could just discuss some of the current state of the art of HPC

and some of the sort of trade offs that exist between the, you know, supercomputer

environment

and scale out distributed systems that are running on, you know, heterogeneous cloud architectures?

So, traditionally,

the HPC world is targeting

scientific computing applications,

which are,

a lot of the times, physical simulations, for example, weather forecasting or climate forecasting

or some sort of fluid dynamics,

these sort of things. And they have been writing loadable MPI Fortran code

in,

maybe big national laboratories

and large universities

to solve those problems.

In the enterprise world,

the time frames don't really fit writing low level code.

You know, you have a data problem. You need to solve it today.

Your marketing and sales team wants the data right now, not 3 months from now. So it's not really practical

to go through all that complexity of the code.

Because of those reasons,

anytime you

say the words high performance computing, you say high performance, that's good. You say compute, that's fine. But if you put them together,

a lot of people get scared.

From users to enterprises to investors,

they just don't like these 3 words together because

it has never worked for the enterprise. It's just too complicated.

Bodo, we believe, for the first time,

democratizes

high performance

computing

for regular developers because you don't need to write low level code. Your code is Python and high level Python. You don't even need to learn new APIs.

So we believe we are bringing HPC to the enterprise

for the first time. And as a result, we believe we are democratizing machine learning AI

because guess what? The big tech companies are actually using HPC for their

AI. They're actually using MPI and all the right techniques

internally,

but the rest of the world has to take advantage of the same techniques. So that's on the compute side and the programmability

side. From algorithmic

point of view, the data applications

are really parallel compute applications

and fit the HPC paradigm

very well. So if a problem has been solved decades ago, why not just use it? I mentioned shuffle. You know, exchanging data, this so called all to all operation

has been solved, like, 30, 40 years ago. And this MPI implementation,

everyone has optimized it, including cloud providers, by the way. 1 of some of the largest supercomputing

in the world are in the cloud, and they have all the benchmarks and everything. So why not reuse all the development for decades

in that ecosystem for the same problems? Now all these data problems are essentially parallel algorithms in scientific comp some subset of scientific computing

that need to be solved.

So the algorithms

are HPC,

and there is no distinction. Any other solution is a workaround for them. On the infrastructure

side and the machine side, yes, HPC machines,

historically, maybe 20, 30 years ago, they were different.

But today,

the cloud

really provides the same

high performance compute capabilities.

Let's take AWS as an example.

The instances

are

really high performance Intel Processors that are used in supercomputers

too, so there is no difference.

The network between them,

today, AWS

provides

an option called Elastic Fabric

Adapter, EFA,

that provides

RDMA and high performance network in between the nodes. You know, our cloud providers, all of them provide placement of these nodes together.

So you have a cluster.

You can bring up a cluster that's an HPC cluster in the cloud. There is no hardware difference, and you can take advantage of the same

sort of performance

and scalability in your cloud environment as well. And we have benchmarks

on AWS.

Recently, we actually published a customer benchmark on 4, 500

cores

on AWS

scaling very well.

So you have a super complete that's not a small number of cores that you can use and scale your application.

So developers, I think, can think bigger now and think about larger problems to solve with the availability of cloud

cloud services with these kind of capabilities.

And then in terms of the technical implementation of the Boto platform, can you dig a bit more into some of the

sort of compiler aspects and some of the ways that the inferential compiler might be different than something like Numba or,

you know, just the GCC compiler that a lot of people have encountered?

So Bodo is what we call an inferential compiler,

which means that

it understands

operations at a very high level.

Now your GCC compiler,

if you are writing c loop to process data, doesn't know what you are doing. It sees loops and translates them to binary code. And, yes, there are low level optimizations

to allocate the register as well and get rid of unused scalar variables and things like that, but at a high level, doesn't know what's going on in the code.

With Bodo,

the compiler

starts with a very high level, which means that the compiler

understands

data frames,

understands data frame joins,

group by user defined functions with data frame apply transformations.

So understands all of those things

and optimizes the program and parallelizes

the program

based on those

inferred properties.

So it's as if the compiler has the same knowledge

as the programmer in some way. And this is made possible because

the APIs in pandas and NumPy are just so mature

that they

keep the semantics

in place. The data frames and join, all of those aspects are already in the program. They are not lost in loops.

The other compilers operate at the loop level,

but Bodo operates at the high level API level. So that's the major difference. So you are not getting maybe a little bit better sequential program out. You're getting a lot better program

both on sequential aspect running on a single core, and you can run it on as many cores as you want. That's the key difference.

So on the platform side, the

difference of the platform implementation is that so on the compute side, as I mentioned,

we are taking a parallel computing approach,

which is

this single program, multiple data architecture,

these processes that can work independently,

the libraries

are taking a disputed computing approach

from Spark to Ask Ray, which is

assuming that in the suited computing, you are assuming

you have a heterogeneous

environment

where the client is, let's say, mobile phone, go over the Internet, connect to some server,

and you have to deal with all the issues of this connection with TCP and so on and so forth, test underlying assumption of the suited computing.

And, you know, the distance and reliability

issues of that connection

and the heterogeneity

is baked in. But in parallel computing,

the assumption is that cores are close together, and they have a nice high performance reliable connection.

And also that they are homogeneous, meaning that the compute

tasks that are running

are the same,

maybe slight load loading balance and things like that, but they are the same.

So based on that,

the communication

between them

is much more optimized, and the protocol

doesn't have to be TCP. TCP was designed for the Internet, long distance,

not in a cloud with a placement group and nodes together,

very nice connection, a 100 GB

network,

those are not great for TCP.

So the underlying assumptions are very different for the parallel compute.

Our SaaS service, for example, it's a disputed competing

environment where

clusters are 1 part of it. So there is connection. The notebooks,

jobs, management,

governance, all sort of things that need distributed computing. But the compute layer itself,

the way you process the data has to be parallel computing,

and that's a principle

that we follow in the model platform.

There are a number of interesting things to dig into there, but 1 of the things that you mentioned earlier too is

some of the sort of retry capabilities where if an individual task within the parallel computation fails,

I'm wondering how you address being able to

identify that and schedule a retry or sort of what the

process is in this parallel compute environment for being able to handle these intermittent or incidental failures. Or if, for instance, 1 of the nodes in the parallel cluster happens to have a hardware failure in the middle of a task execution, sort of what the solution is for being able to be resilient to those types of errors.

Definitely.

So with Bodo,

the software failures go away, which are much more frequent because you're running on bare metal

binaries.

So

this changes the equation for reliability quite a bit.

The main approach for reliability for parallel compute is checkpointing.

You go back to some sort of checkpoint because there's communication processors,

and you have to store this last valid state. You have the same option here as well. That doesn't change.

What changes is that you don't need to use this very often.

And my recommendation for a lot of users has been

for your SLA

that you care about for your application, you don't need to worry about it because

hardware's

failures are very rare. You know, for a server, sometimes

I've seen numbers

10 to a 100 years

for each individual server. And in an enterprise,

you know, the cluster is typically 32,

node cluster up to 32 nodes.

You know, it's just not very

plausible for failures to happen.

However,

if you go to 1, 000 node scale statistics takeover and you do really need to take checkpoint,

or if your application is really long running, if you are doing the deep learning training workload, a large number of nodes,

and

it's running for a long time, for weeks,

then you definitely need to check more. So these trade offs have to be considered.

But for a lot of practical applications,

the fact that the code is running on bare metal makes it much simpler.

And the fact that

the code is faster.

So if you run

for

a fraction of the time,

the chances of something failing is less during that time.

So these are the benefits.

But in terms of actually orchestrating the restart,

Bodo is a clean engine design. So

very simple Python processes

running,

we are not bundling

any middleware in the core engine. So with Spark and others,

there are task schedulers.

So not only they try to schedule the in the actual compute tasks, they want to schedule jobs and kind of a lot of layers together,

but we believe that's not the right choice. The compute parallelism has to be separate

from the other system aspects.

So your Airflow or Kubernetes

infrastructure

has to kind of take care of all the middleware, including restarts and things like that. They do a great job, and we are not trying to replace those.

Struggling with broken pipelines,

stale dashboards,

missing data? If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end, fully automated data observability

platform.

In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem with broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence,

reducing the time to detection and resolution from weeks or days to just minutes.

Start trusting your data with Monte Carlo today. Visitdataengineeringpodcast.com/impact

today to save your spot at Impact, the data observability summit,

a half day virtual event featuring the 1st US chief data scientist,

the founder of the data mesh, the creator of Apache Airflow, and more data pioneers

spearheading some of the biggest movements in data.

The first 50 people who RSVP will be entered to win an Oculus Quest 2.

And then for the actual

distribution of the data to the different tasks that are executing across this parallel cluster, I'm wondering what you are building as far as some of the

integration or optimization layers for people to be able to fetch data from whatever storage layer they might be using, whether that's files in s 3 or a data warehouse or, you know, multiple different databases, and then being able to chunk that out and distribute it across the parallel execution and just some of the planning that needs to go into the application design to be able to take advantage of some of the efficiencies that Bodo is trying to build in for the sort of parallelization and data distribution aspects?

That's a great point. Data has

to come to the cluster efficiently for efficient compute to happen. If your data takes

forever to load, then doesn't matter how fast the compute is. For Bodo, we are completely storage agnostic,

and we want to support

all different kinds of storage, whether it's object storage,

data lakes,

to the new lakehouse paradigm with iceberg,

to data warehouses.

So we have optimized connectors

for all of these options.

And the way they work is

different cores know what chunk of data they need to load, and they load it as efficiently

as possible.

On s 3, of course, there's a lot of parallelism, and it's a very scalable system with typically par k files and par k pieces

that we load efficiently.

On data warehouses,

some of them today

may not be as easy, So we are building connectors

so so that programming API is still Pandas,

nothing changes

for the user. But under the hood, all the complexity is managed

by Bodo. And in some cases, we are actually partnering with Snowflake,

and they are providing

features for us to be able to

load data from Snowflake

more efficiently

than a regular SQL connector

can provide. So, for example, if you load up to a terabyte of data, each call can kind of run a query

to the warehouse and say, give me my chunk of data

up to a terabyte and a few 100 cores that scales.

But beyond that, it can overwhelm the data warehouse.

So we are co designing this connector

with Snowflake that can load massive amount of data up maybe petabytes

without

some of those scalability issues.

So, in general, I think

Bodo enables some

higher

scale

for all of these storage systems,

and they need to adapt and provide

abstractions to load larger amount of data

that was previously possible

for these systems. So that's 1 aspect. The other aspect is some of the

operations,

such as filters, need to be pushed down so you load less data

from

the storage system that we are working on as well. And we have efficient filter push down on partition parquet files, for example.

And for data warehouses and SQL stores, we are developing those as well. So that's a huge area by itself in terms of loading data and distributing it efficiently to all these cores as we are talking about thousands of cores, potentially, in this case.

Another aspect of what you were discussing is being able to do things like building and training neural networks and being able to optimize the execution of that where a lot of the focus recently has been on leveraging GPUs for being able to

run across these, you know, 100 and thousands of cores, but with a smaller memory footprint per core. And I'm wondering what are and then other approaches are things like being able to do, like, automatic network pruning to reduce the number of nodes in the network and reduce the reliance on GPUs.

And I'm wondering what you've seen as some of the potential for this parallel compute approach to be able to

efficiently train these neural networks without necessarily having to rely on GPUs, which have become relatively scarce and difficult to acquire in the, you know, times that we're in now with these issues of supply chain management?

Definitely.

So with Bodo,

we are making it easy to scale to abundant CPU

resources in the cloud. So you can scale to thousands of cores easily,

and, you know, you have a just knob the number of cores. It it works on 1 core all the way to 10, 000 cores or more. We have tests in those scales as well. So you can use both or whether it's for deep learning or other machine learning algorithms or just data processing

on any scale that you want. 1 key aspect of those

systems is that they

provide

HPC

features. Otherwise, training neural networks is so expensive

that without HPC,

it's gonna take forever, and it's not practical. So Bodo being MPI and in the HPC

ecosystem

can

integrate with those

tools and systems

very well.

So you can load your data processes with Bodo,

and we have some features of managing those GPUs if you need to,

integrating all the other GPU based tools

that are built for this purpose. So we naturally

integrate with other tools in this area. For things like Spark, it took years to build features to integrate with MPI and other systems out there for efficient processing of neural networks and other machine learning algorithm.

As far as the actual

Python aspects

of being able to

infer the intent of the program and be able to

recreate this optimized algorithm running in this native binary,

what are some of the aspects of the language and its ecosystem that have made that a complicated endeavor

and some of the potential software architecture and programming patterns that you might run into that could act as bottlenecks or optimization cliffs for being able to build out these algorithms automatically?

So in general, Python is a great language,

and it's adoption

by everybody from, kind of,

school kids all the way to experts in various areas shows how easy it is to use Python.

This simplicity,

a lot of it comes from high level

abstractions

of pandas and others that have made Bodo possible. Without those high level abstractions,

it wouldn't pass be possible to build Bodo.

The main aspect of Bodo is data types, which is

the fact that the compiler needs to know what are the data types for all variables to be able to understand the program and do all the other analysis and,

inference and so on and so forth.

There are some

corner cases where

this type inference is not as easy

that makes it more difficult for the compiler. And we have built a lot of features

to infer

those situations, as well, automatically, because we don't want the user to change code. But sometimes, in some corner cases, we have to throw an error. Hey. Please change this line

because the compiler doesn't understand the data type for it. So the concept of data type is very important. But, fortunately,

there is a lot of

consciousness and there is a large movement in the Python community to even annotate types and

make types a key part of Python programming as well and will gradually add types

because

data types are very important for

testing and validation of programs as well, which, by the way, what Bodo provides

at a high level to the Python community,

we believe, is that we want to make Python

the default platform for building data applications.

We don't want today,

in the enterprise,

codes are written in Python and Pandas by

experts, but some other team who writes it for performance in something else. Whether it's Spark, whether it's even completely

changing the language to Scala,

SQL.

We don't want this rewrite to happen because it's expensive,

takes a lot of time,

validating this rewritten code is very difficult. And by the way, even then, that code is not HPC yet and is not as efficient. So we want the

Python code be from

developing and prototyping

all the way to production

without

any of these extra steps.

So 1 aspect is obviously performance and scalability.

The other aspect is testing and checking

and type checking ahead of time. With Bodo compiler because it's end end to end compilation,

all the data types are checked. If there's some issue, you will see it in testing and compilation time, not in a long running

production job failing, you know, during production.

So that's, I think, key, and I think it has to be improving the Python community. There's this consciousness of the fact that

the APIs have to be typable.

There are some corner cases that need to be ironed out, but it's not a large issue. I think a lot of the APIs and abstractions are already

in that space.

It's just minor tweaking in some corners that we are giving feedback to various projects such as Pandas to do. So that's 1 aspect. In terms of software architectures,

unfortunately,

because this driver executor has become

the default

paradigm,

a lot of

infrastructure

projects have adopted it, and there's a lot of inefficiency in the data

platform architecture

that comes with it. So it's hard for Bodo,

a Bodo based solution, to plug in and get to all the inefficiencies. We have

ways to integrate Bodo in almost anywhere we have tried. But to take full advantage of Bodo's capabilities,

we believe the

programmers and data engineers really need to think

about

the fact that you don't have those driver executor limitations,

and you can really think parallel.

And you design your architecture around a truly parallel system.

So that's

a way of thinking that we think

and needs to advance for taking full advantage of Bolio,

in the data infrastructure.

As far as the assumptions and expectations that you had when you first began working on this problem and

exploring how to turn it into a viable business model. What were some of the

assumptions that you had that have been challenged or had to be reevaluated

as you have progressed through the implementation

of the technology

and building the business around it and working with customers?

That's a great question. Initially,

when we started

the early research, it was all about machine learning

and more complicated machine learning algorithms.

But as we kind of made progress,

when we first met our first customer, a large tech company,

we really thought that the data processing and cleaning earlier

as earlier phases of the data pipeline

are ironed out. It's all about complicated

algorithms at the end.

But we realize that it's all about the early stages of the pipeline, ETL, data prep,

data processing, cleaning, and then later, featurization, and things like that. The algorithm at the end is easy part. It was really unexpected for us. So the first POC we did actually was, hey. We have this complicated pipeline. It's SQL and Python. Can you do it in pure Python simple

code? And that's what we delivered.

And, you know, with standard pandas

doing all those data processing.

Along the way, we realized that

we thought

the SQL problem

and data processing problem is solved. There are SQL systems. They can solve the problem. Yes. Maybe Python is not there, but SQL is there. But we

realized that there's a lot of pain

working with SQL systems as well. Too many knobs to tune,

difficult to deploy,

doesn't give you the right performance for data processing.

That's why

we created both of SQL,

enabling those developers who prefer SQL over Python. So there's this split. We don't think we have to force anybody to choose Python or SQL. They can choose the language they want, but they need to be able to take advantage of parallel computing capabilities of Bodo.

In terms of the implementation,

it was very unexpected for us that Bodo

would work so well out of the box for SQL code as well

because

we haven't added a lot of the SQL

type database

optimizations.

We are adding them 1 by 1, and it's a road map, and it's making good progress there as well. But we realized the parallel computing aspect of it is so important for

SQL type data transformations

as

well that even we realize some of the database

like optimizations

hurt performance

if the parallel computing aspect is done right and is efficient. So a lot of unexpected things that we are learning along with this process,

And we are good students. We are learning as we go from our customers and users and the community in general. In terms of working with your customers, because this is

a different approach than a lot of people have been dealing with, particularly in the data ecosystem where things like Spark, as you mentioned, and data warehouses,

and, you know, these sophisticated

data pipelining

tools are the norm. I'm wondering what you have seen in some of the

education that you've had to do and some of the challenges that you faced in working with customers and helping them to understand

the specifics of the approach that you're offering and some of the benefits that it can provide and how it might integrate with their

existing data operations and data infrastructure?

The biggest challenge,

I would say, has been the mindset when you say the word Python.

When you say the word Python, a lot of customers

think about small data.

Just automatically go there because Python doesn't scale.

We have to really explain, hey.

We are not here for the small datasets. Give us the big data. Give us the more harder problems

at larger scale.

We have solved the scalability problem of Python,

and you can use Python in production.

Python can be scalable and efficient.

You don't have to write your code in Spark and Scala and Java and c plus plus. That has been the biggest,

I would

say, educational

gap that we needed to tackle.

The fact that Python is now a first class language,

and you have to use it in production for those kind of harder problems. Yes. Python is in production for a lot of things, but the mindset is

still

in the space of, hey. I have to rewrite my Python if the problem is too big or too hard.

So that has been the biggest educational

gap

to tackle.

And as I mentioned, the software architecture

kind of really thinking parallel and thinking bigger. Hey. Now you can scale to more cores. With your Spark, you were maybe doing a 100 cores and it was falling over. You can't think bigger. You can run it on a 1, 000 cores now,

and you will get, you know, efficient parallelism, and you can run on bigger datasets.

So completely opposite problems, actually,

in terms of mindset to solve.

As you have been working with customers and working on the technology, what are some of the most interesting or innovative or unexpected ways that you've seen Bodo used?

You know, I'm actually learning from data engineers

in our customers on a daily basis because

they

solve

so many interesting problems

in unexpected

ways all the time.

When, you know, some of these good data engineers understand Bodo,

we saw that they designed the whole application differently.

Some of the applications are actually very interesting.

For example, visualizing

large datasets, kind of interactive visualization

application

to, let's say, visualize supply chain,

A hard problem.

The data engineer designing it when they saw Bodo, they said, okay. I don't need this component, that component, that component.

And I was surprised. Hey. Are you sure you don't want the data warehouse?

Are you sure you don't want this kind of graph

database? Are you sure you don't want this other component?

So the fact that given Bodo's power, they could simplify their architecture

is very impressive

to me. And the fact that they're replacing so many different things, you know, we started thinking in terms of replacing Spark

and the fact that Spark has inefficiencies,

but data engineers are figuring out ways of using Bodo for

a lot of different applications that we didn't expect,

including SQL deployments. I really didn't think someone would try to replace their,

I don't know,

complex

deployment

with all the scheduling and interactive things with Bodo,

but they benchmark, hey. This is faster.

Doesn't have all those knobs. Let's try to

replace that. And we are really, as a company, catching up

developing the tools and features around Bodo to make these, interesting applications possible. So we are playing a catch up game with our users who understand the power of Bodo more than we do.

And in your experience of building the technology and building the company, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Where do I start?

A lot of lessons

a lot of lessons in terms of the dynamics in the enterprise

and

the fact that developer needs

are

sometimes being neglected.

This gap between simplicity and scaling

may not be very well understood

in the enterprise.

And we need to

communicate this developer need through the management chain

and explain why

developers really matter. Their productivity

translates

to your business value.

Now you need to care about their productivity.

And, also, by the way, you are paying a lot of money

for your cloud infrastructure or your data center if you are running things on prem

that you could save

if you build applications the right way, the way

your developers tell you you need to build them. So developers need to have a bigger voice in terms of the needs of the enterprise, and their productivity

has to be prioritized, I think, more. I think we have a good trend going on in terms of

developers

having more say,

but I think there needs to be even more attention on developers in the enterprise.

So that has been a challenge

kind of explaining

all these various aspects

to various personas in the enterprise

of what the solution should be and communicating that.

What makes this even more complicated

is the fact that there are so many different tools,

so many different

options out there that

a lot of people just throw their hands up. Hey. I don't know.

I'll take whatever AWS

gives me or I'll take whatever Azure reps tell me to use. It's just so overwhelming. So

getting out of that mindset and, hey, you can really have a simple architecture

and pick the right tools,

and you shouldn't give up is also another challenge

that we see along the way. For people who are interested in the potential for Bodo to be able to

accelerate their code and simplify their systems architecture,

what are some of the cases where Bodo might be the wrong choice or instances where you've spoken with a customer and realized that the problem that they were trying to solve is not 1 that Bodo is well suited for? Definitely.

So we have an inferential compiler for Python,

but, really, the target is data processing applications.

So when they hear the word Python,

you know, we are not targeting something like a web server

written in Python or some other things. So that's clearly not a use case of Bodo. The other

issue that I also mentioned was the fact that when you say Python,

a lot of customers think small data

and maybe some of those data science applications.

We are saying, no. We are here for big data.

And what I tell customers is that, hey. If Pandas works and the data is small,

why change that?

Bodo is just you need to learn something new in your environment, just run Pandas. Order is for really larger datasets, and we are here to simplify your work, not add extra work for no reason. If regular Pandas on 10 megabyte datasets

works, why would you add something new just because it's shiny? Right? So that's another case we have seen.

The other cases has been

really trying to replace too many things with Bodo,

and we don't have some of the features yet. So we

ask customers, hey. Don't throw away your SQL deployment yet.

We need to build some of the features for you first. So we ask for them to be a little bit more patient with us with some of the use cases that

need more tools around Vodou to provide for them. And, or, a lot of cases, it's just a matter of some examples to provide.

So that's something we are working on. But, clearly, if that application is not large scale data processing

and doesn't need the power of Bodo, that's where we tell customers, hey. This is not the right fit for you. As you continue to work on Bodo and work with your customers and explore the challenges

that data scientists and data engineers and data professionals are facing.

What are some of the things that you have planned for the near to medium term future? We have a lot of interesting things coming in the pipe.

On the platform

cloud solution point of view, we are building what I call our platform 2 0,

and we are really rethinking the experience

in the cloud

SaaS space

for notebooks. Today,

the way parallel notebooks are done, the notebook is actually running on a driver,

and the parallelism happens through these APIs of Spark or something else.

But we are really rethinking the

Jupyter

notebook experience for Python running in parallel.

We are working with IPython parallel projects

for some of the features in open source that would enable our platform and at the same time developing our platform

to create this new truly parallel notebook experience

for developers.

I'm pretty excited about that that feature.

And our new version of platform coming out soon in general, which is, I think, very exciting.

The other aspect is the Bodo SQL

solution that I'm very excited about,

the first SQL solution that truly integrates with Python.

Bodo can actually optimize

between SQL and Python.

Bodo can actually

type check, throw errors if there are some mistakes in the SQL code. You know, you pass the data frame to the SQL code and some column doesn't exist, or you get output from SQL code, some column doesn't exist.

Today, with the existing SQL solutions,

you run it in production or on large dataset, and near the end, you may get an error. With Bodo and compilation time, you can actually

get those errors

very easily

and have a more correct code. So that's an area that I'm excited about.

We have a lot of optimizations coming out for Python,

Bodo. Python in general

that are also interesting and challenge a lot of data processing and database

ways of thinking in the past where PEL computing wasn't considered

as strongly as Bodo is focusing on it. And in general,

we are focusing on

user experience,

you know, the throwing the right errors and all the little things that developers

expect from

a good tool and platform

to make it as

easy to adopt and delightful as possible.

So a lot of things going on, and you'll hear hear a lot more about us hopefully soon.

Are there any other aspects of the work that you're doing at Bodo or the overall space of HPC

and MPI

and just optimization

of data processing workflows that we didn't discuss yet that you'd like to cover before we close out the show? I think, in general,

having a parallel computing mindset

in the enterprise,

is very important

to

kind of think differently and think bigger.

And the kind of problems that can be solved

can be really expanded with Bodo. I think that's an important aspect

for enterprises to consider.

In terms of technologies

in this space,

we touched

on storage solutions and loading data and distributing data.

I think

the storage

technology developers

need to really think about

good parallel abstractions

to load data load and store data

in a scalable manner on large number of cores. I think this area has been neglected maybe because their previous solutions were not scalable enough. So this having some sort of what I call collective API calls that a lot of cores

can load data at the same time very efficiently

is an area that needs to be considered.

And the tools in general in the space

need to really

rethink and simplify

based on,

parallel computing architectures.

And don't assume,

hey. Spark is the only solution out there, and it has all these limitations, and we need to cater to those. No. You can use Python,

parallel Python

with Bodo

and have a simpler

and more delightful experience for your users in general. So these are the things I'm looking forward to, and I think need to be addressed in this area.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Where do I

start?

As I mentioned, this parallel computing gap is something that translates into a lot of issues around it because the compute engine

is at the core, and its complexity

propagates

through the

rest of the architecture.

So I think there needs to be more thinking

about this aspect

and simplifying the architecture. It's really about simplicity. I think there are a lot of solutions out there, but having too many tools can

actually make it more complicated for users. So we have to really think about simplifying the environment as much as possible

for data engineers

and have ready solutions

and ready examples

and whatever we can do to reduce the time to value

for a data project in data teams. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Bodo. It's definitely very interesting project and an interesting approach to a complicated problem. So definitely look forward to seeing where it goes, and definitely

appreciate all of the time and energy you're putting into helping to solve some of these complex challenges that exist in the industry. So thank you again for your time, and I hope you enjoy the rest of your day. Thank you very much, Tobias. It was a great conversation.

Really enjoyed speaking with you.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links