Summary
Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
- Your host is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, a system for automatically optimizing and parallelizing python code for massively parallel data processing and analytics
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Bodo is and the story behind it?
- What are the techniques/technologies that teams might use to optimize or scale out their data processing workflows?
- Why have you focused your efforts on the Python language and toolchain?
- Do you see any potential for expanding into other language communities?
- What are the shortcomings of projects such as Dask and Ray for scaling out Python data projects?
- Many people are familiar with the principle of HPC architectures, but can you share an overview of the current state of the art for HPC?
- What are the tradeoffs of HPC vs scale-out distributed systems?
- Can you describe the technical implementation of the Bodo platform?
- What are the aspects of the Python language and package ecosystem that have complicated the work of building an optimizing compiler?
- How do you handle compiled extensions? (e.g. C/C++/Fortran)
- What are some of the assumptions/expectations that you had when first approaching this project that have been challenged as you progressed through its implementation?
- What are the aspects of the Python language and package ecosystem that have complicated the work of building an optimizing compiler?
- How do you handle data distribution for scale out computation?
- What are some software architecture/programming patterns that act as bottlenecks/optimization cliffs for parallelization?
- What are some of the educational challenges that you have run into while working with potential and current customers?
- What are the most interesting, innovative, or unexpected ways that you have seen Bodo used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bodo?
- When is Bodo the wrong choice?
- What do you have planned for the future of Bodo?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Bodo
- High Performance Computing (HPC)
- University of Illinois, Urbana-Champaign
- Julia Language
- Pandas
- NumPy
- Dask
- Ray
- Numba
- LLVM
- SPMD
- MPI
- Elastic Fabric Adapter
- Iceberg Table Format
- IPython Parallel
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Atlin as an internal tool for themselves. Atlin is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.
With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Ehsan Totony about Bodo, a system for automatically optimizing and parallelizing Python code for massively parallel data processing and analytics. So, Ehsan, can you start by introducing yourself? Thanks for having me, Tobias. I'm Ehsan Tutony, cofounder and CTO of BoDo. And do you remember how you first got involved in the area of data management?
[00:02:16] Unknown:
Interesting story. I was actually working on high performance computing. And along the way, I realized that probably the biggest application of high performance computing is data management and data processing because the amount of data is growing, machine learning is growing, and these are all on the compute level, high performance computing problems to solve. And
[00:02:39] Unknown:
so that brings us to what you're building at Bodo because I know that some of the principles that you're building into the platform are based on your work with HPC. And so I'm wondering if you can just discuss a bit about what it is that you're building there and some of the story behind how you ended up at this specific juncture of problems and why you've decided to focus your time and effort on this problem domain.
[00:02:59] Unknown:
Definitely. I did my PhD at University of Illinois, Urbana Champaign in high performance computing, sort of a hub for HPC. A lot of interesting problems being solved, but it was all low level code. MPI with Fortran, c plus plus many years in development for a lot of the codes, a lot of HPC experts developing those codes. But you look across the hall, all the data scientists and domain experts write code in scripting languages. I used to be, you know, a lot of MATLAB. Now Python is more dominant. So I want to solve this problem, which is called productivity performance gap where, you know, for a programmer, it's simple to write scripting code, but it runs on a single core, and it's very slow and inefficient. But if you want performance, you have to write low level parallel code that can scale to millions of cores these days on supercomputers, but only few people in the world can write those codes. There's a huge gap in this space.
So I joined Intel Labs to solve this problem, and it hasn't really been solved for a long time because the solution would be to have a compiler that could automatically take this high level code and generate that performance expert tuned code automatically, and it's extremely difficult from computer science point of view. But I was fortunate that my project was successful after 4 years of research and came out of Intel to build that along the way because something like Bodo didn't exist. Datasets are growing. Moore's Law is slowing. So single core performance isn't improving as much anymore since really 2, 004, 5. Things like Hadoop and Spark came along to help developers work with larger datasets.
But from our point of view, those systems are suboptimal because they are not following Pell competing principles. And we think the Bodo solution can be orders of magnitude more efficient and faster than these other systems. So because of that, I joined forces with my co founder, Bessard Nasri. He has an interesting story too because at that time, he wasn't at Intel, but he has had a career at Intel. And along the way, he was involved in Intel's very large investment in Hadoop and Cloudera, which went sideways. So he knew all the challenges of big data, and, you know, Spark replaced Hadoop, but he saw the same complaints that he heard about Hadoop, you know, being raised for Spark as well. So he really believed that there should be a better solution for processing large datasets.
[00:05:50] Unknown:
And, Bodo, we believe is the right answer. You mentioned that you're focusing on Python as the target for this compiler oriented approach to being able to paralyze and optimize the actual computation. And I'm wondering what it is about the language and the tool chain that drew your focus as that being the logical target for your initial implementation.
[00:06:15] Unknown:
Actually, the research project didn't start in Python. At Intel Labs, we started with Julia, and we were working with the MIT team behind Julia. And things were going well. But Julia as a language didn't really take off, and Python took over. So from practical point of view, if you look at the data from all different sources, you know, on Stack Overflow, some of the other surveys and other data out there, Python is the dominant programming language in the world in general in a lot of the metrics and, in particular for data applications. So that's a practical choice, and we believe Python is a great language for processing data. APIs like Pandas and NumPy are, very mature, and they present the right level of abstraction for developers and programmers to work with data.
From our point of view, our core competency is building this technology, which we call inferential compiler, and it could be done for other languages as well. But today, Python and SQL, which are our targets, we have a SQL solution as well, which is in beta, but it can will come out very soon. We think Python and SQL are the answer for a lot of the data problems in practice today. And for a long time, they will be dominant.
[00:07:35] Unknown:
In terms of the Python ecosystem specifically, there are some projects that are focusing on trying to be able to handle the scale out computation of these workloads that are inherently single core because of the nature of the CPython virtual machine. And so there are projects such as Ray and Dask that aim to act as a sort of horizontal scale out with a scheduler for being able to chunk up the computation and do sort of a map reduce style workflow. And there are other projects that are focused on compilation and efficiency and speed gains of the Python code, such as Numba. And then there are things like the, you know, NumPy library that actually have a lot of optimized c and Fortran code for being able to handle some of these vectorized arrays for the data processing.
And I'm wondering what you see as the shortcomings of those different elements of the Python ecosystem and some of the benefits that you're looking to provide with Bodo as it compares to what exists in that community?
[00:08:34] Unknown:
So let's dig into what this technology really is and how it's different than all the other options that you mentioned. So Bodo is a new type of compiler we call inferential compiler. In a regular compiler, the code is maybe written in c, c plus plus, and it's translated to binary. That's it. Maybe some optimizations. But in Bodo, the compiler really understands the program at a very high level to automatically paralyze and optimize it. It's as if an HPC expert rewrote the code for you in MPI or c MPI and c plus plus, except that it's done automatically and transparently for you. So we believe this is the holy grail of scaling compute where the code is symbol Python, standard APIs, it's NumPy, it's not Pandas like, it's actual Pandas. If you you know, Bodo provides a jit decorator for functions. If you remove the decorators or disable them, it's standard Python and Pandas. You your code works.
So code is simple Python with Pandas and NumPy and scikit learn, but execution is HPC. You know, it's native code generated. We actually use Namba and LLVM under the hood, plus some other c plus plus libraries that we have developed. And it's as if you rewrote the code, you know, using a lot of performance and HPC optimizations. So that's where we are. In terms of comparison, the best comparison is I would start with Spark and PySpark and these other tools. We can discuss how they are similar into Spark and how kind of their cycles of Spark they are building that image. So without Bodo, the way you would scale computation is you create a library which exposes some APIs, Some high level APIs started with the MapReduce library, then had MapReduce inside Hadoop, then Spark, and now Dask, and Ray, and others. So you create a library.
The program is actually running on a single core. So it's not running in parallel. But when it reaches those API calls, the library breaks up the call into tasks. Spark, let's say, goes across the cluster on the executors, runs those tasks, and comes back to the driver. So it's a driver executor paradigm using task scheduling. You pay a lot of overheads per task. We call this waves of tiny tasks. And you're trying to essentially, with this approach, approximate in parallelism because the program was running on a single core, on a driver. It wasn't really parallel. And if you have a library, you don't have a compiler, that's what you can do. Because without looking at the whole program, how do you know it's safe to parallelize it? And how would you even parallelize it ahead of time? The language is sequential.
Right? What Bodo is is that the compiler sees the code end to end and creates actual parallel code, which, architecturally, we call single program multiple data, SPMD, and it follows the parallel computing principles. And we believe this is the scientifically correct way to run parallel compute. So each process and we use MPI under the hood for message passing and parallelism. So each process in the MPI world called rank owns a chunk of the data and doesn't need to go to a master. There is no scheduling. Right? So its process owns a chunk of the data, knows what to do with it and what to communicate and when to communicate according to the parallel algorithm, and there are no extra inefficiencies.
So these processes run-in a bulk synchronous manner. And when they get to a communication, it's called a collective operation, they do it very efficient. So as an example of the kind of communication that are problematic, typically, and how Bodo solves that, shuffling data is a very common operation in data processing. If you are joining tables, group by, things like that, keys with the same value need to end up on the same process, so you shuffle the database on keys, and systems like Spark do it under the hood. But because the MapReduce architecture and this driver executor is not built for parallelism, this communication is very difficult.
Asynchronous tasks cannot really directly exchange data, so they write shuffle files, and they put a lot of pressure on the OS and network and JBM and all the machinery to write all of these files and then schedule reducers to combine them. But in this parallel computing world, the processes get to the communication phase at the same time, which is called all to all collective, and they exchange messages directly. And in a lot of settings, optimized cloud settings, for example, they actually write into each other's memory, remote direct memory access, RDMA, and it's much more efficient than this other master or driver executor paradigm with Spark.
So there's orders of magnitude potentially improvement just because of the parallel architecture depending on the workload. That's the key difference of this inferential compiler parallel computing paradigm versus the previous approach, which was a driver executor library. So the new Python libraries, Dask and Ray, they're really built in the old approach, which is there are libraries run on a single core. There is some scheduler that breaks up the compute into tasks and goes across the cluster, come back. Except that in our testing, because they use Python and Python is slower than Java, they're slower than Spark. So they say, you know, there is some simplicity better than Spark, but they are slower. That's their trade off. However, speed really matters, especially in the cloud. If you're a 100 times slower than efficient code, you are paying a 100 times more money.
So your $10, 000 bill becomes a $1, 000, 000, and I don't think any organization is okay with that. So there are various aspects. If I were to summarize, the code in both of stays standard Python, you don't have vendor lock in. You can take it anywhere, and you can divide it and run it in regular Python. But these other approaches, Spark, Dask, Ray, they expose new APIs that the developer needs to learn and, adapt to, and you get locked in because you reload your code for DaaS. We have had issues with some of the customers where, hey. I've already written my code in DaaS. I can't take it anywhere else. Even though, you know, from the surface, it tried to expose Pandas like APIs, but in real applications and complicated settings, their code becomes really task code, not really pandas code. So you need to learn the new APIs, and you need to understand the parallelism.
Because in these approaches, you need to know what data has to be distributed, what data should be kind of replicated on all the cores. And if you don't do that properly, you have incorrect results. Sometimes we have had in our customers where, hey. I used DAS, but my results are incorrect. So Bodo automatically parallelizes code, which means that by looking at the whole program and understanding it, it will decide, okay, this data structure is kind of on the map side of MapReduce, if you want to go with that terminology, and we have to distribute it. But this other data structure, which may be a data frame, paralyzable thing, but it's on the reduced side. And, semantically, it's like a scalar has to be replicated on all the cores. So Bodo does the parallelism automatically for you versus others that you need to be really careful using them. Otherwise, you will get to incorrect results. And if the program, structurally, has some issues, model will tell you, hey. This cannot be paralyzed. You have to fix this issue. So this compiler intelligence helps in that aspect as well. So on the simplicity side, there are all these implications.
On the performance side, you can't really beat an efficient parallel computing architecture. You can build all the libraries and optimize for years. You know, 1, 000, 000, 000 of dollars has been spent on Spark, but at the same time, you see all these academic papers say, oh, why is Spark so much slower than MPI? Everyone knows that, but there's nothing they can do because it's an architectural issue. So you can't really ignore those efficiency issues, which, by the way, other than cost for a lot of organizations, we have a Fortune 10 kind of large tech company as our largest customer, and they care about being green as well, reducing the carbon footprint in their data centers. And with Bodo, they see that these orders of magnitude improvements translate into efficiency on the carbon footprint as well, which is kind of related to cost as well. So these are some of the benefits, but something that we didn't expect at all is benefits of Bodo for reliability of the system. The marketing line says that MPI is not reliable. Spark is reliable.
But in practice, because of all these layers of these libraries, there are so many failure points. You know, your OS runs out of open files because you have too many shuffle files, or your JVM runs out of some resource, or the library itself makes a mistake, or there are too many tasks, they fail all the time. But with Bodo, we have had deployments that over a year, no failure because there's nothing to fail. It's native code running efficiently. There are none of these layers. There are no shuffle files. And you fail only if your infrastructure goes down or some hardware fails, which is very rare these days. So all those reliability issues of these libraries go away, and the infrastructure is much more streamlined in general as a result.
So that's on the Perl compute side. But you mentioned other tools for accelerating Python. Yes. There are approaches for accelerating Python. You can and there have been a lot of workarounds, but they may not be comparable because we are talking about data processing. A lot of them may target sequential Python in general with sometimes smaller improvements in 10, 20%, which is not what we are talking about. We are talking about thousands of times faster execution of Python code here and targeted at data processing, not general Python code. That's 1 aspect. And the other aspect is if there are tools that make sequential Python faster, we can use them as well, and we do use them. Although it's based on Numba, and we contribute to Numba as well. We are developers of Numba as well, and that's applicable to the kind of thing we are doing.
Other projects may not be as relevant based on the target use case.
[00:20:31] Unknown:
Then on the other side, there's also the HPC aspect, which you've touched on as far as, you know, MPI and being able to parallelize these algorithms and spread them out across multiple cores and multiple instances. And most of the time, when I think about HPC, I think about, you know, some massive supercomputer that has, you know, some number of 1, 000 or 1, 000, 000 of cores with optimized network interconnects and everything is operating as a single system versus when you're running in the cloud, you're talking about scale out multiple instances. The only real sort of communication synchronization is over into the network, and I'm wondering if you could just discuss some of the current state of the art of HPC and some of the sort of trade offs that exist between the, you know, supercomputer environment and scale out distributed systems that are running on, you know, heterogeneous cloud architectures?
[00:21:23] Unknown:
So, traditionally, the HPC world is targeting scientific computing applications, which are, a lot of the times, physical simulations, for example, weather forecasting or climate forecasting or some sort of fluid dynamics, these sort of things. And they have been writing loadable MPI Fortran code in, maybe big national laboratories and large universities to solve those problems. In the enterprise world, the time frames don't really fit writing low level code. You know, you have a data problem. You need to solve it today. Your marketing and sales team wants the data right now, not 3 months from now. So it's not really practical to go through all that complexity of the code.
Because of those reasons, anytime you say the words high performance computing, you say high performance, that's good. You say compute, that's fine. But if you put them together, a lot of people get scared. From users to enterprises to investors, they just don't like these 3 words together because it has never worked for the enterprise. It's just too complicated. Bodo, we believe, for the first time, democratizes high performance computing for regular developers because you don't need to write low level code. Your code is Python and high level Python. You don't even need to learn new APIs. So we believe we are bringing HPC to the enterprise for the first time. And as a result, we believe we are democratizing machine learning AI because guess what? The big tech companies are actually using HPC for their AI. They're actually using MPI and all the right techniques internally, but the rest of the world has to take advantage of the same techniques. So that's on the compute side and the programmability side. From algorithmic point of view, the data applications are really parallel compute applications and fit the HPC paradigm very well. So if a problem has been solved decades ago, why not just use it? I mentioned shuffle. You know, exchanging data, this so called all to all operation has been solved, like, 30, 40 years ago. And this MPI implementation, everyone has optimized it, including cloud providers, by the way. 1 of some of the largest supercomputing in the world are in the cloud, and they have all the benchmarks and everything. So why not reuse all the development for decades in that ecosystem for the same problems? Now all these data problems are essentially parallel algorithms in scientific comp some subset of scientific computing that need to be solved.
So the algorithms are HPC, and there is no distinction. Any other solution is a workaround for them. On the infrastructure side and the machine side, yes, HPC machines, historically, maybe 20, 30 years ago, they were different. But today, the cloud really provides the same high performance compute capabilities. Let's take AWS as an example. The instances are really high performance Intel Processors that are used in supercomputers too, so there is no difference. The network between them, today, AWS provides an option called Elastic Fabric Adapter, EFA, that provides RDMA and high performance network in between the nodes. You know, our cloud providers, all of them provide placement of these nodes together.
So you have a cluster. You can bring up a cluster that's an HPC cluster in the cloud. There is no hardware difference, and you can take advantage of the same sort of performance and scalability in your cloud environment as well. And we have benchmarks on AWS. Recently, we actually published a customer benchmark on 4, 500 cores on AWS scaling very well. So you have a super complete that's not a small number of cores that you can use and scale your application. So developers, I think, can think bigger now and think about larger problems to solve with the availability of cloud cloud services with these kind of capabilities.
[00:25:56] Unknown:
And then in terms of the technical implementation of the Boto platform, can you dig a bit more into some of the sort of compiler aspects and some of the ways that the inferential compiler might be different than something like Numba or, you know, just the GCC compiler that a lot of people have encountered?
[00:26:14] Unknown:
So Bodo is what we call an inferential compiler, which means that it understands operations at a very high level. Now your GCC compiler, if you are writing c loop to process data, doesn't know what you are doing. It sees loops and translates them to binary code. And, yes, there are low level optimizations to allocate the register as well and get rid of unused scalar variables and things like that, but at a high level, doesn't know what's going on in the code. With Bodo, the compiler starts with a very high level, which means that the compiler understands data frames, understands data frame joins, group by user defined functions with data frame apply transformations.
So understands all of those things and optimizes the program and parallelizes the program based on those inferred properties. So it's as if the compiler has the same knowledge as the programmer in some way. And this is made possible because the APIs in pandas and NumPy are just so mature that they keep the semantics in place. The data frames and join, all of those aspects are already in the program. They are not lost in loops. The other compilers operate at the loop level, but Bodo operates at the high level API level. So that's the major difference. So you are not getting maybe a little bit better sequential program out. You're getting a lot better program both on sequential aspect running on a single core, and you can run it on as many cores as you want. That's the key difference.
So on the platform side, the difference of the platform implementation is that so on the compute side, as I mentioned, we are taking a parallel computing approach, which is this single program, multiple data architecture, these processes that can work independently, the libraries are taking a disputed computing approach from Spark to Ask Ray, which is assuming that in the suited computing, you are assuming you have a heterogeneous environment where the client is, let's say, mobile phone, go over the Internet, connect to some server, and you have to deal with all the issues of this connection with TCP and so on and so forth, test underlying assumption of the suited computing.
And, you know, the distance and reliability issues of that connection and the heterogeneity is baked in. But in parallel computing, the assumption is that cores are close together, and they have a nice high performance reliable connection. And also that they are homogeneous, meaning that the compute tasks that are running are the same, maybe slight load loading balance and things like that, but they are the same. So based on that, the communication between them is much more optimized, and the protocol doesn't have to be TCP. TCP was designed for the Internet, long distance, not in a cloud with a placement group and nodes together, very nice connection, a 100 GB network, those are not great for TCP.
So the underlying assumptions are very different for the parallel compute. Our SaaS service, for example, it's a disputed competing environment where clusters are 1 part of it. So there is connection. The notebooks, jobs, management, governance, all sort of things that need distributed computing. But the compute layer itself, the way you process the data has to be parallel computing, and that's a principle that we follow in the model platform.
[00:30:17] Unknown:
There are a number of interesting things to dig into there, but 1 of the things that you mentioned earlier too is some of the sort of retry capabilities where if an individual task within the parallel computation fails, I'm wondering how you address being able to identify that and schedule a retry or sort of what the process is in this parallel compute environment for being able to handle these intermittent or incidental failures. Or if, for instance, 1 of the nodes in the parallel cluster happens to have a hardware failure in the middle of a task execution, sort of what the solution is for being able to be resilient to those types of errors.
[00:30:56] Unknown:
Definitely. So with Bodo, the software failures go away, which are much more frequent because you're running on bare metal binaries. So this changes the equation for reliability quite a bit. The main approach for reliability for parallel compute is checkpointing. You go back to some sort of checkpoint because there's communication processors, and you have to store this last valid state. You have the same option here as well. That doesn't change. What changes is that you don't need to use this very often. And my recommendation for a lot of users has been for your SLA that you care about for your application, you don't need to worry about it because hardware's failures are very rare. You know, for a server, sometimes I've seen numbers 10 to a 100 years for each individual server. And in an enterprise, you know, the cluster is typically 32, node cluster up to 32 nodes.
You know, it's just not very plausible for failures to happen. However, if you go to 1, 000 node scale statistics takeover and you do really need to take checkpoint, or if your application is really long running, if you are doing the deep learning training workload, a large number of nodes, and it's running for a long time, for weeks, then you definitely need to check more. So these trade offs have to be considered. But for a lot of practical applications, the fact that the code is running on bare metal makes it much simpler. And the fact that the code is faster.
So if you run for a fraction of the time, the chances of something failing is less during that time. So these are the benefits. But in terms of actually orchestrating the restart, Bodo is a clean engine design. So very simple Python processes running, we are not bundling any middleware in the core engine. So with Spark and others, there are task schedulers. So not only they try to schedule the in the actual compute tasks, they want to schedule jobs and kind of a lot of layers together, but we believe that's not the right choice. The compute parallelism has to be separate from the other system aspects.
So your Airflow or Kubernetes infrastructure has to kind of take care of all the middleware, including restarts and things like that. They do a great job, and we are not trying to replace those.
[00:33:47] Unknown:
Struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end, fully automated data observability platform. In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem with broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing the time to detection and resolution from weeks or days to just minutes.
Start trusting your data with Monte Carlo today. Visitdataengineeringpodcast.com/impact today to save your spot at Impact, the data observability summit, a half day virtual event featuring the 1st US chief data scientist, the founder of the data mesh, the creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 people who RSVP will be entered to win an Oculus Quest 2. And then for the actual distribution of the data to the different tasks that are executing across this parallel cluster, I'm wondering what you are building as far as some of the integration or optimization layers for people to be able to fetch data from whatever storage layer they might be using, whether that's files in s 3 or a data warehouse or, you know, multiple different databases, and then being able to chunk that out and distribute it across the parallel execution and just some of the planning that needs to go into the application design to be able to take advantage of some of the efficiencies that Bodo is trying to build in for the sort of parallelization and data distribution aspects?
[00:35:33] Unknown:
That's a great point. Data has to come to the cluster efficiently for efficient compute to happen. If your data takes forever to load, then doesn't matter how fast the compute is. For Bodo, we are completely storage agnostic, and we want to support all different kinds of storage, whether it's object storage, data lakes, to the new lakehouse paradigm with iceberg, to data warehouses. So we have optimized connectors for all of these options. And the way they work is different cores know what chunk of data they need to load, and they load it as efficiently as possible.
On s 3, of course, there's a lot of parallelism, and it's a very scalable system with typically par k files and par k pieces that we load efficiently. On data warehouses, some of them today may not be as easy, So we are building connectors so so that programming API is still Pandas, nothing changes for the user. But under the hood, all the complexity is managed by Bodo. And in some cases, we are actually partnering with Snowflake, and they are providing features for us to be able to load data from Snowflake more efficiently than a regular SQL connector can provide. So, for example, if you load up to a terabyte of data, each call can kind of run a query to the warehouse and say, give me my chunk of data up to a terabyte and a few 100 cores that scales.
But beyond that, it can overwhelm the data warehouse. So we are co designing this connector with Snowflake that can load massive amount of data up maybe petabytes without some of those scalability issues. So, in general, I think Bodo enables some higher scale for all of these storage systems, and they need to adapt and provide abstractions to load larger amount of data that was previously possible for these systems. So that's 1 aspect. The other aspect is some of the operations, such as filters, need to be pushed down so you load less data from the storage system that we are working on as well. And we have efficient filter push down on partition parquet files, for example.
And for data warehouses and SQL stores, we are developing those as well. So that's a huge area by itself in terms of loading data and distributing it efficiently to all these cores as we are talking about thousands of cores, potentially, in this case.
[00:38:22] Unknown:
Another aspect of what you were discussing is being able to do things like building and training neural networks and being able to optimize the execution of that where a lot of the focus recently has been on leveraging GPUs for being able to run across these, you know, 100 and thousands of cores, but with a smaller memory footprint per core. And I'm wondering what are and then other approaches are things like being able to do, like, automatic network pruning to reduce the number of nodes in the network and reduce the reliance on GPUs. And I'm wondering what you've seen as some of the potential for this parallel compute approach to be able to efficiently train these neural networks without necessarily having to rely on GPUs, which have become relatively scarce and difficult to acquire in the, you know, times that we're in now with these issues of supply chain management?
[00:39:12] Unknown:
Definitely. So with Bodo, we are making it easy to scale to abundant CPU resources in the cloud. So you can scale to thousands of cores easily, and, you know, you have a just knob the number of cores. It it works on 1 core all the way to 10, 000 cores or more. We have tests in those scales as well. So you can use both or whether it's for deep learning or other machine learning algorithms or just data processing on any scale that you want. 1 key aspect of those systems is that they provide HPC features. Otherwise, training neural networks is so expensive that without HPC, it's gonna take forever, and it's not practical. So Bodo being MPI and in the HPC ecosystem can integrate with those tools and systems very well.
So you can load your data processes with Bodo, and we have some features of managing those GPUs if you need to, integrating all the other GPU based tools that are built for this purpose. So we naturally integrate with other tools in this area. For things like Spark, it took years to build features to integrate with MPI and other systems out there for efficient processing of neural networks and other machine learning algorithm.
[00:40:36] Unknown:
As far as the actual Python aspects of being able to infer the intent of the program and be able to recreate this optimized algorithm running in this native binary, what are some of the aspects of the language and its ecosystem that have made that a complicated endeavor and some of the potential software architecture and programming patterns that you might run into that could act as bottlenecks or optimization cliffs for being able to build out these algorithms automatically?
[00:41:08] Unknown:
So in general, Python is a great language, and it's adoption by everybody from, kind of, school kids all the way to experts in various areas shows how easy it is to use Python. This simplicity, a lot of it comes from high level abstractions of pandas and others that have made Bodo possible. Without those high level abstractions, it wouldn't pass be possible to build Bodo. The main aspect of Bodo is data types, which is the fact that the compiler needs to know what are the data types for all variables to be able to understand the program and do all the other analysis and, inference and so on and so forth.
There are some corner cases where this type inference is not as easy that makes it more difficult for the compiler. And we have built a lot of features to infer those situations, as well, automatically, because we don't want the user to change code. But sometimes, in some corner cases, we have to throw an error. Hey. Please change this line because the compiler doesn't understand the data type for it. So the concept of data type is very important. But, fortunately, there is a lot of consciousness and there is a large movement in the Python community to even annotate types and make types a key part of Python programming as well and will gradually add types because data types are very important for testing and validation of programs as well, which, by the way, what Bodo provides at a high level to the Python community, we believe, is that we want to make Python the default platform for building data applications.
We don't want today, in the enterprise, codes are written in Python and Pandas by experts, but some other team who writes it for performance in something else. Whether it's Spark, whether it's even completely changing the language to Scala, SQL. We don't want this rewrite to happen because it's expensive, takes a lot of time, validating this rewritten code is very difficult. And by the way, even then, that code is not HPC yet and is not as efficient. So we want the Python code be from developing and prototyping all the way to production without any of these extra steps.
So 1 aspect is obviously performance and scalability. The other aspect is testing and checking and type checking ahead of time. With Bodo compiler because it's end end to end compilation, all the data types are checked. If there's some issue, you will see it in testing and compilation time, not in a long running production job failing, you know, during production. So that's, I think, key, and I think it has to be improving the Python community. There's this consciousness of the fact that the APIs have to be typable. There are some corner cases that need to be ironed out, but it's not a large issue. I think a lot of the APIs and abstractions are already in that space.
It's just minor tweaking in some corners that we are giving feedback to various projects such as Pandas to do. So that's 1 aspect. In terms of software architectures, unfortunately, because this driver executor has become the default paradigm, a lot of infrastructure projects have adopted it, and there's a lot of inefficiency in the data platform architecture that comes with it. So it's hard for Bodo, a Bodo based solution, to plug in and get to all the inefficiencies. We have ways to integrate Bodo in almost anywhere we have tried. But to take full advantage of Bodo's capabilities, we believe the programmers and data engineers really need to think about the fact that you don't have those driver executor limitations, and you can really think parallel.
And you design your architecture around a truly parallel system. So that's a way of thinking that we think and needs to advance for taking full advantage of Bolio, in the data infrastructure.
[00:45:30] Unknown:
As far as the assumptions and expectations that you had when you first began working on this problem and exploring how to turn it into a viable business model. What were some of the assumptions that you had that have been challenged or had to be reevaluated as you have progressed through the implementation of the technology and building the business around it and working with customers?
[00:45:54] Unknown:
That's a great question. Initially, when we started the early research, it was all about machine learning and more complicated machine learning algorithms. But as we kind of made progress, when we first met our first customer, a large tech company, we really thought that the data processing and cleaning earlier as earlier phases of the data pipeline are ironed out. It's all about complicated algorithms at the end. But we realize that it's all about the early stages of the pipeline, ETL, data prep, data processing, cleaning, and then later, featurization, and things like that. The algorithm at the end is easy part. It was really unexpected for us. So the first POC we did actually was, hey. We have this complicated pipeline. It's SQL and Python. Can you do it in pure Python simple code? And that's what we delivered.
And, you know, with standard pandas doing all those data processing. Along the way, we realized that we thought the SQL problem and data processing problem is solved. There are SQL systems. They can solve the problem. Yes. Maybe Python is not there, but SQL is there. But we realized that there's a lot of pain working with SQL systems as well. Too many knobs to tune, difficult to deploy, doesn't give you the right performance for data processing. That's why we created both of SQL, enabling those developers who prefer SQL over Python. So there's this split. We don't think we have to force anybody to choose Python or SQL. They can choose the language they want, but they need to be able to take advantage of parallel computing capabilities of Bodo. In terms of the implementation, it was very unexpected for us that Bodo would work so well out of the box for SQL code as well because we haven't added a lot of the SQL type database optimizations.
We are adding them 1 by 1, and it's a road map, and it's making good progress there as well. But we realized the parallel computing aspect of it is so important for SQL type data transformations as well that even we realize some of the database like optimizations hurt performance if the parallel computing aspect is done right and is efficient. So a lot of unexpected things that we are learning along with this process, And we are good students. We are learning as we go from our customers and users and the community in general. In terms of working with your customers, because this is
[00:48:41] Unknown:
a different approach than a lot of people have been dealing with, particularly in the data ecosystem where things like Spark, as you mentioned, and data warehouses, and, you know, these sophisticated data pipelining tools are the norm. I'm wondering what you have seen in some of the education that you've had to do and some of the challenges that you faced in working with customers and helping them to understand the specifics of the approach that you're offering and some of the benefits that it can provide and how it might integrate with their existing data operations and data infrastructure?
[00:49:15] Unknown:
The biggest challenge, I would say, has been the mindset when you say the word Python. When you say the word Python, a lot of customers think about small data. Just automatically go there because Python doesn't scale. We have to really explain, hey. We are not here for the small datasets. Give us the big data. Give us the more harder problems at larger scale. We have solved the scalability problem of Python, and you can use Python in production. Python can be scalable and efficient. You don't have to write your code in Spark and Scala and Java and c plus plus. That has been the biggest, I would say, educational gap that we needed to tackle.
The fact that Python is now a first class language, and you have to use it in production for those kind of harder problems. Yes. Python is in production for a lot of things, but the mindset is still in the space of, hey. I have to rewrite my Python if the problem is too big or too hard. So that has been the biggest educational gap to tackle. And as I mentioned, the software architecture kind of really thinking parallel and thinking bigger. Hey. Now you can scale to more cores. With your Spark, you were maybe doing a 100 cores and it was falling over. You can't think bigger. You can run it on a 1, 000 cores now, and you will get, you know, efficient parallelism, and you can run on bigger datasets.
So completely opposite problems, actually, in terms of mindset to solve.
[00:50:51] Unknown:
As you have been working with customers and working on the technology, what are some of the most interesting or innovative or unexpected ways that you've seen Bodo used?
[00:51:00] Unknown:
You know, I'm actually learning from data engineers in our customers on a daily basis because they solve so many interesting problems in unexpected ways all the time. When, you know, some of these good data engineers understand Bodo, we saw that they designed the whole application differently. Some of the applications are actually very interesting. For example, visualizing large datasets, kind of interactive visualization application to, let's say, visualize supply chain, A hard problem. The data engineer designing it when they saw Bodo, they said, okay. I don't need this component, that component, that component. And I was surprised. Hey. Are you sure you don't want the data warehouse?
Are you sure you don't want this kind of graph database? Are you sure you don't want this other component? So the fact that given Bodo's power, they could simplify their architecture is very impressive to me. And the fact that they're replacing so many different things, you know, we started thinking in terms of replacing Spark and the fact that Spark has inefficiencies, but data engineers are figuring out ways of using Bodo for a lot of different applications that we didn't expect, including SQL deployments. I really didn't think someone would try to replace their, I don't know, complex deployment with all the scheduling and interactive things with Bodo, but they benchmark, hey. This is faster.
Doesn't have all those knobs. Let's try to replace that. And we are really, as a company, catching up developing the tools and features around Bodo to make these, interesting applications possible. So we are playing a catch up game with our users who understand the power of Bodo more than we do.
[00:52:54] Unknown:
And in your experience of building the technology and building the company, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:53:03] Unknown:
Where do I start? A lot of lessons a lot of lessons in terms of the dynamics in the enterprise and the fact that developer needs are sometimes being neglected. This gap between simplicity and scaling may not be very well understood in the enterprise. And we need to communicate this developer need through the management chain and explain why developers really matter. Their productivity translates to your business value. Now you need to care about their productivity. And, also, by the way, you are paying a lot of money for your cloud infrastructure or your data center if you are running things on prem that you could save if you build applications the right way, the way your developers tell you you need to build them. So developers need to have a bigger voice in terms of the needs of the enterprise, and their productivity has to be prioritized, I think, more. I think we have a good trend going on in terms of developers having more say, but I think there needs to be even more attention on developers in the enterprise.
So that has been a challenge kind of explaining all these various aspects to various personas in the enterprise of what the solution should be and communicating that. What makes this even more complicated is the fact that there are so many different tools, so many different options out there that a lot of people just throw their hands up. Hey. I don't know. I'll take whatever AWS gives me or I'll take whatever Azure reps tell me to use. It's just so overwhelming. So getting out of that mindset and, hey, you can really have a simple architecture and pick the right tools, and you shouldn't give up is also another challenge
[00:55:05] Unknown:
that we see along the way. For people who are interested in the potential for Bodo to be able to accelerate their code and simplify their systems architecture, what are some of the cases where Bodo might be the wrong choice or instances where you've spoken with a customer and realized that the problem that they were trying to solve is not 1 that Bodo is well suited for? Definitely.
[00:55:27] Unknown:
So we have an inferential compiler for Python, but, really, the target is data processing applications. So when they hear the word Python, you know, we are not targeting something like a web server written in Python or some other things. So that's clearly not a use case of Bodo. The other issue that I also mentioned was the fact that when you say Python, a lot of customers think small data and maybe some of those data science applications. We are saying, no. We are here for big data. And what I tell customers is that, hey. If Pandas works and the data is small, why change that?
Bodo is just you need to learn something new in your environment, just run Pandas. Order is for really larger datasets, and we are here to simplify your work, not add extra work for no reason. If regular Pandas on 10 megabyte datasets works, why would you add something new just because it's shiny? Right? So that's another case we have seen. The other cases has been really trying to replace too many things with Bodo, and we don't have some of the features yet. So we ask customers, hey. Don't throw away your SQL deployment yet. We need to build some of the features for you first. So we ask for them to be a little bit more patient with us with some of the use cases that need more tools around Vodou to provide for them. And, or, a lot of cases, it's just a matter of some examples to provide.
So that's something we are working on. But, clearly, if that application is not large scale data processing and doesn't need the power of Bodo, that's where we tell customers, hey. This is not the right fit for you. As you continue to work on Bodo and work with your customers and explore the challenges
[00:57:16] Unknown:
that data scientists and data engineers and data professionals are facing. What are some of the things that you have planned for the near to medium term future? We have a lot of interesting things coming in the pipe.
[00:57:28] Unknown:
On the platform cloud solution point of view, we are building what I call our platform 2 0, and we are really rethinking the experience in the cloud SaaS space for notebooks. Today, the way parallel notebooks are done, the notebook is actually running on a driver, and the parallelism happens through these APIs of Spark or something else. But we are really rethinking the Jupyter notebook experience for Python running in parallel. We are working with IPython parallel projects for some of the features in open source that would enable our platform and at the same time developing our platform to create this new truly parallel notebook experience for developers.
I'm pretty excited about that that feature. And our new version of platform coming out soon in general, which is, I think, very exciting. The other aspect is the Bodo SQL solution that I'm very excited about, the first SQL solution that truly integrates with Python. Bodo can actually optimize between SQL and Python. Bodo can actually type check, throw errors if there are some mistakes in the SQL code. You know, you pass the data frame to the SQL code and some column doesn't exist, or you get output from SQL code, some column doesn't exist. Today, with the existing SQL solutions, you run it in production or on large dataset, and near the end, you may get an error. With Bodo and compilation time, you can actually get those errors very easily and have a more correct code. So that's an area that I'm excited about.
We have a lot of optimizations coming out for Python, Bodo. Python in general that are also interesting and challenge a lot of data processing and database ways of thinking in the past where PEL computing wasn't considered as strongly as Bodo is focusing on it. And in general, we are focusing on user experience, you know, the throwing the right errors and all the little things that developers expect from a good tool and platform to make it as easy to adopt and delightful as possible. So a lot of things going on, and you'll hear hear a lot more about us hopefully soon.
[00:59:56] Unknown:
Are there any other aspects of the work that you're doing at Bodo or the overall space of HPC and MPI and just optimization of data processing workflows that we didn't discuss yet that you'd like to cover before we close out the show? I think, in general,
[01:00:11] Unknown:
having a parallel computing mindset in the enterprise, is very important to kind of think differently and think bigger. And the kind of problems that can be solved can be really expanded with Bodo. I think that's an important aspect for enterprises to consider. In terms of technologies in this space, we touched on storage solutions and loading data and distributing data. I think the storage technology developers need to really think about good parallel abstractions to load data load and store data in a scalable manner on large number of cores. I think this area has been neglected maybe because their previous solutions were not scalable enough. So this having some sort of what I call collective API calls that a lot of cores can load data at the same time very efficiently is an area that needs to be considered.
And the tools in general in the space need to really rethink and simplify based on, parallel computing architectures. And don't assume, hey. Spark is the only solution out there, and it has all these limitations, and we need to cater to those. No. You can use Python, parallel Python with Bodo and have a simpler and more delightful experience for your users in general. So these are the things I'm looking forward to, and I think need to be addressed in this area.
[01:01:48] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:02:03] Unknown:
Where do I start? As I mentioned, this parallel computing gap is something that translates into a lot of issues around it because the compute engine is at the core, and its complexity propagates through the rest of the architecture. So I think there needs to be more thinking about this aspect and simplifying the architecture. It's really about simplicity. I think there are a lot of solutions out there, but having too many tools can actually make it more complicated for users. So we have to really think about simplifying the environment as much as possible for data engineers and have ready solutions and ready examples and whatever we can do to reduce the time to value
[01:02:54] Unknown:
for a data project in data teams. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Bodo. It's definitely very interesting project and an interesting approach to a complicated problem. So definitely look forward to seeing where it goes, and definitely appreciate all of the time and energy you're putting into helping to solve some of these complex challenges that exist in the industry. So thank you again for your time, and I hope you enjoy the rest of your day. Thank you very much, Tobias. It was a great conversation.
[01:03:23] Unknown:
Really enjoyed speaking with you.
[01:03:30] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Ehsan Totony and Bodo
Ehsan's Journey into Data Management
The Productivity Performance Gap in Programming
Founding Bodo and Its Vision
Choosing Python for Bodo's Compiler
Bodo's Inferential Compiler vs. Traditional Compilers
HPC and Cloud Computing
Technical Implementation of Bodo
Handling Failures in Parallel Computing
Data Distribution and Integration with Storage Systems
Training Neural Networks with Bodo
Challenges in Python and Software Architecture
Customer Assumptions and Realizations
Educating Customers and Integration Challenges
Innovative Uses of Bodo
Lessons Learned in Building Bodo
When Bodo is Not the Right Choice
Future Plans for Bodo
Final Thoughts on HPC and Data Processing