Making Spark Cloud Native At Data Mechanics

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because the number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset?

Or tried to understand what a column name means?

Our friends at Atlan started out as a data team themselves and faced all of this collaboration chaos firsthand,

and they started building Atlan as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code,

Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to data engineering podcast.com/atlan,

that's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm interviewing Jean Yves Stephane about data mechanics, a cloud native Spark platform for data engineers. So, Jean Yves, can you start by introducing yourself?

Pleasure to be here. So, yeah, I'm Jean Yves.

I am I'm the cofounder of DataMechanics.

Prior to DataMechanics, I was a software engineer at Databricks, where I led their Spark infrastructure team. So now I've been working with Spark as an infrastructure provider for quite a few years, but, yeah, I'm pretty passionate about it, so I hope I have some interesting stories to share, with your audience.

And do you remember how you first got involved in the area of data management?

Yeah. So I studied engineering in France, then went to the US at Stanford.

At the time, machine learning was everyone's obsession.

I remembered,

pretty popular machine learning class by Andrew Ng, I had 100, 000 students registered,

but it was actually a separate class that interests me, mining massive datasets,

which was, like, my introduction to distributed computing,

and I find that this area was

a great mix of really software engineering problems,

algorithms,

architecture problems,

And then I had the opportunity to join Databricks

as a pretty early, software engineer just, you know, out of college,

and that was an amazing experience, and that's how I got involved in that area.

And so you mentioned that you had that experience of running Spark at Databricks, and now you're running it for other people in your company at DataMechanics. I'm wondering if you can just start by giving a bit more of an overview about what it is that you're building there at DataMechanics

and some of the story behind what made you decide to set out on your own and run your own business to help provide this as a service to more people. Yeah. Of course. So data mechanics is a cloud native spark platform for data engineers.

Our platform is deployed on a Kubernetes cluster

that we create and manage for our customers inside their cloud account.

So the contract with our users is our users, they develop Spark code, they submit it, and then we take care of

scaling the infrastructure

tuning the configurations

collecting the logs and making them available in a friendly user interface so we wanna give a serverless experience to our Spark users.

And and, indeed, you know, prior to AI mechanics, I was leading the Spark infrastructure team at Databricks.

Databricks is an enterprise data platform

it's a great great tool pretty proud of what we built there, but I would say that for data engineers

whose main job is to build and maintain

large scale ETL workloads in a stable and cost effective way, it's not necessarily so great.

It's not so flexible. There are a lot of still tunings hurdles on you. It's hard to make it cost effective too.

And then, yeah, I saw that Spark and Kubernetes would give us an opportunity to make Spark really 10x more developer friendly and cost effective, and my cofounder, Julian, he was a data engineer. He was using Spark on a daily basis. He he shared this passion

about technical challenges and developer tools, and

we always had this crazy dream of starting up, and that's how we started it. In terms of your experience working at Databricks and running their

a to do and also what things to avoid as you're trying to build out this business and, you know, provide Spark as a maintainable service that people are able to scale on their own? Yeah. So first,

Spark has too many knobs.

You have to decide on the memory that you wanna give to your containers how many CPUs how many executors so how many containers to run

spark configurations

there are 100 of them

maybe only, I don't know, a dozen that are really important.

They evolve with Spark versions, then you have to think about shuffle settings, number of partitions.

So it's a big list and most Spark users do not have a PhD in Spark, they just want to run their pipeline or their notebook and get some answers,

so when they run at an issue, their typical answer is to throw more machines at the problem, hoping it will help,

and it doesn't always help, and you end up very often with burning a lot of money, burning too many resources. So that's 1 area of the problem, and we say it's it's related to

infrastructure management,

and I would say the second biggest problem is with the developer workflow,

where today in the big data world,

well, before Sparkling Kubernetes,

you it's still about provisioning VM and running scripts and downloading JARs and going straight from local development to running in production without necessarily unit testing so I feel that the regular software engineering world has made lots of progress and that the big data world didn't fully benefit from it, and that is another category of issues that we wanted to solve.

As far as the actual operational characteristics of Spark, you mentioned that there are all kinds of different knobs and levers that you have to know which ones to turn and which ones to avoid.

And I'm curious that as you bring that into the cloud native world and containerize it, what are some of the other operational

challenges that come up, or what are some of the aspects of Spark that might make it resistant

to being able to easily jam into a container and put onto a Kubernetes cluster that you had to overcome as you're building out this platform of data mechanics?

Overall, you know, it's easier. But this being said, there are some, yes, specific operational things.

For

example, let's say that you want to fully utilize

a node that has, I don't know, 4 CPU, and so you're just gonna tell Spark, oh, submit my application, and I want 4 cores.

Well, actually, this will get stuck in a pending state. It would not get scheduled.

Why? Because, obviously, Kubernetes has taken a little bit of portion of CPU capacity, and because

there may be some daemon running on the node, so you can't take 100%,

and that's something that is a common problem for users,

but in general the move to container to containers

makes things a lot easier. For example, the fact that the Spark distribution itself is included in the container image

makes so many things a lot simple, you get isolation, you get many things that maybe we'll cover next.

The other problem is just that

not every data team should become a Kubernetes expert or they don't necessarily are used to working with Docker, so sometimes

they need to learn new technologies.

But, overall,

I was surprised when I started working with Spark and Kubernetes, even back then at Databricks when it was more of a prototype, how relatively easy it was to get started.

The other aspect of running in the cloud is that, you know, if you're running in a data center, you can reasonably expect that the machines that you're running on are going to continue running and, you you know, except in the case of, you know, a catastrophic disk failure or a power outage.

Whereas if you're running in a cloud environment, you're much more likely to have your instance preempted or suffer from issues with noisy neighbors or, you know, have network congestion because somebody else is trying to download all the videos of Netflix or something.

And

so in particular,

if you're trying to optimize for cost, you might be running on spot instances that are going to be preemptible.

And how does that affect the ways that the consumers of Spark think about how to build their jobs and how do you, as an operator of Spark, manage things like retries, state redistribution,

and things like that for the cases where that instance does get terminated, you know, from under you when you're in the middle of a large batch run? So I'm a huge fan of using Spot node. There are really some great solutions for these problems. So to explain them, first, with Spark, you have 2 types of processes. You have the driver process, the driver is the brain of your application.

It reads your code and it decides how to split it up and to run many tasks,

and then you have the executors

which you may have I don't know 100 executors

and they actually do the real work they actually run your task

The driver is a single point of failure, okay, it's the brain. If the driver dies in the middle of your job, your entire job fails, So it's important to put the driver on a demand node, that's what we do at DataMechanics

by default. And then the executors,

they can be put on Spot. If an executor is, Spot killed, then there will be an automated retry. So the tasks that we're running on it were actually get scheduled on another executor.

There is some lost work, I mean, you know, when you kill the executor in in the middle of running a task, you you lose the work you were doing. You can also lose some files, so that's what you call in a state or distribution, some files that were stored on the executor.

With the latest release of Spark, Spark 3.1,

there is a new feature that's really powerful.

The cloud provider before preempting a node, they give you a notice. Some provider, it's 2 minutes. Other, it's it's only 30 seconds. But they give you a warning and say, oh, in in 2 minutes, this spot node will die. And now Spark, with some integration,

can listen to this warning

and then anticipate the Spot kill and move the shuffle files, move the cache data from the executor who's going away to an executor who's gonna stay. And so that

means Spark is

even more stable. It was already stable with Spot, but now you can almost avoid any impact of the Spot kill.

And so another aspect of running in the cloud is that if you're on premise, you might have a network that, for instance, that's optimized for being able to handle these high throughput workloads where you need to be able to maybe have a storage area network where all your data lives, and you might have InfiniBand

for being able to run between different nodes and have high memory throughput, you know, high network throughput. Whereas in the cloud, you're just working with whatever the data center provider has, and, you know, you might have network congestion from other people running in the same racks and switches as you that you have no control over. And I'm curious if there

are maybe because of networking considerations or other aspects of running within the cloud, you know, maybe because you're optimizing for object storage.

How does that influence the way that you think about designing your jobs or interacting with a spark cluster

going from on premise to a cloud native environment?

So, first, with respect to the differences in infrastructure networking between the cloud and on premise,

It's true, you have to make these choices in the cloud. 1 of the things we can do and we try to do with our product is help people make better choices in terms of which instance type would be the best for your type of workload, oh, more compute, more memory,

what type of disks? As you said, all the disks, they don't have the same throughput, and and throughput to the disks matter a lot in Spark, particularly when you do shuffle where you need to write the data to disk before it gets sent to other nodes in the cluster.

But you're right, in general, we do optimize for object storage performance, and so we have connectors

and a set of configurations to have the best performance possible to s 3 and so on. And, in general, I do like the cloud world where

you don't need to think too much of HDF, of, you know, coupling compute and storage together. You can separate the 2 kinda cleanly, and

on average,

you have really good performance on s 3 and object storage in general.

I mean, I don't have a lot of expertise, you know, working on premise. What I do know is that 1 of the biggest difference probably is the fact that obviously in on premise you have a fixed size cluster

and after a while you have to manage contention, you have to do priorities,

queues, and so on, while in the cloud you just get more machines because you don't have a maximum size. That's 1 less complexity that you need to manage, but the drawback is obviously that

anyone in your company can decide potentially to create a very large cluster provision many machines and then, like, waste a lot of money and since, you know, when you

don't really know how to troubleshoot your Spark applications,

the, you know, the most simple thing you can try out is, like, oh, should I use more memory, or should I use bigger cluster? Well, people tend to over provision their environments

by a lot and waste a lot of money. So that's a big problem for people who migrate to the cloud. In terms of the customizations that you've had to make to Spark itself, is there any special work that you had to do to make a certain distribution

or, you know, any specific modifications to the code of Spark for being able to

run it efficiently in the data mechanics platform or any optimizations that you've made to improve the ability to maybe reduce the container size or make it easier to be able to, you know, incorporate upstream or downstream dependencies or just some of the overall interaction that you've had with Spark itself for building this platform on top of it?

So we don't maintain a fork of Spark, we use Spark open source. When we see a bug, we make a PR and commit it to Spark open source.

However, we do maintain a fleet of optimized

Spark Docker images that our customers use, and they contain the Spark distribution itself, but also, Java, Python, Hadoop, and a lot of connectors to popular data sources,

s 3, GCS, Azure Data Lake, Snowflake, Delta Lake,

so that, you know, you can get started

and not run into the problem where, oh, I don't have the right connector, and then you install it, and, oh, this connector

conflicts with the version of Scala that I use. And, you know, the dependency

nightmare that sometimes people run into. So these images, it was a lot of work to build tests and even to keep maintaining them. We made them available to our customers, and now we also just published them to our Docker Hub for anyone to use. So even if you're not a Data Mechanics customer, but you just want a good Spark Docker image

to read from s 3 and Snowflake,

you can grab it there. A second area of optimizations we did in terms of Spark performance,

is not within Spark itself, but it's more about

tuning all these knobs that we talked about, tuning the amount of memory, the type of instance to put, the number of partitions, and so on. In our platform, we have what we call auto tuning optimization algorithm

that whenever you run a pipeline on a schedule, we're gonna you look at the logs of the runs from yesterday and the day before.

And based on these logs, we're going to say, oh, we have a memory problem. We need more memory. Oh, we're over provisioning the cluster. We should put less executors. Oh, the the number of partition is too small, and the parallelism isn't great because of that. And so we have this algorithm that's not really within Spark, but more 1 abstraction

over Spark that will tune these configurations for you and reap a lot of the low hanging fruits of the mistakes that you could have made otherwise.

And as far as things like that algorithm

or being able to auto tune the cluster, you know, what are some of the other supporting tools that you've had to build to be able to work with Spark and manage it, particularly

in customers' environments where you don't necessarily have full control over the types of hardware that they're using or the types of workloads that they're trying to run alongside it?

The main tool that I wanna talk about is Delight.

So we just publicly launched it. It's a free and cross platform monitoring tool, observability tool, it's a great complement to the Spark UI,

and with the Spark UI, it's very hard to understand why your Spark applications is slow or why it's failing. It shows a lot of information, a lot of big tables, but you're a bit,

it's hard to filter out the noise and find the right information.

It also doesn't show you system metrics like CPU,

memory, disk usage, and so on.

Now with Delight, we give you a very simple graph where you can see the breakdown of your CPU usage, so what are you doing at this point in time

over time and so it say oh I have a first

phase of my Spark application where I do a lot of compute and then I do a lot of IO and then I do a lot of shuffle

And on the same graph, just underneath, you can see on the same timeline,

the list of your Spark jobs and stages. And so you can correlate, oh,

this first part of my app is CPU intensive, and that's because that's when I'm, you know, training this machine learning model. And then this other part is IO intensive because that's when I actually,

save it to dump it to somewhere.

So with this tool, we're really trying to help people

troubleshoot the performance of their app.

We also have memory metrics to let them know, oh, is it safe to use a smaller node that has less memory? Because right now people are in the dark, and

you just do trial and error, but it's a slow iteration cycle,

so I think that's the main thing, And it's also a great complement to our platform value prop because with auto tuning, you know, we

improve the performance, but we're not super smart, we apply some rules. Right? But here it's more about giving intelligence back to the user who knows their code and can use this information to understand, oh, my data is not partitioned the right way. So we will be trying to add as much intelligence as possible in this UI

that is available for free, not just for our customers, for anyone who uses Spark.

In terms of the rest of the platform, can you dig a bit deeper into what it is that you're providing to your customers when they say, I'm going to use data mechanics to run my spark cluster?

What is the actual process of getting it set up and integrated into their environment? And what are the sort of moving pieces under the hood that you're responsible for that you're managing?

So we're deployed on a Kubernetes cluster inside our customer's cloud account, and we create and manage this cluster for them. So the customer doesn't need to have any expertise with Kubernetes.

What they do is they are gonna give us an IAM role, a set of permission on the cloud account where we're going to deploy.

If they use Amazon, we're going to create and manage a EKS cluster for them. If they use GCP, it's GKE.

If they use Azure, it's AKS.

The q cluster is long running, and there is 1 service that's there all the time it's called the gateway and it's basically the entry point to launch Spark applications

through the API or by connecting a Jupyter notebook. And the gateway also serves a web dashboard, so for the customer, you know, the gateway is basically the data mechanics platform.

This being said, we also have a centralized infrastructure,

you know, in our own cloud account

where we do authentication,

where we store some logs, and we have a centralized

database. And so if you're asking how it's implemented, I can also explain, really at a high level the technologies.

So we all our back end services

are written in Python. On the front end, we use Redux, TypeScript.

And in general, our infrastructure is managed by Kubernetes,

Docker, and Terraform.

In terms of the actual deployment, what are some of the complexities that you run into as far as being able to manage these clusters in your customer environments? And

because of the fact that you're using Terraform for deploying them,

how much access do you give to the end user for being able to, you know, go into the console and, you know, modify some of the settings of the cluster? How do you handle reconciling that with the state that Terraform thinks it has as you're building out these clusters and just some of the overall

strategy that you have for being able to

make the actual infrastructure provisioning composable and maintainable as you add more and more users and maybe introduce new capabilities?

1st, what are the biggest challenges

with respect to deployments? I would say it's actually the ability to, you know, continuously

upgrade our services,

because

many of our customers use the platform 247.

In the early days, it wasn't like that. We could say, oh, we have a window, or no 1 uses the platform, it can be down, but we're not there anymore.

So when we do a software update, we need to make sure all our services stay up, and there is absolutely no downtime for the customer,

and we do this by using, you know, a blue green process.

And there are lots of moving pieces, you know. Under the hood, we incorporate the Spark Operator, which is a open source project,

and making it work with Bluegreen was really hectic. Kubernetes cluster itself, you know, has a major version that sometimes we need to upgrade

and that has dependency with a lot of other software, so we need to test all of this and make this highly available.

Now I think the other part of your question was,

how much control does the end user have over the infrastructure?

And you're right. Today, in our product, we don't yet support a way to, say, edit our master,

Terraform script, and

so it's a bit more custom. The good news is that these things, they don't come up on a daily

basis. But, basically, when the customer makes a change, we ask them to let us know,

and then we incorporate this in our Terraform.

We have some plans to be able to integrate this in our product,

and, in fact, to even let the customer

deploy the platform in a self-service way. I think it will still take, I don't know, at least a quarter to build, but once we're there, then we wouldn't have this shared responsibility or gray area where where maybe, you know, you need to communicate to to make sure, changes are

propagated throughout the stack.

Patrick is a diligent data engineer, probably the best in his team.

Yesterday, when trying to optimize the performance of a query running over 20, 000, 000, 000 rows, he was so eager to succeed that he read the entire database documentation.

He changed the syntax. He changed the schema. He gave it his everything and reduced the response time from 20 minutes down to 5.

Today is not a good day.

Sarah from business intelligence says 5 minutes is way too long.

John, the CFO, is constantly slacking every living being trying to figure out what caused the business intelligence

expenses to grow so high yesterday.

Want to become the liberator of data?

Firebolt's cloud data warehouse can run complex queries over terabytes and petabytes of data in sub seconds with minimum resources.

No more waiting, no more huge expenses, and your hard work finally pays off.

Firebolt is the fastest cloud data warehouse.

Visitdataengineeringpodcast.com/firebolt

today to get started,

And the first 25 visitors will receive a free Firebolt t shirt.

As you have gone through the process of building out the platform and onboarding customers and going through this growth phase of when you were able to say, okay. I've got this window

of time. I could just shut everything down and rebuild it to

now everything's 247, and we have to, you know, manage blue green deployments

and figure out how do we maintain 4 or 5 nines of uptime and maintain our SLAs. And how have the overall goals and design of the actual data mechanics platform shifted or evolved since you first began working on it? Yes. So the overall, really high level goal of the platform, the mission has stayed the same,

and the high level design, the fact that we have a data plane, you know, the customer's account, a control plane, you know, centralized, you know, in our cloud account, this is staying the same,

but still, you know, we are constantly learning and taking feedback.

1 major decision

and that

required

significant

investment and some architectural change was to actually decide to open up Delight to the world. You know, Delight, this monitoring tool, initially, we thought we're just gonna build it for our customers.

But then we realized, oh, you know what? The way it's implemented,

anyone using Spark could just attach the Delight agent as a JAR, and then we could process the metrics the same way we do for our customers and provide value to anyone using Spark. And, obviously, there is some marketing value to that. If everyone starts using Delight, as I hope they they will, then they'll know they'll know of us just like they know of the other Spark platforms.

But, yeah, this required to re architect the platform a lot,

make some decisions about which part of the code will be shared between our real customer and the Freezilite users, which part of the infrastructure will be shared in terms of

databases,

storage, and so on.

And it's an investment that took us, I would say, 6 months in total,

including 3 months, which were really

infrastructure changes.

In terms of the actual impact that having the data mechanics platform up and running, and being able to containerize the deployment and have this container that you can run locally for iterating on a job, how does that change the ways that

people think about how and where to use Spark, and how does that change kind of the calculus of when it makes sense to go with Spark versus just doing a 1 odd Python script or something and just the overall impact that it has on the

ability of a set of, you know, of a data team to be able to go from idea to production.

So in terms of how does it change the way you use Spark, I think that's 1 of the great benefits of containerization

and Kubernetes is that people start

developing with Spark more like they were developing on, on a traditional software engineering code.

It's a lot more easy to develop locally, run your application locally on a Docker image with the image you control, and so maybe develop from an IDE,

that's something we see a lot more.

In terms of

where to use Spark, it's the same use cases, I mean Spark itself hasn't changed really,

but there is a lower barrier to entry. Maybe you already use Kubernetes in some part of your stack, well,

why not let's just try the Spark Operator, and let's try to get this started. It's not, unfortunately, you know, that easy yet. I think if you wanna get started with Spark on Kubernetes, fully open source, it's still

a a lot to build right now,

but it is definitely a much lower barrier to entry than having to become a Hadoop PR expert,

And also, yeah, you can mix and match Spark and non Spark apps on the same cluster and manage them with the same tool. So really, the barrier to entry is much lower.

In your tagline, you also mentioned that data mechanics is a Spark platform built for data engineers. And I'm curious how that factors into the ways that you think about what features to build and how to present them, and what what the kind of customer profile is that you have that you're working from as you're iterating on your product.

We focus right now on data engineers.

Why? Because right now, our product

is a real amazing serverless

Spark back end with some really nice also observability layer,

but it's not yet a development environment. If we were trying to sell to, for example, data scientists, they would be like, oh, but we want, I don't know,

hosted collaborative

notebooks, and we want to manage our machine learning life cycle and so on, and a very broad product.

And, also, our vision with Sparkling Communities was that we could, you know, tune the configurations

and make pipelines more performant, and that has a lot more value to people who run very large big data ETL pipelines or streaming. So that's why we we focus

on data engineers.

It forces us to focus on some really hard problems. They're problems like, oh, how to write an optimization algorithm that

will work

whether you're working with, you know, moderate sized data, say, as, you know, 100 gigabyte pipeline,

and a pipeline that process, I don't know, 10 terabyte or 100 terabytes, and then you get into the problems of scale. It's a slightly different product. You know, we're still a startup. The only way we can compete with the very big players in the market is by focusing on a subsegment of the market and then provide a 10 x better developer

experience. So that's what we're trying to do with data engineers. I think as we as we mature,

we also want to be a great platform for other profiles. And even today, we have data scientists who use our platform and who like it a lot. But I would say our focus is on data engineering. Yeah. As you mentioned, 1 of the things that as you expand to more members of the data team, they'll start asking for is integrating with a notebook environment or being able to handle the full machine learning life cycle. And I know that there are a number of tools, both proprietary and open source that offer those capabilities. And so for people who want to start with data mechanics and then integrate with you know, maybe they have JupyterHub set up for their team, What is involved in actually

tying those things together? And is there any sort of barrier that they might run into trying to get it all running in the same Kubernetes cluster or managing those connections, particularly as the product iterates and evolves?

Yeah. We've invested a lot in notebooks and in making it easy to connect to Jupyter Notebook, either you can run locally or that you can run through a JupyterHub.

So this works out of the box, it's very easy to set up, you basically just give Jupyter the URL of the data mechanics gateway, and under the hood to provide this mechanism, we use an open source project called the Jupyter Enterprise Gateway.

So, actually, we did invest to make the developer workflow of data scientists or anyway notebooks user better.

I don't think it's not only data scientists who use notebooks. I mean, I use notebooks, data engineers use notebooks when they need to explore the data quickly, when they need to produce a report, when maybe they want to develop in an interactive way

for larger scale pipelines, and once they're ready to go to production, they definitely move their code to their ID, and then submit it as a batch application, it's dockerized.

But, yeah, we we support

notebooks, and, however, yeah, for machine learning, for example, machine learning model life cycle, we don't have any solution out of the box.

Still,

we're building a product on top of many open source pieces, so very often, it's actually kind of easy to integrate, and

during our customers

POC or we can help them integrate,

but, yeah, we had to make some product choices, and that means focusing on the stuff that our customers don't want to manage themselves, and that's the Spark infrastructure,

and not build a full fledged data science product,

because there are some great products who already do that.

And as you've been building out the platform and working with your customers and helping them to understand the capabilities

of Spark and how it relates to running inside of Kubernetes and how it impacts their overall development cycle.

What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I already

ran into this thing as well at Databricks, but

from really, like,

a challenging, you know, technical lessons

for some of our biggest customers who run into maybe the limits of instances that Amazon can provide or GCP can provide or delay or the scalability limits of the size of a Kubernetes cluster it can get, or how many Spark applications you can submit at the same time without overloading the community's API and so on. And this is really an area where we are learning, many things, we're scaling our services.

If we want to talk more about

a lesson learned

in terms of product management and so on,

initially, when we started this project, I mean, parking communities was kinda high, but we didn't know if it was gonna be a success or not, Now I'm, you know, certain it's gonna be a success, and all the big platforms are gonna adopt it.

It's still unexpected

to see that some people who are not, you know, comfortable at all, they don't know nothing about Kubernetes, they don't necessarily

are very comfortable with Docker,

but with a little bit of, you know, hand holding, with giving them a starter project, and so on, how quickly they like to get started,

so, yeah, that's I guess a good a good learning

that things can change for the better and

and they're changing fast. Yeah. Definitely very fast.

And for people who are considering

trying to, you know, build a Spark job or integrate it into their data workflow

and they might be thinking about using data mechanics. What are the cases where either Spark or data mechanics might be the wrong choice and they'd be better suited either

using a different processing engine or using a different distribution or deployment method.

So Spark

is great as a programming language that lets you write data pipelines that will process very large volumes of data. If your data sets are in the 10 gigabyte size or, I don't know,

20, 30, 50, less than a 100 gigabytes,

maybe you could just not have the complexity of a distributed computing framework

and just scale your Python

based applications.

Another topic that is maybe not great for Spark is

if you have a requirement that you want very very low latency.

Spark streaming can get you some low latency,

if you want to use Spark as a data warehouse tool

and the data is partitioned in the right way, you can get, you know, answers from your data in a in a few seconds, but in general it hasn't been designed

to provide

very very low latency

to serve data in an application where you want a modern web

response time, so these are some caveats with respect to Spark.

Now, for data mechanics,

if you know you need Spark, you know, when is it the wrong choice, I would

say, if you

have mostly,

mathematical or statistical background, but you're not very,

I don't know, comfortable

in calling an API,

or building a Docker image, you don't want get involved in this, but you'd rather have

a mostly UI based platform where you can develop everything from the UI, maybe trigger your application, schedule them with,

you know, drag and drop and and so on, then, you know, that's not what our product provides. But if you don't need to be a Kubernetes expert at all. You just need to be willing to get to know, I don't know, Docker

and calling an API, and you'll see the it is simpler than you might think.

And as you continue to iterate on the data mechanics platform, what do you have planned for the near to medium term either in terms of upgraded capabilities

or improved features or new products that you're planning to release?

If we're talking about this quarter, the main thing we're investing on right now is to improve the development workflow because today we give to our new users, you know, starter pack projects

where they get a starter code, they get scripts to call our APIs and so on, and now we want to actually give them a CLI

which makes things even simpler. In a single click you can build the Docker image, push it, run it,

and also make it simple to create a shell and work with Spark interactively even if you're not necessarily in a notebook, just have a simple PySpark shell.

Then we also want to keep investing on the automated tuning size where we want to automatically remediate

out of memory errors where maybe in the middle of the night your application crashed with a memory issue, well, it's gonna be retried by your scheduler, and on the second attempt, automatically,

it has more memory, and and it passes.

That's for short term. If we're talking more medium term, we probably want to explore adding some scheduling capability to the platform.

So right now we have connectors to Airflow, to Azure Data Factory, to Amazon Step Functions, to Azure Logic App, to many different schedulers,

but some customers, you know, don't have any scheduler.

And we think that in the long term, it's a strategic area to invest in because

scheduling

gives you a high level view of all your pipelines, and once you have this information

you can make a lot of powerful optimizations

such as

instead of telling me when to run the pipeline

tell me at what time to be finished, and I, the platform, I'm going to decide

when is the best time to run it based on

overall capacity, based on

maybe spot prices, spot availability,

and so on. So this is something that we want to invest in, more in the in the medium

term. Are there any other aspects of the work that you're doing at data mechanics or the overall Spark ecosystem that we didn't discuss yet that you'd like to cover before we close out the show? Honestly, I think we covered a lot.

I think just my overall message,

yeah, if you're a data engineer and you know a little bit of Spark, well, first, try out our Delight project, we hope you will be delighted, and it will give you insights into the performance of your Spark applications,

And if you're curious to learn more about Spark on communities,

we wrote some

blog posts about the pros and cons of Spark on communities,

and I'm happy to to talk more and give you a demo, and and learn more about your use case, and whether it can be useful to you or not. And, also, we are hiring, so check out our website for up to date, info on that.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I've heard of some projects that want to solve this,

but I don't think they're mainstream yet. But I think people have a hard time

following changes to the data, basically, having version control

for your datasets.

For example, you know, some of our customers, we migrate them from another platform, maybe we make some changes to their code, and then they're, like, oh, well, I wanna make sure the application is correct and the data is all the same, and that's kinda manual today. So, yes, and have some kind of version control to make sure your data is still correct.

That would be an interesting problem to solve.

Well, thank you very much for taking the time today to join me and share the work that you're doing at DataMechanics. It's definitely a very interesting product and 1 that is filling a much needed area of the market. So I appreciate the time and energy you've put into that, and I hope you enjoy the rest of your day.

Yeah. Thanks so much for having me on the show, Tobias. I hope they had some interesting stories to share, and, yeah, best wishes to you and to this great podcast, your future next steps.

Thank you.

Thank you for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links