Summary
Spark is one of the most well-known frameworks for data processing, whether for batch or streaming, ETL or ML, and at any scale. Because of its popularity it has been deployed on every kind of platform you can think of. In this episode Jean-Yves Stephan shares the work that he is doing at Data Mechanics to make it sing on Kubernetes. He explains how operating in a cloud-native context simplifies some aspects of running the system while complicating others, how it simplifies the development and experimentation cycle, and how you can get a head start using their pre-built Spark container. This is a great conversation for understanding how new ways of operating systems can have broader impacts on how they are being used.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Jean-Yves Stephan about Data Mechanics, a cloud-native Spark platform for data engineers
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what you are building at Data Mechanics and the story behind it?
- What are the operational characteristics of Spark that make it difficult to run in a cloud-optimized environment?
- How do you handle retries, state redistribution, etc. when instances get pre-empted during the middle of a job execution?
- What are some of the tactics that you have found useful when designing jobs to make them more resilient to interruptions?
- What are the customizations that you have had to make to Spark itself?
- What are some of the supporting tools that you have built to allow for running Spark in a Kubernetes environment?
- How is the Data Mechanics platform implemented?
- How have the goals and design of the platform changed or evolved since you first began working on it?
- How does running Spark in a container/Kubernetes environment change the ways that you and your customers think about how and where to use it?
- How does it impact the development workflow for data engineers and data scientists?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while building the Data Mechanics product?
- When is Spark/Data Mechanics the wrong choice?
- What do you have planned for the future of the platform?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Data Mechanics
- Databricks
- Stanford
- Andrew Ng
- Mining Massive Datasets
- Spark
- Kubernetes
- Spot Instances
- Infiniband
- Data Mechanics Spark Container Image
- Delight – Spark monitoring utility
- Terraform
- Blue/Green Deployment
- Spark Operator for Kubernetes
- JupyterHub
- Jupyter Enterprise Gateway
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because the number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all of this collaboration chaos firsthand, and they started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/atlan, that's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Jean Yves Stephane about data mechanics, a cloud native Spark platform for data engineers. So, Jean Yves, can you start by introducing yourself?
[00:02:07] Unknown:
Pleasure to be here. So, yeah, I'm Jean Yves. I am I'm the cofounder of DataMechanics. Prior to DataMechanics, I was a software engineer at Databricks, where I led their Spark infrastructure team. So now I've been working with Spark as an infrastructure provider for quite a few years, but, yeah, I'm pretty passionate about it, so I hope I have some interesting stories to share, with your audience.
[00:02:29] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:32] Unknown:
Yeah. So I studied engineering in France, then went to the US at Stanford. At the time, machine learning was everyone's obsession. I remembered, pretty popular machine learning class by Andrew Ng, I had 100, 000 students registered, but it was actually a separate class that interests me, mining massive datasets, which was, like, my introduction to distributed computing, and I find that this area was a great mix of really software engineering problems, algorithms, architecture problems, And then I had the opportunity to join Databricks as a pretty early, software engineer just, you know, out of college, and that was an amazing experience, and that's how I got involved in that area.
[00:03:14] Unknown:
And so you mentioned that you had that experience of running Spark at Databricks, and now you're running it for other people in your company at DataMechanics. I'm wondering if you can just start by giving a bit more of an overview about what it is that you're building there at DataMechanics and some of the story behind what made you decide to set out on your own and run your own business to help provide this as a service to more people. Yeah. Of course. So data mechanics is a cloud native spark platform for data engineers.
[00:03:42] Unknown:
Our platform is deployed on a Kubernetes cluster that we create and manage for our customers inside their cloud account. So the contract with our users is our users, they develop Spark code, they submit it, and then we take care of scaling the infrastructure tuning the configurations collecting the logs and making them available in a friendly user interface so we wanna give a serverless experience to our Spark users. And and, indeed, you know, prior to AI mechanics, I was leading the Spark infrastructure team at Databricks. Databricks is an enterprise data platform it's a great great tool pretty proud of what we built there, but I would say that for data engineers whose main job is to build and maintain large scale ETL workloads in a stable and cost effective way, it's not necessarily so great.
It's not so flexible. There are a lot of still tunings hurdles on you. It's hard to make it cost effective too. And then, yeah, I saw that Spark and Kubernetes would give us an opportunity to make Spark really 10x more developer friendly and cost effective, and my cofounder, Julian, he was a data engineer. He was using Spark on a daily basis. He he shared this passion about technical challenges and developer tools, and we always had this crazy dream of starting up, and that's how we started it. In terms of your experience working at Databricks and running their
[00:05:13] Unknown:
a to do and also what things to avoid as you're trying to build out this business and, you know, provide Spark as a maintainable service that people are able to scale on their own? Yeah. So first,
[00:05:25] Unknown:
Spark has too many knobs. You have to decide on the memory that you wanna give to your containers how many CPUs how many executors so how many containers to run spark configurations there are 100 of them maybe only, I don't know, a dozen that are really important. They evolve with Spark versions, then you have to think about shuffle settings, number of partitions. So it's a big list and most Spark users do not have a PhD in Spark, they just want to run their pipeline or their notebook and get some answers, so when they run at an issue, their typical answer is to throw more machines at the problem, hoping it will help, and it doesn't always help, and you end up very often with burning a lot of money, burning too many resources. So that's 1 area of the problem, and we say it's it's related to infrastructure management, and I would say the second biggest problem is with the developer workflow, where today in the big data world, well, before Sparkling Kubernetes, you it's still about provisioning VM and running scripts and downloading JARs and going straight from local development to running in production without necessarily unit testing so I feel that the regular software engineering world has made lots of progress and that the big data world didn't fully benefit from it, and that is another category of issues that we wanted to solve.
[00:06:46] Unknown:
As far as the actual operational characteristics of Spark, you mentioned that there are all kinds of different knobs and levers that you have to know which ones to turn and which ones to avoid. And I'm curious that as you bring that into the cloud native world and containerize it, what are some of the other operational challenges that come up, or what are some of the aspects of Spark that might make it resistant to being able to easily jam into a container and put onto a Kubernetes cluster that you had to overcome as you're building out this platform of data mechanics?
[00:07:16] Unknown:
Overall, you know, it's easier. But this being said, there are some, yes, specific operational things. For example, let's say that you want to fully utilize a node that has, I don't know, 4 CPU, and so you're just gonna tell Spark, oh, submit my application, and I want 4 cores. Well, actually, this will get stuck in a pending state. It would not get scheduled. Why? Because, obviously, Kubernetes has taken a little bit of portion of CPU capacity, and because there may be some daemon running on the node, so you can't take 100%, and that's something that is a common problem for users, but in general the move to container to containers makes things a lot easier. For example, the fact that the Spark distribution itself is included in the container image makes so many things a lot simple, you get isolation, you get many things that maybe we'll cover next.
The other problem is just that not every data team should become a Kubernetes expert or they don't necessarily are used to working with Docker, so sometimes they need to learn new technologies. But, overall, I was surprised when I started working with Spark and Kubernetes, even back then at Databricks when it was more of a prototype, how relatively easy it was to get started.
[00:08:29] Unknown:
The other aspect of running in the cloud is that, you know, if you're running in a data center, you can reasonably expect that the machines that you're running on are going to continue running and, you you know, except in the case of, you know, a catastrophic disk failure or a power outage. Whereas if you're running in a cloud environment, you're much more likely to have your instance preempted or suffer from issues with noisy neighbors or, you know, have network congestion because somebody else is trying to download all the videos of Netflix or something. And so in particular, if you're trying to optimize for cost, you might be running on spot instances that are going to be preemptible.
And how does that affect the ways that the consumers of Spark think about how to build their jobs and how do you, as an operator of Spark, manage things like retries, state redistribution,
[00:09:18] Unknown:
and things like that for the cases where that instance does get terminated, you know, from under you when you're in the middle of a large batch run? So I'm a huge fan of using Spot node. There are really some great solutions for these problems. So to explain them, first, with Spark, you have 2 types of processes. You have the driver process, the driver is the brain of your application. It reads your code and it decides how to split it up and to run many tasks, and then you have the executors which you may have I don't know 100 executors and they actually do the real work they actually run your task The driver is a single point of failure, okay, it's the brain. If the driver dies in the middle of your job, your entire job fails, So it's important to put the driver on a demand node, that's what we do at DataMechanics by default. And then the executors, they can be put on Spot. If an executor is, Spot killed, then there will be an automated retry. So the tasks that we're running on it were actually get scheduled on another executor.
There is some lost work, I mean, you know, when you kill the executor in in the middle of running a task, you you lose the work you were doing. You can also lose some files, so that's what you call in a state or distribution, some files that were stored on the executor. With the latest release of Spark, Spark 3.1, there is a new feature that's really powerful. The cloud provider before preempting a node, they give you a notice. Some provider, it's 2 minutes. Other, it's it's only 30 seconds. But they give you a warning and say, oh, in in 2 minutes, this spot node will die. And now Spark, with some integration, can listen to this warning and then anticipate the Spot kill and move the shuffle files, move the cache data from the executor who's going away to an executor who's gonna stay. And so that means Spark is even more stable. It was already stable with Spot, but now you can almost avoid any impact of the Spot kill.
[00:11:15] Unknown:
And so another aspect of running in the cloud is that if you're on premise, you might have a network that, for instance, that's optimized for being able to handle these high throughput workloads where you need to be able to maybe have a storage area network where all your data lives, and you might have InfiniBand for being able to run between different nodes and have high memory throughput, you know, high network throughput. Whereas in the cloud, you're just working with whatever the data center provider has, and, you know, you might have network congestion from other people running in the same racks and switches as you that you have no control over. And I'm curious if there are maybe because of networking considerations or other aspects of running within the cloud, you know, maybe because you're optimizing for object storage.
How does that influence the way that you think about designing your jobs or interacting with a spark cluster going from on premise to a cloud native environment?
[00:12:08] Unknown:
So, first, with respect to the differences in infrastructure networking between the cloud and on premise, It's true, you have to make these choices in the cloud. 1 of the things we can do and we try to do with our product is help people make better choices in terms of which instance type would be the best for your type of workload, oh, more compute, more memory, what type of disks? As you said, all the disks, they don't have the same throughput, and and throughput to the disks matter a lot in Spark, particularly when you do shuffle where you need to write the data to disk before it gets sent to other nodes in the cluster. But you're right, in general, we do optimize for object storage performance, and so we have connectors and a set of configurations to have the best performance possible to s 3 and so on. And, in general, I do like the cloud world where you don't need to think too much of HDF, of, you know, coupling compute and storage together. You can separate the 2 kinda cleanly, and on average, you have really good performance on s 3 and object storage in general.
I mean, I don't have a lot of expertise, you know, working on premise. What I do know is that 1 of the biggest difference probably is the fact that obviously in on premise you have a fixed size cluster and after a while you have to manage contention, you have to do priorities, queues, and so on, while in the cloud you just get more machines because you don't have a maximum size. That's 1 less complexity that you need to manage, but the drawback is obviously that anyone in your company can decide potentially to create a very large cluster provision many machines and then, like, waste a lot of money and since, you know, when you don't really know how to troubleshoot your Spark applications, the, you know, the most simple thing you can try out is, like, oh, should I use more memory, or should I use bigger cluster? Well, people tend to over provision their environments
[00:14:01] Unknown:
by a lot and waste a lot of money. So that's a big problem for people who migrate to the cloud. In terms of the customizations that you've had to make to Spark itself, is there any special work that you had to do to make a certain distribution or, you know, any specific modifications to the code of Spark for being able to run it efficiently in the data mechanics platform or any optimizations that you've made to improve the ability to maybe reduce the container size or make it easier to be able to, you know, incorporate upstream or downstream dependencies or just some of the overall interaction that you've had with Spark itself for building this platform on top of it?
[00:14:41] Unknown:
So we don't maintain a fork of Spark, we use Spark open source. When we see a bug, we make a PR and commit it to Spark open source. However, we do maintain a fleet of optimized Spark Docker images that our customers use, and they contain the Spark distribution itself, but also, Java, Python, Hadoop, and a lot of connectors to popular data sources, s 3, GCS, Azure Data Lake, Snowflake, Delta Lake, so that, you know, you can get started and not run into the problem where, oh, I don't have the right connector, and then you install it, and, oh, this connector conflicts with the version of Scala that I use. And, you know, the dependency nightmare that sometimes people run into. So these images, it was a lot of work to build tests and even to keep maintaining them. We made them available to our customers, and now we also just published them to our Docker Hub for anyone to use. So even if you're not a Data Mechanics customer, but you just want a good Spark Docker image to read from s 3 and Snowflake, you can grab it there. A second area of optimizations we did in terms of Spark performance, is not within Spark itself, but it's more about tuning all these knobs that we talked about, tuning the amount of memory, the type of instance to put, the number of partitions, and so on. In our platform, we have what we call auto tuning optimization algorithm that whenever you run a pipeline on a schedule, we're gonna you look at the logs of the runs from yesterday and the day before.
And based on these logs, we're going to say, oh, we have a memory problem. We need more memory. Oh, we're over provisioning the cluster. We should put less executors. Oh, the the number of partition is too small, and the parallelism isn't great because of that. And so we have this algorithm that's not really within Spark, but more 1 abstraction over Spark that will tune these configurations for you and reap a lot of the low hanging fruits of the mistakes that you could have made otherwise.
[00:16:39] Unknown:
And as far as things like that algorithm or being able to auto tune the cluster, you know, what are some of the other supporting tools that you've had to build to be able to work with Spark and manage it, particularly in customers' environments where you don't necessarily have full control over the types of hardware that they're using or the types of workloads that they're trying to run alongside it?
[00:17:01] Unknown:
The main tool that I wanna talk about is Delight. So we just publicly launched it. It's a free and cross platform monitoring tool, observability tool, it's a great complement to the Spark UI, and with the Spark UI, it's very hard to understand why your Spark applications is slow or why it's failing. It shows a lot of information, a lot of big tables, but you're a bit, it's hard to filter out the noise and find the right information. It also doesn't show you system metrics like CPU, memory, disk usage, and so on. Now with Delight, we give you a very simple graph where you can see the breakdown of your CPU usage, so what are you doing at this point in time over time and so it say oh I have a first phase of my Spark application where I do a lot of compute and then I do a lot of IO and then I do a lot of shuffle And on the same graph, just underneath, you can see on the same timeline, the list of your Spark jobs and stages. And so you can correlate, oh, this first part of my app is CPU intensive, and that's because that's when I'm, you know, training this machine learning model. And then this other part is IO intensive because that's when I actually, save it to dump it to somewhere.
So with this tool, we're really trying to help people troubleshoot the performance of their app. We also have memory metrics to let them know, oh, is it safe to use a smaller node that has less memory? Because right now people are in the dark, and you just do trial and error, but it's a slow iteration cycle, so I think that's the main thing, And it's also a great complement to our platform value prop because with auto tuning, you know, we improve the performance, but we're not super smart, we apply some rules. Right? But here it's more about giving intelligence back to the user who knows their code and can use this information to understand, oh, my data is not partitioned the right way. So we will be trying to add as much intelligence as possible in this UI that is available for free, not just for our customers, for anyone who uses Spark.
[00:19:05] Unknown:
In terms of the rest of the platform, can you dig a bit deeper into what it is that you're providing to your customers when they say, I'm going to use data mechanics to run my spark cluster? What is the actual process of getting it set up and integrated into their environment? And what are the sort of moving pieces under the hood that you're responsible for that you're managing?
[00:19:25] Unknown:
So we're deployed on a Kubernetes cluster inside our customer's cloud account, and we create and manage this cluster for them. So the customer doesn't need to have any expertise with Kubernetes. What they do is they are gonna give us an IAM role, a set of permission on the cloud account where we're going to deploy. If they use Amazon, we're going to create and manage a EKS cluster for them. If they use GCP, it's GKE. If they use Azure, it's AKS. The q cluster is long running, and there is 1 service that's there all the time it's called the gateway and it's basically the entry point to launch Spark applications through the API or by connecting a Jupyter notebook. And the gateway also serves a web dashboard, so for the customer, you know, the gateway is basically the data mechanics platform.
This being said, we also have a centralized infrastructure, you know, in our own cloud account where we do authentication, where we store some logs, and we have a centralized database. And so if you're asking how it's implemented, I can also explain, really at a high level the technologies. So we all our back end services are written in Python. On the front end, we use Redux, TypeScript. And in general, our infrastructure is managed by Kubernetes, Docker, and Terraform.
[00:20:41] Unknown:
In terms of the actual deployment, what are some of the complexities that you run into as far as being able to manage these clusters in your customer environments? And because of the fact that you're using Terraform for deploying them, how much access do you give to the end user for being able to, you know, go into the console and, you know, modify some of the settings of the cluster? How do you handle reconciling that with the state that Terraform thinks it has as you're building out these clusters and just some of the overall strategy that you have for being able to make the actual infrastructure provisioning composable and maintainable as you add more and more users and maybe introduce new capabilities?
[00:21:21] Unknown:
1st, what are the biggest challenges with respect to deployments? I would say it's actually the ability to, you know, continuously upgrade our services, because many of our customers use the platform 247. In the early days, it wasn't like that. We could say, oh, we have a window, or no 1 uses the platform, it can be down, but we're not there anymore. So when we do a software update, we need to make sure all our services stay up, and there is absolutely no downtime for the customer, and we do this by using, you know, a blue green process. And there are lots of moving pieces, you know. Under the hood, we incorporate the Spark Operator, which is a open source project, and making it work with Bluegreen was really hectic. Kubernetes cluster itself, you know, has a major version that sometimes we need to upgrade and that has dependency with a lot of other software, so we need to test all of this and make this highly available.
Now I think the other part of your question was, how much control does the end user have over the infrastructure? And you're right. Today, in our product, we don't yet support a way to, say, edit our master, Terraform script, and so it's a bit more custom. The good news is that these things, they don't come up on a daily basis. But, basically, when the customer makes a change, we ask them to let us know, and then we incorporate this in our Terraform. We have some plans to be able to integrate this in our product, and, in fact, to even let the customer deploy the platform in a self-service way. I think it will still take, I don't know, at least a quarter to build, but once we're there, then we wouldn't have this shared responsibility or gray area where where maybe, you know, you need to communicate to to make sure, changes are propagated throughout the stack.
[00:23:10] Unknown:
Patrick is a diligent data engineer, probably the best in his team. Yesterday, when trying to optimize the performance of a query running over 20, 000, 000, 000 rows, he was so eager to succeed that he read the entire database documentation. He changed the syntax. He changed the schema. He gave it his everything and reduced the response time from 20 minutes down to 5. Today is not a good day. Sarah from business intelligence says 5 minutes is way too long. John, the CFO, is constantly slacking every living being trying to figure out what caused the business intelligence expenses to grow so high yesterday.
Want to become the liberator of data? Firebolt's cloud data warehouse can run complex queries over terabytes and petabytes of data in sub seconds with minimum resources. No more waiting, no more huge expenses, and your hard work finally pays off. Firebolt is the fastest cloud data warehouse. Visitdataengineeringpodcast.com/firebolt today to get started, And the first 25 visitors will receive a free Firebolt t shirt. As you have gone through the process of building out the platform and onboarding customers and going through this growth phase of when you were able to say, okay. I've got this window of time. I could just shut everything down and rebuild it to now everything's 247, and we have to, you know, manage blue green deployments and figure out how do we maintain 4 or 5 nines of uptime and maintain our SLAs. And how have the overall goals and design of the actual data mechanics platform shifted or evolved since you first began working on it? Yes. So the overall, really high level goal of the platform, the mission has stayed the same,
[00:24:47] Unknown:
and the high level design, the fact that we have a data plane, you know, the customer's account, a control plane, you know, centralized, you know, in our cloud account, this is staying the same, but still, you know, we are constantly learning and taking feedback. 1 major decision and that required significant investment and some architectural change was to actually decide to open up Delight to the world. You know, Delight, this monitoring tool, initially, we thought we're just gonna build it for our customers. But then we realized, oh, you know what? The way it's implemented, anyone using Spark could just attach the Delight agent as a JAR, and then we could process the metrics the same way we do for our customers and provide value to anyone using Spark. And, obviously, there is some marketing value to that. If everyone starts using Delight, as I hope they they will, then they'll know they'll know of us just like they know of the other Spark platforms.
But, yeah, this required to re architect the platform a lot, make some decisions about which part of the code will be shared between our real customer and the Freezilite users, which part of the infrastructure will be shared in terms of databases, storage, and so on. And it's an investment that took us, I would say, 6 months in total, including 3 months, which were really infrastructure changes.
[00:26:02] Unknown:
In terms of the actual impact that having the data mechanics platform up and running, and being able to containerize the deployment and have this container that you can run locally for iterating on a job, how does that change the ways that people think about how and where to use Spark, and how does that change kind of the calculus of when it makes sense to go with Spark versus just doing a 1 odd Python script or something and just the overall impact that it has on the ability of a set of, you know, of a data team to be able to go from idea to production.
[00:26:35] Unknown:
So in terms of how does it change the way you use Spark, I think that's 1 of the great benefits of containerization and Kubernetes is that people start developing with Spark more like they were developing on, on a traditional software engineering code. It's a lot more easy to develop locally, run your application locally on a Docker image with the image you control, and so maybe develop from an IDE, that's something we see a lot more. In terms of where to use Spark, it's the same use cases, I mean Spark itself hasn't changed really, but there is a lower barrier to entry. Maybe you already use Kubernetes in some part of your stack, well, why not let's just try the Spark Operator, and let's try to get this started. It's not, unfortunately, you know, that easy yet. I think if you wanna get started with Spark on Kubernetes, fully open source, it's still a a lot to build right now, but it is definitely a much lower barrier to entry than having to become a Hadoop PR expert, And also, yeah, you can mix and match Spark and non Spark apps on the same cluster and manage them with the same tool. So really, the barrier to entry is much lower.
[00:27:45] Unknown:
In your tagline, you also mentioned that data mechanics is a Spark platform built for data engineers. And I'm curious how that factors into the ways that you think about what features to build and how to present them, and what what the kind of customer profile is that you have that you're working from as you're iterating on your product.
[00:28:02] Unknown:
We focus right now on data engineers. Why? Because right now, our product is a real amazing serverless Spark back end with some really nice also observability layer, but it's not yet a development environment. If we were trying to sell to, for example, data scientists, they would be like, oh, but we want, I don't know, hosted collaborative notebooks, and we want to manage our machine learning life cycle and so on, and a very broad product. And, also, our vision with Sparkling Communities was that we could, you know, tune the configurations and make pipelines more performant, and that has a lot more value to people who run very large big data ETL pipelines or streaming. So that's why we we focus on data engineers.
It forces us to focus on some really hard problems. They're problems like, oh, how to write an optimization algorithm that will work whether you're working with, you know, moderate sized data, say, as, you know, 100 gigabyte pipeline, and a pipeline that process, I don't know, 10 terabyte or 100 terabytes, and then you get into the problems of scale. It's a slightly different product. You know, we're still a startup. The only way we can compete with the very big players in the market is by focusing on a subsegment of the market and then provide a 10 x better developer experience. So that's what we're trying to do with data engineers. I think as we as we mature,
[00:29:24] Unknown:
we also want to be a great platform for other profiles. And even today, we have data scientists who use our platform and who like it a lot. But I would say our focus is on data engineering. Yeah. As you mentioned, 1 of the things that as you expand to more members of the data team, they'll start asking for is integrating with a notebook environment or being able to handle the full machine learning life cycle. And I know that there are a number of tools, both proprietary and open source that offer those capabilities. And so for people who want to start with data mechanics and then integrate with you know, maybe they have JupyterHub set up for their team, What is involved in actually tying those things together? And is there any sort of barrier that they might run into trying to get it all running in the same Kubernetes cluster or managing those connections, particularly as the product iterates and evolves?
[00:30:13] Unknown:
Yeah. We've invested a lot in notebooks and in making it easy to connect to Jupyter Notebook, either you can run locally or that you can run through a JupyterHub. So this works out of the box, it's very easy to set up, you basically just give Jupyter the URL of the data mechanics gateway, and under the hood to provide this mechanism, we use an open source project called the Jupyter Enterprise Gateway. So, actually, we did invest to make the developer workflow of data scientists or anyway notebooks user better. I don't think it's not only data scientists who use notebooks. I mean, I use notebooks, data engineers use notebooks when they need to explore the data quickly, when they need to produce a report, when maybe they want to develop in an interactive way for larger scale pipelines, and once they're ready to go to production, they definitely move their code to their ID, and then submit it as a batch application, it's dockerized.
But, yeah, we we support notebooks, and, however, yeah, for machine learning, for example, machine learning model life cycle, we don't have any solution out of the box. Still, we're building a product on top of many open source pieces, so very often, it's actually kind of easy to integrate, and during our customers POC or we can help them integrate, but, yeah, we had to make some product choices, and that means focusing on the stuff that our customers don't want to manage themselves, and that's the Spark infrastructure, and not build a full fledged data science product, because there are some great products who already do that.
[00:31:41] Unknown:
And as you've been building out the platform and working with your customers and helping them to understand the capabilities of Spark and how it relates to running inside of Kubernetes and how it impacts their overall development cycle. What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:32:00] Unknown:
I already ran into this thing as well at Databricks, but from really, like, a challenging, you know, technical lessons for some of our biggest customers who run into maybe the limits of instances that Amazon can provide or GCP can provide or delay or the scalability limits of the size of a Kubernetes cluster it can get, or how many Spark applications you can submit at the same time without overloading the community's API and so on. And this is really an area where we are learning, many things, we're scaling our services. If we want to talk more about a lesson learned in terms of product management and so on, initially, when we started this project, I mean, parking communities was kinda high, but we didn't know if it was gonna be a success or not, Now I'm, you know, certain it's gonna be a success, and all the big platforms are gonna adopt it.
It's still unexpected to see that some people who are not, you know, comfortable at all, they don't know nothing about Kubernetes, they don't necessarily are very comfortable with Docker, but with a little bit of, you know, hand holding, with giving them a starter project, and so on, how quickly they like to get started, so, yeah, that's I guess a good a good learning that things can change for the better and and they're changing fast. Yeah. Definitely very fast.
[00:33:17] Unknown:
And for people who are considering trying to, you know, build a Spark job or integrate it into their data workflow and they might be thinking about using data mechanics. What are the cases where either Spark or data mechanics might be the wrong choice and they'd be better suited either using a different processing engine or using a different distribution or deployment method.
[00:33:39] Unknown:
So Spark is great as a programming language that lets you write data pipelines that will process very large volumes of data. If your data sets are in the 10 gigabyte size or, I don't know, 20, 30, 50, less than a 100 gigabytes, maybe you could just not have the complexity of a distributed computing framework and just scale your Python based applications. Another topic that is maybe not great for Spark is if you have a requirement that you want very very low latency. Spark streaming can get you some low latency, if you want to use Spark as a data warehouse tool and the data is partitioned in the right way, you can get, you know, answers from your data in a in a few seconds, but in general it hasn't been designed to provide very very low latency to serve data in an application where you want a modern web response time, so these are some caveats with respect to Spark.
Now, for data mechanics, if you know you need Spark, you know, when is it the wrong choice, I would say, if you have mostly, mathematical or statistical background, but you're not very, I don't know, comfortable in calling an API, or building a Docker image, you don't want get involved in this, but you'd rather have a mostly UI based platform where you can develop everything from the UI, maybe trigger your application, schedule them with, you know, drag and drop and and so on, then, you know, that's not what our product provides. But if you don't need to be a Kubernetes expert at all. You just need to be willing to get to know, I don't know, Docker and calling an API, and you'll see the it is simpler than you might think.
[00:35:23] Unknown:
And as you continue to iterate on the data mechanics platform, what do you have planned for the near to medium term either in terms of upgraded capabilities or improved features or new products that you're planning to release?
[00:35:36] Unknown:
If we're talking about this quarter, the main thing we're investing on right now is to improve the development workflow because today we give to our new users, you know, starter pack projects where they get a starter code, they get scripts to call our APIs and so on, and now we want to actually give them a CLI which makes things even simpler. In a single click you can build the Docker image, push it, run it, and also make it simple to create a shell and work with Spark interactively even if you're not necessarily in a notebook, just have a simple PySpark shell. Then we also want to keep investing on the automated tuning size where we want to automatically remediate out of memory errors where maybe in the middle of the night your application crashed with a memory issue, well, it's gonna be retried by your scheduler, and on the second attempt, automatically, it has more memory, and and it passes.
That's for short term. If we're talking more medium term, we probably want to explore adding some scheduling capability to the platform. So right now we have connectors to Airflow, to Azure Data Factory, to Amazon Step Functions, to Azure Logic App, to many different schedulers, but some customers, you know, don't have any scheduler. And we think that in the long term, it's a strategic area to invest in because scheduling gives you a high level view of all your pipelines, and once you have this information you can make a lot of powerful optimizations such as instead of telling me when to run the pipeline tell me at what time to be finished, and I, the platform, I'm going to decide when is the best time to run it based on overall capacity, based on maybe spot prices, spot availability, and so on. So this is something that we want to invest in, more in the in the medium
[00:37:22] Unknown:
term. Are there any other aspects of the work that you're doing at data mechanics or the overall Spark ecosystem that we didn't discuss yet that you'd like to cover before we close out the show? Honestly, I think we covered a lot.
[00:37:34] Unknown:
I think just my overall message, yeah, if you're a data engineer and you know a little bit of Spark, well, first, try out our Delight project, we hope you will be delighted, and it will give you insights into the performance of your Spark applications, And if you're curious to learn more about Spark on communities, we wrote some blog posts about the pros and cons of Spark on communities, and I'm happy to to talk more and give you a demo, and and learn more about your use case, and whether it can be useful to you or not. And, also, we are hiring, so check out our website for up to date, info on that.
[00:38:11] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:38:25] Unknown:
I've heard of some projects that want to solve this, but I don't think they're mainstream yet. But I think people have a hard time following changes to the data, basically, having version control for your datasets. For example, you know, some of our customers, we migrate them from another platform, maybe we make some changes to their code, and then they're, like, oh, well, I wanna make sure the application is correct and the data is all the same, and that's kinda manual today. So, yes, and have some kind of version control to make sure your data is still correct. That would be an interesting problem to solve.
[00:39:00] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at DataMechanics. It's definitely a very interesting product and 1 that is filling a much needed area of the market. So I appreciate the time and energy you've put into that, and I hope you enjoy the rest of your day.
[00:39:13] Unknown:
Yeah. Thanks so much for having me on the show, Tobias. I hope they had some interesting stories to share, and, yeah, best wishes to you and to this great podcast, your future next steps. Thank you.
[00:39:29] Unknown:
Thank you for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Jean Yves' Background and Journey into Data Management
Overview of Data Mechanics and Its Mission
Challenges and Solutions in Spark Infrastructure Management
Operational Challenges of Running Spark on Kubernetes
Handling Cloud Environment Challenges
Customizations and Optimizations for Spark
Supporting Tools and Delight Monitoring Tool
Customer Integration and Platform Management
Platform Evolution and Delight Tool Launch
Impact of Containerization on Spark Usage
Focusing on Data Engineers and Future Expansion
Lessons Learned and Product Management
When to Use Spark and Data Mechanics
Future Plans for Data Mechanics
Final Thoughts and Closing Remarks