Summary
One of the most critical aspects of software projects is managing its data. Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage. Planetscale is a serverless option for your MySQL workloads that lets you focus on your applications without having to worry about managing the database or fight with differences between development and production. In this episode Nick van Wiggeren explains how the Planetscale platform is implemented, their strategies for balancing maintenance and improvements of the underlying Vitess project with their business goals, and how you can start using it today to free up the time you spend on database administration.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
- Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.
- Your host is Tobias Macey and today I’m interviewing Nick van Wiggeren about Planetscale, a serverless and globally distributed MySQL database as a service
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Planetscale is and the story behind it?
- What are the core problems that you are solving with the Planetscale platform?
- How might an engineering team address those challenges in the absence of Planetscale/Vitess?
- Can you describe how Planetscale is implemented?
- What are some of the addons that you have had to build on top of Vitess to make Planetscale
- What are the impacts that a serverless database has on the way teams approach their application/platform design and development?
- metrics exposed to help users optimize their usage
- What is your policy/philosophy for determining what capabilities to include in Vitess and what belongs in the Planetscale platform?
- What are the most interesting, innovative, or unexpected ways that you have seen Planetscale/Vitess used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Planetscale?
- When is Planetscale the wrong choice?
- What do you have planned for the future of Planetscale?
Contact Info
- @nickvanwig on Twitter
- nickvanw on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Planetscale
- Vitess
- CNCF == Cloud Native Computing Foundation
- Hadoop
- OLTP == Online Transactional Processing
- Galera
- Yugabyte DB
- CitusDB
- MariaDB SkySQL
- CockroachDB
- NewSQL
- AWS PrivateLink
- Planetscale Connect
- Segment
- BigQuery
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Legacy CDPs charge you a premium to keep your data in a black box. RudderStack builds your CDP on top of your data warehouse, giving you a more secure and cost effective solution. Plus, it gives you more technical controls so you can fully unlock the power of your customer data. Visit rudderstack.com/legacy to take control of your customer data today. Your host is Tobias Macy, and today I'm interviewing Nick Van Wigeren about Planet Scale, a serverless and globally distributed MySQL database as a service. So, Nick, can you start by introducing yourself?
[00:01:21] Unknown:
Yeah. So my name is Nick Van Wigeren. I'm the VP of Engineering at PlanetScale. I've been with the company now for about 2 years, so about half of the total lifetime of the company, and I'm really happy to be here today. And do you remember how you first got started working in data? Yeah. You know, it was kind of an accident. I really wanted to be a back end engineer. That was what I always wanted to do, and it turns out that you can't really have a back end without a database. And so throughout my career, got steadily, steadily more and more involved with kind of databases and how they work inside businesses and just kind of, you know, scaling businesses and things like that. And, ultimately, ended up here at PlanetScale working for a database as a service company. Never really thought I would do that, but I've been doing infrastructure as a service for quite some time.
[00:02:06] Unknown:
In terms of the PlanetScale project, I'm wondering if you can give a bit of an overview about what it is that you're building there and some of the story behind the company and how and why you decided to get involved with it. Yeah. Great question. So to talk about the history of PlanetScale
[00:02:21] Unknown:
is to talk about the history of Vitesse, which is an open source project that PlanetScale is built on top of. Vitesse was created at YouTube a long, long time ago before they were acquired by Google as a kind of set of tools and a a project to help scale MySQL. So you can imagine a decade ago, a lot of these new SQL kind of data management solutions didn't exist. You know, it was created there to kind of just make MySQL work for summing and counting and displaying all the things that YouTube needed. Over time, YouTube gets acquired by Google. Google no longer needs Vitesse, so they donate it to the CNCF.
PlanetScale gets founded about 4 years ago by the kind of creators of the Vitesse project as a way to continue to evangelize, sell, and build a business based on the technology advancements that Vitesse made. So PlanetScale was really born out of the desire to kind of get the test out there and build a product with the test. That's where it came from. We don't own the test. PlanetScale is not the owners of the test, but kind of the stewards in many ways, and it's what we've built all of our business on top of.
[00:03:37] Unknown:
As far as Vitesse and PlanetScale project, what are some of the core problems that you're aiming to solve with it?
[00:03:44] Unknown:
Yeah. So, fundamentally, I think the the biggest challenge that VITES aims to solve, therefore PlanetScale aims to solve, is how difficult it is to distribute your data across multiple servers and have things be performant and easy to operate. So, you know, for a very long time, I think data sizes, datasets, and things like that were kind of content to live on a single server. They weren't so big. They weren't so unwieldy that that was fine. You started to see some analytical use cases and things like that, you know, with Hadoop and kind of all of that explode that space a little bit and expand to, you know, Hadoop clusters and multi server clusters.
What Vitesse and PlanetScale are trying to do are bring some of those same types of innovations, horizontal scalability, horizontal sharding, but to OLTP workloads. So high transaction throughput, workloads where you're powering kind of core business workflows, but also then be able to scale them across multiple servers while not having to rewrite your application to do so. So I'd say our biggest kind of strength and the biggest reason that we're in business is really to help companies who are struggling to scale their databases kind of achieve the next level, whether that's surviving a Black Friday, whether that's doubling the size of their company and adding on millions of more users. That's really our strength and really where we see ourselves in the market.
[00:05:07] Unknown:
In the absence of something like a PlanetScale or Vitesse, I'm curious what you have seen as some of the strategies and tactical implementations that engineering teams have fallen back on to be able to handle some of these questions of load and scalability both in terms of operational and data volumes?
[00:05:28] Unknown:
That's a great question. So I've talked to so many customers and prospects, even people who have turned PlanetScale down. And oftentimes, what they're doing is building pieces or components of the test themselves, kind of homegrown, a homegrown sharded system, for example, where instead of using the test, your application knows there are are 32 MySQL databases, and this customer lives on that MySQL database, and this customer lives on that MySQL database, or, you know, this customer lives on this data store. So companies are kind of organically creating the same model where once you start to need to split, you know, you kind of have to split. So now you've got 16 clusters, and each of those are managed by hand. You know, I was just talking with a customer last week who was saying that they were doing this for years, and they had no automation. They had no kind of process built around this. They would just stand them up when they needed. They'd enter the details into the code, kind of check it in, and then let it ride. And so a lot of times, what we're seeing is either companies are saying, look, like, we're outgrowing this. We can't handle the operational toil.
We wanna switch to something that can do it for us, or we just wanna get out of the business of this and have our engineers go work on something important, not recreating more and more components of of a good horizontally chartered system internally.
[00:06:46] Unknown:
Another interesting aspect of what you're building is taking a look at what the competitive landscape looks like, where in terms of just MySQL itself, there are things like Galera clusters for being able to handle some of the sharding and scaling. There's the work that the folks at MariaDB are doing with their SkySQL and being able to have a managed and, in some cases, serverless implementation. And then more broadly, just in the transactional database market, there are projects like CockroachDB, there's CitusDB, there is YugabyteDB.
So there are a lot of different approaches and attempts to be able to solve this problem of, I wanna be able to run my application anywhere in the world. I want that data to be available and performant, and I don't wanna have to think about it as an application engineer. And I'm wondering if you can give some nuance as to some of the different trade offs and optimizations that you and others are making and some of the ways that an application engineer who doesn't care about the database or doesn't wanna have to care about the database can say, okay. This is the project that's actually going to solve my problem that I have versus I actually wanna go with 1 of these other places.
[00:07:53] Unknown:
That's a great question. So just like you said, in the last 10 years or so, there's been a lot of really impressive and really excellent technology being rolled out, products being rolled out around solving exactly this problem. And I think you can bucket them into a couple of different kind of frameworks. Some, and I think that the kind of newest and maybe shiniest are companies that have rewritten their query engine and database from scratch. Right? So they may be MySQL compatible or Postgres compatible, but there's no Postgres in their code base. There's no MySQL in their code base. And I think this is really cool and really impressive. So they've basically built a database from scratch, and that means that they can do things that you could never dream of doing with a MySQL or never dream of doing with a Postgres because they've got brand new semantics, they've got brand new capabilities, but your application can still speak to them in kind of a known query language. So that's 1 bucket that I think is really great, and we're seeing these get more and more mature, which I think is the most important part.
You know, MySQL and Postgres have been around long enough to be in their thirties. They've got bad knees and bad backs. A lot of the growing pains that some of these newer databases have gone through, you know, MySQL and Postgres already have. That's not to say that they're better or worse in any way. I think all of the competition and all of these products are fantastic. Over on the PlanetScale side, I think what we're actually trying to do is actually build on top of MySQL. So if you use a planet scale database underneath the hood, your query or queries are eventually getting to an actual MySQL.
And that has its own advantages and disadvantages as well. Right? You have to be able to be knowledgeable about MySQL. You have to be able to orchestrate MySQL. You have to be able to kind of really know what you're doing because, you know, MySQL again, it's been around for so long. There's a lot of knobs. But you have the advantage now of you know how it works. Your local development environment is straightforward. It's MySQL. If you have an application already running on MySQL, you can lift and shift it over knowing that at the end of the day, compatibility is gonna be less of an issue. And so to me, the question really is picking what fits best with your technology, what fits best with your community, and what fits best with your kind of expectations.
Do you need massive petabyte scale? Do you need incredibly low latency? Take a look at how all of these technologies are the same, but also subtly different, and map them to the right thing for the right business case. Where PlanetScale really excels, for example, is just to scale out transactional throughput. We've got customers doing thousands of queries per second on a single kind of logical planet scale database, which might just be unachievable on certain other NewSQL databases. But those other NewSQL databases might have data distribution algorithms or things like that that, you know, point of scale can't quite achieve. So with all the marketing that goes on around these, I think the most important part is to be able to dig 1 layer deeper and figure out what matches what I need for my business.
[00:10:45] Unknown:
Another interesting aspect of that equation is the question of what does the local development experience look like where, for instance, with some of these serverless database engines, it can scale to infinity. You can, you know, pay pennies for depending on what you're actually using. But then when you say, okay, well, I actually need to develop against my local machine, and then it needs to deploy to production. There are those potential inconsistencies between what you're seeing locally and what you're seeing production, which adds some overhead in terms of debugging and validation and testing. How much of a factor that plays in the decision making that you see some of your customers playing considering as they're evaluating PlanetScale versus 1 of these other options?
[00:11:29] Unknown:
That's a great question, and I think it absolutely does play a big part. A lot of the reasons that people are adopting a new database, 1 of them is certainly scale, 1 of them is certainly uptime and availability, and that's probably paramount. Right? You wouldn't adopt a database that didn't stay running and stay up. But maybe number 2 is how productive can we be using that database? Companies now are buying products because they wanna keep their developers productive. They wanna keep their velocity high. They wanna be able to keep executing, and they don't want the database to get in their way. So they're evaluating how it fits in with their CICD workflows. They're evaluating how their developers know how to use it and how to operate it. Right? Are they gonna be stuck relearning an entirely new kind of set of concepts?
And they're looking at the local development experience. Can we ship safely and securely using this technology? Is a huge consideration. So for PlanetScale, we build on PlanetScale internally, and our website runs on PlanetScale. We actually just use vanilla MySQL, and the compatibility layer is is so good that they're kind of interoperable. So anywhere you run MySQL, in our CI, in our local laptops, we just test against MySQL, and we trust that when it goes to production, it'll run on PlanetScale as well. But that's not true for everything, and even that's not true for PlanetScale at times. Right? Our JavaScript serverless API, for example, is not available on local development. You can only use that on PlanetScale in production.
So we've worked to provide the ability to do dev branches, which are cheap, disposable, kind of planet scale clusters that are isolated and things like that. So we've also been working to give people the cloud dev environments they want, but everyone and every company has a slightly different workflow. So it makes it very difficult at times to be able to get everyone, And yet at the same time, everyone wants to make sure that their workflow fits in with their technology choices.
[00:13:25] Unknown:
Keying off of what you're just talking about with the kind of development clusters and being able to branch the, you know, the schema to be able to test and iterate on various change sets. I'm wondering if you can talk to some of the capabilities beyond just an out of the box MySQL database that somebody using Planet Scale can build on top of and some of the ways that that helps both in terms of their development velocity as well as their kind of operational stability?
[00:13:54] Unknown:
You know, I think what PlanetScale has tried to provide with dev branches and these isolated clusters is a path to success for developers and a best practice for developers that they can use to get their database changes out to production as easily and safely as GitHub and serverless and all of us have done for their applications. So we've spent so much time collectively as an industry helping developers become more productive, and so much of that energy has been funneled onto the application side. You've got Kubernetes. You've got, you know, Bluegreen Deploys. You've got serverless and all of this, and it's made it so that you can get a piece of code running around the world, production stability that auto scales in a single day, in a single hour, kind of, if you're familiar with the tools.
And what PlanetScale is really wanting to do with development branches and online schema changes and schema rewind, revert, and all of that is do the same thing for databases. So when you go and you have a database change that you wanna do, whether it's adding a table or altering a table, you can immediately create a branch, have an isolated developer experience where you can add data, delete data. You can mess it up if you want. You can start over. When you wanna go get that change into production, it's as easy as a couple of API calls on the command line or a couple of clicks on the web UI, and you can get that change added to your database in a way that's 0 downtime, industry best practice without needing to understand all of the machinations.
If that change goes bad, you can revert it just like you can revert code and get your database back to the prior state without any data loss. So it's all just a march towards making sure that tooling is good for people who work with databases so they themselves don't have to become experts with databases, and they can leave that up to us.
[00:15:46] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. Digging more into PlanetScale itself, I'm wondering if you can talk through how it's implemented and some of the architecture and some of the engineering challenges and capabilities that you've had to build on above and beyond what's available in Vitess?
[00:16:56] Unknown:
PlanetScale uses Vitess really heavily, so that's kind of the core technology of what our databases are built on top of. And then around that, there's a whole host of other, I think, innovations or at least interesting technologies. So we're all in on Kubernetes, for example. I think a lot of people are still afraid to run their database workloads on Kubernetes, But at the same time, we're doing it. We have thousands of databases, tens of thousands of databases running on Kubernetes all of the time, and we love it. We've gone really in on cloud technologies to allow us to achieve great scale and flexibility without having to worry about kind of buying hardware, without having to worry about that part of the scale component. So the combination, I think, of the tests, of Kubernetes, and of all of these cloud technologies that we're built on means that we can kind of innovate and move very quickly and know that underneath the technology is flexible enough that we can move in whatever direction we need and fulfill customers with kind of whatever we need to do.
[00:17:57] Unknown:
As far as the kind of operational challenges of being able to run something like PlanetScale for users and being able to provide this serverless interface where they don't have to care about the underlying hardware, what are some of the, I guess, escape hatches or extra visibility that you've had to build in to be able to give customers the kind of debugability that they would expect from running their own databases while being able to manage some of the security and multi tenancy concerns that are inherent in being able to run a service like that?
[00:18:31] Unknown:
This is another really good question because operating a multi tenant service in the cloud is difficult already. When you add on to it that you're hosting customers' databases, potentially containing some of their most important data about their customers, You have to be really, really, really sure that you're providing them the introspection and the security that they need in order to feel safe at night, kind of trusting PlanetScale with their data. So we've had to do a number of very opinionated things in PlanetScale to make sure that we're doing that right. Great small example of that is our database passwords are not user supplied and are never user supplied.
We generate both the username and the password for every piece of access the customers get so that we can make sure those passwords are geographically secure and follow a standard format so that we can automatically revoke them if they get checked into GitHub. Similarly, we require that all database connections happen over TLS. You cannot make a connection to a PlanetScale database over the public Internet, over AWS PrivateLink, or anything similar unless it is secured by TLS because we want to make sure that no plain text data ever touches the public Internet. And on the debugability side, we've built and are working on building an entire Insights product that really is kind of a soup to nuts application performance monitoring system for the database.
So we show you information about every single query you've run, how long it takes on average, how many rows it scanned, how many rows it's returned, how many rows have been updated. And we're working on evolving that product constantly to make it actually a best in class analytics tool because we know that customers need to be able to poke and prod and debug their database in production, or they're gonna get stuck, or they're going to ask us, and then we are going to have to spend the time using our tools to do that, where what we really wanna be able to do is furnish the customer with that information. So we spend a lot of time thinking, how is a customer going to not succeed?
How is a customer going to run into issues in part of their whole database life cycle? And how can we give them the tools to understand that and move past it without having to contact us or without needing our help?
[00:20:41] Unknown:
As far as the kind of visibility and debugability aspect too, there are the kind of operational aspects of observability of making sure that the server is up and it has enough capacity, which is what you handle and the customer doesn't want to have to think about. But then there are also the elements of being able to debug the behavior of the database itself and how it's interacting with the application and being able to do things like view query plans and explains and understanding where is all the time being spent, you know, where am I violating foreign key constraints. And I'm curious how you are working to be able to provide some of that level of visibility and expose the useful metrics and insights that a customer needs to be able to not just actively investigate things, but be able to proactively understand when a problem is occurring or might occur so that they don't have to wait until everything breaks?
[00:21:34] Unknown:
That's exactly what it is. You know, if you think about the relationship that people have with kind of their database vendors, it's actually very interesting. The database vendor is kind of responsible for keeping the database online, responsible for giving the database a certain amount of resources, whether it's a pre agreed upon amount or an auto scaled amount. Basically, kind of making sure that the code is running in an environment that it is ready to respond to requests and do what it needs to do. But there's also a shared responsibility with the customer. Right? It's possible for anyone to create a bad query that's bad enough to impact other queries, maybe even take the database down.
In a sufficiently complicated system, it's nearly customers need to be able to do is understand, why did my code change cause my database performance to get worse? What query is causing, you know, all of the CPU time or all of this disk time to be used instead of what I want it to be used on. And customers need to be able to get that information early so that they can debug it and fix the problem before it impacts things so badly that they kind of need to do a hard reset or something like that. So I think your point is really great. There's a certain amount of making sure the database is working, that a vendor is responsible for doing, and there is a necessary certain amount of understanding. Now we want that to be less and less as time goes on. But for a complicated system like a database, you do have to know how to hold it. Otherwise, you may run into issues that the database itself may not be able to solve.
[00:23:08] Unknown:
In particular, for something like a MySQL database, there's also the question of whether or not there is appropriate index coverage or if you're over indexing. And I'm curious what level of kind of handholding you offer to customers to be able to surface some of that information of, hey. I see you're querying this table a lot, and you're just doing a full table scan, and it's costing you a lot of time.
[00:23:30] Unknown:
You've just nailed it. So this is on our road map for 2023. With insights, we kind of have the ability for knowledgeable customers to get that level of introspection. You can take a look and see, hey. Why is this query scanning so many rows but only returning 1? But that level of knowledge even requires, you know, someone to understand what an index is, why you would need 1, how they're applied on queries. So we wanna take that to the next level and do pretty much exactly what you just said, which is be able to tell someone, you know, hey. There's an index that could cover a, b, and c that would really save you a lot of time for your busiest 3 queries. Or, hey. We noticed you're spending an inordinate amount of time on this when it just doesn't need to be that way. It can be a a simple point query, a simple lookup query. Because we do see this a lot, and it's never the customer's fault. It's our fault because we're providing them the database. We're providing them the solution.
But the performance difference on even a medium sized table between a proper index and an improper index can be multiple orders of magnitude. We can take a customer who's scaling up and up and up and barely able to meet demands at a single index, and then their database looks nearly idle. It really can have such a huge impact, and it's so important for us to be able to tell customers that. We save them money too because they're not doing table scans and they're not buying a bigger database, and it saves us the problem of their database, you know, being slow or looking slow.
[00:24:52] Unknown:
Another interesting element of that problem is the question of, okay. I've identified that you need to apply this index. This is kind of a 2 pronged element. 1 of it is depending on the table, adding that index might actually be a very expensive operation and take a long time and decrease performance for a while. And so, 1, I'm curious about how you manage to mitigate some of that aspect and some of the ways that you're able to be able to take advantage of the implicit scalability of Vitesse and PlanetScale to be able to reduce the impact. And then the other aspect of it is, I see you're missing this index. You know, do you want me to apply it for you? And then how that factors into the development cycle of customers who are using something like a Jenga ORM or, Hive or Hibernate ORM to actually manage those migrations and some of the potential conflicts that might come up from that?
[00:25:44] Unknown:
Absolutely. So our migration system you know, we've spent a lot of time on our migration system. So in an ideal world, when we kind of warn people about this, when we say, hey. You know, here's the alter statement with the index in it that you need. We wanna make it potentially as easy as a single button, and then we'll go handle it in the background. So we can throttle the migration kind of forward and backward based on how busy their database is, and we almost always want to sacrifice how long the migration takes for serving kind of higher priority traffic. So our system in the background will adjust very frequently and is is able of saying, hey. We're taking up too much time on the box. Hey. Just go a little slower. We almost always find that almost everyone's business is kind of cyclical in 1 way or another, whether it's time zone or day of the week or something like that. So we can use that extra time during those times when, you know, database load is low, and then speed up the migration a little bit more and make it go a little bit faster. So that's 1 thing that's really nice about kind of owning the whole system front to back is that we're able to do that for customers. And, again, they don't even have to think about it. They don't have to tell us, here's when my business is at the slowest.
Here's the Saturday that you can run it as this Saturday. No downtime. Kind of no worry from their end. We just kind of soak up the excess that are able to get that migration done. But your point about, you know, ORMs and stuff like that is really good. Then we haven't really had a great answer for that. We want to be able to tell people, here's the schema change needed. We may do something like, here's an example, Rails migration or Django migration or Hibernate migration. You know, please drop it in. But even then, it becomes many more than a single click to get done. So I think this is a product thing that we're gonna have to think about as we go and develop this feature. It's a very good 1 because if you do it incorrectly now, every single developer is gonna have to go pull. They're gonna hit a conflict the next time they run a migration and, you know, we've all been there. Then someone has to apologize, get into Slack, and share the workaround. 1 person's on vacation, they're gonna come back a week later, and they're gonna hit the same problem, and you're gonna have the same issue. So making sure it fits in again with everyone's dev workflow is really, really important.
[00:27:51] Unknown:
To that notion as well, it's interesting to think about the impact that a serverless database experience has on the approach that the development team takes to designing and building their applications and how much of their time and kind of innovation goes into thinking about their interaction with the database and how much having that serverless platform frees them from having to consider that in the rest of their work that they do? Absolutely. You know, I think a really cool example of this is even something as straightforward as connection limits. So if you're deploying an application on Lambda or Cloudflare Workers or something like that,
[00:28:29] Unknown:
they'll helpfully scale you up to kind of wildly high levels if you get that traffic and your credit card will swipe for it. But your MySQL instance or your Postgres instance might only allow for a 1, 000 connections. Now that may have been fine in a world where you are, you know, deploying applications on bare metal. You think I need a 500 more servers and, like, okay. Well, talk to me in 3 quarters. That's not the world we live in anymore. You want 500 more workers, Amazon will do that for you in the tick of a clock. So now all of a sudden you're getting into this world where you're running out of connections, you have to ration them out, and you've actually got a global scaling limitation just because of 1 single component of your database.
The serverless world hates that, and the serverless world is right to hate that. And so being able to no longer think about that means that you can all of a sudden scale, kind of, your whole architecture out where you can break it down into multiple components that share 1 database. You can break it out into things that scale independently, And just not having to worry about 1 component of your overall stack scaling in 1 dimension can actually lead to innovation, and it can lead to faster processes and faster development life cycles in so many other places just once you take 1 of them off the board.
[00:29:44] Unknown:
Build data pipelines, not DAGs. That's the spirit behind Upsolver SQL Lake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG based orchestration. All you do is write a query in SQL to declare your transformation, and SQL Lake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQL Lake supports a broad set of transformations, including high cardinality joins, aggregations, upserts, and window operations.
Output can be streamed into a data lake for query engines like Presto, Trino, or Spark SQL, a data warehouse like Snowflake or Redshift, or any other destination you choose. Pricing for SQL Lake is simple. You pay $99 per terabyte ingested into your data lake using SQL Lake and run unlimited transformation pipelines for free. That way, data engineers and data users can process to their heart's content without worrying about their cloud bill. Go to data engineering podcast dotcom/upsolver today to get a 30 day trial with unlimited data and see for yourself how to untangle your DAGs. Another aspect of what you're building is the question of how something like PlanetScale, which is engineered and designed for this transactional workload, integrates with and becomes a component of a broader analytical data stack and some of the ways that you are looking to simplify that initial step of going from here's my transactional environment to here's my analytical environment and helping to reduce the friction in that data integration step.
[00:31:19] Unknown:
Totally. So earlier in the year, we built a product called PlanetScale Connect, and that was kind of our first strike at making the, you know, ELT, ETL pipeline make sense. Frankly, it has a long way to go, and that's because not because of anything that, you know, any specific person has or hasn't done. It's just a very fragmented space with a lot of opinions. There's a limitless number of syncs. There's a limitless number of tools to get you from point a to point b. And so I think our strategy is still kind of forming on how much do we wanna be involved in the direct process versus how much do we wanna be supplying the existing market leaders with the tools to connect to PlanetScale and to read from PlanetScale and really figure out what users are doing and what users aren't doing. You know? We never wanna make our users go out and buy another piece of software to be successful on PlanetScale, but we also don't wanna stop our users that already buy that software from being able to leverage it with PlanetScale as well. And I think this is a microcosm of the data world at large that makes it so difficult is there's so much data. There are so many databases and so many companies that, you know, need to do more and more with that data that the space is so fragmented.
It's impossible to come in and hit 80% of the market with 1 solution. You're hitting 5% here, 10% there. You're hitting shared customers over here, and you're always missing someone who says, if only you worked with this or if only you supported that. No matter how many times you give them what they want, there's another 1 through the door with the exact same problem. And this is something that I hate because I hate letting our users down. There's just not enough hours in the day to give everyone exactly what they want. I'm interested in your perspective on that kind of data integration challenge and the ways that the ecosystem
[00:33:07] Unknown:
is evolving. You know, as you mentioned, there's the question of ETL versus ELT, 1 of the big kind of noisome aspects of data integration is this question of change data capture where you say, okay. We don't want to just do a batch pull every time every n number of minutes or hours. You just wanna be able to get changes as they propagate, but then there's the question of, okay. Well, now I need to be able to look at the binary logs of what the database is doing to be able to then figure out what it's actually semantically trying to do of, oh, did this transaction even get committed? Okay. This got deleted. You know, how do I figure out how to represent that historical aspect? And, you know, then there's the question of, well, you shouldn't even be going into the database. You should be looking at the application level and the semantics and just pulling from the application semantics and, you know, having the application push a feed of events. And so I'm just curious what your thoughts and experience has been of working with your customers and working in this ecosystem about some of the most successful and maintainable strategies for people who are trying to wrap their heads around this constantly changing and shifting morass of possibilities?
[00:34:15] Unknown:
Absolutely. So before I even talk about our customers, I mean, I'll talk about ourselves and how we handle this. We're an example of the problem. We use Segment for certain things. We use BigQuery for certain things. Some of the BigQuery is more of change data capture style live. Some of it's more like nightly dump and examine, you know. And we use all these different tools to get data to all different places over to the finance team, over to the marketing team, over to the, you know, the rest of the sales team and the engineering team. And it's really hard, but I think the strategy that I've seen be most successful with the customers I've worked at is when they have a strategy and they have something to unify all of those down into 1.
Very rarely do we find that a specific tool can't do something. It is that 1 tool does a better, 1 tool does b better, so companies buy both a and b. But it's that team a preferred this, team b prefers this, so companies buy both a and b. And where I've seen folks be the most successful is when they've settled on a strategy for a few different tools. We're gonna do ELT and the you know, we're all gonna dump it into this 1 data lake. We're all gonna dump it into this 1 product, and then that's the playground for the rest of the business. Or we're gonna use the best in class tool and have multiple ELT tools, but they're all gonna land in just BigQuery, or they're all gonna land in just our internal data warehouse.
And I think that it really requires a business discipline and business acumen to understand what are all of the uses of our data? How do we catalog them? How do we understand them? And then how do we shape that to really understand where they're coming from, where they're going, and why they're doing it. I would bet a lot of our companies that we work with, a lot of companies in general, have data pipelines they don't even know about, have, at the same time, core business functions being powered by scripts on people's laptops, dumping data every now and then. And so I think the hardest part as a leader, the hardest part about someone coming up with kind of how to wrangle this is to pick a strategy and stick to it. You have to stare it down, and you have to really execute against that strategy, make as few exceptions as possible, and see how far it takes you. Because otherwise, it really is the wild west, then you'll end up with team a, b, and c, not only having their own processes and their own software, but you lack interoperability between them. So if you need the sum of what a and b produce, all you've added is d, a whole another pipeline to connect the 2 of them, and it, you know, it cycles and cycles and cycles and cycles.
[00:36:43] Unknown:
Another aspect that's always interesting to dig into for companies that are building on top of an open source core is the question of governance of that open source project and how they think about the divide between what is the engineering effort that I'm putting into the open source layer and releasing to the public versus the effort that I'm putting into building my product and making sure that my customers and business are successful, and how much of this do I want to turn into a kind of competitive moat versus how much do I want to help everybody, including my potential competitors?
[00:37:17] Unknown:
Fantastic question. I mean, open core or businesses built on open source are so, so, so difficult. Right? Because just like you said, you're sometimes competing with yourself, at the same time, that competition is sometimes your top of funnel. At the same time, that competition is sometimes good enough that people never sign even though, you know, it would serve them, and sometimes it's bad enough that people never get into the top of funnel at all and choose a whole different solution because they've had a bad experience. So for us, I think our strategy is balance.
We wanna make sure the open source project we're built on is healthy, it has users that use it, it's actively developed, it gets new commits, and it moves forward and stays kind of relevant in the industry, while at the same time, we wanna build on top of it and build kind of unique product experiences that may be powered by that open source product, but are a layer above. So a lot of our migration tooling, our revert tooling, and things like that, they're all open source components that exist in Vitess. And what we've done is we've combined them in an opinionated way, and we've built amazing, beautiful workflows on top of it. Nothing that nobody couldn't do, but the goal is to save 10, a 100, a 1000 people from going out and doing it and just using PlanetScale instead. So by making Vitess faster, by giving Vitess more features, we can build more on top of it as PlanetScale while still making the open source project better. But it is a really difficult balancing act because you wanna keep some stuff in reserve.
You wanna have some things that you say, no. If you want this, you have to come to PlanetScale, But you don't wanna make it look, like, capricious. You don't wanna make it look like you're hiding stuff or keeping stuff away or hindering the open source project. So you have to really figure out what that balance is and really make sure that you're not stepping too far in either direction, maintaining a healthy business while also maintaining a healthy community.
[00:39:13] Unknown:
In your experience of working at PlanetScale and on Vitesse and working with your customers who are trying to reduce the amount of time they have to spend thinking about their database and they just wanna get their problems solved, what are some of the most interesting or innovative or unexpected ways that you have seen your platform used?
[00:39:31] Unknown:
You know, I think it really comes down to customers being able to use the framework that we created in ways that we could have never imagined. So people doing things with deploy requests and schema changes that we thought, well, we're actually basic users of this now. We're not even the most advanced use case for this feature. Customers using features of Vitess to move data around or shuttle data around to power business use cases that we would have never thought of. So I think any sufficiently large company, any sufficiently advanced product actually ends up learning as much from their users as they try and teach other users, if that makes sense. Right? You'll always find someone with a specific type of way of thinking about their problem that when they see how your solution maps on it, they'll be able to marry the 2 and find something that's way more advanced and way more cool than you could ever dream of. So I really think it's fun to watch people look at what we've provided and then look at what they're doing and either replace pieces of their workflow that they were doing more manually or just create kind of brand new workflows that would have never been in the cards had they not seen PlanetScale, but now are easy to do or possible because PlanetScale exists.
[00:40:46] Unknown:
In your work at Planet Scale, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:40:53] Unknown:
I'm challenged all the time at Planet Scale. We all are. I think 1 of the biggest learning lessons that I've had is just that customers need databases to do so many things in so many different ways that it's incredibly difficult to do them all. Some customers have really, what you would imagine, straightforward database use cases. Some customers need their databases to stretch, to be global, to power kind of unimaginable use cases. And what we've learned is that tailoring to all of them at once is nearly impossible. So as we look to the future and as we look to take those learnings, I think we were taken aback by just how much the spread of databases is and how many different things people are doing. We're trying to line things up and get kind of tighter with our verticals, get tighter with our use cases a little bit, and then do better for those more tight use cases so so that we can find customers, so we can surprise and delight and kind of figure out where we can compete really well without worrying about use cases that may map poorly or may be more difficult so that we're not wasting customers' time and our time trying to support them. So I should have known this, but in the back of my head, I didn't. I was shocked and surprised just purely by every customer being so different from every customer before. You learn the rhythm of your top 10 customers, your top 15 customers.
The Venn diagram of how they're using their databases may as well be 2 circles on the moon in the sun times because they're just so disparate.
[00:42:19] Unknown:
For people who are looking to something like PlanetScale to reduce the operational overhead of being able to manage their databases, being able to increase their scalability, whether that's for data volumes or geo replication. What are the cases where PlanetScale is the wrong choice and they're better served either using 1 of the other vendors out there or building something in house?
[00:42:41] Unknown:
That's a fantastic question. And I was just talking with a prospect this morning who was kind of telling us about how they do things. And where we landed was PlanetScale is not the right choice for them because they've built enough expertise in house that they don't need a managed solution. You know? Now there are some companies who have built that expertise that don't want it anymore, and they would rather kind of give that to PlanetScale. But if you're a business where that data, that workflow, and those databases are core knowledge and a core competency of your business, don't buy a managed service and don't give that up because you may be able to use it to move faster than your competition. So I really think that the core of it all is be ready to use a managed service if you're going to go buy a managed service. And that's true for PlanetScale, and that's true for any other database as a service. If you want to tinker, if you want to maintain, if you want to spend hours of your week kind of getting really down into the database internals, go run the open source project or go pick up the product up off the shelf that you can tune and tweak. If you're a database user who just wants a database and you don't wanna think about it, PlanetScale and the managed solutions are perfect for you. But you really have to make sure that your business and your mentality fits in with the world view of that product
[00:43:56] Unknown:
because otherwise, you have an impedance mismatch that's gonna cause you to waste your time, and it's gonna cause you to to have picked the wrong product. As you build out the PlanetScale platform and you continue to work with your customers and expand functionality, what are some of the things you have planned for the near to medium term or any particular projects or problem areas that you're excited to explore?
[00:44:16] Unknown:
I think the things that we're looking to do in the near term are really up the planet and up the scale. So we wanna make sure that we're offering more features that help PlanetScale become worldwide, more features to help customers scale. So we're gonna invest a ton in that insights feature I was talking about earlier to make sure that it's better and better and better and better. We already have customers who wake up every morning, check their insights dashboard, pick a slow query, make it better, and then do that again the next day. But we wanna give them more and more insight and more and more options to make their database performance better. We wanna give people more opportunities and more capability to scale their workloads worldwide. And, really, we just wanna make PlanetScale the best solution for what it is good at. So a lot of removing compromises, increasing stability, and increasing quality. We're really happy with the core database technology that we're built on, and we think that there's room just to be the best database without having 10, 000 other features on top. So we're going to invest heavily in becoming the best database.
[00:45:21] Unknown:
Are there any other aspects of the work that you're doing at PlanetScale or this overall space of serverless databases or some of the global replication capabilities that we didn't discuss yet that you'd like to cover before we close out the show?
[00:45:34] Unknown:
I think not a ton. You know, I think the big thing to think about as you think of global databases is what you actually need out of your database when it comes to data locality. So our CEO, Sam Lambert, actually just tweeted yesterday or today, you know, people are willing to drop a 9 or 2 of availability to have global rights, for example. And I think that's a really good way to think about the problem. What do you want out of your data? Where do you want it, and when do you need it performing? Every time that you distribute your data more, every time that you splinter things around and make things more global, you also make things more fragile.
No solution provides kind of all 3 parts of the triangle. And so as we think about building features for data locality and for global data, we're always thinking what's actually useful for users. And that's why we started with global read replicas, for example, because we think the most important part is for global reads to be as fast as possible. And so just, you know, as we're working on this and as we're talking through this, we really wanna be in the position of building the best thing for customers to actually go and succeed and to actually go power their business, not just the best thing on a spec sheet or the best thing in a technical comparison.
We're too pragmatic for that. We wanna be the best thing for the real world.
[00:46:49] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:47:06] Unknown:
Yeah. Great question. And I think it really comes back to that management, that inventorying, and that meta component of data. I think if you look at any vertical, whether it's Segment, Census, BigQuery, whether it's Airbyte, Fivetran, you have these best in class tools that can do amazing things. And what I think you're lacking is the ability to get a view from 10, 000 feet at what the flows look like and why we're doing this. I can build great charts and mode based on BigQuery, based on all of these different things, but no 1 knows how to find them. I, as an engineering leader, can look around and know that we have data in all of these places and utterly completely understand how to turn it into action.
So the thing that I think is missing more than anything is the hardest thing to do, which is to take each of these different components that companies are executing in and figuring out how to help businesses become more strategic in using them. Once you get someone actually actioning data, once you get someone actually using data, they're never gonna stop. But getting folks to start kind of lowering the activation energy and then helping the business plan all around it will help all of these tools take off individually, but it's sort of a collective action problem that I don't think anyone has solved yet.
[00:48:17] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at PlanetScale. It's definitely a very interesting product and an interesting project, and it's great to see the ecosystem of solutions for people who want to be able to just get their applications up and running and get it working, expanding. And it's great to hear the level of effort that you're putting into making that data integration and analytical flow a easier step for people as well. So appreciate that, and thank you again for the time and energy you and your team are putting into it, and I hope you enjoy the rest of your day. Thank you so much for your time. Have a great weekend. It's been a pleasure.
[00:49:00] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the machine learning podcast, which helps you go from idea to production with machine learning. Visit the site at data engineering pod cast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Background
Overview of PlanetScale and Vitesse
Core Problems Solved by PlanetScale
Strategies for Load and Scalability
Competitive Landscape and Trade-offs
Development Experience and Productivity
Capabilities Beyond MySQL
Implementation and Architecture of PlanetScale
Operational Challenges and Debugability
Visibility and Debugging Database Behavior
Indexing and Performance Optimization
Impact of Serverless Database Experience
Integration with Analytical Data Stack
Data Integration Strategies
Governance of Open Source Projects
Innovative Uses of PlanetScale
Lessons Learned at PlanetScale
When PlanetScale is the Wrong Choice
Future Plans and Exciting Projects
Final Thoughts and Data Locality