Summary
A large fraction of data engineering work involves moving data from one storage location to another in order to support different access and query patterns. Singlestore aims to cut down on the number of database engines that you need to run so that you can reduce the amount of copying that is required. By supporting fast, in-memory row-based queries and columnar on-disk representation, it lets your transactional and analytical workloads run in the same database. In this episode SVP of engineering Shireesh Thota describes the impact on your overall system architecture that Singlestore can have and the benefits of using a cloud-native database engine for your next application.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Shireesh Thota about Singlestore (formerly MemSQL), the industry’s first modern relational database for multi-cloud, hybrid and on-premises workloads
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what SingleStore is and the story behind it?
- The database market has gotten very crouded, with different areas of specialization and nuance being the differentiating factors. What are the core sets of workloads that SingleStore is aimed at addressing?
- What are some of the capabilities that it offers to reduce the need to incorporate multiple data stores for application and analytical architectures?
- What are some of the most valuable lessons that you learned in your time at MicroSoft that are applicable to SingleStore’s product focus and direction?
- Nikita Shamgunov joined the show in October of 2018 to talk about what was then MemSQL. What are the notable changes in the engine and business that have occurred in the intervening time?
- What are the macroscopic trends in data management and application development that are having the most impact on product direction?
- For engineering teams that are already invested in, or considering adoption of, the "modern data stack" paradigm, where does SingleStore fit in that architecture?
- What are the services or tools that might be replaced by an installation of SingleStore?
- What are the efficiencies or new capabilities that an engineering team might expect by adopting SingleStore?
- What are some of the features that are underappreciated/overlooked which you would like to call attention to?
- What are the most interesting, innovative, or unexpected ways that you have seen SingleStore used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on SingleStore?
- When is SingleStore the wrong choice?
- What do you have planned for the future of SingleStore?
Contact Info
- @ShireeshThota on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- MemSQL Interview With Nikita Shamgunov
- Singlestore
- MS SQL Server
- Azure Cosmos DB
- CitusDB
- Debezium
- PostgreSQL
- MySQL
- HTAP == Hybrid Transactional-Analytical Processing
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Shirish Sota about SingleStore, formerly known as MemSQL, the industry's first modern relational database for multi cloud, hybrid, and on premises workloads. So, Shirish, can you start by introducing yourself?
[00:02:11] Unknown:
Hi, Shirish Sawas. I am senior vice president of engineering at Singostore. I have worked at Microsoft for the past 15 years. I've worked in various database engines. I started as an intern at SQL Server when I was doing my masters, and then I came back to do my full time employment with SQL Server. I was part of various releases of SQL Server starting from 2, 005 s p 2, which is like a major sub release, that version, all the way to 2012, built various features for SQL Server. And then I was 1 of the founding members, like, early crew of Azure Cosmos DB. Azure Cosmos DB is a completely different kind of a database from SQL Server. It's a non relational database. It was primarily focused on being an elastic geographically distributed database. I did various pieces in Azure Cosmos DB worked on query engine, inverted index, and then I went on to run the entire engineering for Cosmos DB for several years.
I have also, along the way, started running engineering for Postgres Citus, which is an acquisition that, Microsoft made. And then recently, I've started working at Singlestore, so I'm here. And do you remember how you first got started working in data? As I said, I was an intern at, SQL Server. So that that happened almost accidentally. I was interviewing with various companies, various divisions, and I had an offer from Visual Studio to work on the debugger as well as SQL Server here at Microsoft along with a bunch of other offers. But I just picked SQL Server, the problem space, and the kind of interactions that I had felt right to me for some reason. And then I enjoyed the internship quite a lot. I worked on comparing large databases and figuring out what is the effective way to figure the delta and moving the schema changes efficiently, etcetera.
I got hooked and came back and joined SQL Server. SQL Server is a fantastic place to get started if you're interested in, like, database carriers. So I, yeah, I sort of got that going, Learned quite a lot there. There was sense it was sort of really I didn't even work on any other department at Microsoft even though Microsoft is such a large place. That's kinda really my beginning. And then I ventured into various aspects of data management effectively.
[00:04:19] Unknown:
Now in terms of the single store product, I'm wondering if you can just describe a bit about what it is and maybe what it is about that product that interests you and makes you excited to be working there. And for the kind of detailed pieces, I'll refer people back to the interview I did with Nikita Shamganoff
[00:04:37] Unknown:
few years ago at this point, which I'll put in the show notes. Nikita has basically gone through the some of the technical details of SingularStore. And in fact, you know, to answer that question, I probably have to refer to some of those pieces because that's precisely why I really got interested in single store. Technology is just amazing. The novel approach of building a trans analytic in our step database, so to speak, is amazing. We basically have a system that can provide you the low latency for operational needs, as well as sort of like the large queries low query latency and a high concurrency significantly complex queries of analytical needs in 1 place.
And the way it is done is just amazing. Will you take advantage of a rostered memory and then you tear it into a column store in the disc? And then we've now recently added what we call as bottomless, which is tearing further into an object store. And so we kind of reimagine how you do a log structure storage across this memory tiers and then done some significant amount of engineering to provide both ends of the spectrum, both on the transaction analytics. So technology is really 1 of the core reasons why I got attracted to single store. Also, my experience working in databases over the course of, you know, several years has taught me that, you know, the market and the town is just amazing. Right? So you really don't go wrong working in a database market. And now when you have an amazingly high potential technology like single store, set up have a great combination. And so working with people, talking to them, etcetera, proves me right.
[00:06:06] Unknown:
In terms of the database market, it's gotten to be quite crowded with dozens, if not hundreds of entrants for different special purpose use cases as well as the sort of old stalwarts of Postgres and MySQL and Microsoft SQL Server and Oracle. And I'm wondering if you can just talk about the different areas of specialization that you see as being part of the broad categories and what you see as the core sets of workloads that a single store is aimed at addressing?
[00:06:37] Unknown:
Yeah. You know, a little bit of a history would help here. When you think about how this whole industry has evolved, categorized roughly into 3 different phases. Right? And the first phase where we basically had the simple single node databases relational, you know, only, and they were basically catering to the needs of mostly operational. Then we were slowly building the models to do all app with a very concrete modeling of extract transform, a node into, again, relational based data warehouse. Right? Then go to the phase 2 where we basically want to handle the scale, which the single node databases weren't able to keep up with. And, hence, the whole NoSQL movement has begun in tandem with the Hadoop and the MapReduce kind of architecture.
And that kind of really it expanded into various other flavors. I would also say the columnar formats have come into work in that era, which kind of thought how to do data warehouses in a slightly different way. And then, you know, some work of, like, the data lakes have also begun, spark computation has come into place, etcetera. I think we are in the 3rd phase, which is an interesting and most important phase where we need to do more convergence and not explosion further. What I see here is that the modern data stack, as we call it, is getting more complicated. There is the operational sources, and there are ingestions to do the e and l. And then you have the sources which are curated sources and query processings, and then you do transformations, and then you have sort of the analytics dashboards and whatnot.
We have these kinds of pieces, and they're all very good pieces. They're very good categorizations. But what ended up happening with this categorization is, is way too specializations. When you have way too many specializations, then the problem of connecting these boxes and the cost of doing the connections and making sure that the data governance and, you know, protection, etcetera, that goes up significantly. There are great technologies in each of these categories, but the movement of the data and the complexity has gone up quite a bit, which is not necessarily a great thing even though individually, every piece looks great. Has a view that distinctly, you know, we call it as sort of, like, the data 3 database where we want to simplify and converge many of these pieces. Right? When you have a database that can unify the operational needs and the analytical needs in 1 place, you kind of reduce the number of boxes.
Significantly large number of advantages come as a corollary. As I said, you know, governance, privacy, those things are definitely far better when you have data in 1 place. And it becomes a, you know, far lower TCO overall. Right? Less error prone. So those are the characteristics that we are going after. I would wish the market and the industry embraces that all up. I would consider single store to be far ahead of that.
[00:09:17] Unknown:
Because of the fact that it does have a few different types of optimizations that you can take advantage of depending on how you want to execute your workload, I'm wondering if you can speak to some of the unique capabilities that it offers in the market versus the need to incorporate multiple different data stores as you mentioned, or if you have multiple specialized engines for different use cases and some of the ways that you're able to leverage the ability for a single store to operate both as a transactional engine and an analytical engine.
[00:09:49] Unknown:
When you think about the story of single store and Nikita has touched upon it in his previous podcast, they started off basically wanting to do a scale out transactional workload and the bet on the ramps getting cheaper, etcetera. And as we evolved as a hardware trends have evolved, etcetera, we've added significantly important capabilities without losing the initial view. These capabilities will help workloads which are looking for real time insights to begin with. Right? The kinds of applications that need real time insights, instant insights on fresh data have only gone up. Right? When you think about the financial industry or, you know, even any industry that is facing with data that needs to be fresh to make a call, even, you know, ride share apps. In the gig economy, quite a lot of applications require those real time decisions.
Sing a store is in fact really a must have in your toolkit to do those kinds of decisions faster. Because when you have more boxes, more databases, you're effectively losing precious time in moving the data across these boxes, and thereby your time to insights is always gonna be low. Right? Let alone the cost of moving it and TCO governance and all those important problems. And the fundamental business value about the real time insights is something that it's an important and a core capability of single store. Now once you come in there, the other interesting capability of single store is that by the virtue of how the modern data stack has started evolving and creeping, it simplifies it. It simplifies your overall spread of databases.
And true to our name, it's a single store. You know, our vision is to simplify. You don't need a relational single node database, and then you have a document database. You may need an elastic search kind of system, then you are putting a cache on top of it. And then you're moving that data entirely into a a data warehouse. So all these kinds of complexities would be tackled nicely. So that's another core capability of single store. Right? So these are the sort of, like, you know, this is how we start, but we can expand and give you so much of benefit with the other capability of simplifying. So I would just touch upon those as our core capabilities.
Now naturally seeing the store being, you know, it's a general purpose database and it's available in all clouds. You would get the benefit of what you would expect from a cloud native database. Those capabilities are, in fact, very nicely packed into the system.
[00:12:05] Unknown:
Given the fact that you have a long history of working on different database technologies from your time at Microsoft. I'm wondering what are some of the lessons that you're able to bring into your work at SingleStore and some of the useful insights that you have gathered from your experiences on SQL Server and Cosmos DB that you're able to apply to the area that single store is focused on and the types of customers and workloads that you're focused on serving.
[00:12:33] Unknown:
Microsoft is a magical place, and it's definitely had great lessons working on various kinds of databases. And both SQL Server, Cosmos DB, and Postgres have taught me distinct lessons, but the teams are the following. Right? I I probably have both patterns and anti patterns working at Microsoft. On the patent side, the lessons that really apply here is just like any other database. You know, single store is, of course, just similar to kind of those databases in the sense that it's a marathon. Right? When you wanna build a database, it's the results are not gonna come in like how you would expect as a sprint. Right? The focus needs to be sprint. The execution needs to be like a 400 meters sprint, but a building a database is a marathon. It does take a significant amount of effort and, you know, time to sort of see the results. And the other thing is to basically understand the workloads and understanding where you begin, how you optimize those workloads is distinctly a different kind of very deep challenge compared to other businesses. Right? When you are in a database world, there is this sort of a prime directive here in terms of making sure that the data, which is a fundamental asset of all these companies need to be protected and never be corrupted, etcetera. And it's always available when you are running the service, the scale that the customers expect you to be. It becomes a very distinct challenge. Right? Microsoft is a phenomenally great place because it is a hyper scale services effectively. All of them were, and it was dealing with a lot of Fortune 500 customers. And so those lessons are sort of, like, really deeply built into me. On the anti patent side, when you kinda come from a company like Microsoft into a single store, and Microsoft is, of course, a gigantic machine, which does not just databases, but really pretty much everything, every software out there. And so there's a little bit of friction and focus challenges to bring in all the pieces and make the magic happen. Right? To approach in the same way wouldn't work in a company like single store. Right? And to the product as well, just the strategy, execution, a lot of aspects. Right? So those are the places where, you know, I tend to be more careful in terms of not applying those approaches.
[00:14:44] Unknown:
So looking back at when I did have the interview with Nikkita, it's been almost 4 years now being in October of 2018. And I'm wondering if you can share some of the high level changes, whether that's changes in focus, architectural updates, changes in capability
[00:15:01] Unknown:
that have happened in that time from 2018 to where we are today? Yeah. Quite a lot has happened. I will focus on a few most important ones perhaps. The fundamental piece that has changed is that SyncStor is now a cloud first company. We're not a cloud only company. We are a cloud first company. We've embraced all the major cloud providers. We basically started exploiting the capabilities of cloud, economies of scale, and, you know, how well we can horizontally scale the system, how seamlessly can we provide, high availabilities, etcetera.
And along with that, it also is probably more appropriate to talk about bottomless. Bottomless is a feature that I alluded to prior where we tiered the data across not just from in memory and disk, but now we can go to the blob store, the object store. So you could pretty much put, like, you know, really bottomless amount of data, and we scale seamlessly. These kinds of changes were possible because of the shift to the cloud. That's a big business change. That's a big technical change that we've done since 2018, for instance. A ton of optimizations and query improvements have gone into the system. We now have a system where we have a compelling benchmarks on TPC Edge, TPC DS, as well as on tpcc. By the way, you know, that's kind of really very novel for a system to be good at all these benchmarks in 1 place. And that's possible because of so much more work that has happened in the query storage areas. Right? So those are some of the top changes that have happened since then.
[00:16:30] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by snd. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to data engineering podcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, cloud. That's obviously a huge area of growth for most companies. I'm wondering what are some of the other macroscopic trends in data management and application development and just the overall surrounding technical ecosystem that have pushed it in that direction and just some of the ways that those trends have impacted the product direction and areas of focus?
[00:17:58] Unknown:
Definitely, the scale aspect is 1 of the macroscopic trends that pushed us into cloud. Right? When you are talking about elasticity, it's really hard to do it outside of, cloud. We want to be seamlessly elastic. You know, we wanna go from no amount of data to a significant amount of data, no amount of throughput to a high amount of throughput, and really, you know, pay as you go effectively consumption oriented fashion, but those are some of the important metrics, right? The high availability aspect is another very important aspect of modern day mission critical kind of applications that push it as into the cloud. But in addition to those, the application trends, if you look at how significantly broad number of applications have been built on databases, application developers by and large want an integrated stack. Best and most popular 1 that people know of is mean stack for instance. Right? You know? And there are a bunch of others as well. They used to be LAMP. Now it's mean. There are many other stacks that are popping up, but they want integration of databases and frameworks all the way to exposing something that is sort of so seamless that you provide the back end as a service. Right. Where the applications are just working on single page mobile application single page applications. It's smooth, seamless without having to understand the, sort of, like, the depths of the traditional algebra effectively. Right? The other trend is consumption oriented. That's right. We wanna build applications.
You want to go from very little amount of money to potentially a mission critical application. We wanna make it easy. And so the serverless paradigm is another 1 that is changing, and single source is definitely evolving along with it effectively.
[00:19:35] Unknown:
And you mentioned recently the idea of the modern data stack. That's something that's gaining a lot of mindshare and airtime right now. I'm wondering if you can talk to for engineering teams that are already invested in or starting to think about adopting that architectural pattern and that paradigm, where does single store fit in that overall architecture?
[00:19:55] Unknown:
When you think about the modern data stack, again, you know, this is our definition, but roughly the definition is that you have operational sources and you have, you know, ELD with the storage and query in between, and then you have the dashboards and the insights on the far end. Right? Single store can simplify many of these pieces to start with. It's an operational store. You know, we have begun the roots of the company, and, basically, it's a gross store transactional, but scalable system in memory. So and then we've expanded it further, of course, but we can be seen as the operational source of truth. We can be used as the intermediate phase where you basically extracted the data from various sources and just read into a storage and a query format for the analytical needs, especially on the real time analytics as well as broader analytics. We have many customers who do that. So we can play on the operations and the analytics piece. We do the ingestion very seamlessly.
If you use single store as the operational transaction and the analytics, you don't even need the intermediate pipes anyway. But if you're coming in from a different source, we support seamless ingestion with very native, you know, feature call us pipelines that work with Kafka and bunch of other streaming data. So we can do that very nicely. Those are the end real time insights, of course, you know, really, really good at it. We have great integration, amazing partner ecosystem, but effectively, think of it as operations as well as the intermediate storage and query for analytics. Those are the areas where we can do a very good job effective.
[00:21:23] Unknown:
In terms of the simplifications that an engineering team might expect when they're adopting single store where maybe they have gotten used to this idea of, I've got my operational store, and now I need to do the extract and load into my analytical store. Just some of the operational and application and system architecture considerations that they need to be thinking about as they move to single store and some of the types of capabilities or efficiencies that they might expect by moving in that direction?
[00:21:55] Unknown:
Yeah. So, fundamentally, a single store is a relation engine. Right? It's relational foundation, but it is also a multi model system. We have various kinds of type supports. We can do special JSON time series and whatnot. But we are a relational system, and we take care of the art of converting an operational friendly data modeling to an analytical friendly data format. We took that, and it's completely transparent in the sense that you just ingest the data as though you're ingesting into an operational store. And internally, we take care of transforming the data into columnar format in the right way and then supporting various data model.
So that translates into a significant amount of developer benefits because the developer doesn't need to go and do all the modeling exercises, need not do the data engineering pipelines to move the data across, doesn't have to figure out, hey, which part of the data needs to become columnarized and which parts shouldn't have to how do I do the tiering? How do I do the sort of, you know, translation of various pieces, etcetera? And you don't have to work with complex pieces to then go into a real time insights kind of a platform. They're all in 1 place, which means a significant saving in in dev time, let alone the benefits in the production.
[00:23:11] Unknown:
And from the analytical perspective, if I recall, it also has some built in capabilities for being able to do some of the extract and load as part of the engine runtime so you don't necessarily have to bring in another tool like a 5 Tran or being or or an Airbyte to be able to feed data into it that it can actually go and do some of that retrieval of data into the storage for analytical purposes.
[00:23:35] Unknown:
That's true. It depends really on the volume and the requirements. We work with those tools very well. If you have a DBT, for instance, we work with that connector, you're bringing in data, etcetera. But, fundamentally, we have key capabilities in the system. Again, I refer to it as pipelines, which will help you move the data seamlessly if you have, change data capture from an operational source, you know, you can put in a DBCM kind of an approach there and works very well with our pipelines. And then we have very rich store procedure support. And we are also adding in fact, this is a new thing that's coming out, of WASM support to be able to take in, you know, WebAssembly run times natively in the engine to do the transformations locally. Right? You don't have to move the data out. Right? So there's quite a lot of that support in built inside the system. While we work with the ecosystem, a lot of your needs probably don't even need them because we we can do quite a lot of that, transformations locally. And, again, with the pipelines, the extraction and the loading part gets significantly seamless.
[00:24:35] Unknown:
And so for people who are used to the idea of I have my operational store and I've got my analytical store and I'm going to move data between them. There's usually some other aspect of mutation of the data model to be able to make it more conducive to that analytical workload. But because of the fact that you're able to have your operational and your analytical workloads coexist in single store, possibly on the same sets of data, I'm wondering if you can talk to some of the ways that that influences how teams think about data modeling for their applications so that it is also useful for analytical purposes beyond just the built in capabilities of treating it as a row store in memory and a column store on disk.
[00:25:18] Unknown:
So single store effectively has 1 model where it's relational, foundational, and, basically, you'd have to go in and ingest the data in in 1 model. There's no operational model and analytical model. It's really just 1 thing. So if you think that your requirements are better suited with, let's say, a Snowflake schema or something like that, you go ahead and do that. But our query engines are pretty capable enough to provide you the operational goals that you might have, that you basically set the data in and whatever schema that you have and then provide the right optimization based on whatever model you have. It works really well, fairly well for the operational goals as well. You may need to do some adjustments based on your goals for your application. You probably are lot more on the analytical heavy and little less on the operation side, so you make those adjustments based on that. And naturally, seeing a store has, you know, some of the capabilities like materialized views, etcetera, and we're working on some of the gaps there. But those will also help to evolve subtle translations
[00:26:16] Unknown:
between the data models your application might need. And because of the fact that it is available as a cloud service, I'm wondering what you see as the ways that teams consume the product where maybe they have 1 instance deployed for their different application use cases and another 1 deployed for their analytical use cases? Or is it more a matter of they have 1 widely scaled instance with different schema definitions for their different applications, and then they just use those
[00:26:46] Unknown:
view capabilities or some translations into analytical tables within that same schema space for powering their business intelligence dashboards and things like that? It's definitely a mix. You know, we have customers who use it as, like, a, you know, purely, like, like, 1 place for all of the tables with various schemas and various needs. And there are some customers who would have those logical partitions. Right? Depending on really the kind of customer and depending on the type of workload, you know, when you think about the enterprise sensitive workloads, they might do some logical partitioning. But there are lots of SaaS companies who basically do mixing up all of the data is in 1 place, and you could run all kinds of workloads in 1 place, but heavy operations as well as your dashboards all in 1 place. Right? And that's the beauty of. You don't need to really change your developer experience, your IT skills to switch between these needs.
[00:27:38] Unknown:
So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data. For every table in select who's using it in the company, and how they're using it, all the way down to the SQL queries. Best of all, it's simple to set up and easy for both engineering and operations teams to use. With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial at dataengineeringpodcast.com/selectstar.
You'll also get a swag package when you continue on a paid plan. And so we've discussed a number of the different capabilities of the engine of being able to operate as a row and a column store, transactional and analytical use cases. It has some abilities for being able to do data ingestions and building these pipelines. But what are some of the other features and capabilities that are either underappreciated or overlooked by people who are new to the system that you think are worth calling attention to explaining some of the ways that teams who are used to maybe a Postgres or a MySQL can think about and take advantage of those additional kind of value add features.
[00:28:59] Unknown:
1 of the key things that I would like to focus the customer's attention on this 1 is the fact that this is really the only system that has a fantastic TPC, HTPC DS kind of workloads and also a fairly respectable TPC workload, benchmark. Right? And that is incredibly novel. Now, of course, that goes back to the fact that we are both operational and the analytical database. But it's not just any vanilla edge top database. It is so good that it can compete with the industry grade relational databases, the single node relational databases like postscripts, my SQL on 1 hand, and sort of like the industry grade data warehouses like snowflakes, big queries, etcetera. In fact, we beat many of them on these benchmarks.
That's a fact that sort of surprises a lot of our customers when we talk about it, but that is 1 of the overlooked aspects of single storage. It's not just a sort of like a simple edge type database. It's a lot more powerful. And there are ton of other features. When you think about these kinds of capabilities, some of the customers may have the impression that they don't scale. Right? There's a contained specific amount of storage, etcetera. Not at all. It's very distributed. We have done tons of improvements, not just on the storage and compute, but also on the credit processor. Right? We've invested significantly on our optimizer or execution layer altogether to take advantage of the economies of scale in the cloud and then bring these edge tech capabilities to live in a distributed scale. Though those are the few features that are a little bit surprising for some folks to hear, but they are a 100% true.
[00:30:31] Unknown:
In terms of the applications of the technology and your experience working with some of your customers and managing your team? What are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:30:43] Unknown:
Even the spread of our capabilities, the spectrum of capabilities that we offer, sometimes, you know, customers use it for a very interesting work rewards. Right? We see this happen. Of course, the sort of the common ones are financials and the entertainment, the gig economy applications, etcetera. Those are sort of most common ones. But time and again, we see some of the interesting ones like, you know, there was a face recognition app that was built. You know, the data was all bit mapped, and yet sort of figured out how key capabilities were very relevant to them in terms of how we do our vectorized query execution and compressed data. It worked out really well for them. So those kinds of applications do pop up once in a while. In your own experience of working at single store, what are some of the most interesting or unexpected or challenging lessons that you've learned?
I think the interesting piece is really the engineering trade offs. When you think about achieving these significantly hard goals of trying to become a single store of transaction and analytics, you gotta make the right engineering trade offs. Every trade off has an interesting history and how we've evolved it and how we've sort of executed it has always been interesting. It always surprises me whenever I learn more details about it. It is not magic, but it is quite a lot of hard work in terms of getting it done and perfected.
We are just quite ahead of that curve, and that's kind of super interesting. For me, personally, with the Microsoft pedigree, there are challenges in terms of big companies, small company, and different kinds of customer base versus a new era of customer base. So there are differences across the board in terms of Microsoft customers and single store, for instance. Right? You know, there have been a lot of challenges and surprises, pleasant ones in that regard.
[00:32:24] Unknown:
And as far as those customer conversations, because of the fact that single store is fairly unique in the marketplace as far as the set of capabilities that it provides, I'm curious what you have come across as some of the points of customer education that you've had to keep coming back to or the typical sticking points that you run into as people are starting to maybe, you know, fight against the idea that single store is the right answer for their problem?
[00:32:50] Unknown:
I think customers have let's say, in some cases, they have already set into a certain kind of an architecture. They have, tooling to do the ETO. They already have a tooling to cleanse the data in a certain way, curated, and put it into a warehouse. And then they already have certain licenses for query engines, etcetera. You know? And then when you come along and try to disrupt that, there's a bit of a friction in terms of how you move. All that is already there and then, you know, sort of bring in here. And and there's some amount of adoption challenge in terms of how do we move from all these pieces into 1 piece. Right? So when we talk about the value, we bring in incrementally and try to sort of, like, help you take the entire stock and simplify it.
Those kinds of conversations are they end up with what's the catch. Right? What is exactly that we are giving up? We basically have a conversation which is rooted in the fact that it is as simple as really doing the engineering trade offs, and you're not really losing much. Right? And that explanation takes a little bit of time. Right? It is sort of like a sticky point that comes over and again.
[00:33:54] Unknown:
Given the fact that databases can be so sticky because of the fact that that's where all of your data lives and doing a migration can be very difficult and potentially risk or error prone. I'm wondering what are some of the ways that single store is able to simplify that process of saying, okay. You're coming from database x. It's going to be very easy. We'll, you know, just consume the right of head logs or the bin logs to be able to replicate into our engine, and then you just cut over once everything is synced up and you're happy with it. I'm just wondering what you see as some of the ways that you're able to kind of ease some of those concerns and simplify that migration process for people who have existing workloads, and they just want to be able to add in the new capabilities that SingleStore provides.
[00:34:38] Unknown:
Yeah. I mean, we're very committed to helping customers be sort of not feel like they're locked into a vendor. To start with, we are completely cloud neutral. You could basically go into any of the clouds. You could, in fact, have the data spread across 1 cloud to the other, etcetera. So really don't want you to be locked in anything. We've embraced MySQL wire compact, which is obviously a very open source, you know, sort of like it has its flavor, but it makes it easy in terms of how you move the data out of our system, into our system, etcetera. Your applications and drivers are compatible largely. So what happens inside the system is significantly different than how, let's say, MySQL does it or Postgres does it. But we sort of have the interfaces and APIs to import export in an agnostic fashion, very open source friendly fashion effectively.
And, you know, we're working on lots of features in terms of sort of enabling the data sharing and accessing the data easily and, you know, we'll have more to speak about it over the course of next few months, we'll make it easy for customers to really just ingest and export
[00:35:42] Unknown:
effectively. For people who are looking at a choice of which database engine and which database technology to use, what are the cases where a single store is the wrong choice?
[00:35:52] Unknown:
I think really the only case where if you have a simple application, you're very comfortable with a relational database that's single nodes, like, you don't need to scale, etcetera. And you just have, like, vanilla and SQL kind of needs. You don't even have multi model kinda needs, no modern application needs, etcetera, then you might be happier with what you have, like a Vanilla Postgres or MySQL might do. But most of the applications, these are evolving even even those single node simple applications do want insights, etcetera. So things are changing quite a bit there. But, you know, that's kind of the starting point where it might not be needed. Right? But as I said, you know, the value addition is much more than just scaling or sort of trans analytics. It's more than that. So things are changing fast.
[00:36:36] Unknown:
As you continue to iterate on the product and try to grow adoption, what are some of the things you have planned for the near to medium term or areas of focus that you're excited to dig into?
[00:36:47] Unknown:
We are announcing a couple of features pretty soon. You know, I talked about Wasm, which is basically, you know, running the web assembly runtime natively inside the engine. We are doing workspaces, which are basically computer and computer isolation. Meaning, you can bring in different kinds of computes on the same data. You could do, you know, on demand, you could attach more computer and throw it away if you don't need it. And you share the data seamlessly that way effectively. We basically are committed to building seamless elasticity consumption oriented seamless elasticity effectively.
We are committed to building, significantly higher availability sort of like disaster recovery across regions, multi zones, etcetera. We committed to building a compelling developer experience, right, And making sure that we push the boundaries on the cloud first. You know, the fiber of the company is to make sure that we push the vision of a single store, which is really having both transactional and analytical in 1 place. So we do more and more work on evolving our query engines, query optimizations in terms of beating these, you know, being really fast because it's sort of the core tenant of the company. So those are always there. Those are foundations for us. But real time analytics, high availability, seamless elasticity, compelling developer experience are some of the common themes that you would hear in our engineering, hallways.
[00:38:01] Unknown:
Are there any other aspects of the single store product or the capabilities that it provides or the ways that it's being applied that we didn't discuss yet that you'd like to cover before we close out the show?
[00:38:12] Unknown:
I think, again, you know, as I talked about Wasm and Workspaces, for instance, are coming, and we've done significant amount of work in terms of building a compelling query experience, very performant query experience across multiple notes. Right? These are some capabilities that are really not available in the market. The scale and to the perf that single store offers. Those are a few things that I call out in the near term. And then as we go along and as we start building, I'd love to come back and talk about them in the near future.
[00:38:38] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:38:53] Unknown:
So I'll start with the gap, and then I'll talk a little bit about the challenges in the other direction. We are so focused on the modern data stack across all these various boxes in terms of the sources and the analytical dashboard and the whole ELT, ETL in between. That I think there's a less focus on the governance and security aspects. I'm not saying it's completely lacking, but we need to do more on the governance of the data and security, which is becoming a significantly important topic. And it gets more complicated with sort of, like, expansion of these phases. Right? And that's an area where we need to be a lot more careful in terms of really questioning every aspect of those phase. And every box and an arrow that you draw, you need to be conscious of the fact that, well, your data is now exiting from 1 premise to the other, so you need to do more surface area, threat modeling, and all those pieces. And governance, lineage of data gathering, etcetera, becomes harder. So those are the areas that are sort of negative corollaries of the evolution of the modern data stacks. I think that's the area where I wish we focus more and kind of place into the vision of single store, which is to simplify. So you get to do your governance and security and those pieces in 1 place and do it with simpler tools and more efficiently.
[00:40:06] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you've been doing at SingleStore. It's definitely a very interesting project and interesting engine. Definitely great that it is available on the market to help simplify the operational and analytical architectures for engineering teams. So I appreciate all of the time and energy that you're putting into helping drive that product forward, and I hope you enjoy the rest of your day. Thank you, Tobias.
[00:40:30] Unknown:
Thank
[00:40:35] Unknown:
you for listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email host atdataengineeringpodcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Shirish Sota: Background and Experience
SingleStore Product Overview
Database Market and Specialization
Unique Capabilities of SingleStore
Lessons from Microsoft and Application to SingleStore
Changes and Updates in SingleStore Since 2018
Macroscopic Trends in Data Management
SingleStore's Role in the Modern Data Stack
Data Modeling and Application Considerations
Customer Use Cases and Applications
Migration and Adoption Challenges
Future Plans and Focus Areas for SingleStore
Closing Remarks and Final Thoughts