Summary
One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.
Preamble
Contact Info
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
- And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloads
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what MemSQL is and how the product and business first got started?
- What are the typical use cases for customers running MemSQL?
- What are the benefits of integrating the ingestion pipeline with the database engine?
- What are some typical ways that the ingest capability is leveraged by customers?
- How is MemSQL architected and how has the internal design evolved from when you first started working on it?
- Where does it fall on the axes of the CAP theorem?
- How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?
- Can you describe the lifecycle of a write transaction?
- Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?
- How do you mitigate the impact of network latency throughout the cluster during query planning and execution?
- How much of the implementation of MemSQL is using custom built code vs. open source projects?
- What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?
- What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?
- When is MemSQL the wrong choice for a data platform?
- What do you have planned for the future of MemSQL?
Contact Info
- @nikitashamgunov on Twitter
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- MemSQL
- NewSQL
- Microsoft SQL Server
- St. Petersburg University of Fine Mechanics And Optics
- C
- C++
- In-Memory Database
- RAM (Random Access Memory)
- Flash Storage
- Oracle DB
- PostgreSQL
- Podcast Episode
- Kafka
- Kinesis
- Wealth Management
- Data Warehouse
- ODBC
- S3
- HDFS
- Avro
- Parquet
- Data Serialization Podcast Episode
- Broadcast Join
- Shuffle Join
- CAP Theorem
- Apache Arrow
- LZ4
- S2 Geospatial Library
- Sybase
- SAP Hana
- Kubernetes
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
[00:00:11]
Tobias Macey:
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast dot com slash linode today to get a $20 credit and launch a new server in under a minute. And you work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle. Scaphos maximizes interoperability with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure instantaneously.
Request a demo today at data engineering podcast dotcom/metis dashmachine to learn more about how MetisMachine is operationalizing data science. And the team at MetisMachine has shipped a proof of concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome, and it is, then join the free webinar with MetisMachine on October 11th at 2 PM EST 11 AM Pacific. MetisMachine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau.
Go to metis machine.com/webinars now to register. Go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/ chat. Your host is Tobias Macy. And today, I'm interviewing Nikita Shamganov about MemSQL, a new SQL database built for simultaneous transactional and analytical workloads. So, Nikita, could you start by introducing yourself? Yeah. So my name is Nikita Shamganoff.
[00:02:12] Nikita Shamgunov:
My role here is the CEO, and, have a long history here, at memSQL. And prior to that, I was at Facebook. And before that, I was a SQL Server, working on the kernel. I think it also it makes sense to to share my, academic history. So I actually was I grew up and got educated in Russia, and my passion then was doing programming competitions. So that carried me through my high school years, then college years, and then I went into, the academic track where, I went to Saint Petersburg University of Fine Mechanics and Optics and did my PhD in computer science. And right after the graduation, in fact, I got, an offer a a few months before the graduation. I got on the plane and joined Microsoft to start working on SQL Server. And so it sounds like that was your initial on ramp to getting involved in the area of data management?
Yeah. Absolutely. So I work on on on some of the aspects of that, during my PhD program. But, really, the, the trial by fire was certainly at SQL Server where not only I was exposed to, the area of data management, but in a broader way, started working on on systems. And I think there is a a a very specific set of skills that goes into engineering, hardcore, complex systems
[00:03:36] Tobias Macey:
written in c and c plus plus and built for performance and reliability. So certainly, SQL Server was a a very, very good training ground for me. Yeah. Getting thrown in at the deep end is usually 1 of the most effective ways to learn something very thoroughly because it's either, sink or swim, and that is a is a great inspiration for taking the time to fully internalize what's happening under the covers. Yeah. Absolutely.
[00:04:00] Nikita Shamgunov:
And the experience, like you said, was was really going into the deep end. A few weeks into the job, when I just arrived from Russia into the United States, I was spending most of my time in in the debugger, basically understanding, that SQL Server kernel, fixing bugs. And at the time, we were working on stabilizing 1 of the most ambitious releases of SQL Server, which is SQL Server 2, 005, 5 years in the making, which is really unheard of in in modern days. Like, nobody takes 5 years to ship a piece of software. But, you know, again, at the time, we were on this death march of driving the quality of of the product up after building, a lot of features that went into into that product. So that was that was a lot of fun, actually. And so
[00:04:46] Tobias Macey:
that was what set you on the road to ultimately working on MemSQL. So I'm wondering if you can describe a bit about what MemSQL is and the motivation
[00:04:58] Nikita Shamgunov:
for starting the project and building a business around it. Absolutely. So MemSQL started with with a few insights. And the insight number 1 is a several attempts of building a scalable transactional system inside Microsoft. I was not part of that effort, but, my my cofounder and chief architect, Adam, was. And and while we were there working on the project called Matrix DB, it was an incredibly frustrating experience that we actually could not ship that thing. And then when I joined Facebook, it became very apparent to me what the power and the value of such system is, and we went on, on a journey to build a new database. You know, we were very passionate about about building, databases and building systems. And I remember having a conversation back then, and the conversation literally was the following. Well, it's actually a lot easier to build an analytical system. But we thought that at the time the game was over. You know, history proved us wrong. Right? You know, since, there's a lot of cloud data warehouses came into existence. But we were looking at the market and seeing a lot of companies like Data Allegro, Vertica, Netezza, Astrodata, being acquired. And so we're like, okay. Well, it's much harder to build a transactional distributed system, so let's just do that. And then we went on on their journey having just a few insights. A, we believe that there was more of an opening to build a scalable transactional system.
2, we saw the value, of such system by looking some of the workloads inside Facebook. Most of such workloads that we were exposed to were in, around real time analytics. Basically, you need a transactional operational system to withstand withstand, high volume of inserts, updates, deletes that that hitting the system. But at the same time, the the the value is to, actually drive insights, drive analytics. And then the last insight was the unsuccessful, attempt of Microsoft of building a distributed system with a starting point, SQL Server. And, the competition for Microsoft at the time was Oracle RAC and on Oracle Exadata. And we and we kind of believed that there is value in building such a system. And we thought if we build it, then there will be a market for for the system because, obviously, Microsoft is trying very, very hard to build 1. So that's that's how we started. And then, the amount of technology that goes into a product like this is is is absolutely enormous. So you need to pave a path, and we made a bunch of assumptions to get to basically version 1.0 to not only build such a complex product, but being able to drive revenue on the way to the ultimate vision, of being fully scalable, elastic, transactional system that can support transactions and analytics and and, in 1 product. So that assumption, the first 1 was, let's build an in memory system. There was a lot of research at the time about building in memory systems. And also there's another assumption that that you can make that the cost of RAM is gonna go down so dramatically that everybody will be running, transactional systems in memory. That assumptions turn out to not be true.
And if you compare the the cost of RAM versus the cost of flash, for example, the the cost of flash, you know, dropped all the way down to the ground. Versus the cost of RAM, you know, RAM got cheaper, but not orders and orders of magnitude cheaper compared to flash. But yet, in memory was actually a very good starting point for us because it allowed us to onboard various high value workloads. So you can drive revenue with relatively small number of customers solving exotic problems, problems that those institutions have and they're willing to put money and faith into a new a new startup to solve, to solve those problems.
[00:08:56] Tobias Macey:
And how much of your product road map ended up getting driven by those early customers with the exotic problems? And how much of it was a matter of going and doing customer research with people who had more standard workloads to try and figure out what was the best direction for driving the company? This is a very good question. So the beauty of the situation that where where we ended up, early on is the exotic angle of those problems were only around performance,
[00:09:27] Nikita Shamgunov:
but not around the surface area. So the road map very nicely aligned with expanding the surface area. And the surface area of a transactional SQL systems is massive. Not only you need to build your storage features and clustering features, but you also need to provide very, very good SQL support, which is a combination of just a very broad language, SQL, but also, investments into query optimization and query execution and really understanding the universe of all possible workloads and doing a very good job workload by workload and on shipping new releases of of MemSQL. So to answer your question, the exotic part is just performance and scalability. And then the general purpose part of having a very good support of SQL, having an extremely reliable product, and I can talk for a long time about quality of, databases and systems in general. So so we found ourselves actually in in a in a in a decent spot where we're not torn expanding the exotic nature of the workloads versus,
[00:10:33] Tobias Macey:
building general purpose features. Those kinda align very nicely for us. And now that you have iterated and put out a number of additional releases from that 1.0 milestone. What are some of the typical use cases that you see for customers who are running MemSQL and people who are in conversations with you to add it into their infrastructure?
[00:10:57] Nikita Shamgunov:
So you can you can parse this question and answer it in terms of tiers. So at the very, very high level, we support general purpose database workloads that require scale. We give you unlimited storage and unlimited compute. And in 1 product, in an elegant way, we can scale your your applications, which is a very stark contrast to your existing, existing single node systems or legacy systems, the Oracles, the SQL Servers, the Postgreses of the world, where the amount of storage and compute is limited. Now this is too general. So what is the the next iteration of answering that question? Well, the second 1 is, like, let's look at the very, specific workloads, but still wear a a technology hat. The first 1 is real time analytics. You have a data coming at you either through an application or through a message bus such as Kafka or in the cloud Kinesis, and you need to provide, insights into that data in real time. It's supremely important to know what's happening with that data within maybe milliseconds or seconds or or minutes, but, you know, in a short period of time. So, and we call this workload real time dashboards. And real time dashboards are important in financial markets. They're important in health care. They're important in packaged goods and, in in web 2.0. So real time dashboards is something that MQL does, incredibly well. So this is 1 workload. The second workload is there's a web point, 2 0 workload that requires scale because of concurrency, and that's where giving infinite compute allows you to build your system with much fewer moving pieces. You don't need to cache.
You don't need to shard your database. You can just deploy MaplesQL, and it will drive arbitrary level of concurrency for your, web 2 put 2 0 or mobile application. So we have a number of workloads like this as well. And specifically, the moment you add reporting into an application like this, you just have a much easier way of burning compute. So, for example, in the financial markets, there's a number of use cases such as portfolio analysis, or wealth management, where you log in and and you see your assets and and reports. And reports, January reports burns, a nontrivial amount of compute. So if you want to deliver great user experience and lower latencies, you actually need a scalable system. So, so that's another kind of class of applications where, there's a lot of concurrency on the other end. And then finally, there is a set of applications that are closer to data warehousing and data marts where you just need to to produce a report and you need to produce, that report quickly. And in those workloads, the advantage over data warehouses is really that you can do it on top of the same data that powers your applications. So you don't need to extract your data, put it into the data warehouse, but you, but you do it on the data that sits in the database. So the advantage is really simplicity. So that that's more from the technology standpoint. And now finally, there are use cases by industry.
That that's the third layer. And we're mostly, the most successful in financial services and media. But we but recently, for example, we also see health care coming in, more and more into the platform. So and the specific use cases by industry are different. So for example, in the financial services, we see wealth management, portfolio management, risk analysis taking a substantial portion of the workloads that we support. Versus in media, it's all about advertising and streaming, Understanding the quality of service, understanding the performance of your ads, and doing that in real time, and and and and really driving decisions that will change the parameters of your advertising campaigns based on what's actually happening. And finally, across many, many industries, but particularly in media, it's real time reporting of the performance of your, of your campaigns or things like pricing and, and billing. So there there are companies that have massive big data workloads.
[00:15:18] Tobias Macey:
And in order for them to provide billing that's up to this second, you require something like memSQL and nothing else will work. Yeah. And what you were describing of being able to build a transactional application on top of a database and then be able to build reports on that same data. If you're not using a system that can support both of those workloads in tandem, then you would generally be doing something like change data capture, piping through maybe a Kafka topic that then lands into a data warehouse to be able to have some transformation that rolls it into an aggregated report, which automatically adds latency because of all the additional steps. Whereas being able to run across the same records as they're being written reduces the overall amount of time and effort in terms of building the reports and in terms of supporting the data life cycles.
And I also know that MEM SQL is built to be multi tenant, so I'm wondering if the typical case is that it will 1 deployment of MEM SQL will be used for backing multiple different applications, and then reports are built to be able to combine the data across those applications
[00:16:24] Nikita Shamgunov:
in that engine, or if it's something where an additional step is necessary to be able to aggregate the data across those different applications because of some boundaries in place because of the multi tenancy? Yeah. So this is this is a a loaded question. So let me answer 1 by 1. So first of all, you're absolutely right. And in many cases, in order to provide reporting on top of the operational data, you need to set up an architecture that looks like having changed data capture, piping it through an ETL tool, or in in your case, to Kafka, pushing into a data warehouse.
Data warehouse is not transactional, so we need to be very diligent into making sure that the data that goes into the data warehouse is in fact a a real copy of the data produced by the operational system. You can't just apply a change data capture log to a data warehouse, on a high high throughput, high performance transactional system. And the the second question was, around multitenancy, and we see a mix, actually. We see there's a there's a class of workloads where we which which we call tier ones. And tier 1 workloads are they have very, very strict reliability and performance requirements.
They usually have to run 24 by 7. The uptime has to be 5 nines. And, the latency and then especially p 90 latencies have to be very, very consistent. And so what ends up happening, the tier 1 workloads are typically run-in separate clusters of MIM SQL because, they're so important to the business that that, any disruption that will affect those p 9 latencies or any disruption that will, be affected by bringing new applications into the platform cannot be tolerated. And then for tier 2 workloads, that's precisely right. People just bring a lot a lot of applications into the platform and drive a lot of ROI through consolidation.
Because Minty was scalable, so you can bring, and gives you so much compute and storage. So you can just bring a lot of applications into the single platform. Now on from the product standpoint, that motion, which we we're we're very, very excited about. Right? Because, what we see in a typical enterprise is a problem that we call proliferation of data stores. A very typical conversation with something like a big bank or a very large, enterprise company or a very large, you know, world class technology company, when you talk to their DevOps people, they say, look. It looks like our business team is falling in love with your technology, but we are a DevOps team. And we're currently supporting, you know, 40 plus databases.
If we bring you in, something needs to go. And this is a great conversation because we typically see several, not just 1 technologies, especially legacy technologies disappear once people onboard,
[00:19:25] Tobias Macey:
Mentee. And in addition to being able to run both the transactional and the analytical loads on the single engine, there are also additional features built into memSQL itself, such as an ingestion pipeline for being able to consume data from external sources and pull it in for being able to build those reports.
[00:19:46] Nikita Shamgunov:
And I'm wondering what the benefits are of having that built into the database engine versus using some of the preexisting platforms and tooling such as the ETL systems or streaming data systems such as Kafka or Spark? Yeah. This is a great question. And you you can either, put a purist hat when you start looking at this, from the product standpoint and say, look. We're we're building a perfect engine, and we're gonna focus on the engine and then hope that the ecosystem will build tools around that engine to to use it efficiently. And then there's reality. And reality is such that, the market and the set of technologies are moving so fast that the customer is confused. There's too much stuff, and that stuff is not necessarily working very well. I guess there are clusters of technologies that are working very well together, some around clouds, some around the Hadoop ecosystem, some around kinda old school legacy databases.
But the observation here is that in the past, we had a well defined interchange protocol. It's called ODBC. And once you implement ODBC, you automatically compatible with a set of tools that matter. And in the past, those tools were mostly, you know, databases, data warehouses, ETL tools, caches. Now if you look at the data landscape now, you realize that a lot of data sources have become distributed. We're living in the world of, distributed primary storage, such as enterprise storage, secondary storage such as object stores, which is distributed message buses like Kafka and Kinesis. And there is no such thing as a distributed ODBC, and, which means that that you have a choice. Right? You either wait or you bring that ingest ingestion capabilities as a first class citizen, which for the customer means simplicity.
And if you have data coming from Kafka or you have data, in s 3 or HDFS, you can literally type 1 command in MIM SQL, and this data is flowing in MIM SQL in real time with arbitrary scale. It is very, very hard to achieve the same level of integration and performance by relying on a third party ecosystem, which, you know, frankly, is very different from customer to customer. Some people want Informatica. Some people want Spark. Some people want NetProduced jobs. So that's really why
[00:22:12] Tobias Macey:
we pulled in data ingest as a first class citizen into the engine. And what are some of the typical ways that you see your customers using that ingest capability built into the engine where they don't necessarily have to bolt on these additional systems to be able to provide additional data sources to the primary storage system that they're using within MemSQL?
[00:22:37] Nikita Shamgunov:
Yeah. Great question. So the number 1 is simply load data from SVO and GFS. This is not particularly groundbreaking, but it's incredibly easy to do, and incredibly performant to do it in The second 1 is extending that capability to supporting Kafka. So now you can just point at Kafka and say, hey, Move that data into a table in memSQL, and we'll do it, and we'll do it with great performance. You know, we have customers that are moving that data into memSQL tables multiple gigabytes a second. And then the the the third tier to actually grabbing data from a streaming data source such as Kafka and moving into a stored procedure.
So now there's control flow, and you can grab a data stream and then transform it and then transform it and write it back into MIM SQL or write it back into s 3 or HDFS. And that supports a large variety of streaming workloads, workloads like, oh, I wanna scrape Amazon prices, which are constantly changing. And I wanna match those prices, or I wanna make sure I beat those prices for the most traded products that are in Amazon. And with Mansequel, it's trivial to implement that application. Well, as long as you as you figure out how to scrape the prices and put them into Kafka. So now once they're in Kafka, you can grab that data stream, move it into a stored procedure, run a bunch of SQL commands. And in inter transactional systems, you can have running accounts or you can, you know, store all the historical data, and you can crash reports over that historical data with very high performance. And so, fundamentally, there in order to build such a high value application, you need just a few moving parts. And on that ingestion path,
[00:24:28] Tobias Macey:
do you automatically introspect the schema of the data records that are flowing in, or do you require some integration with, for instance, the Kafka schema registry for being able to determine the shape and types of data that are coming in? And then lastly, curious what the serialization formats are that you support for that ingestion workload.
[00:24:50] Nikita Shamgunov:
So this is this is evolving. Right now, we support CSV, TSV format, Avro, and JSON. And in the future, we'll be adding, Parquet, which seems to be, you know, 1 of the most requested features. And in terms of the schema, MemSQL is a relational database. So data is structured, but it also supports semi structured data. We support a JSON data type. And when we store JSON in MemSQL, we automatically parse it and shred it, and store it in in in a column store. So what it gives, the user, it gives you the user the ability to have as much structure as the user needs and wants. But at the same time, it doesn't sacrifice performance. So we don't we don't to using schema to a user, but there are certain advantages in using schema mostly around discoverability and compatibility with BI tools, where if you schematize your data and then you attach a BI tool, you you clearly see what's there, and then you can, easily, craft visualizations, and it simplifies building apps as well. As for Kafka schema registry, that is something that is on the road map, but not currently supported with the SQL
[00:26:07] Tobias Macey:
65. And moving on to the structure of the database itself, I'm wondering if you can discuss how it is architected internally and how that has evolved over the years that you've been developing it from that first 1 release.
[00:26:22] Nikita Shamgunov:
Yeah. Definitely. So a lot of things stayed starting 1 0 release, and a lot of things evolved. 1 of the things that we quickly realized, and back to my question, back to my comment around the RAM prices is that RAM is not enough. And, we made a decision back then to invest into very dense on disk data structures that would support, reporting as well as, as operational workloads. And, we had a choice back then what kind of data structures to introduce, and we landed on on column stores. In hindsight, I think this was a very, very good decision, but it took us years to get it into a place where we're really happy. And I think it's in a good spot right now, but wait for another year or so. And, it's just gonna be magical. So, the architecture is is a distributed architecture, and and the hard parts are transactionality.
And the other hard part is, query processing. Because query processing, in order to be efficient, requires to to as you, run and execute a specific query, it requires you to move data around the cluster. And then as opposed to a single node system where 1 of the hardest components is a query optimizer, in a distributed system, it's still 1 of the hardest components, but the number of choices is much greater than it is in a single node system. Where in on a single node system, you think about join order. It's kinda 1 of the key components of a query optimizer. Then you think about rewrites of, you know, decorrelating subqueries and whatnot. In a distributed system, the numb the number of choices for your joins is greater because you introduce the notion of a broadcast join, where you take a smaller table and and send it to to all the nodes in the cluster to perform a join or, a shuffle join where you move data in the system before you join, 2 tables. As well as you can say, oh, the dataset is relatively small, and I'm just gonna bring all that data into 1 node and perform the join there. So all of that are choices, and the query optimizer working in the in tandem with the query execution system have to have to make those choices, cost them, and at the end of the day, deliver on wide variety of workloads starting from very simple ones, you know, kinda key value workloads to mixed workloads to mostly analytical and reporting workloads.
So so the architecture, of the system has to be very that, you know, you can you can build a a, and then storage query optimization and query execution has to work in tandem in in such a cluster. And given that it is distributed
[00:29:13] Tobias Macey:
by nature, I'm wondering where it falls along the axis of the cap theorem since there are some gradations along there that you can optimize for? Absolutely.
[00:29:25] Nikita Shamgunov:
Well, it's a CP system. Right? And the and we sacrifice, a little bit of unavailability or availability to, deliver on consistency and partition tolerance. In in practicality, there's a trade off. Right? So do you do you push complexity on the application developer and deliver on the a part of the cap theorem, or, you build a consistent system? And it has to be partition tolerance, of course, because, you know, the greatest sin of a database is to lose data or produce wrong results. So that's really a choice. And in practice, and we know that by running tier 1 workloads for some of the most demanding customers, a 100% availability is impossible. Even though, even if the architecture allows that, there are certain things that you cannot control, that that you cannot build a 100% available system. So it's actually a very natural choice to build a CP system because, by doing things like adding more resiliency on the hardware side, adding more resiliency in the software, adding auto healing features.
You actually drive the the total availability of the platform way above the 5 nines. And going beyond that is is pretty much impractical even if you have a perfect architecture and have an AD system.
[00:30:50] Tobias Macey:
And as you mentioned, the disk storage is column oriented. And when reading through the documentation, I noticed that the in memory representation is row oriented. So I'm wondering how much overhead is introduced by converting between those representations, between the on disk and in memory representations,
[00:31:10] Nikita Shamgunov:
and the benefits of having those different views of the underlying data? This is a very technical question, and and I'm gonna give a very, very technical answer as well. So first of all, when when you run a lot of updates, so you run a lot of inserts, into a column store table, it does make sense to have a row store representation for that highly volatile part of the data. And then on the right code path, there is certain amount of overhead of flushing that that row store data into a column store system. Surprisingly, the bottleneck there is IO, not CPU. And the reason to that is there's, you you know, the CPUs are actually very performance and and they're fantastic compression algorithms developed that are very, very cheap on the CPU. So this is this is part 1. Probably, there's gonna be, about 5 parts to, in answering that question. The second part is, okay. Well, we run a query against a column store table, and we're sending the results back. And then we use an immediate computation to produce the results of that query. If it's a reporting query, the amount of data you process is drastically more than the amount of data you send back. And in this case, column stores are the perfect representation because we keep that that vectorized representation throughout the query execution pipeline. So each, database operator is using, running computations on top of what we internally call vector projections.
And you if if you wanna see, have the analogy in the open source world, the project that's called Apache Arrow comes to mind. So all the internal query processing runs on vectors. And then at the very end, we convert that representation to a row store representation because the client expects data coming row by row. There's 1 scenario in which this doesn't work that well, and that scenario is when you're trying to pull out a lot of data out of the system. And the the typical way, where people use it is in data science. They have Python tools, and all they want is to run a simple query maybe with a with a predicate and whatnot. And all they want is to get data out of the system and move it into a Python library or Spark. With Spark, we can do it in parallel. So the problem there is not as pronounced.
But in Python, we need to grab a lot of data in the columnstore format and move it to a Python library that's also processing data into columnstore format. And in the, in the middle, there is a client that consumes data row by row. So this is 1 scenario where I feel like we could do a better job, and there isn't real overhead converting from row store, column store representation to row store representation. Then there is another workload where you run high volume of updates. And in this case, our, our caching systems and and our row store representation is a perfect representation. We have customers running 10, 000, 000 updates a second against our row store representation.
And, but we've basically given you an unlimited capacity, for workloads like this. That's again where row store works super, super well. So, having those 2 representation as, with row store and column store allows us to cover a wide range of workloads. And depending on the workload, you can you can get the best of both worlds. And given this
[00:34:46] Tobias Macey:
duality of the representation for the data, I'm wondering how that translates into any special approach necessary for data modeling and building the schema for the data store or if because of the fact that it's largely SQL oriented and relational in nature, you would just use the traditional techniques and mechanisms that anybody familiar with legacy and traditional
[00:35:13] Nikita Shamgunov:
Siebel databases would be familiar with? Yeah. It's a it's a very good question. So we currently give you a few choices. We currently give you the ability to create a column store table that will reside on disk and be cached in memory, and we give you the option of building in a memory row store table. We're working very hard into merging the 2 over time, and the reason for us to do it is just simplicity. So by giving you that choice, we're increasing the the cognitive overhead, for a developer to understand her workload, 1 notch deeper to make those choices. And then we also give you a knob to define how you distribute data against the across the cluster. And, it's up to you if you wanna use that knob or not.
So barring those, is your standard SQL. And then it's standard SQL, it's query optimization and query execution that makes sure that, you will get great out of the box experience with the product. Are we done? Well, in the future, I I believe databases will will get into a place where there are no choices. Right? You just create tables, and there's no capacity. You don't think about capacity. It's just a cloud API, which is SQL. We, and you can create table and run queries. And what you pay for is, is basically latencies. In a few latencies and and the the volumes of of data stored and processed. We're still far away from that future, but we're absolutely driving towards that future.
[00:36:48] Tobias Macey:
And in terms of the write path for data from the read path, obviously, you have it stored in disk, and then you do that translation to the in memory representation. But for writing the data, I'm curious what the overall life cycle is given that it is a in memory capable database, and you do have that transaction log for persistence and just, how it flows from somebody issuing that request through to it actually being fully persisted on disk?
[00:37:17] Nikita Shamgunov:
Yeah. So, let's say you run, an insert statement, into a table. So when you do that, the data moves into 1 of the nodes, in the database, gets replicated to another node. And once that happens, you get an acknowledgment on the client that, the write went through. With that, you will have a transaction log on every node registering that record. And now that record is appended to the transaction log. So so the data is not going anywhere. So now it's in the database. And then internally, that record will be represented as an in memory record in in that table, and then there is a background process that goes. And once the amount of data in the in memory representations exceeds a certain threshold, it will perform a conversion of the in memory representation to a non disk representation and flush it on disk. So that's, that's really the life cycle. And
[00:38:17] Tobias Macey:
1 of the main aspects that you promote for MemSQL is the speed and performance of the overall system. And I'm wondering, what are some of the techniques that you use for being able to optimize for that overall system speed and performance across the cluster? That's that's a good question. There are a lot of them. So let me
[00:38:38] Nikita Shamgunov:
bucket them into several categories, and I will also give you an idea of what the impact is, of each bucket. So the first bucket is in memory representation and, cogeneration. And in that bucket, if your data is in a row store format in in memory, it's it's trivial to go and navigate to, an individual row. So that that takes very little overhead. But in the context of a of a SQL query that extracts either just 1 record or perform certain computation, 1 of the surprising aspects is that you burn compute on evaluating scalar expressions. And so that's where query compilation and we compile queries into machine code drives a lot of value.
So that's, I would say, bucket number 1. The second bucket is network optimizations. When you move data across the cluster, if you do it in a naive way, you will burn a lot of compute on serializing and deserializing data before and after you move it across the network. So that code path has to be highly optimized. And I would say this is an example of just systems approach to building, to building a database. And the system, approach is understanding where the bottlenecks are and implementing them carefully and efficiently. So in order to to drive value in that bucket, you're just looking a lot at a, at a lot of profile data, by running multiple workloads across the system and, understanding what the highest offenders, offender is and and fixing that. And finally, there is a bucket, around compression and, vectorization, and that applies mostly to column stores and reporting. And that's the most, you know, well architecturally designed and and beautiful system that performs computations on top of compressed and encoded data. And that specifically is, applicable for larger, report inquiries, for product at evaluation, and, for evaluating joints.
So 1 of the recent benchmarks that we published was, you know, trillion rows, a second benchmark. That's only possible by having a very careful implement implementation of operations on compressed and encoded data. And, we obviously didn't just do it for that specific query. We built a a general purpose system that uses, those techniques,
[00:41:12] Tobias Macey:
and drive performance to to the variety of of workloads. I have a question here. I'm not sure if it is an interesting 1 or not about the sort of ratio of open source versus proprietary code used within the project.
[00:41:24] Nikita Shamgunov:
Great. So we're very, very practical. The majority of MEM memSQL. And how much of the total implementation
[00:41:31] Tobias Macey:
of memSQL is done using custom built code versus open source projects that are leveraged within the engine? I never ran the count. Certainly, we,
[00:41:41] Nikita Shamgunov:
we use open source software when it makes sense. The majority of of MIM SQL, that would guess, about 95 to 97%, is built in house. However, there are particular pieces of open source code, that we use in the engine. Specifically from Postgres, we use a parser and binder. So everybody understands what a parser is. You know, you get a new SQL query, and and you need to understand what's in the query. So we need to parse it. The binder is another component that is adjacent to parser that takes the query and and binds it to your metadata. So if you reference a column in your SQL query, you need to know what column that is in in your metadata catalog. We also use a few libraries in our column store representation, things like lz4 for compression purposes.
And we use a geospatial library from Google. I believe it's called s 2, for geospatial capabilities. So that's about it. The rest of the, of the engine, you know, core optimization, core execution, storage, replication, you know, all the SQL supports and built ins are built in house.
[00:42:52] Tobias Macey:
And 1 thing that we didn't touch on yet is the fact that the interface for MemSQL uses the MySQL wire protocol, and I'm wondering what you see as being the benefit of choosing that dialect for the communication layer. And if you have any thoughts of the level of difficulty or practicality of implementing additional wire protocols for people who want to be able to use memSQL as a drop in replacement for other database engines? This is a great question. And people often confuse wire protocol and syntax.
[00:43:26] Nikita Shamgunov:
So wire protocol is, is actually very, very straightforward. You connect to a database. You have a handshake. And then the protocol is I send you a string, and you give me results back. And then there's syntax. And the syntax is, well, what's actually in the string? And the compatibility certainly has 2 parts to it. 1 is the compatibility around the wire protocol, where my my my strong opinion is doesn't really matter what they what the protocol is. And you wanna choose something that's the most, ubiquitous and the easiest to install.
And when we started, we had the option of MySQL or Postgres to be the wire protocol. And we chose MySQL because MySQL, at the time, was the the, the most popular database, relational database in the world. It actually still is, but posers is is is growing rapidly. 2018, we would have a lot more, pressure or or pull to choose post responder protocol. Then there is database compatibility or app compatibility. And a 100% compatibility is only possible if you'll be using the the full top part of the database. So something like AWS Aurora would, would be compatible with MySQL simply because they run the top part of MySQL. And by the top part, I mean everything but storage. Now if we were to choose a a Postgres, wire protocol or any other protocol, that still will be 2 parts. Right? What do you connect to MIM SQL with?
And second is, is it actually compatible with with a 100% post risk applications? So and that's where you have a trade off. You either say, I'm going to innovate in on query processing, and I and for that, I need to build it, from first principles and from ground up, or I will be driving towards app compatibility and no surprises for an application developer. And in this world, you will have to reuse the full code, and building a distributed query processor is actually very, very tricky and and a hard proposition. So we choose we chose the former. We chose to innovate on query processing, and that's why we're not a 100% compatible with any other database.
We support SQL. But if you were want to move, your application from MySQL or Postgres or Oracle, you will have to put some work. Not much, but that work will be required. And for somebody who is interested
[00:46:06] Tobias Macey:
in moving to using memSQL, I'm wondering what's involved in actually getting an instance, set set up and deployed and integrated with their application and environment.
[00:46:17] Nikita Shamgunov:
The setting up and deploying is trivial. We we do a very good job, and once you commit to certain sizes of the cluster, it's basically a push button experience, and then SQL will be running on that cluster. With the, MemSQL has no dependencies, so it runs as a as a Linux binary and, stitches itself, to a cluster. Moving an application really depends of how much this application has ties with, with the database. We're working with some large customers moving data from Sybase and moving data from HANA. And the complexity of that workload, usually depends on the complexity of the stored procedures. In the past, we didn't support stored procedures, and we couldn't even touch those workloads.
But now that we do, there is a migration path to moving the story procedures because, our story procedure, language is, though familiar, still modern, and those legacy engines that there are certain choices that they made to build in your language that just don't make sense in the modern world anymore. So we as we look forward for to providing a great developer experience, building new applications is is very, very nice. Migrating legacy applications will incur some pain mostly around stored procedures, but those are very well understood pains and, with some effort. And if you need with some help with professional services on MQL, it's a it's a very easy proposition.
[00:47:46] Tobias Macey:
And what have you found to be some of the common points of confusion or common difficulties that your customers encounter when they're either migrating to MIM SQL or building on top of it with a new project? Well, it's it's basically understanding
[00:48:01] Nikita Shamgunov:
the distributed nature of the system. Once it clicks, it becomes, we we start to have a lot of fans on the other side of the, of the enterprise. We call them champions. So getting around, and making a leap from a single node system to a distributed system is the hardest. And once you're there and once you realize the value, then not only we have kinda passionate fans in the enterprises,
[00:48:29] Tobias Macey:
they tend to get promoted too. And going back a bit to earlier, you were saying how MemSQL provides, limitless compute and storage. And I'm wondering what the capacity is in terms of auto scaling natively to the database versus how much of that is necessary to be implemented as another layer that the operations team manages? And then also, I'm curious what the licensing model is in terms of the way that the in memory capability plays into the overall system. Great question. So,
[00:49:03] Nikita Shamgunov:
so MIM SQL is scalable but not fully elastic yet. So there there are limits to how far you can expand the cluster from the starting point. And, and, typically, you can, if you commit to a cluster of size, let's say, 5 nodes, you will be able to comfortably grow it to about 20 nodes, but not to a 1000 nodes. And, but that's basically a feature, and that feature is on the road map. And, few customers actually want to have that level of elasticity, and, that's why the feature is on the road map and and will, in the future, make Memseeker fully elastic. They
[00:49:43] Tobias Macey:
just curious about the licensing model as far as in memory versus on disk and, how that plays in. Yeah. And the licensing mode model is per node. So it's, it's very very straightforward.
[00:49:55] Nikita Shamgunov:
So you basically take the number of nodes and multiply it by a number. And what have you found to be some of the most challenging
[00:50:02] Tobias Macey:
or unexpected aspects of building and growing the technical and business aspects of MemSQL?
[00:50:09] Nikita Shamgunov:
There are a few. 1 is, we we didn't realize what it we didn't know viscerally what it takes to support tier 1 workloads and what kind of pressure it puts on the quality of the technology and the software as well as the capability of the the organization, which is around support, customer success, and kinda 3 60 customer experience that you need to have to make your tier 1 customer successful. So it was a journey, and, that journey involved, waking up at 4 AM in the morning and, giving, CIOs and CTOs on the other end my my personal phone number, and, building out multiple levels of, support, in house escalation pass, you know, remote offices around the globe to have a a very quick, basically instant response to our most demanding customers and and most demanding kind of 1 of the world's most demanding applications.
So that's, that path, every and anybody who is building a strong operational database with tier 1 capabilities we'll have to go and walk that path. So, the path is not easy, but we we will underway. And in a way, that was that was a bit of a surprise to me, I have to say. And given the strengths and flexibility of MemSQL,
[00:51:43] Tobias Macey:
it's, easy to fall into the trap of thinking that it will solve everyone's problems. But when is it the wrong choice for a particular data workload or a data platform implementation?
[00:51:54] Nikita Shamgunov:
At this point in time, MySQL works very well when you have a scale problem. And, there are workloads for which MemSQL is an overkill, and you're better off with some sort of commodity database such as Postgres. This will change over time where memSQL will will take over those workloads as well simply because it will provide better developer experience and simplicity and cost. But at the time, you know, those very simple departmental databases where you don't expect much from the database itself are better served for something like Post
[00:52:29] Tobias Macey:
And what are some of the features or improvements that you have planned for the future of MemSQL, both the product and the company?
[00:52:38] Nikita Shamgunov:
Great question. So, we, we want to build an X-ray database company. And the way we think about features is, you know, what are the things that, we we need to build to keep our existing customers happy and keep advancing MemSQL inside those those enterprises. MemSQL is an enterprise company. And, what are the features we add in to capture new workloads and new markets? And the end state is where you, like I said, you have a cloud service, you have a SQL API to it, and that SQL API to to it has has endless capacity, and you can pay for better latencies and better throughput. So on the way towards that vision, we're we're investing in managed services, Kubernetes integration, and we can constantly strengthening the engine itself, making it more elastic, making it more robust, having the ability to run multiple data centers, and, just driving the quality and and and developer experience. And are there any other aspects
[00:53:45] Tobias Macey:
of MemSQL, the project, or the data management space in general that we didn't cover yet, which you think we should discuss before we close out the show? Well, I think there's there's a lot of things in data managements that are happening in the intersection of things,
[00:54:02] Nikita Shamgunov:
on on various boundaries. 1 is, cloud to on premises boundary, where I think Kubernetes is 1 of those technologies that will start supporting data infrastructure, not just stateless apps. And, we're looking forward to that. There's another boundary in data in data management, that is between databases and data warehouses, where we believe that over time, we will live in the in more of a converged, world where you just have an, you know, a SQL API to your data. And there's another boundary, which is data science to a data management tool. And the requirements for that boundary are, keep rising, especially with, with AI and ML workloads.
And I think we're just looking, we're just at the beginning, of that world. And this will, a lot of innovation and a lot of changes are gonna happen, around that boundary. AI and ML workloads are rising exponentially, but they need to store data somewhere. And there is no default choice for that just yet. So that's that's an exciting world, and we're absolutely gonna be part of that. And for anybody who wants
[00:55:11] Tobias Macey:
to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Well, I think it's it's that ability to seamlessly scale your workloads.
[00:55:31] Nikita Shamgunov:
So I remember using MySQL for the first time in my life, and that was back in the nineties, and it felt like magic. Felt that there's a technology that, you you know, I can throw my data into, and and it's safe, it's secure, and it's fast. You know, fast forward, you know, almost 20 years, we need to have similar tools for exponentially more, larger datasets where where when you when you run those commands, when you operate those datasets, it seems completely fluid and effortless, and we're just not there. Once you start operating and and managing terabytes and petabytes of data, that fluidity is gone, but there's a tremendous amount of value in supporting such workloads.
So I think there is a gap there, and and and we as a technology industry are are are are not quite there, and we're working on it. Alright. Well, thank you very much for taking the time today to discuss the work that you've been doing with MemSQL.
[00:56:35] Tobias Macey:
It's definitely a very impressive product and 1 that seems to be, driving a lot of value for a lot of companies. So thank you for that, and, I hope you enjoy the rest of your day. Thank you,
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast dot com slash linode today to get a $20 credit and launch a new server in under a minute. And you work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle. Scaphos maximizes interoperability with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure instantaneously.
Request a demo today at data engineering podcast dotcom/metis dashmachine to learn more about how MetisMachine is operationalizing data science. And the team at MetisMachine has shipped a proof of concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome, and it is, then join the free webinar with MetisMachine on October 11th at 2 PM EST 11 AM Pacific. MetisMachine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau.
Go to metis machine.com/webinars now to register. Go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/ chat. Your host is Tobias Macy. And today, I'm interviewing Nikita Shamganov about MemSQL, a new SQL database built for simultaneous transactional and analytical workloads. So, Nikita, could you start by introducing yourself? Yeah. So my name is Nikita Shamganoff.
[00:02:12] Nikita Shamgunov:
My role here is the CEO, and, have a long history here, at memSQL. And prior to that, I was at Facebook. And before that, I was a SQL Server, working on the kernel. I think it also it makes sense to to share my, academic history. So I actually was I grew up and got educated in Russia, and my passion then was doing programming competitions. So that carried me through my high school years, then college years, and then I went into, the academic track where, I went to Saint Petersburg University of Fine Mechanics and Optics and did my PhD in computer science. And right after the graduation, in fact, I got, an offer a a few months before the graduation. I got on the plane and joined Microsoft to start working on SQL Server. And so it sounds like that was your initial on ramp to getting involved in the area of data management?
Yeah. Absolutely. So I work on on on some of the aspects of that, during my PhD program. But, really, the, the trial by fire was certainly at SQL Server where not only I was exposed to, the area of data management, but in a broader way, started working on on systems. And I think there is a a a very specific set of skills that goes into engineering, hardcore, complex systems
[00:03:36] Tobias Macey:
written in c and c plus plus and built for performance and reliability. So certainly, SQL Server was a a very, very good training ground for me. Yeah. Getting thrown in at the deep end is usually 1 of the most effective ways to learn something very thoroughly because it's either, sink or swim, and that is a is a great inspiration for taking the time to fully internalize what's happening under the covers. Yeah. Absolutely.
[00:04:00] Nikita Shamgunov:
And the experience, like you said, was was really going into the deep end. A few weeks into the job, when I just arrived from Russia into the United States, I was spending most of my time in in the debugger, basically understanding, that SQL Server kernel, fixing bugs. And at the time, we were working on stabilizing 1 of the most ambitious releases of SQL Server, which is SQL Server 2, 005, 5 years in the making, which is really unheard of in in modern days. Like, nobody takes 5 years to ship a piece of software. But, you know, again, at the time, we were on this death march of driving the quality of of the product up after building, a lot of features that went into into that product. So that was that was a lot of fun, actually. And so
[00:04:46] Tobias Macey:
that was what set you on the road to ultimately working on MemSQL. So I'm wondering if you can describe a bit about what MemSQL is and the motivation
[00:04:58] Nikita Shamgunov:
for starting the project and building a business around it. Absolutely. So MemSQL started with with a few insights. And the insight number 1 is a several attempts of building a scalable transactional system inside Microsoft. I was not part of that effort, but, my my cofounder and chief architect, Adam, was. And and while we were there working on the project called Matrix DB, it was an incredibly frustrating experience that we actually could not ship that thing. And then when I joined Facebook, it became very apparent to me what the power and the value of such system is, and we went on, on a journey to build a new database. You know, we were very passionate about about building, databases and building systems. And I remember having a conversation back then, and the conversation literally was the following. Well, it's actually a lot easier to build an analytical system. But we thought that at the time the game was over. You know, history proved us wrong. Right? You know, since, there's a lot of cloud data warehouses came into existence. But we were looking at the market and seeing a lot of companies like Data Allegro, Vertica, Netezza, Astrodata, being acquired. And so we're like, okay. Well, it's much harder to build a transactional distributed system, so let's just do that. And then we went on on their journey having just a few insights. A, we believe that there was more of an opening to build a scalable transactional system.
2, we saw the value, of such system by looking some of the workloads inside Facebook. Most of such workloads that we were exposed to were in, around real time analytics. Basically, you need a transactional operational system to withstand withstand, high volume of inserts, updates, deletes that that hitting the system. But at the same time, the the the value is to, actually drive insights, drive analytics. And then the last insight was the unsuccessful, attempt of Microsoft of building a distributed system with a starting point, SQL Server. And, the competition for Microsoft at the time was Oracle RAC and on Oracle Exadata. And we and we kind of believed that there is value in building such a system. And we thought if we build it, then there will be a market for for the system because, obviously, Microsoft is trying very, very hard to build 1. So that's that's how we started. And then, the amount of technology that goes into a product like this is is is absolutely enormous. So you need to pave a path, and we made a bunch of assumptions to get to basically version 1.0 to not only build such a complex product, but being able to drive revenue on the way to the ultimate vision, of being fully scalable, elastic, transactional system that can support transactions and analytics and and, in 1 product. So that assumption, the first 1 was, let's build an in memory system. There was a lot of research at the time about building in memory systems. And also there's another assumption that that you can make that the cost of RAM is gonna go down so dramatically that everybody will be running, transactional systems in memory. That assumptions turn out to not be true.
And if you compare the the cost of RAM versus the cost of flash, for example, the the cost of flash, you know, dropped all the way down to the ground. Versus the cost of RAM, you know, RAM got cheaper, but not orders and orders of magnitude cheaper compared to flash. But yet, in memory was actually a very good starting point for us because it allowed us to onboard various high value workloads. So you can drive revenue with relatively small number of customers solving exotic problems, problems that those institutions have and they're willing to put money and faith into a new a new startup to solve, to solve those problems.
[00:08:56] Tobias Macey:
And how much of your product road map ended up getting driven by those early customers with the exotic problems? And how much of it was a matter of going and doing customer research with people who had more standard workloads to try and figure out what was the best direction for driving the company? This is a very good question. So the beauty of the situation that where where we ended up, early on is the exotic angle of those problems were only around performance,
[00:09:27] Nikita Shamgunov:
but not around the surface area. So the road map very nicely aligned with expanding the surface area. And the surface area of a transactional SQL systems is massive. Not only you need to build your storage features and clustering features, but you also need to provide very, very good SQL support, which is a combination of just a very broad language, SQL, but also, investments into query optimization and query execution and really understanding the universe of all possible workloads and doing a very good job workload by workload and on shipping new releases of of MemSQL. So to answer your question, the exotic part is just performance and scalability. And then the general purpose part of having a very good support of SQL, having an extremely reliable product, and I can talk for a long time about quality of, databases and systems in general. So so we found ourselves actually in in a in a in a decent spot where we're not torn expanding the exotic nature of the workloads versus,
[00:10:33] Tobias Macey:
building general purpose features. Those kinda align very nicely for us. And now that you have iterated and put out a number of additional releases from that 1.0 milestone. What are some of the typical use cases that you see for customers who are running MemSQL and people who are in conversations with you to add it into their infrastructure?
[00:10:57] Nikita Shamgunov:
So you can you can parse this question and answer it in terms of tiers. So at the very, very high level, we support general purpose database workloads that require scale. We give you unlimited storage and unlimited compute. And in 1 product, in an elegant way, we can scale your your applications, which is a very stark contrast to your existing, existing single node systems or legacy systems, the Oracles, the SQL Servers, the Postgreses of the world, where the amount of storage and compute is limited. Now this is too general. So what is the the next iteration of answering that question? Well, the second 1 is, like, let's look at the very, specific workloads, but still wear a a technology hat. The first 1 is real time analytics. You have a data coming at you either through an application or through a message bus such as Kafka or in the cloud Kinesis, and you need to provide, insights into that data in real time. It's supremely important to know what's happening with that data within maybe milliseconds or seconds or or minutes, but, you know, in a short period of time. So, and we call this workload real time dashboards. And real time dashboards are important in financial markets. They're important in health care. They're important in packaged goods and, in in web 2.0. So real time dashboards is something that MQL does, incredibly well. So this is 1 workload. The second workload is there's a web point, 2 0 workload that requires scale because of concurrency, and that's where giving infinite compute allows you to build your system with much fewer moving pieces. You don't need to cache.
You don't need to shard your database. You can just deploy MaplesQL, and it will drive arbitrary level of concurrency for your, web 2 put 2 0 or mobile application. So we have a number of workloads like this as well. And specifically, the moment you add reporting into an application like this, you just have a much easier way of burning compute. So, for example, in the financial markets, there's a number of use cases such as portfolio analysis, or wealth management, where you log in and and you see your assets and and reports. And reports, January reports burns, a nontrivial amount of compute. So if you want to deliver great user experience and lower latencies, you actually need a scalable system. So, so that's another kind of class of applications where, there's a lot of concurrency on the other end. And then finally, there is a set of applications that are closer to data warehousing and data marts where you just need to to produce a report and you need to produce, that report quickly. And in those workloads, the advantage over data warehouses is really that you can do it on top of the same data that powers your applications. So you don't need to extract your data, put it into the data warehouse, but you, but you do it on the data that sits in the database. So the advantage is really simplicity. So that that's more from the technology standpoint. And now finally, there are use cases by industry.
That that's the third layer. And we're mostly, the most successful in financial services and media. But we but recently, for example, we also see health care coming in, more and more into the platform. So and the specific use cases by industry are different. So for example, in the financial services, we see wealth management, portfolio management, risk analysis taking a substantial portion of the workloads that we support. Versus in media, it's all about advertising and streaming, Understanding the quality of service, understanding the performance of your ads, and doing that in real time, and and and and really driving decisions that will change the parameters of your advertising campaigns based on what's actually happening. And finally, across many, many industries, but particularly in media, it's real time reporting of the performance of your, of your campaigns or things like pricing and, and billing. So there there are companies that have massive big data workloads.
[00:15:18] Tobias Macey:
And in order for them to provide billing that's up to this second, you require something like memSQL and nothing else will work. Yeah. And what you were describing of being able to build a transactional application on top of a database and then be able to build reports on that same data. If you're not using a system that can support both of those workloads in tandem, then you would generally be doing something like change data capture, piping through maybe a Kafka topic that then lands into a data warehouse to be able to have some transformation that rolls it into an aggregated report, which automatically adds latency because of all the additional steps. Whereas being able to run across the same records as they're being written reduces the overall amount of time and effort in terms of building the reports and in terms of supporting the data life cycles.
And I also know that MEM SQL is built to be multi tenant, so I'm wondering if the typical case is that it will 1 deployment of MEM SQL will be used for backing multiple different applications, and then reports are built to be able to combine the data across those applications
[00:16:24] Nikita Shamgunov:
in that engine, or if it's something where an additional step is necessary to be able to aggregate the data across those different applications because of some boundaries in place because of the multi tenancy? Yeah. So this is this is a a loaded question. So let me answer 1 by 1. So first of all, you're absolutely right. And in many cases, in order to provide reporting on top of the operational data, you need to set up an architecture that looks like having changed data capture, piping it through an ETL tool, or in in your case, to Kafka, pushing into a data warehouse.
Data warehouse is not transactional, so we need to be very diligent into making sure that the data that goes into the data warehouse is in fact a a real copy of the data produced by the operational system. You can't just apply a change data capture log to a data warehouse, on a high high throughput, high performance transactional system. And the the second question was, around multitenancy, and we see a mix, actually. We see there's a there's a class of workloads where we which which we call tier ones. And tier 1 workloads are they have very, very strict reliability and performance requirements.
They usually have to run 24 by 7. The uptime has to be 5 nines. And, the latency and then especially p 90 latencies have to be very, very consistent. And so what ends up happening, the tier 1 workloads are typically run-in separate clusters of MIM SQL because, they're so important to the business that that, any disruption that will affect those p 9 latencies or any disruption that will, be affected by bringing new applications into the platform cannot be tolerated. And then for tier 2 workloads, that's precisely right. People just bring a lot a lot of applications into the platform and drive a lot of ROI through consolidation.
Because Minty was scalable, so you can bring, and gives you so much compute and storage. So you can just bring a lot of applications into the single platform. Now on from the product standpoint, that motion, which we we're we're very, very excited about. Right? Because, what we see in a typical enterprise is a problem that we call proliferation of data stores. A very typical conversation with something like a big bank or a very large, enterprise company or a very large, you know, world class technology company, when you talk to their DevOps people, they say, look. It looks like our business team is falling in love with your technology, but we are a DevOps team. And we're currently supporting, you know, 40 plus databases.
If we bring you in, something needs to go. And this is a great conversation because we typically see several, not just 1 technologies, especially legacy technologies disappear once people onboard,
[00:19:25] Tobias Macey:
Mentee. And in addition to being able to run both the transactional and the analytical loads on the single engine, there are also additional features built into memSQL itself, such as an ingestion pipeline for being able to consume data from external sources and pull it in for being able to build those reports.
[00:19:46] Nikita Shamgunov:
And I'm wondering what the benefits are of having that built into the database engine versus using some of the preexisting platforms and tooling such as the ETL systems or streaming data systems such as Kafka or Spark? Yeah. This is a great question. And you you can either, put a purist hat when you start looking at this, from the product standpoint and say, look. We're we're building a perfect engine, and we're gonna focus on the engine and then hope that the ecosystem will build tools around that engine to to use it efficiently. And then there's reality. And reality is such that, the market and the set of technologies are moving so fast that the customer is confused. There's too much stuff, and that stuff is not necessarily working very well. I guess there are clusters of technologies that are working very well together, some around clouds, some around the Hadoop ecosystem, some around kinda old school legacy databases.
But the observation here is that in the past, we had a well defined interchange protocol. It's called ODBC. And once you implement ODBC, you automatically compatible with a set of tools that matter. And in the past, those tools were mostly, you know, databases, data warehouses, ETL tools, caches. Now if you look at the data landscape now, you realize that a lot of data sources have become distributed. We're living in the world of, distributed primary storage, such as enterprise storage, secondary storage such as object stores, which is distributed message buses like Kafka and Kinesis. And there is no such thing as a distributed ODBC, and, which means that that you have a choice. Right? You either wait or you bring that ingest ingestion capabilities as a first class citizen, which for the customer means simplicity.
And if you have data coming from Kafka or you have data, in s 3 or HDFS, you can literally type 1 command in MIM SQL, and this data is flowing in MIM SQL in real time with arbitrary scale. It is very, very hard to achieve the same level of integration and performance by relying on a third party ecosystem, which, you know, frankly, is very different from customer to customer. Some people want Informatica. Some people want Spark. Some people want NetProduced jobs. So that's really why
[00:22:12] Tobias Macey:
we pulled in data ingest as a first class citizen into the engine. And what are some of the typical ways that you see your customers using that ingest capability built into the engine where they don't necessarily have to bolt on these additional systems to be able to provide additional data sources to the primary storage system that they're using within MemSQL?
[00:22:37] Nikita Shamgunov:
Yeah. Great question. So the number 1 is simply load data from SVO and GFS. This is not particularly groundbreaking, but it's incredibly easy to do, and incredibly performant to do it in The second 1 is extending that capability to supporting Kafka. So now you can just point at Kafka and say, hey, Move that data into a table in memSQL, and we'll do it, and we'll do it with great performance. You know, we have customers that are moving that data into memSQL tables multiple gigabytes a second. And then the the the third tier to actually grabbing data from a streaming data source such as Kafka and moving into a stored procedure.
So now there's control flow, and you can grab a data stream and then transform it and then transform it and write it back into MIM SQL or write it back into s 3 or HDFS. And that supports a large variety of streaming workloads, workloads like, oh, I wanna scrape Amazon prices, which are constantly changing. And I wanna match those prices, or I wanna make sure I beat those prices for the most traded products that are in Amazon. And with Mansequel, it's trivial to implement that application. Well, as long as you as you figure out how to scrape the prices and put them into Kafka. So now once they're in Kafka, you can grab that data stream, move it into a stored procedure, run a bunch of SQL commands. And in inter transactional systems, you can have running accounts or you can, you know, store all the historical data, and you can crash reports over that historical data with very high performance. And so, fundamentally, there in order to build such a high value application, you need just a few moving parts. And on that ingestion path,
[00:24:28] Tobias Macey:
do you automatically introspect the schema of the data records that are flowing in, or do you require some integration with, for instance, the Kafka schema registry for being able to determine the shape and types of data that are coming in? And then lastly, curious what the serialization formats are that you support for that ingestion workload.
[00:24:50] Nikita Shamgunov:
So this is this is evolving. Right now, we support CSV, TSV format, Avro, and JSON. And in the future, we'll be adding, Parquet, which seems to be, you know, 1 of the most requested features. And in terms of the schema, MemSQL is a relational database. So data is structured, but it also supports semi structured data. We support a JSON data type. And when we store JSON in MemSQL, we automatically parse it and shred it, and store it in in in a column store. So what it gives, the user, it gives you the user the ability to have as much structure as the user needs and wants. But at the same time, it doesn't sacrifice performance. So we don't we don't to using schema to a user, but there are certain advantages in using schema mostly around discoverability and compatibility with BI tools, where if you schematize your data and then you attach a BI tool, you you clearly see what's there, and then you can, easily, craft visualizations, and it simplifies building apps as well. As for Kafka schema registry, that is something that is on the road map, but not currently supported with the SQL
[00:26:07] Tobias Macey:
65. And moving on to the structure of the database itself, I'm wondering if you can discuss how it is architected internally and how that has evolved over the years that you've been developing it from that first 1 release.
[00:26:22] Nikita Shamgunov:
Yeah. Definitely. So a lot of things stayed starting 1 0 release, and a lot of things evolved. 1 of the things that we quickly realized, and back to my question, back to my comment around the RAM prices is that RAM is not enough. And, we made a decision back then to invest into very dense on disk data structures that would support, reporting as well as, as operational workloads. And, we had a choice back then what kind of data structures to introduce, and we landed on on column stores. In hindsight, I think this was a very, very good decision, but it took us years to get it into a place where we're really happy. And I think it's in a good spot right now, but wait for another year or so. And, it's just gonna be magical. So, the architecture is is a distributed architecture, and and the hard parts are transactionality.
And the other hard part is, query processing. Because query processing, in order to be efficient, requires to to as you, run and execute a specific query, it requires you to move data around the cluster. And then as opposed to a single node system where 1 of the hardest components is a query optimizer, in a distributed system, it's still 1 of the hardest components, but the number of choices is much greater than it is in a single node system. Where in on a single node system, you think about join order. It's kinda 1 of the key components of a query optimizer. Then you think about rewrites of, you know, decorrelating subqueries and whatnot. In a distributed system, the numb the number of choices for your joins is greater because you introduce the notion of a broadcast join, where you take a smaller table and and send it to to all the nodes in the cluster to perform a join or, a shuffle join where you move data in the system before you join, 2 tables. As well as you can say, oh, the dataset is relatively small, and I'm just gonna bring all that data into 1 node and perform the join there. So all of that are choices, and the query optimizer working in the in tandem with the query execution system have to have to make those choices, cost them, and at the end of the day, deliver on wide variety of workloads starting from very simple ones, you know, kinda key value workloads to mixed workloads to mostly analytical and reporting workloads.
So so the architecture, of the system has to be very that, you know, you can you can build a a, and then storage query optimization and query execution has to work in tandem in in such a cluster. And given that it is distributed
[00:29:13] Tobias Macey:
by nature, I'm wondering where it falls along the axis of the cap theorem since there are some gradations along there that you can optimize for? Absolutely.
[00:29:25] Nikita Shamgunov:
Well, it's a CP system. Right? And the and we sacrifice, a little bit of unavailability or availability to, deliver on consistency and partition tolerance. In in practicality, there's a trade off. Right? So do you do you push complexity on the application developer and deliver on the a part of the cap theorem, or, you build a consistent system? And it has to be partition tolerance, of course, because, you know, the greatest sin of a database is to lose data or produce wrong results. So that's really a choice. And in practice, and we know that by running tier 1 workloads for some of the most demanding customers, a 100% availability is impossible. Even though, even if the architecture allows that, there are certain things that you cannot control, that that you cannot build a 100% available system. So it's actually a very natural choice to build a CP system because, by doing things like adding more resiliency on the hardware side, adding more resiliency in the software, adding auto healing features.
You actually drive the the total availability of the platform way above the 5 nines. And going beyond that is is pretty much impractical even if you have a perfect architecture and have an AD system.
[00:30:50] Tobias Macey:
And as you mentioned, the disk storage is column oriented. And when reading through the documentation, I noticed that the in memory representation is row oriented. So I'm wondering how much overhead is introduced by converting between those representations, between the on disk and in memory representations,
[00:31:10] Nikita Shamgunov:
and the benefits of having those different views of the underlying data? This is a very technical question, and and I'm gonna give a very, very technical answer as well. So first of all, when when you run a lot of updates, so you run a lot of inserts, into a column store table, it does make sense to have a row store representation for that highly volatile part of the data. And then on the right code path, there is certain amount of overhead of flushing that that row store data into a column store system. Surprisingly, the bottleneck there is IO, not CPU. And the reason to that is there's, you you know, the CPUs are actually very performance and and they're fantastic compression algorithms developed that are very, very cheap on the CPU. So this is this is part 1. Probably, there's gonna be, about 5 parts to, in answering that question. The second part is, okay. Well, we run a query against a column store table, and we're sending the results back. And then we use an immediate computation to produce the results of that query. If it's a reporting query, the amount of data you process is drastically more than the amount of data you send back. And in this case, column stores are the perfect representation because we keep that that vectorized representation throughout the query execution pipeline. So each, database operator is using, running computations on top of what we internally call vector projections.
And you if if you wanna see, have the analogy in the open source world, the project that's called Apache Arrow comes to mind. So all the internal query processing runs on vectors. And then at the very end, we convert that representation to a row store representation because the client expects data coming row by row. There's 1 scenario in which this doesn't work that well, and that scenario is when you're trying to pull out a lot of data out of the system. And the the typical way, where people use it is in data science. They have Python tools, and all they want is to run a simple query maybe with a with a predicate and whatnot. And all they want is to get data out of the system and move it into a Python library or Spark. With Spark, we can do it in parallel. So the problem there is not as pronounced.
But in Python, we need to grab a lot of data in the columnstore format and move it to a Python library that's also processing data into columnstore format. And in the, in the middle, there is a client that consumes data row by row. So this is 1 scenario where I feel like we could do a better job, and there isn't real overhead converting from row store, column store representation to row store representation. Then there is another workload where you run high volume of updates. And in this case, our, our caching systems and and our row store representation is a perfect representation. We have customers running 10, 000, 000 updates a second against our row store representation.
And, but we've basically given you an unlimited capacity, for workloads like this. That's again where row store works super, super well. So, having those 2 representation as, with row store and column store allows us to cover a wide range of workloads. And depending on the workload, you can you can get the best of both worlds. And given this
[00:34:46] Tobias Macey:
duality of the representation for the data, I'm wondering how that translates into any special approach necessary for data modeling and building the schema for the data store or if because of the fact that it's largely SQL oriented and relational in nature, you would just use the traditional techniques and mechanisms that anybody familiar with legacy and traditional
[00:35:13] Nikita Shamgunov:
Siebel databases would be familiar with? Yeah. It's a it's a very good question. So we currently give you a few choices. We currently give you the ability to create a column store table that will reside on disk and be cached in memory, and we give you the option of building in a memory row store table. We're working very hard into merging the 2 over time, and the reason for us to do it is just simplicity. So by giving you that choice, we're increasing the the cognitive overhead, for a developer to understand her workload, 1 notch deeper to make those choices. And then we also give you a knob to define how you distribute data against the across the cluster. And, it's up to you if you wanna use that knob or not.
So barring those, is your standard SQL. And then it's standard SQL, it's query optimization and query execution that makes sure that, you will get great out of the box experience with the product. Are we done? Well, in the future, I I believe databases will will get into a place where there are no choices. Right? You just create tables, and there's no capacity. You don't think about capacity. It's just a cloud API, which is SQL. We, and you can create table and run queries. And what you pay for is, is basically latencies. In a few latencies and and the the volumes of of data stored and processed. We're still far away from that future, but we're absolutely driving towards that future.
[00:36:48] Tobias Macey:
And in terms of the write path for data from the read path, obviously, you have it stored in disk, and then you do that translation to the in memory representation. But for writing the data, I'm curious what the overall life cycle is given that it is a in memory capable database, and you do have that transaction log for persistence and just, how it flows from somebody issuing that request through to it actually being fully persisted on disk?
[00:37:17] Nikita Shamgunov:
Yeah. So, let's say you run, an insert statement, into a table. So when you do that, the data moves into 1 of the nodes, in the database, gets replicated to another node. And once that happens, you get an acknowledgment on the client that, the write went through. With that, you will have a transaction log on every node registering that record. And now that record is appended to the transaction log. So so the data is not going anywhere. So now it's in the database. And then internally, that record will be represented as an in memory record in in that table, and then there is a background process that goes. And once the amount of data in the in memory representations exceeds a certain threshold, it will perform a conversion of the in memory representation to a non disk representation and flush it on disk. So that's, that's really the life cycle. And
[00:38:17] Tobias Macey:
1 of the main aspects that you promote for MemSQL is the speed and performance of the overall system. And I'm wondering, what are some of the techniques that you use for being able to optimize for that overall system speed and performance across the cluster? That's that's a good question. There are a lot of them. So let me
[00:38:38] Nikita Shamgunov:
bucket them into several categories, and I will also give you an idea of what the impact is, of each bucket. So the first bucket is in memory representation and, cogeneration. And in that bucket, if your data is in a row store format in in memory, it's it's trivial to go and navigate to, an individual row. So that that takes very little overhead. But in the context of a of a SQL query that extracts either just 1 record or perform certain computation, 1 of the surprising aspects is that you burn compute on evaluating scalar expressions. And so that's where query compilation and we compile queries into machine code drives a lot of value.
So that's, I would say, bucket number 1. The second bucket is network optimizations. When you move data across the cluster, if you do it in a naive way, you will burn a lot of compute on serializing and deserializing data before and after you move it across the network. So that code path has to be highly optimized. And I would say this is an example of just systems approach to building, to building a database. And the system, approach is understanding where the bottlenecks are and implementing them carefully and efficiently. So in order to to drive value in that bucket, you're just looking a lot at a, at a lot of profile data, by running multiple workloads across the system and, understanding what the highest offenders, offender is and and fixing that. And finally, there is a bucket, around compression and, vectorization, and that applies mostly to column stores and reporting. And that's the most, you know, well architecturally designed and and beautiful system that performs computations on top of compressed and encoded data. And that specifically is, applicable for larger, report inquiries, for product at evaluation, and, for evaluating joints.
So 1 of the recent benchmarks that we published was, you know, trillion rows, a second benchmark. That's only possible by having a very careful implement implementation of operations on compressed and encoded data. And, we obviously didn't just do it for that specific query. We built a a general purpose system that uses, those techniques,
[00:41:12] Tobias Macey:
and drive performance to to the variety of of workloads. I have a question here. I'm not sure if it is an interesting 1 or not about the sort of ratio of open source versus proprietary code used within the project.
[00:41:24] Nikita Shamgunov:
Great. So we're very, very practical. The majority of MEM memSQL. And how much of the total implementation
[00:41:31] Tobias Macey:
of memSQL is done using custom built code versus open source projects that are leveraged within the engine? I never ran the count. Certainly, we,
[00:41:41] Nikita Shamgunov:
we use open source software when it makes sense. The majority of of MIM SQL, that would guess, about 95 to 97%, is built in house. However, there are particular pieces of open source code, that we use in the engine. Specifically from Postgres, we use a parser and binder. So everybody understands what a parser is. You know, you get a new SQL query, and and you need to understand what's in the query. So we need to parse it. The binder is another component that is adjacent to parser that takes the query and and binds it to your metadata. So if you reference a column in your SQL query, you need to know what column that is in in your metadata catalog. We also use a few libraries in our column store representation, things like lz4 for compression purposes.
And we use a geospatial library from Google. I believe it's called s 2, for geospatial capabilities. So that's about it. The rest of the, of the engine, you know, core optimization, core execution, storage, replication, you know, all the SQL supports and built ins are built in house.
[00:42:52] Tobias Macey:
And 1 thing that we didn't touch on yet is the fact that the interface for MemSQL uses the MySQL wire protocol, and I'm wondering what you see as being the benefit of choosing that dialect for the communication layer. And if you have any thoughts of the level of difficulty or practicality of implementing additional wire protocols for people who want to be able to use memSQL as a drop in replacement for other database engines? This is a great question. And people often confuse wire protocol and syntax.
[00:43:26] Nikita Shamgunov:
So wire protocol is, is actually very, very straightforward. You connect to a database. You have a handshake. And then the protocol is I send you a string, and you give me results back. And then there's syntax. And the syntax is, well, what's actually in the string? And the compatibility certainly has 2 parts to it. 1 is the compatibility around the wire protocol, where my my my strong opinion is doesn't really matter what they what the protocol is. And you wanna choose something that's the most, ubiquitous and the easiest to install.
And when we started, we had the option of MySQL or Postgres to be the wire protocol. And we chose MySQL because MySQL, at the time, was the the, the most popular database, relational database in the world. It actually still is, but posers is is is growing rapidly. 2018, we would have a lot more, pressure or or pull to choose post responder protocol. Then there is database compatibility or app compatibility. And a 100% compatibility is only possible if you'll be using the the full top part of the database. So something like AWS Aurora would, would be compatible with MySQL simply because they run the top part of MySQL. And by the top part, I mean everything but storage. Now if we were to choose a a Postgres, wire protocol or any other protocol, that still will be 2 parts. Right? What do you connect to MIM SQL with?
And second is, is it actually compatible with with a 100% post risk applications? So and that's where you have a trade off. You either say, I'm going to innovate in on query processing, and I and for that, I need to build it, from first principles and from ground up, or I will be driving towards app compatibility and no surprises for an application developer. And in this world, you will have to reuse the full code, and building a distributed query processor is actually very, very tricky and and a hard proposition. So we choose we chose the former. We chose to innovate on query processing, and that's why we're not a 100% compatible with any other database.
We support SQL. But if you were want to move, your application from MySQL or Postgres or Oracle, you will have to put some work. Not much, but that work will be required. And for somebody who is interested
[00:46:06] Tobias Macey:
in moving to using memSQL, I'm wondering what's involved in actually getting an instance, set set up and deployed and integrated with their application and environment.
[00:46:17] Nikita Shamgunov:
The setting up and deploying is trivial. We we do a very good job, and once you commit to certain sizes of the cluster, it's basically a push button experience, and then SQL will be running on that cluster. With the, MemSQL has no dependencies, so it runs as a as a Linux binary and, stitches itself, to a cluster. Moving an application really depends of how much this application has ties with, with the database. We're working with some large customers moving data from Sybase and moving data from HANA. And the complexity of that workload, usually depends on the complexity of the stored procedures. In the past, we didn't support stored procedures, and we couldn't even touch those workloads.
But now that we do, there is a migration path to moving the story procedures because, our story procedure, language is, though familiar, still modern, and those legacy engines that there are certain choices that they made to build in your language that just don't make sense in the modern world anymore. So we as we look forward for to providing a great developer experience, building new applications is is very, very nice. Migrating legacy applications will incur some pain mostly around stored procedures, but those are very well understood pains and, with some effort. And if you need with some help with professional services on MQL, it's a it's a very easy proposition.
[00:47:46] Tobias Macey:
And what have you found to be some of the common points of confusion or common difficulties that your customers encounter when they're either migrating to MIM SQL or building on top of it with a new project? Well, it's it's basically understanding
[00:48:01] Nikita Shamgunov:
the distributed nature of the system. Once it clicks, it becomes, we we start to have a lot of fans on the other side of the, of the enterprise. We call them champions. So getting around, and making a leap from a single node system to a distributed system is the hardest. And once you're there and once you realize the value, then not only we have kinda passionate fans in the enterprises,
[00:48:29] Tobias Macey:
they tend to get promoted too. And going back a bit to earlier, you were saying how MemSQL provides, limitless compute and storage. And I'm wondering what the capacity is in terms of auto scaling natively to the database versus how much of that is necessary to be implemented as another layer that the operations team manages? And then also, I'm curious what the licensing model is in terms of the way that the in memory capability plays into the overall system. Great question. So,
[00:49:03] Nikita Shamgunov:
so MIM SQL is scalable but not fully elastic yet. So there there are limits to how far you can expand the cluster from the starting point. And, and, typically, you can, if you commit to a cluster of size, let's say, 5 nodes, you will be able to comfortably grow it to about 20 nodes, but not to a 1000 nodes. And, but that's basically a feature, and that feature is on the road map. And, few customers actually want to have that level of elasticity, and, that's why the feature is on the road map and and will, in the future, make Memseeker fully elastic. They
[00:49:43] Tobias Macey:
just curious about the licensing model as far as in memory versus on disk and, how that plays in. Yeah. And the licensing mode model is per node. So it's, it's very very straightforward.
[00:49:55] Nikita Shamgunov:
So you basically take the number of nodes and multiply it by a number. And what have you found to be some of the most challenging
[00:50:02] Tobias Macey:
or unexpected aspects of building and growing the technical and business aspects of MemSQL?
[00:50:09] Nikita Shamgunov:
There are a few. 1 is, we we didn't realize what it we didn't know viscerally what it takes to support tier 1 workloads and what kind of pressure it puts on the quality of the technology and the software as well as the capability of the the organization, which is around support, customer success, and kinda 3 60 customer experience that you need to have to make your tier 1 customer successful. So it was a journey, and, that journey involved, waking up at 4 AM in the morning and, giving, CIOs and CTOs on the other end my my personal phone number, and, building out multiple levels of, support, in house escalation pass, you know, remote offices around the globe to have a a very quick, basically instant response to our most demanding customers and and most demanding kind of 1 of the world's most demanding applications.
So that's, that path, every and anybody who is building a strong operational database with tier 1 capabilities we'll have to go and walk that path. So, the path is not easy, but we we will underway. And in a way, that was that was a bit of a surprise to me, I have to say. And given the strengths and flexibility of MemSQL,
[00:51:43] Tobias Macey:
it's, easy to fall into the trap of thinking that it will solve everyone's problems. But when is it the wrong choice for a particular data workload or a data platform implementation?
[00:51:54] Nikita Shamgunov:
At this point in time, MySQL works very well when you have a scale problem. And, there are workloads for which MemSQL is an overkill, and you're better off with some sort of commodity database such as Postgres. This will change over time where memSQL will will take over those workloads as well simply because it will provide better developer experience and simplicity and cost. But at the time, you know, those very simple departmental databases where you don't expect much from the database itself are better served for something like Post
[00:52:29] Tobias Macey:
And what are some of the features or improvements that you have planned for the future of MemSQL, both the product and the company?
[00:52:38] Nikita Shamgunov:
Great question. So, we, we want to build an X-ray database company. And the way we think about features is, you know, what are the things that, we we need to build to keep our existing customers happy and keep advancing MemSQL inside those those enterprises. MemSQL is an enterprise company. And, what are the features we add in to capture new workloads and new markets? And the end state is where you, like I said, you have a cloud service, you have a SQL API to it, and that SQL API to to it has has endless capacity, and you can pay for better latencies and better throughput. So on the way towards that vision, we're we're investing in managed services, Kubernetes integration, and we can constantly strengthening the engine itself, making it more elastic, making it more robust, having the ability to run multiple data centers, and, just driving the quality and and and developer experience. And are there any other aspects
[00:53:45] Tobias Macey:
of MemSQL, the project, or the data management space in general that we didn't cover yet, which you think we should discuss before we close out the show? Well, I think there's there's a lot of things in data managements that are happening in the intersection of things,
[00:54:02] Nikita Shamgunov:
on on various boundaries. 1 is, cloud to on premises boundary, where I think Kubernetes is 1 of those technologies that will start supporting data infrastructure, not just stateless apps. And, we're looking forward to that. There's another boundary in data in data management, that is between databases and data warehouses, where we believe that over time, we will live in the in more of a converged, world where you just have an, you know, a SQL API to your data. And there's another boundary, which is data science to a data management tool. And the requirements for that boundary are, keep rising, especially with, with AI and ML workloads.
And I think we're just looking, we're just at the beginning, of that world. And this will, a lot of innovation and a lot of changes are gonna happen, around that boundary. AI and ML workloads are rising exponentially, but they need to store data somewhere. And there is no default choice for that just yet. So that's that's an exciting world, and we're absolutely gonna be part of that. And for anybody who wants
[00:55:11] Tobias Macey:
to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Well, I think it's it's that ability to seamlessly scale your workloads.
[00:55:31] Nikita Shamgunov:
So I remember using MySQL for the first time in my life, and that was back in the nineties, and it felt like magic. Felt that there's a technology that, you you know, I can throw my data into, and and it's safe, it's secure, and it's fast. You know, fast forward, you know, almost 20 years, we need to have similar tools for exponentially more, larger datasets where where when you when you run those commands, when you operate those datasets, it seems completely fluid and effortless, and we're just not there. Once you start operating and and managing terabytes and petabytes of data, that fluidity is gone, but there's a tremendous amount of value in supporting such workloads.
So I think there is a gap there, and and and we as a technology industry are are are are not quite there, and we're working on it. Alright. Well, thank you very much for taking the time today to discuss the work that you've been doing with MemSQL.
[00:56:35] Tobias Macey:
It's definitely a very impressive product and 1 that seems to be, driving a lot of value for a lot of companies. So thank you for that, and, I hope you enjoy the rest of your day. Thank you,
Introduction to Nikita Shamganov and MemSQL
Founding Insights and Early Challenges
Use Cases and Customer Applications
Ingestion Capabilities and Data Integration
Architectural Evolution and Technical Details
Data Modeling and Schema Design
Wire Protocol and Compatibility
Challenges and Future Directions