Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

Hello, and welcome to the data engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode.

With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast dot com slash linode today to get a $20 credit and launch a new server in under a minute. And you work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle.

Scaphos maximizes interoperability with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure instantaneously.

Request a demo today at data engineering podcast dotcom/metis

dashmachine to learn more about how MetisMachine is operationalizing data science.

And the team at MetisMachine has shipped a proof of concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team.

If you think that sounds awesome, and it is, then join the free webinar with MetisMachine on October 11th at 2 PM EST 11 AM Pacific.

MetisMachine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau.

Go to metis machine.com/webinars

now to register.

Go to data engineering podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/

chat. Your host is Tobias Macy. And today, I'm interviewing Nikita Shamganov about MemSQL, a new SQL database built for simultaneous

transactional and analytical workloads. So, Nikita, could you start by introducing yourself? Yeah. So my name is Nikita Shamganoff.

My role here is the CEO,

and,

have a long history here, at memSQL.

And prior to that, I was at Facebook. And before that, I was a SQL Server, working on the kernel. I think it also it makes sense to to share my,

academic history. So I actually was I grew up and got educated in

Russia, and my passion then was doing programming competitions.

So that carried me through my high school years, then college years, and then I went into,

the academic track

where, I went to Saint Petersburg University of Fine Mechanics and Optics and did my PhD in computer science. And right after the graduation,

in fact, I got, an offer a a few months before the graduation. I got on the plane and joined Microsoft to start working on SQL Server. And so it sounds like that was your initial on ramp to getting involved in the area of data management?

Yeah. Absolutely.

So I work on on on some of the

aspects of that, during my PhD program.

But, really, the,

the trial by fire was certainly at SQL Server where

not only I was exposed to, the area of data management, but in a broader way, started working on on systems. And I think there is a a a very specific set of skills that goes into engineering,

hardcore,

complex systems

written in c and c plus plus and built for performance and reliability. So certainly, SQL Server was a a very, very good training ground for me. Yeah. Getting thrown in at the deep end is usually 1 of the most effective ways to learn something very thoroughly because it's either, sink or swim, and that is a is a great inspiration for taking the time to fully internalize

what's happening under the covers. Yeah. Absolutely.

And the experience, like you said, was was really going into the deep end. A few weeks into the job,

when I just arrived from Russia into the United States, I was spending most of my time in in the debugger, basically understanding,

that SQL Server kernel,

fixing bugs. And at the time, we were working on stabilizing

1 of the most ambitious releases of SQL Server, which is SQL Server 2, 005, 5 years in the making, which is really unheard of in in modern days. Like, nobody takes 5 years to ship a piece of software. But, you know, again, at the time, we were on this death march of driving the quality

of of the product up after building,

a lot of features that went into into that product. So that was that was a lot of fun, actually. And so

that was what set you on the road to ultimately

working on MemSQL.

So I'm wondering if you can describe a bit about what MemSQL is

and the motivation

for starting the project and building a business around it. Absolutely. So MemSQL started with with a few insights. And the insight number 1 is

a

several attempts of building a scalable transactional system inside Microsoft.

I was not part of that effort, but,

my my cofounder and chief architect, Adam, was. And and while we were there working on the project called Matrix DB, it was an incredibly frustrating experience

that we actually could not ship that thing. And then when I joined Facebook, it became very apparent to me

what the power

and the value of such system is, and we went on, on a journey to build a new database. You know, we were very passionate about about building,

databases and building systems. And I remember

having a conversation back then, and the conversation literally was the following. Well, it's actually a lot easier to build an analytical system. But we thought that at the time the game was over. You know, history proved us wrong. Right? You know, since, there's a lot of cloud data warehouses

came into existence. But we were looking at the market and seeing a lot of companies like Data Allegro, Vertica, Netezza,

Astrodata,

being acquired. And so we're like, okay. Well, it's much harder to build a transactional distributed system, so let's just do that. And then we went on on their journey having just a few insights. A, we believe that there was more of an opening to build a scalable transactional system.

2, we saw the value,

of such system by looking some of the workloads inside Facebook. Most of such workloads that we were exposed to were in, around real time analytics. Basically, you need a transactional operational system

to withstand withstand,

high volume of inserts, updates, deletes that that hitting the system. But at the same time, the the the value is to, actually drive insights, drive analytics. And then the last insight was

the unsuccessful,

attempt of Microsoft of building a

distributed system with a starting point,

SQL Server. And,

the competition for Microsoft at the time was Oracle RAC and on Oracle Exadata. And we and we kind of believed that there is value in building such a system. And we thought if we build it, then there will be a market for for the system

because, obviously, Microsoft is trying very, very hard to build 1. So that's that's how we started. And then,

the amount of technology that goes into

a product like this is is is absolutely enormous. So you need to pave a path, and we made a bunch of assumptions to get to basically version 1.0

to not only

build such a complex product, but being able to

drive revenue on the way

to the ultimate vision,

of being fully scalable, elastic,

transactional system that can support transactions and analytics and and,

in 1 product. So that assumption, the first 1 was, let's build an in memory system. There was a lot of research at the time about building in memory systems. And also there's another assumption that that you can make that the cost of RAM is gonna go down so dramatically

that everybody will be running, transactional systems in memory. That assumptions turn out to not be true.

And if you compare the the cost of RAM

versus the cost of flash, for example, the the cost of flash, you know, dropped all the way down to the ground. Versus the cost of RAM,

you know, RAM got cheaper, but not

orders and orders of magnitude cheaper compared to flash. But yet, in memory was actually a very good starting point for us because it allowed us to

onboard

various

high value workloads.

So you can drive revenue with relatively small number of customers

solving exotic problems, problems that those institutions have and they're willing to put money and faith into a new a new startup to solve, to solve those problems.

And

how much of your product road map ended up getting driven by those early customers with the exotic problems? And how much of it was a matter of going and doing customer research with people who had more standard workloads to try and figure out what was the best direction for driving the company? This is a very good question. So the beauty of the situation that where where we ended up, early on is the exotic angle of those problems were only around performance,

but not around the surface area. So

the road map very nicely aligned

with expanding the surface area. And the surface area of a transactional SQL systems is massive. Not only you need to build your storage features and clustering features,

but you also need to provide

very, very good SQL support, which is a combination of just a very broad language,

SQL, but also,

investments into query optimization and query execution

and really understanding the universe of all possible workloads and doing a very good job

workload by workload

and on shipping new releases of of MemSQL. So to answer your question, the exotic part is just performance and scalability. And then

the general purpose part of

having a very good support of SQL, having an extremely reliable

product, and I can talk for a long time about quality

of, databases and systems in general. So so we found ourselves actually in in a in a in a decent spot where we're not torn expanding

the exotic nature of the workloads

versus,

building general purpose features. Those kinda align very nicely for us. And now that you have iterated and put out a number of additional releases from that 1.0

milestone.

What are some of the typical use cases that you see for customers who are

running MemSQL and people who are in conversations with you to add it into their infrastructure?

So you can you can parse this question and answer it in terms of tiers. So at the very, very high level, we support

general purpose database workloads that require scale. We give you

unlimited storage and unlimited compute. And in 1 product, in an elegant way, we can scale your

your applications, which is a very stark contrast

to your existing,

existing single node systems or legacy systems, the Oracles, the SQL Servers, the Postgreses of the world, where the amount of storage and compute is limited. Now this is too general. So what is the the next iteration of answering that question? Well, the second 1 is, like, let's look at the very,

specific workloads,

but still wear a a technology hat. The first 1 is real time analytics. You have a data coming at you

either through an application

or through a message bus such as Kafka or in the cloud Kinesis,

and you need to

provide, insights into that data

in real time. It's supremely important to

know what's happening with that data within maybe milliseconds or seconds or or minutes,

but, you know, in a short period of time. So,

and we call this workload real time dashboards. And real time dashboards are important in financial markets. They're important in health care. They're important in packaged goods and,

in in web 2.0. So real time dashboards is something that MQL does, incredibly well. So this is 1 workload. The second workload is there's a web point, 2 0 workload that requires scale because of concurrency,

and that's where giving infinite compute

allows you to build your system with much fewer moving pieces. You don't need to cache.

You don't need to shard your database. You can just deploy MaplesQL, and it will drive arbitrary level of concurrency

for your, web 2 put 2 0 or mobile application. So we have a number of workloads like this as well. And specifically,

the moment you add

reporting into an application like this, you just have a much

easier way of burning compute. So, for example, in the financial markets,

there's a number of use cases such as portfolio analysis,

or wealth management, where you log in and and you see your assets and and reports. And reports, January reports burns,

a nontrivial amount of compute. So if you want to deliver great user experience and lower latencies, you actually need a scalable system. So, so that's another

kind of class of applications

where,

there's a lot of concurrency on the other end. And then finally, there is a set of applications that are closer to data warehousing

and data marts where you just need to to produce a report and you need to produce, that report quickly. And in those workloads,

the advantage over data warehouses is really

that you can do it on top of the same data that powers your applications. So you don't need to extract your data, put it into the data warehouse, but you, but you do it on the data that sits in the database. So the advantage is really simplicity. So that that's more from the technology standpoint. And now finally,

there are use cases by industry.

That that's the third layer. And we're mostly, the most successful

in financial services and media. But we but recently, for example, we also see health care coming in, more and more into the platform. So and the specific use cases by industry are different. So for example, in the financial services,

we

see wealth management, portfolio management,

risk analysis

taking a substantial portion of the workloads that we support. Versus in media, it's all about advertising

and streaming, Understanding the quality of service, understanding the performance of your ads, and doing that in real time, and and and and really driving decisions

that will change the parameters of your advertising campaigns based on what's actually happening. And finally,

across many,

many industries, but particularly in media,

it's real time reporting

of the performance of your, of your campaigns

or things like pricing

and, and billing. So there there are companies

that have massive big data workloads.

And in order for them to provide billing that's up to this second, you require something like memSQL and nothing else will work. Yeah. And what you were describing of being able to build a transactional application on top of a database and then be able to build reports on that same data. If you're not using a system that can support both of those workloads in tandem, then you would generally be doing something like change data capture, piping through maybe a Kafka topic that then lands into a data warehouse to be able to have some transformation

that rolls it into an aggregated report, which

automatically adds latency because of all the additional steps. Whereas being able to run across the same records as they're being written reduces the overall amount of time and effort in terms of building the reports and in terms of supporting the data life cycles.

And I also know that MEM SQL is built to be multi tenant, so I'm wondering if the typical case is that it will 1 deployment of MEM SQL will be used for backing multiple different applications, and then reports are built to be able to combine the data across those applications

in that engine, or if it's something where an additional step is necessary to be able to aggregate the data across those different applications because of some boundaries in place because of the multi tenancy? Yeah. So this is this is a a loaded question. So let me answer 1 by 1. So first of all, you're absolutely right. And in many cases, in order to provide reporting on top of the operational data, you need to set up an architecture

that looks like having changed data capture, piping it through an ETL tool, or in in your case, to Kafka,

pushing into a data warehouse.

Data warehouse is not transactional,

so we need to be

very diligent into making sure that the data that goes into the data warehouse

is in fact a a real copy of the data produced by the operational system. You can't just apply

a change data capture log

to a data warehouse,

on a high high throughput, high performance transactional system. And the the second question was,

around multitenancy,

and we see a mix,

actually. We see there's a there's a class of workloads where we which which we call tier ones. And tier 1 workloads are

they have

very, very strict reliability

and performance requirements.

They usually have to run 24 by 7. The uptime has to be 5 nines.

And,

the latency and then especially p 90 latencies

have to be very, very consistent.

And so what ends up happening,

the tier 1 workloads

are typically run-in separate clusters of MIM SQL because,

they're so important to the business that that,

any disruption that will affect those p 9 latencies

or any disruption that will,

be affected by bringing new applications into the platform cannot be tolerated. And then for tier 2 workloads, that's precisely right. People just bring a lot a lot of applications

into the platform

and drive a lot of ROI

through consolidation.

Because Minty was scalable,

so you can bring, and gives you so much compute and storage.

So you can just bring a lot of applications into the single platform. Now on from the product standpoint,

that motion,

which we we're we're very, very excited about. Right? Because, what we see in a typical enterprise

is a problem that we call proliferation of data stores. A very typical conversation with something like a big bank or a very large,

enterprise company or a very large, you know, world class technology company, when you talk to their DevOps people, they say, look. It looks like our business team is falling in love with your technology,

but we are a DevOps team. And we're currently supporting,

you know, 40 plus databases.

If we bring you in, something needs to go. And this is a great conversation

because we typically see

several, not just 1 technologies,

especially legacy technologies disappear

once people onboard,

Mentee. And in addition to being able to run both the transactional

and the analytical loads on the single engine, there are also additional features built into memSQL itself, such as an ingestion pipeline for being able to consume data from external sources and pull it in for being able to build those reports.

And I'm wondering what the benefits are of having that built into the database engine versus using some of the preexisting platforms and tooling such as the ETL systems or streaming data systems such as Kafka or Spark? Yeah. This is a great question. And you you can either, put a purist hat when you start looking at this, from the product standpoint and say, look. We're we're building a perfect engine, and we're gonna focus on the engine and then hope that the ecosystem

will build tools around that engine to to use it efficiently. And then there's reality. And

reality is such that,

the market and the set of technologies are moving so fast that the customer is confused. There's too much stuff, and that stuff is not necessarily working

very well. I guess there are clusters of technologies that are working very well together, some around clouds,

some around the Hadoop ecosystem,

some around kinda old school legacy databases.

But the observation here

is that in the past,

we had a well defined interchange protocol. It's called ODBC.

And once you implement ODBC, you automatically

compatible

with a set of tools that matter. And in the past, those tools were mostly, you know, databases, data warehouses,

ETL tools, caches. Now if you look at the data landscape now,

you realize that a lot of data sources have become distributed. We're living in the world of,

distributed primary storage,

such as enterprise storage,

secondary storage such as object stores, which is distributed

message buses like Kafka

and Kinesis. And there is no such thing as a distributed ODBC,

and,

which means that that you have a choice. Right? You either wait or you bring that ingest ingestion capabilities as a first class citizen, which for the customer means simplicity.

And

if you have data coming from Kafka or you have data,

in s 3 or HDFS,

you can literally type 1 command in MIM SQL,

and this data is flowing in MIM SQL in real time

with arbitrary scale. It is very, very hard to achieve the same level of integration and performance by relying on a third party ecosystem, which,

you know, frankly, is very different from customer to customer. Some people want Informatica.

Some people want Spark. Some people want NetProduced jobs. So that's really why

we pulled in data ingest as a first class citizen into the engine. And what are some of the typical ways that you see your customers using that ingest capability

built into the engine where they don't necessarily

have to bolt on these additional systems to be able to provide

additional data sources to the primary storage system that they're using within MemSQL?

Yeah. Great question. So the number 1 is simply load data from SVO and GFS. This is not particularly groundbreaking,

but it's incredibly easy

to do, and incredibly performant to do it in

The second

1 is extending that capability

to supporting

Kafka. So now you can just point at Kafka

and say, hey, Move that data into a table in memSQL,

and we'll do it,

and we'll do it with great performance. You know, we have customers that are moving that data into memSQL tables

multiple gigabytes a second. And then the the the third tier

to actually grabbing data from a streaming data source such as Kafka

and moving into a stored procedure.

So now there's control flow,

and you can grab a data stream and then transform it and then transform it and write it back into MIM SQL

or write it back into s 3 or HDFS.

And that supports a large variety

of streaming workloads, workloads like, oh, I wanna scrape Amazon prices, which are constantly changing. And I wanna match those prices, or I wanna make sure I beat those prices for

the most traded products that are in Amazon. And with Mansequel, it's trivial to implement that application. Well, as long as you as you figure out how to scrape the prices and put them into Kafka. So now once they're in Kafka, you can grab that data stream, move it into a stored procedure,

run a bunch of SQL commands. And in inter transactional systems, you can have running accounts or you can, you know, store all the historical data,

and you can crash reports over that historical data with very high performance. And so, fundamentally,

there in order to build such a high value application, you need just a few moving parts. And on that ingestion path,

do you

automatically introspect the schema of the data records that are flowing in, or do you require some integration with, for instance, the Kafka schema registry for being able to determine the shape and types of data that are coming in? And then lastly, curious what the serialization formats are that you support for that ingestion workload.

So this is this is evolving.

Right now, we support

CSV, TSV format,

Avro,

and JSON.

And in the future,

we'll be adding,

Parquet,

which seems to be, you know, 1 of the most requested features. And in terms

of the schema,

MemSQL is a relational database. So data is structured, but it also supports semi structured data. We support a JSON data type. And when we store JSON in MemSQL, we automatically parse it and shred it,

and store it in in in a column store.

So what it gives,

the user, it gives you the user the ability to have as much structure

as the user needs and wants. But at the same time, it doesn't sacrifice performance. So we don't we don't

to using schema to a user,

but there are certain advantages in using schema

mostly around discoverability

and compatibility with BI tools,

where if you schematize your data and then you attach a BI tool, you you clearly see what's there, and then you can, easily,

craft visualizations,

and it simplifies building apps as well. As for Kafka schema registry, that is something that is on the road map, but not currently supported with the SQL

65.

And moving on to the structure of the database itself, I'm wondering if you can discuss how it is architected internally

and how that has evolved over the years that you've been developing it from that first 1 release.

Yeah. Definitely. So a lot of things stayed

starting 1 0 release, and a lot of things evolved. 1 of the things that we quickly realized,

and back to my question,

back to my comment

around the RAM prices is that RAM is not enough. And,

we made a decision back then to invest into very dense on disk data structures

that would support,

reporting

as well as, as operational workloads.

And, we had a choice back then what kind of data structures to introduce, and we landed on on column stores.

In hindsight, I think this was a very, very good decision,

but it took us years to get it into a place where we're really happy. And I think it's in a good spot right now, but wait for another year or so. And, it's just gonna be magical. So,

the architecture

is is a distributed architecture, and and the hard parts are

transactionality.

And the other hard part is, query processing. Because query processing,

in order to be efficient, requires

to to as you, run and execute a specific query, it requires you to move data around the cluster. And then as opposed to a single node system

where 1 of the hardest components

is a query optimizer, in a distributed system, it's still 1 of the hardest components, but the number of choices is much greater than it is in a single node system. Where in on a single node system, you think about join order. It's kinda 1 of the key components of a query optimizer. Then you think about rewrites

of, you know, decorrelating subqueries and whatnot. In a distributed system, the numb the number of choices

for your joins is greater because you introduce the notion of a broadcast join, where you take a smaller table and and send it to to all the nodes in the cluster to perform a join or, a shuffle join where you move data in the system before you join,

2 tables. As well as you can say, oh, the dataset is relatively small, and I'm just gonna bring all that data into 1 node and perform the join there. So all of that are choices, and the query optimizer

working in the in tandem with the query execution system have to have to make those choices, cost them, and at the end of the day, deliver

on wide variety of workloads

starting from

very simple ones, you know, kinda key value workloads

to mixed workloads to mostly analytical and reporting workloads.

So so the architecture,

of the system

has to be very that,

you know, you can you can build a a,

and then storage query

optimization and query execution has to work in tandem in in such a cluster. And given that it is distributed

by nature,

I'm wondering where it falls along the axis of the cap theorem since there are some gradations

along there that you can optimize for? Absolutely.

Well, it's a CP system. Right? And the and we sacrifice,

a little bit of unavailability

or availability

to,

deliver on consistency and partition tolerance.

In in practicality,

there's a trade off. Right? So do you do you push complexity on the application developer

and deliver on the a part of the cap theorem,

or, you build a consistent system? And it has to be partition tolerance, of course, because,

you know, the greatest sin of a database is to lose data or produce wrong results. So that's really a choice. And in practice,

and we know that by running tier 1 workloads for some of the most demanding customers, a 100% availability

is impossible. Even though,

even if the architecture

allows that, there are certain things that you cannot control, that that you cannot build a 100% available system. So it's actually a very natural

choice

to build a CP system because,

by doing things

like adding more resiliency on the hardware side,

adding more resiliency

in the software,

adding auto healing features.

You actually drive the the total availability

of the platform

way above the 5 nines. And going beyond that is is

pretty much impractical even if you have a perfect

architecture

and have an AD system.

And

as you mentioned, the disk storage is column oriented.

And when reading through the documentation, I noticed that the in memory representation

is row oriented. So I'm wondering how much overhead is introduced by

converting between those representations,

between the on disk and in memory representations,

and the benefits of having those different views of the underlying data? This is a very technical question, and and I'm gonna give a very, very technical answer as well. So first of all,

when when you run a lot of updates, so you run a lot of inserts,

into a column store table, it does make sense to have a row store representation for that highly volatile

part of the data. And then on the right code path, there is certain amount of overhead

of flushing that that row store data into a column store system. Surprisingly, the bottleneck there is IO, not CPU. And the reason to that is there's, you you know, the CPUs are actually very performance and and they're

fantastic compression

algorithms developed that are very, very cheap on the CPU. So this is this is part 1. Probably, there's gonna be,

about 5 parts to, in answering that question. The second part is, okay. Well, we run a query against a column store table,

and we're sending the results back. And then we use an immediate

computation to produce the results of that query. If it's a reporting query, the amount of data you process

is drastically more than the amount of data you send back. And in this case, column stores are the perfect representation

because we keep that that vectorized

representation

throughout the query execution pipeline. So each,

database operator

is

using,

running computations on top of what we internally call vector projections.

And you if if you wanna see,

have the analogy in the open source world, the project that's called Apache Arrow comes to mind. So all the internal query processing

runs on vectors. And then at the very end, we convert

that representation to a row store representation

because the client expects data coming row by row. There's 1 scenario

in which this doesn't work that well, and that scenario is when you're trying to pull out a lot of data out of the system. And the the typical way,

where people use it is in data science. They have Python tools, and all they want is to run a simple query maybe with a with a predicate and whatnot. And all they want is to get data out of the system and move it into a Python library or Spark. With Spark, we can do it in parallel.

So

the problem there is not as pronounced.

But in Python,

we

need to grab a lot of data in the columnstore format and move it to a Python library that's also processing data into columnstore format. And in the, in the middle, there is a client that consumes data row by row. So this is 1 scenario

where

I feel like we could do a better job,

and there isn't real overhead converting from row store,

column store representation to row store representation.

Then there is another workload where you run high volume of updates. And in this case, our,

our caching systems and and our row store representation is a perfect representation.

We have customers running 10, 000, 000 updates a second against our row store representation.

And,

but we've basically given you an unlimited capacity, for workloads like this. That's again where row store works super, super well. So,

having those 2 representation

as, with row store and column store allows us to cover a wide range of workloads. And depending on the workload, you can you can get the best of both worlds. And given this

duality of the representation

for the data, I'm wondering how that translates

into

any special approach necessary for data modeling

and building the schema for the data store

or if because of the fact that it's largely SQL oriented and relational in nature, you would just use the traditional techniques and mechanisms that anybody

familiar with legacy and traditional

Siebel databases would be familiar with? Yeah. It's a it's a very good question.

So we currently give you a few choices.

We currently give you the ability to create a column store table that will reside on disk and be cached in memory, and we give you the option of building in a memory row store table. We're working very hard into merging the 2 over time, and the reason for us to do it is just simplicity.

So

by giving you that choice, we're increasing the the cognitive overhead,

for a developer

to understand her workload,

1 notch deeper

to make those choices. And then we also give you a knob

to

define how you distribute data against the across the cluster. And, it's up to you if you wanna use that knob or not.

So barring those,

is your standard SQL. And then it's standard SQL, it's query optimization and query execution

that makes sure that, you will get great out of the box experience with the product. Are we done? Well, in the future, I I believe

databases will will

get into a place

where there are no choices.

Right? You just create tables, and there's no capacity. You don't think about capacity. It's just a cloud API, which is SQL.

We, and you can create table and run queries. And what you pay for is,

is basically latencies. In a few latencies and and the the volumes of of data

stored and processed. We're still far away from that future, but we're absolutely driving towards that future.

And in terms of the write path for data from the read path, obviously, you have it stored in disk, and then you do that translation to the in memory representation. But for writing the data,

I'm curious what the overall life cycle is given that it is a in memory capable database, and you do have that transaction log for persistence and just, how it flows from somebody issuing that request through to it actually being fully persisted on disk?

Yeah.

So,

let's say you run, an insert statement,

into a table. So when you do that, the data

moves into 1 of the nodes,

in the database,

gets replicated to another node. And once that happens, you get an acknowledgment

on the client that, the write went through. With that, you will have a transaction log on every node registering that record. And now that record is appended to the transaction log.

So so the data is not going anywhere. So now it's in the database. And then internally,

that

record will be represented as an in memory record in in that table, and then there is a background process

that goes. And once the amount of data in the in memory representations

exceeds a certain threshold, it will perform a conversion of the in memory representation

to a non disk representation

and flush it on disk. So that's,

that's really the life cycle. And

1 of the main

aspects that you promote for MemSQL

is the speed and performance of the overall system.

And I'm wondering, what are some of the techniques that you use for being able to optimize for that overall system speed and performance across the cluster? That's that's a good question. There are a lot of them. So let me

bucket them into several categories,

and I will also give you an idea

of what the impact is,

of each bucket. So the first bucket is in memory representation

and, cogeneration.

And in that bucket, if your data is in a row store format in in memory, it's it's trivial to go and navigate to, an individual row. So that that takes very little overhead. But in the context of a of a SQL query that extracts either just 1 record or perform certain computation,

1 of the surprising aspects is

that you burn compute on evaluating scalar expressions.

And so that's where query compilation

and we compile queries into machine code drives a lot of value.

So that's, I would say, bucket number 1. The second bucket is network optimizations.

When you move data across the cluster,

if you do it in a naive way, you will burn a lot of compute

on serializing and deserializing data before and after you move it across the network. So that code path has to be highly optimized.

And I would say this is an example

of just systems approach to building, to building a database. And the system,

approach is understanding where the bottlenecks are and implementing them carefully and efficiently. So in order to to drive value in that bucket, you're just looking a lot at a, at a lot of profile data,

by running multiple workloads across the system and,

understanding

what the highest offenders, offender is and and fixing that. And finally, there is a bucket,

around compression

and,

vectorization,

and that applies mostly to column stores and reporting. And that's the most, you know, well architecturally designed and and beautiful system

that performs computations

on top of compressed and encoded data. And that specifically is,

applicable

for

larger,

report inquiries,

for product at evaluation,

and, for evaluating joints.

So 1 of the recent benchmarks that we published was,

you know, trillion rows, a second benchmark. That's only possible

by having a very careful implement

implementation

of operations

on

compressed and encoded data. And, we obviously didn't just do it for that specific query. We built a a general purpose system

that uses,

those techniques,

and drive performance to to the variety of of workloads. I have a question here. I'm not sure if it is an interesting 1 or not about the sort of ratio

of open source versus proprietary code used within the project.

Great. So we're very, very practical. The majority of MEM memSQL. And how much of the total implementation

of memSQL

is done using custom built code versus open source projects that are leveraged within the engine? I never ran the count. Certainly, we,

we use open source software

when it makes sense. The majority of of MIM SQL, that would guess,

about 95

to 97%,

is built in house. However, there are particular pieces of open source code, that we use in the engine. Specifically from Postgres, we use a parser

and binder. So everybody understands what a parser is. You know, you get a new SQL query, and and you need to understand what's in the query. So we need to parse it. The binder

is another component that is adjacent to parser

that takes the query and and binds it to your metadata. So if you reference a column in your SQL query, you need to know what column that is in in your metadata catalog. We also use a few libraries in our column store representation,

things like lz4

for compression purposes.

And we use a geospatial

library

from Google. I believe it's called s 2, for geospatial capabilities.

So that's about it. The rest of the,

of the engine, you know, core optimization, core execution, storage, replication,

you know, all the SQL supports and built ins are built in house.

And 1 thing that we didn't touch on yet is the fact that the interface for MemSQL

uses the MySQL

wire protocol,

and I'm wondering what you see as being the benefit of choosing that dialect for the communication layer. And if you have any thoughts of

the level of difficulty or practicality

of implementing additional wire protocols for people who want to be able to use memSQL as a drop in replacement for other database engines? This is a great question. And people often confuse wire protocol and syntax.

So wire protocol

is, is actually very, very straightforward.

You connect to a database. You have a handshake.

And then the protocol

is I send you a string, and you give me results back. And then there's syntax.

And the syntax is, well, what's actually in the string? And the compatibility

certainly has 2 parts to it. 1 is the compatibility

around the wire protocol,

where my my my strong opinion is doesn't really matter what they what the protocol is. And you wanna choose

something that's the most,

ubiquitous

and the easiest to install.

And when we started, we had the option of MySQL or Postgres to be the wire protocol. And we chose MySQL because MySQL,

at the time, was the the,

the most popular database, relational database in the world. It actually still is,

but posers is is is growing rapidly.

2018,

we would have a lot more,

pressure or or pull to choose post responder protocol. Then there is database

compatibility

or app compatibility.

And a 100% compatibility

is only possible if you'll be using the the full top part of the database. So something like AWS Aurora

would, would be compatible with MySQL

simply because they run the top part of MySQL. And by the top part, I mean everything but storage. Now if we were to choose a a Postgres,

wire protocol or any other protocol,

that still will be 2 parts. Right? What do you connect to MIM SQL with?

And second is, is it actually compatible with with a 100% post risk applications?

So and that's where you have a trade off. You either say, I'm going to innovate in on query processing,

and I and for that, I need to build it,

from first principles and from ground up, or I will be driving

towards app compatibility

and no surprises for an application developer. And in this world, you will have to

reuse

the full code, and building a distributed query processor

is

actually very, very tricky and and a hard proposition. So we choose we chose the former.

We chose to innovate on query processing,

and that's why we're not a 100% compatible with any other database.

We support SQL.

But if you were want to move,

your application

from MySQL or Postgres or Oracle, you will have to put some work. Not much, but that work will be required. And for somebody who is interested

in moving to using memSQL,

I'm wondering what's involved in actually getting an instance,

set set up and deployed and integrated with their application and environment.

The setting up and deploying is trivial. We we do a very good job, and once you commit to certain sizes of the cluster, it's basically a push button experience,

and then SQL will be running on that cluster. With the, MemSQL has no dependencies,

so it runs as a as a Linux binary

and, stitches itself,

to a cluster. Moving an application

really depends of how much this application has ties with, with the database. We're working with some large customers moving data from Sybase and moving data from HANA.

And

the complexity of that workload,

usually depends on the complexity of the stored procedures. In the past, we didn't support stored procedures, and we couldn't even touch those workloads.

But now that we do,

there is a migration

path to moving the story procedures

because,

our story procedure,

language

is, though familiar,

still modern, and those legacy engines that there are certain choices that they made to build in your language that just don't make sense in the modern world anymore. So we

as we look forward for to providing a great developer experience,

building new applications is is very, very nice.

Migrating legacy applications will incur some pain mostly around stored procedures, but those are very well understood pains and, with some effort.

And if you need with some help with professional services on MQL, it's a it's a very easy proposition.

And what have you found to be some of the common points of confusion or common difficulties

that your customers encounter when they're either migrating to MIM SQL or building on top of it with a new project? Well, it's it's basically understanding

the distributed nature of the system. Once it clicks, it becomes,

we we start to have a lot of fans

on the other side of the, of the enterprise.

We call them champions. So getting around,

and making a leap from a single node system to a distributed system is the hardest. And once you're there

and once you realize the value,

then not only we have kinda passionate fans

in the enterprises,

they tend to get promoted too. And going back a bit to earlier, you were saying how MemSQL provides,

limitless compute and storage.

And I'm wondering

what the capacity is in terms of auto scaling natively to the database versus how much of that is necessary to be implemented as another layer that the operations team manages?

And then also,

I'm curious what the licensing model is in terms of the way that the in memory

capability

plays into the overall system. Great question. So,

so MIM SQL is scalable

but not fully elastic yet. So there there are limits to

how far you can expand the cluster from the starting point. And,

and, typically,

you can, if you commit to a cluster of size, let's say, 5 nodes,

you will be able to comfortably

grow it to about 20 nodes,

but not to a 1000 nodes. And,

but that's basically a feature,

and that feature is on the road map. And,

few customers actually want to have that level of elasticity,

and, that's why the feature is on the road map and and will, in the future, make Memseeker fully elastic. They

just curious about the licensing model as far as in memory versus on disk and, how that plays in. Yeah. And the licensing mode model is per node. So it's, it's very very straightforward.

So you basically take the number of nodes and multiply it by a number. And what have you found to be some of the most challenging

or unexpected

aspects of building and growing the technical and business aspects of MemSQL?

There are a few. 1 is,

we we didn't realize

what it we didn't

know viscerally what it takes to support tier 1 workloads

and what kind of pressure

it puts on the quality of the technology

and the software

as well as the

capability of the the organization, which is around support,

customer success,

and kinda 3 60 customer experience

that you need to have to make your tier 1 customer successful. So it was a journey, and, that journey involved,

waking up at 4 AM in the morning and,

giving,

CIOs and CTOs on the other end my my personal phone number,

and, building out

multiple levels of,

support,

in house

escalation pass, you know, remote offices around the globe

to have a a very quick,

basically instant response

to our most demanding customers and and most demanding kind of 1 of the world's most demanding applications.

So

that's,

that path, every and anybody who is building a strong operational

database with tier 1 capabilities

we'll have to go and walk that path. So,

the path is not easy,

but we we will underway. And in a way, that was that was a bit of a surprise to me, I have to say. And given the strengths and flexibility of MemSQL,

it's, easy to fall into the trap of thinking that it will solve everyone's problems. But when is it the wrong choice for a particular data workload or a data platform implementation?

At this point in time, MySQL

works very well when you have a scale problem. And,

there are workloads for which MemSQL is an overkill, and you're better off with some sort of commodity database such as Postgres.

This will change over time

where

memSQL will

will take over those workloads as well simply because it will provide better developer experience

and simplicity

and cost. But at the time, you know, those very simple departmental databases where you don't expect much from the database itself are better served for something like Post

And what are some of the

features

or improvements that you have planned for the future of MemSQL,

both the product and the company?

Great question. So,

we,

we want to build an X-ray database company. And the way we think about features

is, you know, what are the things that,

we we need to build to

keep our existing customers happy and keep advancing MemSQL inside those those enterprises. MemSQL is an enterprise company. And,

what are the features we add in to capture

new workloads and new markets? And the end state is

where you,

like I said, you have a cloud service, you have a SQL API to it,

and that SQL API to to it has has endless capacity,

and you can pay for better latencies and better throughput. So on the way towards that vision, we're we're investing in

managed services, Kubernetes integration, and we can constantly strengthening the engine itself, making it more elastic,

making it more robust, having the ability to run multiple data centers, and,

just driving the quality and and and developer experience. And are there any other aspects

of MemSQL,

the project,

or the data management space in general that we didn't cover yet, which you think we should discuss before we close out the show? Well, I think there's there's a lot of things in data managements that are happening in the intersection of things,

on on various boundaries.

1 is, cloud to on premises boundary, where I think Kubernetes is 1 of those technologies

that will start supporting data infrastructure,

not just stateless apps. And, we're looking forward to that. There's another boundary

in data in data management,

that is between databases and data warehouses,

where we believe that over time, we will live in the in more of a converged,

world where you just have an, you know, a SQL API to your data. And there's another boundary,

which is data science to a data management tool. And the requirements for that boundary are, keep rising,

especially with,

with AI and ML workloads.

And I think we're just looking, we're just at the beginning,

of that world.

And this will, a lot of innovation and a lot of changes are gonna happen,

around that boundary.

AI and ML workloads are rising

exponentially,

but they need to store data somewhere. And there is no default choice for that just yet. So that's that's an exciting world, and we're absolutely gonna be part of that. And for anybody who wants

to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective

on what you see as being the biggest gap in the tooling or technology that's available for data management today. Well, I think it's it's that ability to seamlessly scale your workloads.

So I remember using

MySQL for the first time in my life,

and that was back in the nineties, and it felt like magic. Felt that there's a technology that, you you know, I can throw my data into, and and it's safe, it's secure, and it's fast. You know, fast forward, you know, almost 20

years,

we need to have

similar tools for

exponentially

more, larger datasets

where

where when you when you run those commands, when you operate those datasets,

it seems completely fluid and effortless,

and we're just not there.

Once you start operating and and

managing terabytes

and petabytes of data,

that fluidity is gone, but there's a tremendous amount of value

in supporting such workloads.

So I think there is a gap there, and and and we as a technology industry are are are are not quite there, and we're working on it. Alright. Well, thank you very much for taking the time today to discuss the work that you've been doing with MemSQL.

It's definitely a very impressive product and 1 that seems to be, driving a lot of value for a lot of companies. So thank you for that, and, I hope you enjoy the rest of your day. Thank you,

Data Engineering Podcast