Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Lenley Hansarling about Aero Spike and building real time data platforms. So, Lenley, can you start by introducing yourself? Yeah. I am Lenley Hensarling.

I'm the chief strategy officer here at Aero Spike, and I also manage the product management group.

I was brought in to sort of bring

a more business perspective

to a very technical company.

Have worked with the CEO,

John Dillon, for a number of years sort of through other things, working with investors.

I work now with Srinivasan,

who is 1 of the founders, the remaining founder who's here

and is 1 of your typical database

PhDs from University of Wisconsin. And we're really, you know, lucky to have him in that deep technical background.

Do you remember how you first got involved in the area of data management? Well, you know, it goes back to university probably. I was at the University of Texas at Austin as they spun up their

computer science program. There was a guy, Avi Silvershotz, there. If you Google him, he's like 1 of the guys like Altman

who really was deep into formal theory about database. And so

I got interested in it there. I got interested in some of the

thornier problems of, you know, federated database and, you know, multi database and distributed database and such. And

worked in file systems over the years in operating systems. And it became 1 of those itches that you keep coming back to scratch. And so

I recently, before Aerospike, was at Enterprise DB, a Postgres company.

And, you know, Postgres is a wonderful

Breaker put together.

The difference

is that here, we've got what what I would call leading edge technology

in the, you know, ultra distributed,

you know, real time,

do stuff at hyperscale

in a single cluster. And so,

you know, that attracted me.

And so that brings us to the Aerospike project. As you mentioned, you came in to help sort of bring a business perspective to the project, and it started off as just a, you know, very heavily

technological

company and very focused on the specifics of the database engine and how to make it fast and how to make it scale. And I'm just wondering if you can give a bit of an overview about what the aerospike project is and some of the story behind where it started and how it got to where it is today.

Big data was coming up. Well, we've been in existence about 10 years, 11 now, I guess.

And big data

really evolved during that time.

And,

you know, we really focused on 2 things, and our 2 founders, Brian Balkowski and Srinivasan,

were really focused on

how do we get

sub millisecond latency,

but not just that. To do that at scale, and by scale, they really focused on large datasets,

meaning, you know, tens of terabytes, hundreds of terabytes. We've recently released a benchmark

with Amazon and Intel for a petabyte benchmark. Right?

And to do that at high throughput as well. So how can you do that for, you know, 100 of thousands of connections to a cluster,

you know, concurrently,

things like that.

So they found traction in the Ad Tech market, which was evolving over this same period.

You know, The Trade Desk is a customer of ours, and and there's a great quote by their CTO that he said, just, you know, when we ask him a question

at 1 of our customer advisory boards, he said, you know, the ability to process more and more data within bounded

time windows is gonna result in business models that we haven't yet contemplated. You know? And I wrote that down in a notebook, and I've remembered it for the last 2 years because it's a great quote, but it's really meaningful in terms

of what drives our focus.

And we're seeing that happen now in financial services, in IOT, IoT, and other places where the more data you can apply against decisioning

in a given bounded time window,

the better result you can get or the better decision. Right? So that's something that we focused on

and are expanding

that

our understanding of where that's applicable, I guess. And there's real technology

behind it. You know, a lot of times I look at

companies that have evolved and

a a sort of slow evolution and adoption of technology for other purposes and things.

But Brian and Srini really set out to build a different solution

and to take advantage of SSDs in a different way. You know, people say, hey, great, solid state storage,

you know, they're like drives.

They said they're not like drives, they're like memory.

Okay? And they wrote drivers that bypass the operating system

and went directly to the SSDs

so that we manage the SSD as if it were block storage

DRAM.

And we get near DRAM speeds

by handling the data in the SSD.

And we have something called hybrid memory,

which is

put the indexes

in RAM and put the data

in the SSDs,

but make that like it's an expanded data space and really address things like that. And that's, you know, at the single node level,

but it goes on from there because they also focused on

really doing the distribution

and how the clients interact with the clustered environment

very differently.

Instead of a quorum, it's a roster based system.

And the clients are what I would call a first class citizen

in that distributed model.

So they know how to go in a single hop based on the digest

to the data that you're looking for. And that's the fundamental, you know, underpinnings

of being able to do many things faster to get to that data

very directly.

And now, you know, we're

taking that same type of approach into secondary indexes.

And to the scale of those, we've expanded into having,

you know, all flash solutions. So that if you have indexes that are super large and you have a dataset that's super large, it all goes into flash.

This has a result that makes it possible to support

bigger workloads

in a cost effective manner. Okay, which is 1 of the reasons that we'll win

in many competitions

is because

the projection of more data for a single node means that you don't need as many nodes

to cover a given data space, if you will.

And that makes a huge difference in terms of how many nodes you have to manage and the cost of managing all of that, but also gives you an ability to scale up as well as scale out. So we really have focused. And, you know, 1 joke I tell is that we're the last group of

true system software

programmers that are focused on really ringing everything they can out of both the CPU

and the way the buses are constructed, the way the network cards are constructed,

and SSDs.

And we've taken that same mind set to supporting Optane, you know, or or persistent memory

as well. So that focus on really exploiting the technology underneath us and doing that in a way that is cost effective and efficient

is sort of the combination of things that really allow us to do that. In terms of the

use cases that you and your customers are primarily focusing on, you mentioned that earlier on in its lifespan, it was very popular in adtech because of their need to be able to

very quickly, you know, determine what they wanted to bid for a given advertisement spot because of the need to be able to be in the hot path for a

search request.

And

now that the cloud has taken a much larger share of the sort of compute capacity and the workloads that people are running, how has the aerospike product and the customers that you're working with shifted the kind of primary use cases that it's being applied to? You know, Tobias, that's a great question. I like to say we sort of, you know,

made our chops, if you will, in the Adtech space.

But the ad tech space is characterized and it's evolved to where, you know, I guess, I don't know, 5 years ago, they say, we got a lot of data sources we're applying to this. It's like, you know, in the tens of.

Now, you know, we have customers that say we're adding hundreds of data sources a month,

you know, and putting them into 1

picture of what a user profile is. Okay? Then then what's happened is everybody thinks of things as profiles. That sort of, in some sense, the foundation of IoT, the foundation of, you know, digital marketing,

the foundation of really

AI to some extent because, essentially, you're creating a profile in real time

based on a data stream.

Okay. We capture that data stream, hold it, but then we also take that data that's captured from many, many different sources

and put it into

large datasets.

Okay? And then our customers will

I think the best way I've heard this said was 1 of our customers, I asked him, like, why did you buy the product when I first joined aerospike? Right? I went around a bunch of customers, said, why did you buy that product?

Their first reaction is always, well, it's your product. Why would you ask that question? And I said, because, like, I'm not buying it. You are. And I need to know why you bought it. But he said, oh, it's simple, actually. You know, in the past, we were able to use, you know, hours of information

captured from the stream for a given, you know, profile we wanna construct. We wanna match that against, you know, days of information.

He said, now, we take weeks of information

and match it against months of information

that informs that model through, you know, machine learning and etcetera. Right? And we can match all of that

within the same SLA in terms of a time bound window,

which is 20 milliseconds or 40 milliseconds.

So that model right there

applies to

any number of things. Right? Every time somebody's trying to figure out what to put in front of someone

1 at the bottom of the screen in ecommerce,

you know, the more data that they can apply to that,

both about that characterizes the user coming in, but also characterizes

different cohort groups that they match up within things. And if they can do all of that

database match up and access all that data, but in a very bounded time window, that's become where we excel.

The other thing that's happened is

primarily in financial services. Right? So that model I just discussed is used a lot in fraud, if you will, and in identity management. And that's key to financial services that's online.

But we've also

been taken up by a number of large brokerages and banks

to do more real time transactional

things.

So I'll give you a great example. We've got a customer,

can't be named, but, you know, a brokerage.

And and they used to

compute

the picture from all the data of their margin business

once a day.

Okay? And that was done by the risk management compliance people and they traded against that risk profile, if you will,

both for individuals

and for the institution

at large.

Now, with us,

they initially reduced that

as a real time picture that they would recompute every 30 minutes.

Now they can recompute that

in single digit minutes and are trying to get it under a minute. And what that does is compress the risk window

of what the unknown is, you know, what they're operating against.

And the world changes

a lot. You you know, if the last 2 years hasn't taught us that,

you know you know, my joke is nobody ever expects the Spanish inquisition, you know.

And that's really what they're fighting against all the time to get a more accurate real time picture of what's going on. But the business benefits of things like that is that I can say yes

more often

and quicker as somebody's positions change.

You know, if all of a sudden I've sold off a lot and I have a lot more cash,

and I wanna do a margin trade on something else,

that's a very different risk profile

than it was when I was holding a bunch of stuff where the price might be fluctuating.

And so being able to do things like that in real time

is a big change

for many, many different types of companies.

You know, you can project this into logistics,

IoT,

anything, really.

And this drive towards becoming real time and applying data to decisioning

and having a more real time picture, more up to date picture

is something that people are doing

in any number of businesses

across any number of industries, I'd say. You know, we're talking

to automotive suppliers,

and we're talking to automotive companies,

etcetera, right, that are looking at data this way and trying to understand how can I operate it on it by some definition real time?

And they're not talking about small amounts of data. Right? Because you wanna be able to say,

project this across a big enough cohort group, right, to get some accuracy,

to be able to say, what's my next best action? And that can be true

for cohort groups that are people,

but also things. You know, we talk about the Internet of things, and so you've got all this telemetry coming in. Telemetry about the weather,

about the grid,

about fluctuations

in power that are happening,

all kinds of things. And everything is instrumented these days,

including you and I. Right? I always joke about that. You know, I hold up my phone and say, you know, we're all instrumented.

And, you know, if you have an Apple Watch, you're even more instrumented.

But it's amazing the amount of data that's captured by the devices that our devices near.

Okay? So it's that type of,

you know, ability to have insights based on massive amounts of data

and combination of data that people

are continuously

innovating on. They say, you know what? If I had this data and this data and I can correlate that that's somehow, you know, connected,

then I can know these things. And so that's going on continuously and changing how we approach

business and marketing

and everything.

In terms of the sort of technical architectures that aerospike is being incorporated into, I'm wondering if you can give some of the typical ways that it is put into the overall

data life cycle and sort of where it fits in the overall infrastructure of a given company as far

as how they're storing the data, how they're accessing the data, what the sort of usage patterns look like for an aerospike deployment?

So that's a great question, actually. And, you know, we sort of have 2 major threads of work now. We continue to evolve the database.

You know, I mentioned that we're adding secondary indexes or we're revamping our secondary indexes to make them even faster and more scalable

and allow people to do more things with the database. But we've also made

a lot of investment

in what I would call the data fabric. Right? And making our database be a first class citizen

in the data mesh, the data fabric. And that means

really having optimized spark connectors,

Kafka connectors, Pulsar connectors.

We're about to come out with

a new thing, which is we call ConnectX, and it's sort of based on a change notification

mechanism, be able to push out in a very neutral format

and support multiple formats, you know, Avro, Arrow, you know, whatever you want,

and be able to push that into a data pipeline.

Okay?

Because there's not 1 fixed point of data anymore.

Another great quote from 1 of our customers was that, you know, we used to all look at data and getting value from data

by saying, let's have a massive data warehouse

or let's have a data lake, you know, and and then look back at that static data

and try and derive insights from it. Now it's completely swapped. We've got data that's essentially in motion all the time

that

has creates pools, if you will, where you aggregate some of it, but none of it is constant. It's always changing. It's being updated in real time. And so that means we have to lean into, you know, great support for,

you know, Kafka, Flink, you know, Pulsar,

you know, etcetera. Right? And so we put a lot of effort into that. People compose

architectures then that have

many, many different clusters using our database at the edge

for real time ingestion of different

incoming datasets.

They aggregate that back in, you know, warm stores that are still in real time. It may not be sub millisecond

that the access

and the computations happen on that,

but it's massive amounts up to many petabytes

that they can then pull

from that dataset

in low latency, which might be described by, you know, single digit millisecond

for access to 1 piece of data. But within 20, 30 milliseconds because of parallelization,

they can pull

significant amounts of data in those time windows,

make another decision, and that may get passed back on to another point

that,

you know, might be in Google Bigtable. It might go into,

you know, some other data store. It might be, you know,

accessed through or go into Snowflake or something like that. But then after that machine learning

and the development of the features,

as they say, those features are then pushed back to the edge

for real time decisioning,

so we wind up being a feature store as well.

I'll give you 1 example.

A customer that we think you're pretty prominently in very massive graphs

that are used in different ways.

And those graphs are developed

through ML,

sometimes in our database, sometimes in other databases. But once they understand what the graph is so that they can operate on it in real time, they have to provision that graph. So we're talking about

something that in our database would be represented by billions of records

with, you know, literally thousands of

vertices

per record

that they can then compute on in these, you know, low millisecond time windows.

And so it's this, you know, sort

of feedback loop

of many different levels of processing and decisioning and creating models,

but then reprovisioning

that back to the edge. And we figure at the points where you have to be able to ingest data really fast,

access and make decisions on it fast, but also supply that data back upstream,

and then take it coming from upstream back downstream to the edge. You know, there's probably a little convoluted, but I think people that deal in data pipelines will understand, you know, that model

it'll be fairly familiar to them.

So digging deeper into the

specifics of the AeroSpike engine, I'm wondering if you can talk to some of the ways that it's architect

and the data model that it is designed around and some of the ways that that data model maps into the performance capabilities.

You know, the data model, I have to say, it's sort of funny because it took us a while to realize that

we were a document database as well as other things. Right? We're a key value

model. So

but the value

can be a compound

document, if you will. Okay? And so what we support is, you know, bins.

Think, you know,

columns, if you will. But within a bin well, actually, there's a namespace for database. Right? A cluster can have multiple databases, so we call them namespaces.

And then within that namespace,

there are bins.

Within those bins, we support a map list structure

that can be hierarchically

nested. Right?

So

quite often,

our ability to access

a significant

piece of information,

right, that's you know,

has a lot to it. And 1 quick read

is part of the game that we play. Right? And so those maps and lists have APIs that can go directly to them. The indexing that we have

can also help you navigate this as well.

So that data structure also is in some sense well, not in some sense. It is a superset of something like JSON.

So, recently, 1 of the things we've done is

put a JSON API

on top of it so that you could,

in Java,

manage JSON

like you would with some other databases that support JSON. Now there's a little performance hit to that, but the wonderful thing about us is that if you need, you know, sub millisecond capability, you go directly to our APIs,

navigate

the bin map list architecture

directly,

and you can get that, you know, sub millisecond

access to the piece of data you want. Okay?

And that is kind of the basic model. Now the other thing we deal with is something that every NoSQL

database really does. Right? You get the people coming from the relational world,

and they have a mindset of, look, I'm gonna do joins, and I'm gonna construct a piece of data. If I refer to this, and everybody else does, I think, as the

as the denormalization,

you know, path.

And so it's a different mindset,

and people have to learn it. We tend to spend a lot of time with our customers and clients, particularly the ones, you know, that are maybe in financial services. So you got a group of programmers.

They've written to an Oracle database or a DB 2 database for a long time, and now they want to get fast and real time.

And then we look at their data models and say, no. You don't have to do that. You construct it into

these bins with the map list architecture,

and you can do everything you want. And we can update this data

or rewrite the records so fast. You don't have to worry about constructing saying

saying is that

this schema is extensible,

if you will.

So, you know, from a data model standpoint, that's a lot of it. You know, I mentioned that by compressing

many things into 1 record, if you will. I I referenced the the whole idea of a graph being represented with us so that a record might have thousands of vertices in it that can be navigated. Because what we do is, as I'm saying, swizzle all that into memory. Right? And then you're accessing it very quickly

and can navigate that graph,

you know, super fast.

And so that's based upon, you know, some of what I mentioned before about the way we handle SSDs as if they were memory.

But the way we do that too is that we have a very, very,

on a per node basis,

parallelized

access model.

Okay?

And that's based upon really understanding the chipsets we're dealing with and what parallelism can really be affected that way.

And then that gives us this ability to project down on a per node basis to a greater

piece of the data. Now we do

have also a

partitioning model

that spreads all this data across nodes.

And that partitioning model,

you know, is something that

the user doesn't have to necessarily

understand. Right?

Because we are constantly distributing

those partitions

based on access

and load balancing behind the scenes

so that there are no hot spots, if you will. Okay? And this leads to that ability.

We also support a number of

applications in some very, very large social media companies. Right? And they like us because

we can give this predictable

performance

across huge, you know, I guess, users coming in in parallel

and be able to get access to

all the data

that's available

in ways that

as what people are looking for based upon what's happening in the world, it changes.

We rebalance

everything without them having to do anything.

Okay?

And so that's another thing. Those load balancing algorithms

lead to, you know, not only low latency. You know, a lot of people say, we can support

sub millisecond

or small number of millisecond latency.

But when you look at a graph of their performance, there's a lot of jitter, you know. On average,

it's 5 milliseconds,

But there are spikes all over the place. Right?

And those spikes matter.

So what we've done

in the data layout,

in the massive parallelization,

in the distribution of the data, and the load balancing

is gotten down the variation

to really, really, really thin,

if you will,

variance.

And that matters a lot for the types of workloads that we're dealing with. You mentioned that you are working very close to the level of the instruction set for the CPUs and the disk architectures that you're working with. And I'm wondering how the sort of rapid iteration

at the hardware level and with the

increasing adoption of things like ARM architectures

has influenced

the ways that you're building the database and any

challenges that that has posed as far as being able to run across potentially heterogeneous

physical architectures?

That's a really great question. And it is, you know, I'll sort of say, hey, it's the burden we signed up for, if you will.

What that really means is that

our

engineers and our architects

are constantly

looking at these things. You know, you brought up Arm.

Clearly, you know, Graviton is gonna be a big deal for AWS. Right?

And

what I will say is we're gonna be on Intel a little bit longer than some other people because the way we're going to Graviton is not like, hey, there's a compiler that works on it, that we're done.

What we do is look at both how the memory handling is done in the chip,

how the threading models work, you know, in the chip, understand what the compilers are doing with that, and optimize our code for it. So we've already done ports

to ARM.

We understand the differences.

We're still working out, you know, some of the things of, like, gee, how do we really wanna use that instruction set to do very specific things? This is a really telling thing about the difference,

you know, of Aerospike

versus some of the other databases.

They're happy to say, you know, we ported it. Hey. It ran. We're good. And it's 40%

faster.

Well, we don't wanna be just 40%

faster based upon the, you know, compiler

and the chip. We want to really

be fully optimized

for that architecture.

And, you know, there are things happening in

storage as well. You know, 1 of the things that we're working on and sort of projecting into the future. You know, right now,

we're, like, you know, 100 gig networks. Right? The level of parallelization

that you can get

with the network cards that, you know, when we talk about the cloud, I always like to think about the bumper sticker that was around, I guess, about 5 years ago. You know, there is no cloud. It's just someone else's computer.

And if you're really into optimizing,

you have to remember that you're running on

an actual

device that's subject to the laws of physics.

Okay? And we actually think of it that way.

And so when we look at, you know, Graviton and

the Intel architecture,

intel slash AMD, right, if you want,

We actually think deeply about those things. And when we talk about, can we do

a model that works across the 2 of them? Can we have, you know, dynamic compilation

choices

and do things so that we can install

something that will be optimized for both?

Or will we have to have, you know, slightly different versions?

We will always choose

to optimize

for that speed

and for that scale because we think that's where things are going.

That the ability to access more data

to give you

a higher fidelity model

to make a decision in real time is what's gonna,

you know, propel the future.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done.

DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data,

both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

Talking more about the cloud, another interesting element of that is that you are dealing with virtualization

layers. And I'm wondering how some of the recent

innovations,

thinking of things like the Nitro runtime for the AWS environments

and just some of the evolution of

how virtualization

interacts with the physical hardware and the kernels running on those instances

either gives you access or constrains your capabilities of being able to eke out the appropriate level of performance on the underlying systems that people are running in these cloud

At some point, you know, when I just think of think about it with the sort of classical

computer sciences, science sort of lens on, right, I tend to think it's like, look and I forget who the author was, and I have to apologize for it. But, you know, there's a great article called, you know, the data center is the computer.

And I think we all remember it. You can look it up that way.

But, you know, that is where things are going, that sort of serverless

virtual world.

But what constrains that now for high performance

is really what the predictable

network

latency

is

that's provided

within a given zone

in a data center that the cloud providers have. And I mentioned before, you know, they're at 100 gig. When they get to 400

gigabits, you know, it's like they're gonna be there

in, I don't know, a couple of years.

Then the network

becomes more like a bus, if you will, on a chip. And we can really start to think of

the data center as the computer.

We are already,

you know, very, very distributed,

you know, in terms of how we look at the world

and how we think about breaking down computations

on data. You know, we were talking about programming models earlier.

There's a lot of demand

to allow people to program in SQL

against a distributed

data store that can scale and be elastic.

So 1 way you can do that is to sort of make every node

support SQL.

The other thing is to look at things

like Presto, Trinos, Starburst,

you know, whatever you wanna call it. Right? That's also a distributed architecture

that can go against the distributed storage model, if you will,

like the 1 we have,

and offer that massive parallelization

and distributed nature to scale out as you ramp things. So I think that that's something that's happening now. I think that the storage models

right now

for the cloud vendors

have been

optimized for 2 things, and there's a space in the middle is what I'll say. The 2 things they're optimized for is,

you know, a microservices

that have an ephemeral

storage requirement

that while that service is running, which may be for short bursts,

they've got that data store underneath them as SSDs.

But if that node fails, it's not a big deal.

You know, you notice this in your interactions online sometimes. It's like, what? It lost some of my context,

but I'm still going, and then I have to, you know, reenter something. Well, no big whoop. But if you're talking about a database with massive amounts of data,

and, you know, I neglected to say that we support strong consistency

in our transactions in this distributed world we've constructed. Right? We passed the Jepson test.

But if you think about a strongly consistent transaction

that you want to make sure you capture,

well, you know, if the storage underneath you goes away, then that's sort of really

complex thing.

So

they have persistent data storage

in the back end that's, you know, network attached storage,

but the latency there is less than someone might want.

And when we look at how we use

hardware in the cloud, some of the vendors have

various instance types

that have durable SSDs

and sizable SSDs,

you know, storage optimized instances, etcetera.

And some of them have models that are we support ephemeral,

and then we have this network attached storage and high compute instances.

But there's a space in there where you need

large attached storage

to be able to do real time things

without having to have

literally

thousands and thousands of nodes.

You know, I talked about our efficiency

and the cost implications on that. And the numbers are things like this.

We will replace,

you know, say, open source Cassandra,

and it might have 4, 000 nodes

and a very complex and hard to manage and keep the whole thing up. Right? And we'll replace that with,

you know,

a

150, 200

nodes. And it's because of this ability to have an expanded

data space,

leveraging, you know, large

SSDs

that give us terabytes and terabytes per load.

So I think, you know, you bring up, you know, 1 of the things where I said earlier, there's a bit of an impedance mismatch.

But I think that the cloud vendors said, you know what?

The first place we went, and this was much like when Linux entered the playing field,

it's like, hey. We're gonna run all the web servers.

We're gonna run all the microservice applications.

And all that back office stuff is still on mainframes or Superdomes or, you know, whatever.

And I think now they're starting to say,

we want our fair share

of the large mission critical transactional

workloads,

but the modern ones,

the ones that are looking for real time performance.

And so we're starting to have those conversations.

What's that gonna require?

And I think it's 2 things. Right? 1, in the in the near term,

have the instance types that will support

real time

database workloads.

And the second thing is that the data center is evolving based upon

progress and networking technology

to allow that virtualization

that you're referring to, but still do it in a way that will

meet the needs of real time workloads.

Continuing on the subject of

the database architecture and running it, I'm wondering if you can speak to the operational characteristics of the system. You mentioned that it is a scale up and scale out architecture. It does intelligent data distribution so that you don't have to do a lot of partition rebalancing, and you don't have to do sort of preemptive partitioning that the engine will handle that itself. But I'm wondering if you could just speak to some

more of the aspects of getting it up and running, you know, managing the clustering, upgrades,

the you know, where it lies in the CP versus AP continuum, and just some more of the sort of aspects of actually running this as an operator and as a end user of the system? You know, it is a distributed system, and so that presents

a different model that people aren't as used to is what I'd say.

So

we have made a number of investments recently

in,

quote, unquote, making it more manageable. 1 of those things is Kubernetes. Right?

Provides a model for that. But the Kubernetes operator and just the notion of control planes.

1 of the things we've also done is

been working a lot more on observability.

Right? So, you know, observability and management. And the observability

side is that we have

been adding more and more points, and we're adding

more intelligence

around how that telemetry

needs to be interpreted because it's not a simple, like, hey, here's the computer.

You know, it has, you know, 32 cores on it,

and that's what it is.

It is a c

of compute and storage, if you will. And so how that's presented to customers. Now a lot of that,

we automate

because of the design.

But there's also this

constant

conversation going on between all of these nodes

and, you know, thousands and thousands and thousands of clients.

And

those clients are not simple clients. They're really middleware. Right? They're big applications

accessing it.

So how do you really

understand

all of that data that's in motion?

Okay. And how do you manage it? Because the other thing that we have to realize,

particularly

about the Cloud,

is that, you know, you're also I I sit, in a sea of compute and storage.

But, you know, it's not a sea. It's the global ocean, if you will, if you think about Cloud providers.

And at any given moment,

you know what? There are gonna be some hardware things that are just going down. It's just about, you know,

the time to live for those pieces of hardware.

And we've spent a lot of time

looking at what happens

when nodes disappear. We handle it in the background.

Data has to be moved around. You know, we spend a lot of time

looking at what happens in the Cloud, working with customers,

and we're trying to automate and build this into sizers and provisioning mechanisms

into our Kubernetes

operator

and in your levels below that. Because what 1 has to know is how much headroom do I have have to have if I'm gonna scale up because I can add notes.

But before I get there,

there's gonna be a lot of moving data around to rebalance

the new cluster, if you will, with the new capabilities.

And so providing people pictures of that, we made a when we joined the Cloud Native Computing Foundation, CNCF,

We've invested heavily in Prometheus and Grafana

dashboards for this. We are also

beginning to dig into OpenTelemetry

so that we can provide that same information

back into and and this is sort of less for the digital native population

and more for the existing enterprises.

But we wanna be able to tie into

whatever observability

and management tooling

that customers have. Because the other thing we're cognizant of and 1 of the reasons we joined CNCF

is us building

dashboards

and providing our telemetry in a form that we consume. It's not about that.

People want everything to be instrumented,

everything in that data pipeline,

and the applications

as well as our database. So we're investing a lot

and really making sure we fit into

the world of the enterprise and the world of, you know, what I call,

you know, new tech companies,

whether they be neobanks

or adtech

or IoT oriented companies or health care oriented companies that are doing things in real time. You mentioned that the clients are, you know, these rich middleware components

of the overall

interaction pattern. And I'm wondering how that

manifests in terms of when you're going through an upgrade cycle of your upgrading the

servers that they're communicating with, how do you manage the compatibility

both forwards and backwards between the server nodes as you're going through an upgrade path as well as the clients because there's

a lot of moving pieces there and being able to make sure that at every step of the way where you upgrade 1 instance and 1 client, that everything is still able to communicate

without having a breakage in the overall data flow. Yep. This is 1 of the central problems. Right? And so on the server side, we support, you know, rolling upgrades within the cluster that just sort of happen automatically,

if you will. Right?

1 of the things we do on the client is

we are

very, very cognizant

of making sure that older clients, you know, that we're backwards compatible to clients

as much as we can be. Right?

The other thing is that we

will allow

mixing of different

generations of clients and be smart about that to the extent we can.

Because the upgrading of the clients is something that, you know, also has to be factored into things.

Now, where we

have new capabilities and people want to take advantage of them with new

iterations of their applications,

They

can upgrade those clients

into that next generation that has those capabilities

and tie into it. But the old clients, they'll still work, but they won't have access to those new capabilities. And that's been a design center for us, if you will. The other thing that we're looking into more and more because, you know, customers are running,

you you know, bigger and bigger,

more complex

distributed applications against us,

is that we have this roster of all the clients.

And then we've gotta decide, are we gonna get into the business of managing

the upgrade of clients

through that roster?

Or are we gonna just say, hey,

you can query that roster,

know where those clients are,

and deal with that yourself. And that's kind of where we are now.

But this is something that is coming to the fore. Right? You mentioned too as far as abstraction layers over the data model where it's primarily key value, but you have this capability of sticking richer structures in as the value.

And for people who are trying to use it as a document store or for people, as you mentioned, who might be coming from a relational point of view, what are some of the

abstraction layers and systems that you and your customers are building to be able to manage these

relational or hierarchical

data architectures?

The area that we have sort of a rich interaction with customers on in the open source space is

in just this area. Right?

So we've worked with some customers who

built spring data support,

you know, libraries. And so we've taken those up and are, you know, investing in them and support them ourselves.

I think I mentioned that we've added JSON.

We've also, you you know, built a Java object mapper that's really, you know, a programming

time.

Annotate the code, and we will generate the

calls to our system so that the programmer doesn't have to think about it. So, you know,

POJO or whatever you wanna call it. Right? So we have good support for that.

Well, we also have released a beta

of a Redis compatible set of interfaces, if you will. Because, you know, we have a lot of business where people

thought all they needed was a cash. That's what I'll say. Right? And there's this progression.

The first reaction

when somebody that's an enterprise

starts to go digital is they say, gee,

the systems we have can't keep up with it. Let's put a cash in front of it. And then they go,

that's probably not enough.

Let's put, you know, Cassandra and it's sort of a cache that is gonna capture some data, and then we'll operate on that data as if it were another database.

And then that hits a wall. So we see that over and over again.

But to ease that transition,

we've built this library

that, you know, will

ease

the need to port your application

completely.

But, you know, it has cost to it,

and as does, you know, any of these models where we layer things on. The other area, you know, you mentioned relational,

now, who's sophisticated about that.

But there's all this data there now,

and we don't need the real time, but you support read, write, mixed workflows really well,

and we can scale the cluster out.

We'd like it to support SQL. So our path to that has been presto.

You know, I still call it presto. There's a little war about Trino versus whatever, but, you know, it's presto.

We've written a connector there, and that has

this, what I consider, a really nice model

of being able to distribute out in parallel eyes if you have multiple queries going on. You know, they can spawn workers. We've

multiple queries going on. You know, they can spawn workers. We've written workers that handle a lot of push down

into our system,

and so it performs very well.

But it's not real time by our standards, though some customers say, yeah, that's more real time than the relational database. Right?

And so there are just these grades of things, and I think we're gonna see that.

We look at this as another layer of middleware,

if you will, but it is this distributed

world,

And that's kind of our approach to to all these things. Right? We do some things in the server. I should mention that

we've invested a lot in providing

a sort of rich

expressions

capability.

So that if you

want to filter

what's coming back to you

or what's being pushed out by our change notification

system,

you can write an expression that's executed on the server

and gives you that, you know, data locality thing. And so we're expanding this expressive

capability

that a programmer can leverage

in many different ways on the server or in new client, you know, libraries and models.

In your experience of working with the AeroSpike team and with your customers and

experimenting with the system and seeing the ways that it's being deployed? What are some of the most interesting or innovative or unexpected ways that you've seen the Aerospike system used? Yeah. Sometimes we're surprised is what I'll say. We, we hadn't anticipated

that use of it. The biggest thing, I think, that we've seen is that

people have started to use it

in ways more like

a traditional,

you know, operational system of record. Right? And that leads to new demands on us,

demands in terms of security, in terms of management,

etcetera. Right? And that use of it as a,

you know, replacement, if you will, for traditional systems

is,

I think, the thing that's driving the next wave of our revenue as well. Right? I'll tell you, it goes like this. So we put strong consistency in there, and we thought of it as, you know, this is, like, gotta be used for some certain use cases.

But now they're standing up, you know, large

clusters

to handle transactional

stuff in real time.

And that ability,

you know, to do that leads to new capabilities for payment systems.

Okay?

And we had a customer

that was building a payment system for the European banking system.

And

we have something called RackAware so that we can split things across zones. Right?

They came to us and said, well, what would you have to do to be able to

split

that across data centers

and be able to handle

real time hot standby,

you know, that we could have, or to

provide

immediately

low latency read capability

at different sites

for the penalty of the speed of light in doing strongly consistent transactions across things. And we hadn't contemplated this type of thing because we were thinking sub millisecond means real time.

They were like, you know, do you have any idea it's only a 150 milliseconds?

That's incredible

to do a strongly consistent

transaction

that shows up, you know,

in

3

sites

geographically

distributed.

Okay? And they're like, this is amazing. And we're like, it's kinda slow.

But then we realized it's not kinda slow for that use case.

Okay? And it gives you a measure of resilience.

That's incredible. Right?

And now we have people talking to us about doing things like and this is where, you know, I always have to say,

these days, customers lead you, if you will. You know, they say, hey. We think you can do this with your system. And then we go, yeah. We'll have to do this, you know, to make that possible or manageable.

But they're doing this, and they're even talking to us now about,

you know, we need redundancy

in these transactions,

not only

within data centers, within 1 cloud provider,

but we're neutral to the cloud provider, so we can give you the ability to split that

across cloud providers even. Right? There are penalties for this because they have ingress, egress cost. But given

a workload that demands that,

it can

be, you know, something that people are willing to pay for.

And, you know, we're very efficient in terms of the amount of data being, you know, shoveled around.

So we see that, and that's something that really has been driven, you know, by customers going,

look, there's some capabilities

inherent in your architecture.

We think it can be exploited

differently

and more aggressively

than your market, you know, if you will. And this led to some work we had to do for sure. But it was really driven by these new workloads that are truly global,

if you will. In your own experience of working with the technology and the business of Aerospike and working with your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? Yeah. You alluded to it. We discussed it in some depth. Right?

But it's that the cloud, right,

was constructed

for a first wave of workloads, if you will. And there are some data from various analysts that shows that

while every large enterprises

is in the cloud. Right?

What's in the cloud?

And so a small percentage of their transactional

workloads,

their core mission critical transactional workloads

have migrated to the cloud

much less than the cloud providers would like. Now

because of all these systems of engagement

and new applications

at the edge, there's been huge, huge growth. And the reason people think there's headroom from an investment standpoint in all the cloud providers is you ain't seen nothing yet. Right? And I think that really figuring out how real time

transactional

workloads and large real time datasets.

You know, if you think about HBase. Right? And you think about Hadoop,

like, those things are dead. What replaces that

that allows real time access to massive amounts of data for decisioning,

you know, driven by AIML.

And those new workload, right, are gonna run-in the cloud,

and they run-in the cloud today,

but not very well in real time and not in a transactional sense, if you will. And I think that we're working on that. We have a lot of thoughts around that

as we

have customers who have made that transition,

you know, on premise,

and they start having the conversations about moving those work loads, massive work loads up to the cloud,

that's led to more discussions with cloud vendors

about that space. And I think, to me, you know, I'm still excited about technology and about

database and what,

really, you know,

information

at massive levels means.

That makes this just a super fun challenging

space

to deal with. And I'll sort of sum things up this way.

Our chief revenue officer and I had a discussion with a new customer

who's really a purveyor of, you know, derivation of information

across massive ingestion.

And, you know, we were having this conversation. We asked, asked, like, how many, you know, different streams of data

do you have coming in in real time into this cluster?

And, you know,

they said, a 150, 000.

Okay?

And our CRO said, did you say 50, 000? That's incredible. And they said, no, no, a 150, 000.

And I said, yeah, I heard it that way. And the CRO said, we can do that?

You know, and the customer said, yeah, it's the only reason we bought you.

Because, you know, the other things we looked at couldn't handle that

ingestion.

And I think that that's a great representation

of a difference in perspective. It's not

only the data within your enterprise.

It's all the data in the world that's available that applies to your problem is the way people are thinking about

applications and databases and data technology, if you will. For people who are interested in being able to

accelerate

the pace of interaction with their data or be able to massively scale out their capacity? What are the cases where aerospike is the wrong choice?

That's a good question. And there are clearly

places

where that's

true. You know, what I would say is that

for big datasets,

if you're really looking for

an analytic work store,

we're not a columnar store. Okay? Now we may add that capability

built on some of the techniques that we talked about here, and we talk about that. But we're really focused on operational transactional stores and the ability to derive

insight from them in real time,

but not really

data warehouse kind of thing. You know what? Snowflake's done that

pretty well. Billy Bosworth has a new company, you you know, focused on Arrow and some other things that, you know, is attacking

That's them. That's not us. Okay? Might become us sometime in the future, but that's not us today.

We're just really focused on solving this set of problems.

Now I'd say that's really the dividing line for us. The other thing I'd say is that, you know what, if you think that,

you know,

2 terabytes is a large data set,

There are probably cheaper, simpler ways to solve your problem

than us.

We talk about aspirational

scale. We talk to a lot of new tech startup companies, and they're like, we wanna buy 500 gigabytes, and it seems kind of expensive.

And then we say, what are you gonna be at in 18 months? And they say, a 100 terabytes. And we say, we'll be the most cost effective solution for you.

Okay?

And in enterprises,

right,

they're getting smart about this because they've been burned multiple times by having to replatform

3, 4 times as they scale up.

And now they're saying,

we need to start here. But there are a lot of applications that aren't gonna be that big.

And, you know, I would personally say use Mongo,

you know, use Couchbase.

Those are great solutions.

They've spent a lot in sort of making it easier to program to. They're a bit ahead of us there.

But, you know, if you need the scale, if you need the low latency, if you need the throughput,

you know, we're probably not the only game in town, but, you know, the the only game that I really understand in town, I guess.

And as you continue to iterate on the product and the platform for the Aerospike system? What are some of the things you have planned for the near to medium term? Near to medium term, 1 of the things we're really looking at is, like, I guess, 2 different vectors. 1 security, I'll get to that. But the other thing is I mentioned we were doing a lot of work in secondary indexes.

And so we'll take that to where we'll be able to index into those nested, you know,

rich data structures within a value in the key value store. And that really opens

up a lot

of new applications

that can be done really, really quickly because

our secondary indexes are now, you know, pretty much as fast as our primary hash

to get to, you know, from the key to the value. And now being able to do that across multiple

vectors, you know, is gonna open up a huge amount of new capabilities.

On the security front, as we move into

more and more enterprise workloads, we've also gotten into the federal space.

There's this demand for us solving the problem of, hey, in this extensible data structure that you have, this nested thing,

how fine a granularity

can we get to without impacting

the scale

and the throughput in the low latency?

And what are the trade offs?

And how can we provide more constrained pieces. The other thing, because we scale so much, right,

what we have is

people using us in a shared service manner or in a multi tenant manner

across, you know, hundreds and hundreds of similar workloads, but they want to contain those. So we're doing a lot of work

around being able

to fence off different applications

from each other

in a fairly

elastic and pliable way still, right, but with quotas and limits of different kinds. And

every time we think we have it solved, a customer comes up and says, no. But I'm running

this combination of things, and here's what's missing. So we continue to invest heavily in that. Are there any other aspects of the work that you're doing at Aerospike or the overall

potential and use cases for real time systems that we didn't discuss yet that you'd like to cover before we close out the show?

I'm really

seeing

new

demands

for this notion of real time that

every time

you think that this is how fast business is going to the pace of business is going to be,

that people are driving you know, we talked a lot about,

you know, in the financial services space, you know, program trading, machine trading, and such. Right?

Well, that's

filtering out into every aspect of the world almost.

And the need to be able to have those insights

as fast as you can

is only growing. Right? And so I think that's gonna mean more competition for us, I'm sure. Right?

But it also means more market for us, and I think that it keeps being fascinating to me

how people are discovering

new

data streams, if you will. If you look at all these supply chain problems we have, because I spent a fair amount of my life in that space too,

that, you know, it's fascinating

how broken it is right now. And it's because it was optimized

within a single supply chain

for efficiency,

but not taking in all the data you would need to be able

to manage the trade offs between resiliency

and efficiency. Right?

But

with these models we're talking about, we have the bandwidth and the capability in the cloud

to solve these problems.

And I think that, you know, that's gonna have to be done in this, you know, hyper connected world

that we live in

that's subject to,

you know, disruption

in ways we don't expect.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. You know, I think the biggest thing is that, you know, we touched on it with the cloud vendors

that

I don't think they understand

low latency. I think they've divided it up more based on the first set of workloads that they had, and they have a lot of workloads coming at them really fast that

they're gonna have to fill in some new instance types.

And I think that, you know, as I said, we're starting to work with those vendors more closely.

We're not alone in that, I'll say. Right?

And I think that that's gonna be 1 of the key things. And it's also in this notion of data movement.

You know, I think that you're gonna see a lot more focus

on really getting the pipes

between data centers

to expand,

to be more parallelized

because the amount of data changing hands

is something that's just amazing to me. You know, people talk about the monetization

of data. I think more about,

you know, what's the available

set of data that you can apply to any problem?

And it grows every day.

Every well, probably every minute,

you know. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing at AeroSpike. It's definitely a very interesting set of technologies and an interesting platform and something that I've been keeping an eye on for a number of years. So I appreciate having the opportunity to speak with you and learn more about what you're working on.

Definitely excited to see where things like Aerospike and the overall movement towards more real time access to scalable datasets is going. So thank you for all the time and effort that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you. It's great to be here.

For listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links