Summary
Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold’s proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Your host is Tobias Macey and today I’m interviewing Lenley Hensarling about Aerospike and building real-time data platforms
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Aerospike is and the story behind it?
- What are the use cases that it is uniquely well suited for?
- What are the use cases that you and the Aerospike team are focusing on and how does that influence your focus on priorities of feature development and user experience?
- What are the driving factors for building a real-time data platform?
- How is Aerospike being incorporated in application and data architectures?
- Can you describe how the Aerospike engine is architected?
- How have the design and architecture changed or evolved since it was first created?
- How have market forces influenced the product priorities and focus?
- What are the challenges that end users face when determining how to model their data given a key/value storage interface?
- What are the abstraction layers that you and/or your users build to manage reliational or hierarchical data architectures?
- What are the operational characteristics of the Aerospike system? (e.g. deployment, scaling, CP vs AP, upgrades, clustering, etc.)
- What are the most interesting, innovative, or unexpected ways that you have seen Aerospike used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aerospike?
- When is Aerospike the wrong choice?
- What do you have planned for the future of Aerospike?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Aerospike
- EnterpriseDB
- "Nobody Expects The Spanish Inquisition"
- ARM CPU Architectures
- AWS Graviton Processors
- The Datacenter Is The Computer (Affiliate link)
- Jepsen Tests
- Cloud Native Computing Foundation
- Prometheus
- Grafana
- OpenTelemetry
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Lenley Hansarling about Aero Spike and building real time data platforms. So, Lenley, can you start by introducing yourself? Yeah. I am Lenley Hensarling.
[00:02:06] Unknown:
I'm the chief strategy officer here at Aero Spike, and I also manage the product management group. I was brought in to sort of bring a more business perspective to a very technical company. Have worked with the CEO, John Dillon, for a number of years sort of through other things, working with investors. I work now with Srinivasan, who is 1 of the founders, the remaining founder who's here and is 1 of your typical database PhDs from University of Wisconsin. And we're really, you know, lucky to have him in that deep technical background. Do you remember how you first got involved in the area of data management? Well, you know, it goes back to university probably. I was at the University of Texas at Austin as they spun up their computer science program. There was a guy, Avi Silvershotz, there. If you Google him, he's like 1 of the guys like Altman who really was deep into formal theory about database. And so I got interested in it there. I got interested in some of the thornier problems of, you know, federated database and, you know, multi database and distributed database and such. And worked in file systems over the years in operating systems. And it became 1 of those itches that you keep coming back to scratch. And so I recently, before Aerospike, was at Enterprise DB, a Postgres company.
And, you know, Postgres is a wonderful Breaker put together. The difference is that here, we've got what what I would call leading edge technology in the, you know, ultra distributed, you know, real time, do stuff at hyperscale in a single cluster. And so, you know, that attracted me.
[00:03:53] Unknown:
And so that brings us to the Aerospike project. As you mentioned, you came in to help sort of bring a business perspective to the project, and it started off as just a, you know, very heavily technological company and very focused on the specifics of the database engine and how to make it fast and how to make it scale. And I'm just wondering if you can give a bit of an overview about what the aerospike project is and some of the story behind where it started and how it got to where it is today.
[00:04:20] Unknown:
Big data was coming up. Well, we've been in existence about 10 years, 11 now, I guess. And big data really evolved during that time. And, you know, we really focused on 2 things, and our 2 founders, Brian Balkowski and Srinivasan, were really focused on how do we get sub millisecond latency, but not just that. To do that at scale, and by scale, they really focused on large datasets, meaning, you know, tens of terabytes, hundreds of terabytes. We've recently released a benchmark with Amazon and Intel for a petabyte benchmark. Right? And to do that at high throughput as well. So how can you do that for, you know, 100 of thousands of connections to a cluster, you know, concurrently, things like that.
So they found traction in the Ad Tech market, which was evolving over this same period. You know, The Trade Desk is a customer of ours, and and there's a great quote by their CTO that he said, just, you know, when we ask him a question at 1 of our customer advisory boards, he said, you know, the ability to process more and more data within bounded time windows is gonna result in business models that we haven't yet contemplated. You know? And I wrote that down in a notebook, and I've remembered it for the last 2 years because it's a great quote, but it's really meaningful in terms of what drives our focus.
And we're seeing that happen now in financial services, in IOT, IoT, and other places where the more data you can apply against decisioning in a given bounded time window, the better result you can get or the better decision. Right? So that's something that we focused on and are expanding that our understanding of where that's applicable, I guess. And there's real technology behind it. You know, a lot of times I look at companies that have evolved and a a sort of slow evolution and adoption of technology for other purposes and things. But Brian and Srini really set out to build a different solution and to take advantage of SSDs in a different way. You know, people say, hey, great, solid state storage, you know, they're like drives.
They said they're not like drives, they're like memory. Okay? And they wrote drivers that bypass the operating system and went directly to the SSDs so that we manage the SSD as if it were block storage DRAM. And we get near DRAM speeds by handling the data in the SSD. And we have something called hybrid memory, which is put the indexes in RAM and put the data in the SSDs, but make that like it's an expanded data space and really address things like that. And that's, you know, at the single node level, but it goes on from there because they also focused on really doing the distribution and how the clients interact with the clustered environment very differently.
Instead of a quorum, it's a roster based system. And the clients are what I would call a first class citizen in that distributed model. So they know how to go in a single hop based on the digest to the data that you're looking for. And that's the fundamental, you know, underpinnings of being able to do many things faster to get to that data very directly. And now, you know, we're taking that same type of approach into secondary indexes. And to the scale of those, we've expanded into having, you know, all flash solutions. So that if you have indexes that are super large and you have a dataset that's super large, it all goes into flash. This has a result that makes it possible to support bigger workloads in a cost effective manner. Okay, which is 1 of the reasons that we'll win in many competitions is because the projection of more data for a single node means that you don't need as many nodes to cover a given data space, if you will.
And that makes a huge difference in terms of how many nodes you have to manage and the cost of managing all of that, but also gives you an ability to scale up as well as scale out. So we really have focused. And, you know, 1 joke I tell is that we're the last group of true system software programmers that are focused on really ringing everything they can out of both the CPU and the way the buses are constructed, the way the network cards are constructed, and SSDs. And we've taken that same mind set to supporting Optane, you know, or or persistent memory as well. So that focus on really exploiting the technology underneath us and doing that in a way that is cost effective and efficient is sort of the combination of things that really allow us to do that. In terms of the
[00:09:37] Unknown:
use cases that you and your customers are primarily focusing on, you mentioned that earlier on in its lifespan, it was very popular in adtech because of their need to be able to very quickly, you know, determine what they wanted to bid for a given advertisement spot because of the need to be able to be in the hot path for a search request. And now that the cloud has taken a much larger share of the sort of compute capacity and the workloads that people are running, how has the aerospike product and the customers that you're working with shifted the kind of primary use cases that it's being applied to? You know, Tobias, that's a great question. I like to say we sort of, you know,
[00:10:20] Unknown:
made our chops, if you will, in the Adtech space. But the ad tech space is characterized and it's evolved to where, you know, I guess, I don't know, 5 years ago, they say, we got a lot of data sources we're applying to this. It's like, you know, in the tens of. Now, you know, we have customers that say we're adding hundreds of data sources a month, you know, and putting them into 1 picture of what a user profile is. Okay? Then then what's happened is everybody thinks of things as profiles. That sort of, in some sense, the foundation of IoT, the foundation of, you know, digital marketing, the foundation of really AI to some extent because, essentially, you're creating a profile in real time based on a data stream.
Okay. We capture that data stream, hold it, but then we also take that data that's captured from many, many different sources and put it into large datasets. Okay? And then our customers will I think the best way I've heard this said was 1 of our customers, I asked him, like, why did you buy the product when I first joined aerospike? Right? I went around a bunch of customers, said, why did you buy that product? Their first reaction is always, well, it's your product. Why would you ask that question? And I said, because, like, I'm not buying it. You are. And I need to know why you bought it. But he said, oh, it's simple, actually. You know, in the past, we were able to use, you know, hours of information captured from the stream for a given, you know, profile we wanna construct. We wanna match that against, you know, days of information.
He said, now, we take weeks of information and match it against months of information that informs that model through, you know, machine learning and etcetera. Right? And we can match all of that within the same SLA in terms of a time bound window, which is 20 milliseconds or 40 milliseconds. So that model right there applies to any number of things. Right? Every time somebody's trying to figure out what to put in front of someone 1 at the bottom of the screen in ecommerce, you know, the more data that they can apply to that, both about that characterizes the user coming in, but also characterizes different cohort groups that they match up within things. And if they can do all of that database match up and access all that data, but in a very bounded time window, that's become where we excel.
The other thing that's happened is primarily in financial services. Right? So that model I just discussed is used a lot in fraud, if you will, and in identity management. And that's key to financial services that's online. But we've also been taken up by a number of large brokerages and banks to do more real time transactional things. So I'll give you a great example. We've got a customer, can't be named, but, you know, a brokerage. And and they used to compute the picture from all the data of their margin business once a day.
Okay? And that was done by the risk management compliance people and they traded against that risk profile, if you will, both for individuals and for the institution at large. Now, with us, they initially reduced that as a real time picture that they would recompute every 30 minutes. Now they can recompute that in single digit minutes and are trying to get it under a minute. And what that does is compress the risk window of what the unknown is, you know, what they're operating against. And the world changes a lot. You you know, if the last 2 years hasn't taught us that, you know you know, my joke is nobody ever expects the Spanish inquisition, you know.
And that's really what they're fighting against all the time to get a more accurate real time picture of what's going on. But the business benefits of things like that is that I can say yes more often and quicker as somebody's positions change. You know, if all of a sudden I've sold off a lot and I have a lot more cash, and I wanna do a margin trade on something else, that's a very different risk profile than it was when I was holding a bunch of stuff where the price might be fluctuating. And so being able to do things like that in real time is a big change for many, many different types of companies.
You know, you can project this into logistics, IoT, anything, really. And this drive towards becoming real time and applying data to decisioning and having a more real time picture, more up to date picture is something that people are doing in any number of businesses across any number of industries, I'd say. You know, we're talking to automotive suppliers, and we're talking to automotive companies, etcetera, right, that are looking at data this way and trying to understand how can I operate it on it by some definition real time? And they're not talking about small amounts of data. Right? Because you wanna be able to say, project this across a big enough cohort group, right, to get some accuracy, to be able to say, what's my next best action? And that can be true for cohort groups that are people, but also things. You know, we talk about the Internet of things, and so you've got all this telemetry coming in. Telemetry about the weather, about the grid, about fluctuations in power that are happening, all kinds of things. And everything is instrumented these days, including you and I. Right? I always joke about that. You know, I hold up my phone and say, you know, we're all instrumented.
And, you know, if you have an Apple Watch, you're even more instrumented. But it's amazing the amount of data that's captured by the devices that our devices near. Okay? So it's that type of, you know, ability to have insights based on massive amounts of data and combination of data that people are continuously innovating on. They say, you know what? If I had this data and this data and I can correlate that that's somehow, you know, connected, then I can know these things. And so that's going on continuously and changing how we approach business and marketing and everything.
[00:16:52] Unknown:
In terms of the sort of technical architectures that aerospike is being incorporated into, I'm wondering if you can give some of the typical ways that it is put into the overall data life cycle and sort of where it fits in the overall infrastructure of a given company as far as how they're storing the data, how they're accessing the data, what the sort of usage patterns look like for an aerospike deployment?
[00:17:16] Unknown:
So that's a great question, actually. And, you know, we sort of have 2 major threads of work now. We continue to evolve the database. You know, I mentioned that we're adding secondary indexes or we're revamping our secondary indexes to make them even faster and more scalable and allow people to do more things with the database. But we've also made a lot of investment in what I would call the data fabric. Right? And making our database be a first class citizen in the data mesh, the data fabric. And that means really having optimized spark connectors, Kafka connectors, Pulsar connectors.
We're about to come out with a new thing, which is we call ConnectX, and it's sort of based on a change notification mechanism, be able to push out in a very neutral format and support multiple formats, you know, Avro, Arrow, you know, whatever you want, and be able to push that into a data pipeline. Okay? Because there's not 1 fixed point of data anymore. Another great quote from 1 of our customers was that, you know, we used to all look at data and getting value from data by saying, let's have a massive data warehouse or let's have a data lake, you know, and and then look back at that static data and try and derive insights from it. Now it's completely swapped. We've got data that's essentially in motion all the time that has creates pools, if you will, where you aggregate some of it, but none of it is constant. It's always changing. It's being updated in real time. And so that means we have to lean into, you know, great support for, you know, Kafka, Flink, you know, Pulsar, you know, etcetera. Right? And so we put a lot of effort into that. People compose architectures then that have many, many different clusters using our database at the edge for real time ingestion of different incoming datasets.
They aggregate that back in, you know, warm stores that are still in real time. It may not be sub millisecond that the access and the computations happen on that, but it's massive amounts up to many petabytes that they can then pull from that dataset in low latency, which might be described by, you know, single digit millisecond for access to 1 piece of data. But within 20, 30 milliseconds because of parallelization, they can pull significant amounts of data in those time windows, make another decision, and that may get passed back on to another point that, you know, might be in Google Bigtable. It might go into, you know, some other data store. It might be, you know, accessed through or go into Snowflake or something like that. But then after that machine learning and the development of the features, as they say, those features are then pushed back to the edge for real time decisioning, so we wind up being a feature store as well.
I'll give you 1 example. A customer that we think you're pretty prominently in very massive graphs that are used in different ways. And those graphs are developed through ML, sometimes in our database, sometimes in other databases. But once they understand what the graph is so that they can operate on it in real time, they have to provision that graph. So we're talking about something that in our database would be represented by billions of records with, you know, literally thousands of vertices per record that they can then compute on in these, you know, low millisecond time windows.
And so it's this, you know, sort of feedback loop of many different levels of processing and decisioning and creating models, but then reprovisioning that back to the edge. And we figure at the points where you have to be able to ingest data really fast, access and make decisions on it fast, but also supply that data back upstream, and then take it coming from upstream back downstream to the edge. You know, there's probably a little convoluted, but I think people that deal in data pipelines will understand, you know, that model it'll be fairly familiar to them.
[00:21:42] Unknown:
So digging deeper into the specifics of the AeroSpike engine, I'm wondering if you can talk to some of the ways that it's architect and the data model that it is designed around and some of the ways that that data model maps into the performance capabilities.
[00:21:58] Unknown:
You know, the data model, I have to say, it's sort of funny because it took us a while to realize that we were a document database as well as other things. Right? We're a key value model. So but the value can be a compound document, if you will. Okay? And so what we support is, you know, bins. Think, you know, columns, if you will. But within a bin well, actually, there's a namespace for database. Right? A cluster can have multiple databases, so we call them namespaces. And then within that namespace, there are bins. Within those bins, we support a map list structure that can be hierarchically nested. Right?
So quite often, our ability to access a significant piece of information, right, that's you know, has a lot to it. And 1 quick read is part of the game that we play. Right? And so those maps and lists have APIs that can go directly to them. The indexing that we have can also help you navigate this as well. So that data structure also is in some sense well, not in some sense. It is a superset of something like JSON. So, recently, 1 of the things we've done is put a JSON API on top of it so that you could, in Java, manage JSON like you would with some other databases that support JSON. Now there's a little performance hit to that, but the wonderful thing about us is that if you need, you know, sub millisecond capability, you go directly to our APIs, navigate the bin map list architecture directly, and you can get that, you know, sub millisecond access to the piece of data you want. Okay?
And that is kind of the basic model. Now the other thing we deal with is something that every NoSQL database really does. Right? You get the people coming from the relational world, and they have a mindset of, look, I'm gonna do joins, and I'm gonna construct a piece of data. If I refer to this, and everybody else does, I think, as the as the denormalization, you know, path. And so it's a different mindset, and people have to learn it. We tend to spend a lot of time with our customers and clients, particularly the ones, you know, that are maybe in financial services. So you got a group of programmers. They've written to an Oracle database or a DB 2 database for a long time, and now they want to get fast and real time.
And then we look at their data models and say, no. You don't have to do that. You construct it into these bins with the map list architecture, and you can do everything you want. And we can update this data or rewrite the records so fast. You don't have to worry about constructing saying saying is that this schema is extensible, if you will. So, you know, from a data model standpoint, that's a lot of it. You know, I mentioned that by compressing many things into 1 record, if you will. I I referenced the the whole idea of a graph being represented with us so that a record might have thousands of vertices in it that can be navigated. Because what we do is, as I'm saying, swizzle all that into memory. Right? And then you're accessing it very quickly and can navigate that graph, you know, super fast.
And so that's based upon, you know, some of what I mentioned before about the way we handle SSDs as if they were memory. But the way we do that too is that we have a very, very, on a per node basis, parallelized access model. Okay? And that's based upon really understanding the chipsets we're dealing with and what parallelism can really be affected that way. And then that gives us this ability to project down on a per node basis to a greater piece of the data. Now we do have also a partitioning model that spreads all this data across nodes. And that partitioning model, you know, is something that the user doesn't have to necessarily understand. Right?
Because we are constantly distributing those partitions based on access and load balancing behind the scenes so that there are no hot spots, if you will. Okay? And this leads to that ability. We also support a number of applications in some very, very large social media companies. Right? And they like us because we can give this predictable performance across huge, you know, I guess, users coming in in parallel and be able to get access to all the data that's available in ways that as what people are looking for based upon what's happening in the world, it changes. We rebalance everything without them having to do anything.
Okay? And so that's another thing. Those load balancing algorithms lead to, you know, not only low latency. You know, a lot of people say, we can support sub millisecond or small number of millisecond latency. But when you look at a graph of their performance, there's a lot of jitter, you know. On average, it's 5 milliseconds, But there are spikes all over the place. Right? And those spikes matter. So what we've done in the data layout, in the massive parallelization, in the distribution of the data, and the load balancing is gotten down the variation to really, really, really thin, if you will, variance.
[00:28:01] Unknown:
And that matters a lot for the types of workloads that we're dealing with. You mentioned that you are working very close to the level of the instruction set for the CPUs and the disk architectures that you're working with. And I'm wondering how the sort of rapid iteration at the hardware level and with the increasing adoption of things like ARM architectures has influenced the ways that you're building the database and any challenges that that has posed as far as being able to run across potentially heterogeneous physical architectures?
[00:28:36] Unknown:
That's a really great question. And it is, you know, I'll sort of say, hey, it's the burden we signed up for, if you will. What that really means is that our engineers and our architects are constantly looking at these things. You know, you brought up Arm. Clearly, you know, Graviton is gonna be a big deal for AWS. Right? And what I will say is we're gonna be on Intel a little bit longer than some other people because the way we're going to Graviton is not like, hey, there's a compiler that works on it, that we're done. What we do is look at both how the memory handling is done in the chip, how the threading models work, you know, in the chip, understand what the compilers are doing with that, and optimize our code for it. So we've already done ports to ARM.
We understand the differences. We're still working out, you know, some of the things of, like, gee, how do we really wanna use that instruction set to do very specific things? This is a really telling thing about the difference, you know, of Aerospike versus some of the other databases. They're happy to say, you know, we ported it. Hey. It ran. We're good. And it's 40% faster. Well, we don't wanna be just 40% faster based upon the, you know, compiler and the chip. We want to really be fully optimized for that architecture. And, you know, there are things happening in storage as well. You know, 1 of the things that we're working on and sort of projecting into the future. You know, right now, we're, like, you know, 100 gig networks. Right? The level of parallelization that you can get with the network cards that, you know, when we talk about the cloud, I always like to think about the bumper sticker that was around, I guess, about 5 years ago. You know, there is no cloud. It's just someone else's computer.
And if you're really into optimizing, you have to remember that you're running on an actual device that's subject to the laws of physics. Okay? And we actually think of it that way. And so when we look at, you know, Graviton and the Intel architecture, intel slash AMD, right, if you want, We actually think deeply about those things. And when we talk about, can we do a model that works across the 2 of them? Can we have, you know, dynamic compilation choices and do things so that we can install something that will be optimized for both? Or will we have to have, you know, slightly different versions?
We will always choose to optimize for that speed and for that scale because we think that's where things are going. That the ability to access more data to give you a higher fidelity model to make a decision in real time is what's gonna, you know, propel the future.
[00:31:37] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.
DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. Talking more about the cloud, another interesting element of that is that you are dealing with virtualization layers. And I'm wondering how some of the recent innovations, thinking of things like the Nitro runtime for the AWS environments and just some of the evolution of how virtualization interacts with the physical hardware and the kernels running on those instances either gives you access or constrains your capabilities of being able to eke out the appropriate level of performance on the underlying systems that people are running in these cloud
[00:33:17] Unknown:
At some point, you know, when I just think of think about it with the sort of classical computer sciences, science sort of lens on, right, I tend to think it's like, look and I forget who the author was, and I have to apologize for it. But, you know, there's a great article called, you know, the data center is the computer. And I think we all remember it. You can look it up that way. But, you know, that is where things are going, that sort of serverless virtual world. But what constrains that now for high performance is really what the predictable network latency is that's provided within a given zone in a data center that the cloud providers have. And I mentioned before, you know, they're at 100 gig. When they get to 400 gigabits, you know, it's like they're gonna be there in, I don't know, a couple of years.
Then the network becomes more like a bus, if you will, on a chip. And we can really start to think of the data center as the computer. We are already, you know, very, very distributed, you know, in terms of how we look at the world and how we think about breaking down computations on data. You know, we were talking about programming models earlier. There's a lot of demand to allow people to program in SQL against a distributed data store that can scale and be elastic. So 1 way you can do that is to sort of make every node support SQL.
The other thing is to look at things like Presto, Trinos, Starburst, you know, whatever you wanna call it. Right? That's also a distributed architecture that can go against the distributed storage model, if you will, like the 1 we have, and offer that massive parallelization and distributed nature to scale out as you ramp things. So I think that that's something that's happening now. I think that the storage models right now for the cloud vendors have been optimized for 2 things, and there's a space in the middle is what I'll say. The 2 things they're optimized for is, you know, a microservices that have an ephemeral storage requirement that while that service is running, which may be for short bursts, they've got that data store underneath them as SSDs.
But if that node fails, it's not a big deal. You know, you notice this in your interactions online sometimes. It's like, what? It lost some of my context, but I'm still going, and then I have to, you know, reenter something. Well, no big whoop. But if you're talking about a database with massive amounts of data, and, you know, I neglected to say that we support strong consistency in our transactions in this distributed world we've constructed. Right? We passed the Jepson test. But if you think about a strongly consistent transaction that you want to make sure you capture, well, you know, if the storage underneath you goes away, then that's sort of really complex thing.
So they have persistent data storage in the back end that's, you know, network attached storage, but the latency there is less than someone might want. And when we look at how we use hardware in the cloud, some of the vendors have various instance types that have durable SSDs and sizable SSDs, you know, storage optimized instances, etcetera. And some of them have models that are we support ephemeral, and then we have this network attached storage and high compute instances. But there's a space in there where you need large attached storage to be able to do real time things without having to have literally thousands and thousands of nodes.
You know, I talked about our efficiency and the cost implications on that. And the numbers are things like this. We will replace, you know, say, open source Cassandra, and it might have 4, 000 nodes and a very complex and hard to manage and keep the whole thing up. Right? And we'll replace that with, you know, a 150, 200 nodes. And it's because of this ability to have an expanded data space, leveraging, you know, large SSDs that give us terabytes and terabytes per load. So I think, you know, you bring up, you know, 1 of the things where I said earlier, there's a bit of an impedance mismatch. But I think that the cloud vendors said, you know what?
The first place we went, and this was much like when Linux entered the playing field, it's like, hey. We're gonna run all the web servers. We're gonna run all the microservice applications. And all that back office stuff is still on mainframes or Superdomes or, you know, whatever. And I think now they're starting to say, we want our fair share of the large mission critical transactional workloads, but the modern ones, the ones that are looking for real time performance. And so we're starting to have those conversations. What's that gonna require?
And I think it's 2 things. Right? 1, in the in the near term, have the instance types that will support real time database workloads. And the second thing is that the data center is evolving based upon progress and networking technology to allow that virtualization that you're referring to, but still do it in a way that will meet the needs of real time workloads.
[00:39:15] Unknown:
Continuing on the subject of the database architecture and running it, I'm wondering if you can speak to the operational characteristics of the system. You mentioned that it is a scale up and scale out architecture. It does intelligent data distribution so that you don't have to do a lot of partition rebalancing, and you don't have to do sort of preemptive partitioning that the engine will handle that itself. But I'm wondering if you could just speak to some more of the aspects of getting it up and running, you know, managing the clustering, upgrades, the you know, where it lies in the CP versus AP continuum, and just some more of the sort of aspects of actually running this as an operator and as a end user of the system? You know, it is a distributed system, and so that presents
[00:39:57] Unknown:
a different model that people aren't as used to is what I'd say. So we have made a number of investments recently in, quote, unquote, making it more manageable. 1 of those things is Kubernetes. Right? Provides a model for that. But the Kubernetes operator and just the notion of control planes. 1 of the things we've also done is been working a lot more on observability. Right? So, you know, observability and management. And the observability side is that we have been adding more and more points, and we're adding more intelligence around how that telemetry needs to be interpreted because it's not a simple, like, hey, here's the computer.
You know, it has, you know, 32 cores on it, and that's what it is. It is a c of compute and storage, if you will. And so how that's presented to customers. Now a lot of that, we automate because of the design. But there's also this constant conversation going on between all of these nodes and, you know, thousands and thousands and thousands of clients. And those clients are not simple clients. They're really middleware. Right? They're big applications accessing it. So how do you really understand all of that data that's in motion? Okay. And how do you manage it? Because the other thing that we have to realize, particularly about the Cloud, is that, you know, you're also I I sit, in a sea of compute and storage.
But, you know, it's not a sea. It's the global ocean, if you will, if you think about Cloud providers. And at any given moment, you know what? There are gonna be some hardware things that are just going down. It's just about, you know, the time to live for those pieces of hardware. And we've spent a lot of time looking at what happens when nodes disappear. We handle it in the background. Data has to be moved around. You know, we spend a lot of time looking at what happens in the Cloud, working with customers, and we're trying to automate and build this into sizers and provisioning mechanisms into our Kubernetes operator and in your levels below that. Because what 1 has to know is how much headroom do I have have to have if I'm gonna scale up because I can add notes.
But before I get there, there's gonna be a lot of moving data around to rebalance the new cluster, if you will, with the new capabilities. And so providing people pictures of that, we made a when we joined the Cloud Native Computing Foundation, CNCF, We've invested heavily in Prometheus and Grafana dashboards for this. We are also beginning to dig into OpenTelemetry so that we can provide that same information back into and and this is sort of less for the digital native population and more for the existing enterprises. But we wanna be able to tie into whatever observability and management tooling that customers have. Because the other thing we're cognizant of and 1 of the reasons we joined CNCF is us building dashboards and providing our telemetry in a form that we consume. It's not about that.
People want everything to be instrumented, everything in that data pipeline, and the applications as well as our database. So we're investing a lot and really making sure we fit into the world of the enterprise and the world of, you know, what I call, you know, new tech companies, whether they be neobanks or adtech or IoT oriented companies or health care oriented companies that are doing things in real time. You mentioned that the clients are, you know, these rich middleware components
[00:44:09] Unknown:
of the overall interaction pattern. And I'm wondering how that manifests in terms of when you're going through an upgrade cycle of your upgrading the servers that they're communicating with, how do you manage the compatibility both forwards and backwards between the server nodes as you're going through an upgrade path as well as the clients because there's a lot of moving pieces there and being able to make sure that at every step of the way where you upgrade 1 instance and 1 client, that everything is still able to communicate
[00:44:39] Unknown:
without having a breakage in the overall data flow. Yep. This is 1 of the central problems. Right? And so on the server side, we support, you know, rolling upgrades within the cluster that just sort of happen automatically, if you will. Right? 1 of the things we do on the client is we are very, very cognizant of making sure that older clients, you know, that we're backwards compatible to clients as much as we can be. Right? The other thing is that we will allow mixing of different generations of clients and be smart about that to the extent we can. Because the upgrading of the clients is something that, you know, also has to be factored into things.
Now, where we have new capabilities and people want to take advantage of them with new iterations of their applications, They can upgrade those clients into that next generation that has those capabilities and tie into it. But the old clients, they'll still work, but they won't have access to those new capabilities. And that's been a design center for us, if you will. The other thing that we're looking into more and more because, you know, customers are running, you you know, bigger and bigger, more complex distributed applications against us, is that we have this roster of all the clients.
And then we've gotta decide, are we gonna get into the business of managing the upgrade of clients through that roster? Or are we gonna just say, hey, you can query that roster, know where those clients are, and deal with that yourself. And that's kind of where we are now.
[00:46:31] Unknown:
But this is something that is coming to the fore. Right? You mentioned too as far as abstraction layers over the data model where it's primarily key value, but you have this capability of sticking richer structures in as the value. And for people who are trying to use it as a document store or for people, as you mentioned, who might be coming from a relational point of view, what are some of the abstraction layers and systems that you and your customers are building to be able to manage these relational or hierarchical data architectures?
[00:47:04] Unknown:
The area that we have sort of a rich interaction with customers on in the open source space is in just this area. Right? So we've worked with some customers who built spring data support, you know, libraries. And so we've taken those up and are, you know, investing in them and support them ourselves. I think I mentioned that we've added JSON. We've also, you you know, built a Java object mapper that's really, you know, a programming time. Annotate the code, and we will generate the calls to our system so that the programmer doesn't have to think about it. So, you know, POJO or whatever you wanna call it. Right? So we have good support for that.
Well, we also have released a beta of a Redis compatible set of interfaces, if you will. Because, you know, we have a lot of business where people thought all they needed was a cash. That's what I'll say. Right? And there's this progression. The first reaction when somebody that's an enterprise starts to go digital is they say, gee, the systems we have can't keep up with it. Let's put a cash in front of it. And then they go, that's probably not enough. Let's put, you know, Cassandra and it's sort of a cache that is gonna capture some data, and then we'll operate on that data as if it were another database. And then that hits a wall. So we see that over and over again.
But to ease that transition, we've built this library that, you know, will ease the need to port your application completely. But, you know, it has cost to it, and as does, you know, any of these models where we layer things on. The other area, you know, you mentioned relational, now, who's sophisticated about that. But there's all this data there now, and we don't need the real time, but you support read, write, mixed workflows really well, and we can scale the cluster out. We'd like it to support SQL. So our path to that has been presto. You know, I still call it presto. There's a little war about Trino versus whatever, but, you know, it's presto.
We've written a connector there, and that has this, what I consider, a really nice model of being able to distribute out in parallel eyes if you have multiple queries going on. You know, they can spawn workers. We've multiple queries going on. You know, they can spawn workers. We've written workers that handle a lot of push down into our system, and so it performs very well. But it's not real time by our standards, though some customers say, yeah, that's more real time than the relational database. Right? And so there are just these grades of things, and I think we're gonna see that.
We look at this as another layer of middleware, if you will, but it is this distributed world, And that's kind of our approach to to all these things. Right? We do some things in the server. I should mention that we've invested a lot in providing a sort of rich expressions capability. So that if you want to filter what's coming back to you or what's being pushed out by our change notification system, you can write an expression that's executed on the server and gives you that, you know, data locality thing. And so we're expanding this expressive capability that a programmer can leverage in many different ways on the server or in new client, you know, libraries and models.
[00:50:52] Unknown:
In your experience of working with the AeroSpike team and with your customers and experimenting with the system and seeing the ways that it's being deployed? What are some of the most interesting or innovative or unexpected ways that you've seen the Aerospike system used? Yeah. Sometimes we're surprised is what I'll say. We, we hadn't anticipated
[00:51:09] Unknown:
that use of it. The biggest thing, I think, that we've seen is that people have started to use it in ways more like a traditional, you know, operational system of record. Right? And that leads to new demands on us, demands in terms of security, in terms of management, etcetera. Right? And that use of it as a, you know, replacement, if you will, for traditional systems is, I think, the thing that's driving the next wave of our revenue as well. Right? I'll tell you, it goes like this. So we put strong consistency in there, and we thought of it as, you know, this is, like, gotta be used for some certain use cases. But now they're standing up, you know, large clusters to handle transactional stuff in real time.
And that ability, you know, to do that leads to new capabilities for payment systems. Okay? And we had a customer that was building a payment system for the European banking system. And we have something called RackAware so that we can split things across zones. Right? They came to us and said, well, what would you have to do to be able to split that across data centers and be able to handle real time hot standby, you know, that we could have, or to provide immediately low latency read capability at different sites for the penalty of the speed of light in doing strongly consistent transactions across things. And we hadn't contemplated this type of thing because we were thinking sub millisecond means real time.
They were like, you know, do you have any idea it's only a 150 milliseconds? That's incredible to do a strongly consistent transaction that shows up, you know, in 3 sites geographically distributed. Okay? And they're like, this is amazing. And we're like, it's kinda slow. But then we realized it's not kinda slow for that use case. Okay? And it gives you a measure of resilience. That's incredible. Right? And now we have people talking to us about doing things like and this is where, you know, I always have to say, these days, customers lead you, if you will. You know, they say, hey. We think you can do this with your system. And then we go, yeah. We'll have to do this, you know, to make that possible or manageable.
But they're doing this, and they're even talking to us now about, you know, we need redundancy in these transactions, not only within data centers, within 1 cloud provider, but we're neutral to the cloud provider, so we can give you the ability to split that across cloud providers even. Right? There are penalties for this because they have ingress, egress cost. But given a workload that demands that, it can be, you know, something that people are willing to pay for. And, you know, we're very efficient in terms of the amount of data being, you know, shoveled around. So we see that, and that's something that really has been driven, you know, by customers going, look, there's some capabilities inherent in your architecture.
We think it can be exploited differently and more aggressively than your market, you know, if you will. And this led to some work we had to do for sure. But it was really driven by these new workloads that are truly global,
[00:54:49] Unknown:
if you will. In your own experience of working with the technology and the business of Aerospike and working with your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? Yeah. You alluded to it. We discussed it in some depth. Right?
[00:55:05] Unknown:
But it's that the cloud, right, was constructed for a first wave of workloads, if you will. And there are some data from various analysts that shows that while every large enterprises is in the cloud. Right? What's in the cloud? And so a small percentage of their transactional workloads, their core mission critical transactional workloads have migrated to the cloud much less than the cloud providers would like. Now because of all these systems of engagement and new applications at the edge, there's been huge, huge growth. And the reason people think there's headroom from an investment standpoint in all the cloud providers is you ain't seen nothing yet. Right? And I think that really figuring out how real time transactional workloads and large real time datasets.
You know, if you think about HBase. Right? And you think about Hadoop, like, those things are dead. What replaces that that allows real time access to massive amounts of data for decisioning, you know, driven by AIML. And those new workload, right, are gonna run-in the cloud, and they run-in the cloud today, but not very well in real time and not in a transactional sense, if you will. And I think that we're working on that. We have a lot of thoughts around that as we have customers who have made that transition, you know, on premise, and they start having the conversations about moving those work loads, massive work loads up to the cloud, that's led to more discussions with cloud vendors about that space. And I think, to me, you know, I'm still excited about technology and about database and what, really, you know, information at massive levels means.
That makes this just a super fun challenging space to deal with. And I'll sort of sum things up this way. Our chief revenue officer and I had a discussion with a new customer who's really a purveyor of, you know, derivation of information across massive ingestion. And, you know, we were having this conversation. We asked, asked, like, how many, you know, different streams of data do you have coming in in real time into this cluster? And, you know, they said, a 150, 000. Okay? And our CRO said, did you say 50, 000? That's incredible. And they said, no, no, a 150, 000. And I said, yeah, I heard it that way. And the CRO said, we can do that?
You know, and the customer said, yeah, it's the only reason we bought you. Because, you know, the other things we looked at couldn't handle that ingestion. And I think that that's a great representation of a difference in perspective. It's not only the data within your enterprise. It's all the data in the world that's available that applies to your problem is the way people are thinking about
[00:58:16] Unknown:
applications and databases and data technology, if you will. For people who are interested in being able to accelerate the pace of interaction with their data or be able to massively scale out their capacity? What are the cases where aerospike is the wrong choice?
[00:58:34] Unknown:
That's a good question. And there are clearly places where that's true. You know, what I would say is that for big datasets, if you're really looking for an analytic work store, we're not a columnar store. Okay? Now we may add that capability built on some of the techniques that we talked about here, and we talk about that. But we're really focused on operational transactional stores and the ability to derive insight from them in real time, but not really data warehouse kind of thing. You know what? Snowflake's done that pretty well. Billy Bosworth has a new company, you you know, focused on Arrow and some other things that, you know, is attacking That's them. That's not us. Okay? Might become us sometime in the future, but that's not us today.
We're just really focused on solving this set of problems. Now I'd say that's really the dividing line for us. The other thing I'd say is that, you know what, if you think that, you know, 2 terabytes is a large data set, There are probably cheaper, simpler ways to solve your problem than us. We talk about aspirational scale. We talk to a lot of new tech startup companies, and they're like, we wanna buy 500 gigabytes, and it seems kind of expensive. And then we say, what are you gonna be at in 18 months? And they say, a 100 terabytes. And we say, we'll be the most cost effective solution for you. Okay?
And in enterprises, right, they're getting smart about this because they've been burned multiple times by having to replatform 3, 4 times as they scale up. And now they're saying, we need to start here. But there are a lot of applications that aren't gonna be that big. And, you know, I would personally say use Mongo, you know, use Couchbase. Those are great solutions. They've spent a lot in sort of making it easier to program to. They're a bit ahead of us there. But, you know, if you need the scale, if you need the low latency, if you need the throughput, you know, we're probably not the only game in town, but, you know, the the only game that I really understand in town, I guess.
And as you continue to iterate on the product and the platform for the Aerospike system? What are some of the things you have planned for the near to medium term? Near to medium term, 1 of the things we're really looking at is, like, I guess, 2 different vectors. 1 security, I'll get to that. But the other thing is I mentioned we were doing a lot of work in secondary indexes. And so we'll take that to where we'll be able to index into those nested, you know, rich data structures within a value in the key value store. And that really opens up a lot of new applications that can be done really, really quickly because our secondary indexes are now, you know, pretty much as fast as our primary hash to get to, you know, from the key to the value. And now being able to do that across multiple vectors, you know, is gonna open up a huge amount of new capabilities.
On the security front, as we move into more and more enterprise workloads, we've also gotten into the federal space. There's this demand for us solving the problem of, hey, in this extensible data structure that you have, this nested thing, how fine a granularity can we get to without impacting the scale and the throughput in the low latency? And what are the trade offs? And how can we provide more constrained pieces. The other thing, because we scale so much, right, what we have is people using us in a shared service manner or in a multi tenant manner across, you know, hundreds and hundreds of similar workloads, but they want to contain those. So we're doing a lot of work around being able to fence off different applications from each other in a fairly elastic and pliable way still, right, but with quotas and limits of different kinds. And every time we think we have it solved, a customer comes up and says, no. But I'm running this combination of things, and here's what's missing. So we continue to invest heavily in that. Are there any other aspects of the work that you're doing at Aerospike or the overall
[01:03:00] Unknown:
potential and use cases for real time systems that we didn't discuss yet that you'd like to cover before we close out the show?
[01:03:07] Unknown:
I'm really seeing new demands for this notion of real time that every time you think that this is how fast business is going to the pace of business is going to be, that people are driving you know, we talked a lot about, you know, in the financial services space, you know, program trading, machine trading, and such. Right? Well, that's filtering out into every aspect of the world almost. And the need to be able to have those insights as fast as you can is only growing. Right? And so I think that's gonna mean more competition for us, I'm sure. Right? But it also means more market for us, and I think that it keeps being fascinating to me how people are discovering new data streams, if you will. If you look at all these supply chain problems we have, because I spent a fair amount of my life in that space too, that, you know, it's fascinating how broken it is right now. And it's because it was optimized within a single supply chain for efficiency, but not taking in all the data you would need to be able to manage the trade offs between resiliency and efficiency. Right?
But with these models we're talking about, we have the bandwidth and the capability in the cloud to solve these problems. And I think that, you know, that's gonna have to be done in this, you know, hyper connected world that we live in that's subject to, you know, disruption in ways we don't expect.
[01:04:49] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. You know, I think the biggest thing is that, you know, we touched on it with the cloud vendors
[01:05:09] Unknown:
that I don't think they understand low latency. I think they've divided it up more based on the first set of workloads that they had, and they have a lot of workloads coming at them really fast that they're gonna have to fill in some new instance types. And I think that, you know, as I said, we're starting to work with those vendors more closely. We're not alone in that, I'll say. Right? And I think that that's gonna be 1 of the key things. And it's also in this notion of data movement. You know, I think that you're gonna see a lot more focus on really getting the pipes between data centers to expand, to be more parallelized because the amount of data changing hands is something that's just amazing to me. You know, people talk about the monetization of data. I think more about, you know, what's the available set of data that you can apply to any problem?
And it grows every day. Every well, probably every minute,
[01:06:15] Unknown:
you know. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing at AeroSpike. It's definitely a very interesting set of technologies and an interesting platform and something that I've been keeping an eye on for a number of years. So I appreciate having the opportunity to speak with you and learn more about what you're working on. Definitely excited to see where things like Aerospike and the overall movement towards more real time access to scalable datasets is going. So thank you for all the time and effort that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you. It's great to be here. For listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Welcome
Interview with Lenley Hansarling
Overview of Aerospike Project
Use Cases and Market Evolution
Technical Architecture and Data Model
Hardware and Performance Optimization
Operational Characteristics and Management
Abstraction Layers and Client Compatibility
Challenges and Lessons Learned
Future Plans and Security Enhancements
Closing Remarks and Contact Information