Speed Up And Simplify Your Streaming Data Workloads With Red Panda

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance?

How much time could you save if those tasks were automated across your cloud platforms?

Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud.

Their comprehensive data level security, auditing, and de identification features eliminate the need for time

manual processes, and their focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms.

Learn how they streamline and accelerate manual processes to help you derive real results from your data at data engineering pod cast.com/immuta,

that's immu

t a, and get a 14 day free trial. And

when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode,

that's l I n

o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. Don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Alexander Gallego about his work at vectorized building Red Panda as a performance optimized drop in replacement for Kafka. So, Alexander, can you start by introducing yourself? Hi, Tobias. Thanks for having me here.

I am the founder and CEO of Vectorize,

working on Red Panda, which is a drop in replacement for Kafka for mission critical systems.

Been working on streaming for the last 10 years. Kind of started actually on the computational side of things.

I was the CTO and founder of a company called Concorde,

which is similar to Apache Flink, but we wrote it in c plus plus

And we sold that to Akamai in 2016.

During my tenure at Concore, we basically couldn't find a storage system that could keep up with with that system, and I've now moved to a storage side of things with Red Panda.

And do you remember how you first got involved in the area of data management?

After

college, I I actually went to school for crypto, and I decided to drop out of grad school. Anyways, after college, I went to work for this ad tech company in New York City. And from a technical perspective, it was super cool. It was the idea of, like, let's compete with Google on AdTech

for the premium publishers and advertisers.

I was the 1st engineer there, so it was a really exciting opportunity for me as a as a recent college grad.

And what we did there was we built this real time machine learning for ad bidding pipeline,

which it doesn't sound super cool today. But honestly, 10 years

ago, it was really innovative, and we were basically printing money 10 years ago by doing that. You know, that's how I got started.

You mentioned that Red Panda is this drop in replacement for Kafka with a focus on mission critical applications. So I'm wondering if you can give a bit more context as to what type of applications those might be and some of the ways that Red Panda facilitates that use case.

Effectively, what we care about through a lot of the market research

is we focus on 3 tenets for for the product. And then I'm gonna tell you a little bit more about, like, the particular use case that I think we're a really good fit for. So we basically went, and before I started out this company as a as a full company, I interviewed something like 40 companies, and they spent everything from IoT to

the big tech, even some things. And what we found is that the majority of the companies were actually unhappy running Kafka. And

what was really good about Kafka was actually the entire ecosystem. Like, the Kafka API

sort of won,

but the guarantees

of Kafka, whether it was data safety or running Kafka at a scale, a lot of those companies were having difficulties with that. I think the complaint from the majority of this survey, so 50% of these companies were really having difficulties running Kafka. And for those that did manage to operate Kafka at a scale, so the usual, like, you know, Netflix, Dropbox, etcetera, those people that have deep distributed system expertise,

they weren't getting enough performance. And so what you end up with is, like, massively overprovisioning clusters.

And for some small percentage, they actually cared

a lot about data loss. And what's interesting is actually through the last probably a 100 enterprises that we've talked to,

the majority of enterprises actually run Kafka in a in an unsafe mode. And so when we say mission critical systems

are really people that want to manage 1 system. They wanna have 1 fault domain. They don't wanna manage zookeeper. They don't wanna understand how to manage zookeeper snapshot or run TLS on the leader election port of zookeeper or restate the snapshot or understand how to tune the JVM.

They really want sort of this streamlined system that kind of gets out of the way. That's really the first 1. Right? So operational simplicity was really a big 1. The second 1 was really safety. And so

if you look at kind of what happened in hardware over the last 10 years, a decade ago, there's a company called Blackblaze, and they published this statistics about hardware. And

so 10 years ago, to run a terabyte of SSD was like $2, 500.

But now the cost of running a terabyte SSD is like $200. Right? So it's sort of the first bottleneck.

And because hardware got better on many levels, both on on networking, on disk, and on CPU, the rise of many core systems, we are now able to give stronger guarantees

to our users about data safety. And so the main kind of critical piece about our architecture that has had a profound impact for a lot of our customers, whether they're in oil or ad tech or whatever,

is really that we're safe. Right? We give them the linearizable

Raft implementation

to save their data with a sound fault domain and and a sound failure model. Like, as a developer,

you actually understand exactly what it means to have 2 out of 3 replicas up and running. You understand that there'll be no gaps in the logs. So you have this really strong long completeness guarantee.

You know, if if you contrast that with the default settings of Kafka, you can run Kafka by the way safely,

but no 1 that we've talked to in the last a 100 enterprises does it because the performance impact of flushing every message to disk is too slow. So if you contrast kind of this approach

versus Kafka's, we give basically safety by default for users. And we allow them to both run

with low latency, high throughput, and safety. And so that's really what we mean when we say mission critical systems.

And in terms of the safety aspect, you mentioned that with Kafka, you can flush everything to disk on every right. But because of the overhead of the JVM, it ends up being slow.

I'm curious if in your implementation, you're using something akin to write a head log such as what Postgres does for being able to handle that safety guarantee while also being able to handle some of the

overall performance throughput on the underlying hardware?

This is like a deep technical topic, so let me give you the big highlights, here. So it is actually not just the JVM. So what the JVM

adds to a system is really latency.

And so there's 2 aspects to this question here. It's it's understanding the difference between latency and throughput here. Adding the garbage collection, what it does and we've seen this in production.

We send system, you know, regularly with 2 second JVM latency pauses and in some systems with 6 seconds latency pauses kind of on the outliers.

So that's with the JVM.

But what I'm talking about is really throughput. And the reason most people don't run Kafka with safety settings is because it affects not only their latency, but particularly their throughput. And so what happens

when you issue an f sync on a particular file handle

is that that injects

a global barrier on the kernel. And so

that file, right, so if if you write into a file, it actually injects a barrier on the file system itself

because you can't do anything else. You have to wait for the media to be flushed to to disk. We have to flush the file too to guarantee data safety, but but we do kind of a lot of tricks around batch coalescing

and deep bouncing the rights on our raft implementation.

The second thing that we did there is from a technical level, we don't use the page cache. Right? And so we eliminate

a a very large set of locks that have been in the kernel. On the first step. We'll talk actually about the networking in just a second. But when a message comes in, we're not consuming

any resources from the page cache, which means the global logs and resources that happen on a particular file handle basis, they're just not taken. They're completely eliminated. The second thing is that we have this adaptive file allocation technique.

And so what it means, a lot of the barriers or a lot of the performance limitations come around with synchronizing the metadata on the Linux kernel. And so by preallocating

the file system space

and coalescing the flushes and doing kind of all of these low level

tricks, we're really able to drive

probably 4 x the throughput of of a cluster while being safe and deliver about 10 x lower latency.

So hopefully, that disambiguates some of the performance things that we've seen with our Raft implementation.

In addition to the challenges of running Kafka in a safe mode and the performance impacts that that brings, what are some of the other areas where Kafka

has limitations

with Red Panda in terms of simplifying the operability of the system beyond just the failure domains?

1 of the things that we're really proud of is our auto tuner.

Our team has a lot of deep technical expertise from big companies like Microsoft, Yandex, Red Hat, Akamai.

And 1 of the things that you have to do at a scale is, like, you kinda have to basically automate yourself out of a job almost.

So

1 of the most challenging things actually in running any storage system

is

for the person that is running the system, not the expert that wrote the system,

is for this operator to understand,

honestly, what are the best settings for my computer?

You know, like, how much memory do I allocate in particular for the JVM? Like, what's the what's the garbage collection? Like, how do I tune these these things? So, of course, eliminating the JVM kind of eliminates an entire category of problems. Right? So that kinda makes it simple. The second thing that we do is we try to really deeply understand what are the kernel parameters

that affect the runtime performance both for throughput and latency of RedPand.

And instead of exposing this as a list of checklists that say, you know, run this file system, disable no merges on the Linux IO scheduler, enable your multi queue network, and kind of all of these very deep technical things, we ship an auto tuner.

And so even before Red Panda starts,

we guarantee that by the time Red Panda starts, the binary starts, which by the is is just a single file,

the kernel has the optimal settings. I'll tell you 3 or 4 things that we do to optimize sort of the experience

for the person that's managing the storage system. So the first thing that we do is we probe the hardware for

whether the mult whether the NIC can support multi queue networking. And if it does, we can enable multi queue networking. The second thing that we do is we detect if you're running an SSD.

And if you are, we probe the hardware and then we measure with some target latency of 500 microseconds.

How much data can we push to this NVMe device

given this target latency?

The third thing we do is we tune all other couple of Linux kernel settings.

1 big 1 that has an impact is because we use DMA, because we don't use the page cache, because we talk effectively to the device controller, we tell the Linux controller not to merge our IO blocks. And in fact, I was like, don't do anything because we, the application, understand much better how to align our memory buffers

so that, you know, they're basically block aligned

like the file system expects them. And so all of this is automated.

This kind of a step forward in managing the system is really a system that kind of gets out of the way. You run 1 binary. You do a systemctl star red panda. If you run-in system d, it auto tunes the machine and then it runs the thing. Of course, this happens super fast, which is very anticlimactic when I'm trying to show this to my wife. I was like, look at all the things that we do, and it happens in, like, a 100 milliseconds. But from a management perspective,

that's sort of that operational simplicity

is really what the CEOs are buying. The technical person, the engineer in the room, they wanna hear all the details about how we do, you know, debounce the rights to raft. How do we batch thing? How do we improve latency?

How do we do kind of all these low level things, whether it's from the data management the data structure perspective at the core. But what the CIO cares about is, like, okay. How hard is it to run this system? That's the first thing that they care about. And secondly,

what are the things that you're improving on on the status quo? Given this 1 binary is really, I think, the only way the the reason why we're getting so much traction with some of our customers,

Because not only do we improve the safety guarantees of Kafka, the throughput, we give lower latency, we give higher throughput, we give them safety, But we also, from a management perspective, actually reduce the operational overhead of what it means to run a Kafka cluster on a scale.

You mentioned that there are some areas for improvement

on the specifics

of how Kafka is implemented. But at the same time, the overall ecosystem around Kafka is obviously

large and vibrant enough for you to target it as being a drop in replacement. So I'm wondering if you can talk through the overall strengths of the surrounding ecosystem

and why it is that you are

choosing to be an easy replacement for Kafka rather than building your own protocols and trying

to grow your own ecosystem in place of it?

Yeah. We have to really give credit to Kafka here. What Kafka is

what they figure is Kafka has won as the de facto API. Right? We we talk to

actually, literally, hun by now, 100 of enterprises, probably around a 100 enterprises,

high level executives about running Kafka, and fundamentally,

the Kafka API won. And if you think about the Kafka API has actually been the de facto standard, they have a huge ecosystem. There are millions of lines of code. You can take TensorFlow

and push it into a topic and then have that consumed by a Spark ML and push it back onto a topic. And then eventually that makes it to Elasticsearch.

And all of this sudden, with 3 or 4 little processes, you have a very sophisticated machine learning pipeline. Right? And so the API

is really, in my opinion, the thing that won. Now if you compare

the Kafka API

to what SQL was for all of these databases,

is kinda really how we're treating the Kafka API. We said, okay, this protocol has 1. Developers

love it, and fundamentally, there's millions of lines of code that are written into this, and we wanna plug into that ecosystem. We wanna give them a better system for it. If you compare

Red Panda

and the Kafka API

with SQL and kind of the whole evolution on, you know, you went from relational databases to NoSQL databases

to, like, Cassandra. Right? So it was Postgres, and then everyone moved to Cassandra type systems.

And then the last 1 is the NewSQL or the CockroachDB system, which gives them, you know, distribution,

geo distribution, and so on. We really think that

the Kafka API really needs a new system that gives people the guarantees that they're used. Right? So there's this new stack of people that don't wanna run JVM system, that want a system that integrates well with Kubernetes and all of these modern systems, but it's easy to run. It gives them better safety, or it gives them kind of the things that they expect the queue to give them. But in terms of the Kafka API,

it was hugely important.

And, honestly, from a business perspective, it was important that we leverage their system. There was no way actually, the first couple of months

of the company, I tried to boil the ocean. And and clearly, it didn't work. I was like, look. I'm gonna invent this new API. We're gonna solve a bunch of things, like head of line blocking and all of this, you know, very low level of details. And then I didn't take into account this this super rich ecosystem.

But after talking with a bunch of enterprises, it kind of became clear

that what people wanted was not to touch their production application. They just wanted to point their application to a better system, and off you go at 10 x the speed. That's really what they wanted. And they want a reduction in hardware.

They wanted to get back, for example, engineering capacity, because now you can run the same system with a part time person rather than 2 full time people.

So we really thought that the Kafka API was, like, this super rich ecosystem

that where we can give better guarantees

without breaking any existing application.

And so can you dig a bit more into how Red Panda itself is actually implemented and some of the ways that you are able to be a drop in replacement for Kafka?

Kafka has actually

50 current different APIs.

And each API, by the way, has 12 different versions. So let's look at fetch, for example. Just basically the consume API. Let's see. There are 12 versions,

and then you have an encoder and a decoder

per version,

per request and response. Right? So I send the message, and that has an encoder and decoder, so 2 things. And then I received a message that has an encoder and a decoder, 2 more things. And for those 4 things, you have 12 particular versions. So there's no way we were going to manually generate this. And so

we spent

the bulk of late last year honestly writing a parser combinator library where we could just cogenerate

the entire Kafka API ecosystem. And so Kafka has these JSON files, and and it describes

it's like a variant of JSON, but it describes

the type kind of what the JVM type system expects.

So we can consume those files directly, and then we added an enhancement of that with some of the c plus plus 20 features to actually give stronger guarantees and prevent some abuse from the native API. That's sort of the first level. And so if you look at Feds, for example,

it would be 48 different scenarios that you have to take into account. So there's no way we would be able to keep up with the Kafka. Like, we wouldn't have a product if we had to manually generate this a 160 APIs.

A couple more deep level kind of technical things, you know, how we guarantee compatibility.

The next 1 is

because the ecosystem is so rich, right, and we really are just targeting at the API level. We say, talk to this TCP endpoint, and then we look like Kafka, we talk like Kafka, there's exactly no difference, and all of your applications work, which means we can download all of these open source projects that are already interacting with Kafka,

that rely on bug compatibility

in, you know, in the case of like the Saramaj driver, they have some particular bugs that we had to like imitate, which by the way upstream does the same. And so we imported a large set of of unit tests from the Liberty Kafka test, and then we just pointed at at our IP. And because we speak Kafka, it sort of simplifies this giant

set of testing matrix and that all we have to do is run these open source systems and put them on Amazon or Google. It it really doesn't matter. And they say, you know, test this and let us know where we failed. Those are the 2 kind of guarantees.

And then

on the safety level, you know, we hired Dennis out of the Cosmos DB team

to kinda come in and implement

Jepsen. In fact, we actually extended Jepsen

to be able to analyze longer histories and without getting too much of the rabbit safety, which we can next. So we give them safety and API compatibility,

and all of the product

features actually have an integration test. 1 of the most fun things that that we spent probably last year,

a couple of months writing,

this really interesting fault injection test. And we've been running this 247 since last year. And every a 150

seconds,

we inject a fatal failure, whether we delete

like, a rough group or we crash the file system

or we inject network partitions, we delete topics, we create topics,

and we have to give the system, you know, time to recover because at the same time, by the way, we're pushing a gigabyte of data per second. And so we can simulate

a much larger fault scenario than most of our customers can. And we know because we are the expert, kinda what are the difficult things. And in addition to that, we run them to all of these, you know, fault exploration frameworks, these these safety frameworks, these compatibility frameworks.

And that's how we are trying to guarantee

safety and and compatibility.

Another interesting element of this is that

while Kafka has become the de facto API for

streaming storage systems,

there are some

other contenders,

most notably being Pulsar, which has also admittedly added a protocol compatibility layer for Kafka. But then there's also the work being done for the open streaming standard to try and

coalesce around a particular interface for streaming applications. And I'm wondering what your overall thoughts are on the

viability of some of these other protocols

and the potential

areas for improvement in the Kafka APIs and ways that the overall ecosystem might be able to progress to a shared understanding of how best to interact with these types of systems?

Let's tackle each of those in particular. And I think there's also a set of things we haven't yet talked about, which is what we see as the future of streaming. In a way, as a company, we're trying to position ourselves as the future of streaming for the new stack.

So talk about Pulsar. From a market distribution perspective,

I think something like 90% of all the companies that we talk to use Kafka. There's some companies that that use Pulsar.

Maybe there's bias in the people that want to talk to us. But that's just, you know, the stats that we're seeing. There's very few people that are running NATS and all the other streamings, you know, RabbitMQ, there's very, very few. And just really in terms of the market. I was actually speaking with an executive from IBM the other day, and literally Kafka

crashed multibillion dollar products for IBM. Like, that's how popular it is. And I think the reason is not necessarily

the protocol, but the amount of connectors. Right? The fact that you can take TensorFlow

and plug it straight into Elasticsearch

or plug it into a Cassandra compatible database

or, you know, push it to DynamoDB.

It's kind of this

Lego component for a lot of the architectures, whether you're trying to do it, to use it to detect real time fraud, or you're trying to use Kafka to deal with back pressure from databases that have a hard time with the scaling, like Elasticsearch,

is the thing that is 1. So to answer your question about streaming and the protocol, I think it's great. I think the great thing about open source and the market is that people will build the applications that they need, and they don't care

whether it is a protocol

or it is a standard or something. They're just trying to get work done. And the majority of people are getting work done with

Kafka. In a way, it's really easy for us to provide another parser mechanism or, like, you know, change the format from the Kafka format to JSON. It doesn't matter. But I think what matters is really this plug and play

architecture where you get to leverage

all of the Kafka connectors. And in fact, all of the new database, whether it's like OmniSight or CillaDB or CockroachDB, they even all actually, main SQL too has a transactional Kafka connector.

So the Kafka API is so popular

that all of these modern database systems actually have built in customized drivers to consume

data from Kafka in a way that gives them higher and better guarantees or different guarantees

than the standard connectors. So I'm not sure

whether the open messaging standard is ultimately going to win. We're just trying to give users

a better system

for their existing code

currently. It's really hard to tell. I think it's great that there are, you know, innovations in this space, and it would make, honestly, a lot of our jobs simpler if it wasn't the Kafka API, if it was, like, a standard that had, like, you know, well understood semantics.

Some of the things that we have to guarantee

are really hard

because it's not like the bugs are not documented, of course, And you really kinda have to run an application,

download a bunch of open source project to really detect these edge conditions, which is why we leverage the entire open source ecosystem to test our code. But ultimately, I think the Kafka APIs, I think that's 1 today. It's very cool that the other messaging frameworks also take off.

Going a little bit further down this path, the other aspect to the streaming ecosystem and 1 of the things that Pulsar

touts as an advantage is that it also supports the more traditional messaging paradigm similar to what RabbitMQ does, where it doesn't rely on this persisted

append only log stream for being able to send messages between systems. And I'm wondering what your thoughts are on the utility of that being collocated

within a system such as Kafka or Red Panda or Pulsar

versus running it as a dedicated system where you're relying on some of the time tested applications such as RabbitMQ?

The request response for for messaging, I think is it's super useful.

I think the the answer to that in the Kafka ecosystem

is a thing called

Kafka REST or like a proxy. We actually have our own. It's written in c plus plus API compatible,

where you kinda send and and receive messages.

Underneath the scenes, it kind of proxies the protocol from REST for this request response to the Kafka protocol. For the majority of use cases that we talk to, which are

either large web 2.0

companies, like the largest database in the world, analytical database in the world, the largest CDN,

people doing fraud detection for the largest banks, or on the hedge fund side where they're trying to do settlements. They really care quite a bit about data loss as you would imagine. Those are probably our focus. So I haven't really seen that much demand from a market perspective

to run a lot of Kafka proxy stuff. In fact, I the real driver for the Kafka proxy is

the Kafka driver

when I mean the driver, I mean the client code

has a lot of knowledge. It is a very sophisticated piece of technology.

And the reason to use the proxy is really to bridge gaps. So when you connect to Kafka via whether it's Go or or c plus plus or Java,

you know, really any driver, but those are the most popular ones.

The driver has to understand the internal IPs because it does literal election on the consumer groups. It, you know, has, like, a lot of protocols embedded into the client, and so it needs to understand the internal cluster state. It actually fetches all the metadata from the controller and so on. And so the HTTP proxies, I think, Les used a way to replace,

RabbitMQ

in the request response

type workload

and more actually to bridge external systems. So if you're trying

to proxy your streaming rights to your infrastructure via an HTTP proxy,

then you would put our, you know, c plus plus proxy in front of Kafka. And then anyone that talks HTTP can talk straight to this proxy

without having to understand the mechanics or the internal

client protocols.

And for the majority of whether it's your machine learning applications or your fraud detection

or your real time ad bidding, most people just really use that. And I think that Kafka ultimately replaced RabbitMQ,

at least in the market and what we've seen.

And going back to the specifics of your implementation,

what are some of the ways that the overall design

or the product direction has changed or evolved since you first began working on it, and any initial assumptions that you had going into

the project that have been updated or invalidated in the process?

I would say the main

thing that changed for us

was our ability to add coprocessors.

So let me give you kind of a little bit of background here. For most APIs currently, right, let's use gRPC

to be executive with our time. You define inside a protocol file. You define

the data exchange format, so, like, the data structure you're gonna exchange,

and then you run it through an ideal generator, like protoc or g r p gRPC compiler,

and it takes the schema and it generates HTTP stubs

for a multitude of languages, Python,

Java, you know, it doesn't matter. And that's roughly how, like, a lot of

serialization format, a known h point, and a known way to interact with these endpoints.

What doesn't exist to this day in a way that is scalable, and I'll mention the the specific details in a second, is really a way to provide users with a data API

in a way that goes

beyond serialization,

in a way that gives deep data guarantees.

And so something that has evolved from the beginning of the product

was our idea of coprocessors.

So what's a coprocessor is

Red Panda allows for inline WASM

transforms.

So as you're sending the data, we can run it through, yeah, through, like, a WASM transform so you can upload either a JavaScript or WASM. We run it inside the V 8 engine, and then we can materialize that transformation to a different topic. This may sound trivial, but I'm gonna walk you through example.

1 of our customers where we took 5 clusters of 20 nodes each, and we reduced it to 1 20 node cluster.

And this goes to the idea of the data API.

So what they were doing is that they were sending information.

And for the sake of this discussion, let's say that it had 3 fields, first name, last name, and some PII information.

Right? And so the other cluster were literally

materialized views of this original cluster. And the way it's done today

is you either use, Kafka streams or Flink or Spark or some other job to consume data from 1 topic and push it to another topic. We think that that is not scalable. The reason it's not scalable is because you're consuming network resources from the clusters that you're consuming to and producing to. And so our answer to that and we're still in early stages of dates, and we think this is the right approach. Ultimately, the market will decide.

But given people the ability to upload a Wasm transform scripts, basically,

decouples

the production of data from the consumption of data with a data API.

So 1 example here is you can push some person object with some PII information.

And so, of course, the simple thing to do is you say, okay. I'm gonna remove all of the PII fields

and push them to a separate topic. So let's take change data capture. You have Debezium pushing

your transactions

to a database, and now you can run an inline transform, and you can guarantee

that the consumers

of of a particular topic

will not see PII information.

But then you can even go beyond that. Right? What you can say is, like, I am going to transform the same 3 fields, first name, last name, and some PI information. Let's take Social Security number,

and it's gonna have the same 3 fields. So it's not just about removing or adding fields or data enrichment, which by the way is useful,

but it's about changing the meaning of the data.

And so, you know, being able to upload inline transformations

is I think what is fundamentally missing

from being able to give people API like guarantees like you do for your microservices, but now this is for your data. You can upload and you can say, oh, I am going to divorce

the producers of this data and the version in this game of my sources of data and guaranteed

whether it's for your machine learning, workloads, whether it's a uniform contract, a uniform serialization format with a uniform set of fields,

or I'm gonna enrich some data with a little bit of things. This 1 shot transformation, this is, I think, what is missing

from being able to truly expose the idea of the data APIs, which is, again beyond serialization. It's about deep introspective guarantees

of the data.

And what's useful about our particular implementation

is that you get parent stream correctness guarantees,

which means

because they execute locally, you're not consuming network transfer. You're not consuming CPU. You're simply pushing it through a materialization engine, like a V 8 engine, and say, hey. Run this function.

And whatever I get back, I'm gonna write it to 2 topics. Right? So it's very cheap for us to do it in a very scalable way, because the only thing that is increasing

is disk. But from a business perspective,

it now gives you the ability to guarantee GDPR compliance. You can say all of the tuples that come through this stream

have guaranteed

removed the Social Security number for everyone in here, or have removed

their driver's license information, or have masked their address, and so on.

Yeah. It's definitely very interesting and useful approach to that problem. And that brings me to my next question too, which is trying to explore

some of the ways that you are able to innovate on the overall space beyond just performance when you are tying yourself to the specifics of the API and some of the room for growth or other areas of the streaming ecosystem that are not being thoroughly explored at the current point in time?

That's always so fun. I think that the most fun

as a technology is building.

So every time someone asked me this question, it's it's a lot of fun for me to talk about. We're solving real problems right now, which is to some extent, we had to start with something. You know? You know, I had Emacs and GCC on my computer, and I was like, okay. I'm gonna build the system. 18 months later, we have the system that is compatible with Kafka. And that's where we had to start. Right? We had to start with better guarantees, better performance, kind of all of these things that we've been talking about through the podcast. I think the future though

is much more interesting, in particular,

on the disaster recovery

and the compliance space and the idea that we call the shadow indexing.

And so shadow indexing

is actually you push your data to the Red Panda cluster. Right? And instead of deleting files locally,

before we delete the file locally from this, we upload them to s 3 or Google Cloud

storage or Azure Blob. It doesn't matter to some durable,

very cheap storage. Right? So that's the first 1. So that that ability is tier to storage in you know, Kafka's working on a KIP and Pulsar has something like that, but that's not enough. What we also upload

is the shadow index, and it's basically a self discovery metadata about the state. What that allows us to do is add the ability to have a true global Kafka cluster. How does that work? So let's say that you're running to US east and you wanna read from US west. So US east post their data and, you know, it's working as a regular Red Panda cluster. It uploads their data to s 3.

When US west wants to read data that was written to US East, instead of using MirrorMaker,

which is how most of this is done. Right? You you have this thing. It kinda like mirrors the topic,

and it consumes resources

both on the raft level, on the disk level, on the compression, on the network, on both sides of the cluster.

We tell s 3 the API, and I was like, hey, s 3. Copy object, which is is an API that effectively allows you to move

the object or basically point the source to the other data center, let's say in US east. Okay. But fundamentally, what is shadow index? Shadow index then transfer a little bit of metadata. We're talking about kilobytes for, like, a petabyte range of of actually true data. So what it gives is we actually don't consume any resources, and we have this shadow index, this metadata

that tells us, oh, if you wanna read the entire historical

data of this topic, you can.

It is on US east. Let me copy the data to US west s 3, and it's very cheap. It's, like, 2¢ per gigabyte,

and only materialize what the client gives me. Why is this a game changer, I think, for the future of streaming?

Fundamentally,

what you're giving people is the unification

of the historical file system access

and the real time access. You're giving them the exact same Kafka API.

And so, of course, on US West, if you request historical data that was written in US east, you know, it'll has to fetch it from s 3 and it'll materialize it on some scratch space of the file system, but it'll give you the data. Let's say as a machine learning person in an organization,

which by the way, it keeps the access control and stuff. You don't have to understand

where the data comes from. Right? Like Red Panda will manage the hierarchy of data. It's not just uploading. It's I think it's the ability to query

historical and real time with the exact same Kafka API.

So that's just sort of all offloaded onto here. And I think really that idea

of, like, leveraging

the cloud providers

native,

you know, truly becoming cloud native. I think cloud native is really thrown around. My view of it is you leverage the cost saving infrastructure of Amazon with s 3 or or Azure, all of these dynamic compute resources to build a better system. That is what, to me, what is cloud native. And so s 3

allows us to give them a global view

of Kafka with unlimited replicas in any cluster that is super cheap. Right? I mean, it's it's literally 2¢ per gigabyte transfer, and that's what you're paying on your on your s 3 bill

without the need to materialize it locally and for that cluster to have overload. So from a system performance perspective, you could literally proxy the reads of petabytes and petabytes of data with actually a very small memory footprint on the box. So that's a big 1. That

coupled with our this inline WASM transform, right, this 1 shot transformations,

which, you know, it doesn't solve all the use cases if you're merging a a ton of streams and, you know, there are other systems like Flink is is is excellent, in my opinion.

They they do a better job of, like, merging a stream and doing more computation. But this 1 shot transformation,

given data API guarantees

combined with our idea of shadow indexing, is I really think is sort of the the future of streaming. It's like, you know, read it anywhere. It's scalable. It's cheap. You can run it. It gives you safety guarantees. It give you correctness guarantees.

We give you a linearizable

raft implementation.

We give you, you know, some low latency, some high throughput

for every cluster.

If you're looking for a way to optimize your data engineering pipeline with instant query performance, look no further than cubes.

Cubes is next generation OLAP technology built for the scale of big data from UST Global, a renowned digital services provider.

Cubes lets users and enterprises analyze data on the cloud and on premise with blazing speed while eliminating the complex engineering required to operationalize

analytics at scale.

With an emphasis on visual data engineering,

connectors for all major business intelligence tools and data sources,

cubes allows users to query OLAP cubes with sub second response times on 100 of billions of rows.

To learn more and sign up for a free demo, visit data engineering pod cast.com/cubes

today. That's data engineering podcast.com/qubz.

For people who are interested in exploring Red Panda and deploying it into production or a test environment, what is actually involved in getting a cluster set up and joining the nodes together and managing the auto tuning that you mentioned? And I'm also curious

to understand some of the challenges or complexities that you face in being able to automatically discover those optimal settings and how cloud environments complicate that problem.

By the time this podcast is out on September 22nd, we're gonna have a free download. People can just go to our website, but put it on the link of the

podcast.

They could try it themselves. They get their own kind of private repo. The auto tuning and discovery, it happens as long as you're running system d. This auto tuning, because it's hardware, it literally runs at hardware speed, which is, like I mentioned, is really anticlimactic

when you're trying to show

someone new camera

that doesn't understand the system. You're like, oh, look. I did all these cool things, and it's just like, you know, a bunch of terminal lines. But anyway, so that happens automatically, and then Red Panda starts. Now the exciting part about this question is configuration.

The reason we chose Raft you know, actually, the original protocol I wrote was chain replication. But when you add chain like, node removal and node addition and a bunch of stuff, you end up with a thing that looks exactly like Raft, except

that Raft

gives you a proof on configuration

changes. There are 2 types of changes in Raft. I'm only gonna talk about the thing that is gonna be available on September 22nd,

which is called joint consensus.

And so

the reason for us choosing Raft, in addition

to safety

and to having

that this log completeness guarantee, which says there's no gaps in the log or you can't write data. It was configuration changes, sound configuration changes. And so when you add a node through the cluster, you have to give it basically 3 or 4 IPs

just to for failure recovery, but it'll ask the current nodes. It's like, hey. Who is the leader in the cluster for the controller? It's a very low low throughput cluster. It really it's only used when you're creating a topic or you add a node. It's basically otherwise idle. So So there's 1 node in the cluster that does this, and it uses, like, the RAF leader election to move around in case there is failure. Anyways, you ask the controller,

and then the controller

initiates a joint consensus,

protocol,

which is exactly how you add or remove nodes. Right? So you can just expand or remove notes

like you would. You know, you start up a thing, and then you'll go through this sound configuration changes, which means you add them, you know, you make sure that they have the controller metadata as updated and so on, and then they can join the cluster and start the voting. So that's how you add a node. It's really,

ultimately, some of the systems have some of the open source systems, we're probably going to open source a large part of this later this year, have offloaded a lot of this complexity onto the users. And so instead, we unloaded the complexity

of, like, node discovery,

configuration changes onto the system itself,

and then we added the complexity of having to implement the joint consensus algorithm

to add and remove nodes. So it's really, I think, you know, quite trivial. It's like you would have on on any node. You say, I've got install Red Panda.

It gives you the thing. You give it the seed IPs, and then then it's up and running from from that point.

Digging into

the business side of things,

what is your overall approach to being able to grow this platform

and expand its capabilities

and ensure that it's sustainable from a technological and business perspective?

The first thing that we had to do was really be API compatible.

Basically, that opened up every enterprise in the world.

It's amazing how many Kafka installations are in the world. I think I expect something like a 120, 000

total Kafka installations

to date.

So the market for us is really huge.

We have seen demands in the high volume space. So like pushing a couple of petabytes per week, you know, let's say 4, 5, 6 gigabytes per second, like huge, huge volumes per cluster,

and in the finance sector. Right? And the reason for that is that they care about safety, and the only 1 cares about throughput and scalability.

What we haven't really addressed is the middle of the market, the majority. And so, you know, what's good about where we started

is an oil company is always gonna have money to process, like, IoT devices,

same thing with the largest analytical space, same thing with the largest CDN and telco providers. That's like the big volume.

On the finance side, you know, hedge funds really have no price sensitivity.

So that's also a good market for us. But ultimately, this won't be like a massively successful product,

I think, unless we give this product to the masses. And so on September 22nd, we're gonna have a free download trial where anyone can just try it. It's the the entire product. They can test it. They can see if what we're saying is true. Hopefully, they find bugs. It's great when your customers find faults and you fix them and you make the system better. And then ultimately, towards the later part of the year, we want to make this available for a lot of people that don't have the money to pay. And so what that means is

probably a community license of sorts similar.

My current thinking on this is similar to the CockroachDB license, where

if you're a cloud provider, you have to pay. If you are vendor in data services where Kafka is kinda like, you know, they think that you really are selling plus or minus some proxies,

you have to pay. But if you wanna use it, basically, all of the other users are free for people to use. And I think that's how we move forward. So that's 1. And then the second 1 is I think there's this largely

unaddressed

huge

developer ecosystem, which is like, the JavaScript, the Python developers, and so on. Because our system is so easy to run, like, no 1 complains honestly about running Nginx or running Node. Js. It's like it's a single binary that you run.

You know, I think it's to some extent, the reason why Cockroach has seen a lot of growth

is the system is really easy to run. It was 1 binary, so we put a lot of effort

into making it super, super easy to run, no tunables,

autodiscoverable,

where people could just get to focus on their applications.

And so

I think that's a large market that, you know, we're seeing some pickup traction

where a lot of node

developers, they wanna do real time fraud detection. They have the same problems that some of the Java enterprise shops

have, but maybe they don't have the internal sophistication to manage JVM system, to understand ZooKeeper,

to understand, like, Snapshot, TLS, literal election, kind of all of these really complicated things. And they just want a system that gets out of the way. And then 2, that they can interact with the streams and simply upload another, you know, JavaScript and Wazhm scripts to give you data API concept I've been talking about. And so I think those are how we address maybe the 3 parts of the market. So we already target the fortune, basically, 2, 000 with what we have today. They care about reducing

the number of nodes from 300

to 3. This is actually a real use case, and I can talk more later. They care about

hardware reduction, cost, performance optimization, getting back engineering capacity

on the finance side. So, like, very high end. They have no price sensitivity. They care about safety, data correctness, guarantees.

The majority of the market kind of like, you know, most businesses running 3, 5, 7 nodes

will benefit from our open source. You know? So that's the strategy. And then hopefully, towards the end of the year, we're gonna announce our cloud.

And our cloud is, you know, effectively what every data vendor gives you. It's like you offload

everything to the vendor.

You know, they offer you a a mutual TLS authenticated port, and he goes pushed to this port and consumed from this port, and the management, they they kinda take advantage of that. That's what we're thinking in terms of business.

In terms of the ways that you have seen some of your initial customers using Red Panda, what are some of the most most interesting or innovative or unexpected use cases that you've seen?

We have 1

incredibly

intriguing use case. It's We've only heard of this 1 customer of ours,

and I've never even thought about it. And I've been doing a streaming for 10 years.

So they are a database company fundamentally. Right? They provide some services, but at its core, they're a database company.

They're like a cloud native database company. And so they're actually thinking because of our tiered storage, the idea that we talk about shadow indexing,

Kafka, in terms of offset managing,

gives you a totally addressable log. Right? Like, from offset 0 all the way up to, like, you in 64 max,

you can address every single record in a Kafka log. If you couple that with our shadow indexing idea, which means any cluster

can read for any topic, can go and read

any batch in the history of this, and we manage the data hierarchy and what we materialize on this versus what's uploaded onto

durable

storage like s 3, then what they're using us for is that they actually upload

a megabyte of data for every record batch. And so every record is a megabyte of blob is, like, their own format. We don't really know what format it is. They just send a a megabyte. When they receive queries from their users, like, let's say somebody connects like a Tableau or something like that, and they're trying to run their data, their query processes,

they actually fetch arbitrary offsets in the lineage

of this totally ordered log to fetch BLOBs in a way that gives them hierarchical storage,

that gives them low latency and high throughput for your streaming rights. So data that's hot is gonna come back hot. And data that's cold is still gonna come back relatively fast because we can fetch it from s 3, decode it really fast, and kinda like return the data to the user. So they get this totally addressable log with automatic data hierarchy management. That's probably the most technically

interesting use case.

And from a business use case is

we are talking to a cancer lab,

and they basically wanna do DNA sequencing

because we can manage such large volumes of data and such a small amount of hardware.

They want to retest cancer patients

before they leave the wet lab. And so, you know, you spit on it or you're like, you know, you urinate in a cup and then they analyze the thing

and they do some sequencing on it. Eventually, if they push that data to Red

Panda and they run like SparkML on it, they can actually crunch through gigabytes of data per second.

And you can just tell your patient, Hey, can you wait in the waiting room for the next 2 hours while I'm going through, you know, let's say like a terabyte of data, and then you have the results. And so then being able to test a patient before they leave is maybe the most interesting, maybe business use case. And so I think those are the maybe the 2 most most wild use case that we've seen with Red Panda.

From talking to you and learning more about the system, Red Panda is definitely a very

interesting and impressive piece of technology and something that I'm personally looking forward to experimenting with. But what are the cases where it's the wrong choice?

Yeah. We don't support transactions

yet.

From

customer's experience, like, kind of the cool thing about being in the data space is that it really doesn't matter what our claims are or what our

competitors' claims are. Like, ultimately, the clients will tell you,

you know, this is what worked, and and this is what didn't work. And

to them, a lot of them have scalability limitations running the current implementation of the Kafka transactions.

And so

I would say it's probably

less than 10% of the people that are running Kafka transactions. So we're not a good fit

if you require Kafka transactions. Currently,

we only solve

the base API, which is deceptively

simple to say. Remember that Kafka has over a 160 APIs

with, like, versioning and so on.

So we covered the majority of them, but we don't cover transactions.

We are basically thinking through how do we improve

the throughput of transactions. Right? So transactions

are really not for low latency. They're for data safety. And what people care about is, like, they wanna push a bunch of transactions per second, but they're not trying to optimize

the last 500 microseconds or the last 200 microseconds. That's not what transaction users are are using transactions for upstream.

And so

we did drive the draft for transactions. We didn't like the performance of it or, like, the design trade offs. So that's still very much in in active research,

and it's probably going to be on, I would say,

q 1 next year. You know, if you are using the Kafka transactions, then that's fundamental to how you're building your use case. Like, you know, we're not a good fit there. I think for the other type of systems, we're really I think the future of streaming with our co processors, with the Wazim engine, and because we also sort of give them the entire ecosystem. Right? We have in the schema registry. We have an HTTP proxy.

We give you API compatibility. We give you these 2 things, which are the future of streaming, leveraging cloud native technologies.

But if you're definitely in the transaction space, I think currently Kafka is probably your only option.

Are there any other aspects of the overall streaming ecosystem

or the Kafka space or the work that you're doing at Vectorized and Red Panda that we didn't discuss that you'd like to cover before we close out the show or any particular ways that the audience can engage with you to help you drive this product forward?

Yeah. For sure. I really encourage people to try it. It's gonna be free to download, you know, free to run. There's gonna be no limits,

you know, right now. So download, try it, you know, let us know what you think. We just love to to see how we can improve. That's really sort of the first thing. Ultimately, all of our features have ring. When you start a company, you kinda have this grand idea of how things are gonna work out.

You know, for me, it was like, I wanna build the super fastest storage engine. I'm gonna solve all the problems.

And people just wanted, you know, a better Kafka.

Now we're building the future with co processors and stuff. I really think if you are, especially in the JavaScript community, in the Python community,

and you have always been wanting to

try and test out and and build real time streaming pipelines in a way that's easy for you to manage, which is, you know, it's a single binary. I think we really are calling for for this community to help us create a system that they wish they had for them. We think that the ability to upload these co processors, given these data API guarantees is gonna be super useful for effectively tensor flow shops, where you push data to a topic. They really are not either latency sensitive or throughput sensitive. They just want convenience.

And so this 1 binary is, I think, addressing the need of running a simple system

that gives them data API guarantees.

So I think that's really sort of the future. And and, you know, later this year, we're gonna have

also part of the product, a big part of the product, probably the majority,

is going to be open source and available for people to hack. It is written in c plus plus 20, with, like, you know, coroutines and future. So

probably not a lot of contributors. But if you are looking to hack around this, stay tuned. The best way to reach out to me,

Twitter at vectorize io or my personal Twitter, which is called emaxrdna, which is the largest number in the Linux kernel.

And, yeah, and I think, you know, feel free to reach out, and and we're happy to to start the conversation. 1 of the things that we're most excited about

is I'm a Latino founder,

and there aren't that many

black, Latinos, and females working on hard distributed systems.

We have created a scholarship

called Hack the Planet, which we're super psyched.

And

I think if if there are any listeners

that are underrepresented in tech,

definitely reach out. The application process is open.

And this came about, honestly, recently through all of the

recent events, which are very sad. We wanted to do something.

And

I think that the best thing in addition to donating to the organizations that we do is really to donate our time. We have experts that have run some of the largest systems in the world at Microsoft, at Red Hat, at Akamai,

and we can have a significant impact in your life if you're listening to this podcast.

You get to keep the entire IP. You get to work on the project that you want. We don't

want anything of the IP. We wanna give you money to hack on something that is hard, that is in distributed systems as long as you're part of this, and you get the mentoring. You get to talk to CEO of me once a month on anything, whether it's starting a business or running distributed systems. And you also get mentorship

from all of our senior architects,

Dennis, Noah, Meha,

Ben. I mean, mostly everyone that I work with has been in this business for, like, you know, 10, 15, 20 years. We really hope that you take advantage of it. Definitely

hit up the vectorize.iosolarship.

We want to help you. Yeah. And we'd love to be the change that we want to see today.

For anybody who does want to get in touch or reach out about any of those things, I'll have your contact information in the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

It's a really fun question to always look at the future. You know, the biggest piece missing to me

is tooling that gives

deep guarantees

about the data. We have an idea and we have an opinion, which is our co processors, but, ultimately, people will decide if that's useful. You know? I think that the way

we advance is we ask some of our users, we propose a system, we build a system that we think is the future. There are other alternatives that I can think of, but I think tooling that gives deep introspective guarantees

about the data. So 1 example is whether you're a database or streaming system, it ultimately doesn't matter. That's you what we have to move away is from kind of this low bids like vectorize. You know, that's for us, for very few people to really spend all of their time and expertise focusing on moving data around on the bids. But from a business perspective,

what I think is missing is allowing

you know, building system that give users higher guarantees. 1 example is if you put data on any system, that that system guarantees

data provenance,

which is the right people through the entire transformation of the data only have access to the data. Whether that system goes through Red Panda and moves through HBase and then back to and it doesn't matter what the pipeline is. But, like, that kind of data prominence, you know, just keeping all the permission in around all of the systems, something that manages that, I think, is missing.

The second 1, I would say, is the ability to give users higher level guarantees. Like, as long as you put JSON in here, we'll make

sure that no private information is leaked.

But those things are guaranteed by this storage system. In a way, I really think that GDPR compliance

is a contract that databases have to adhere to, or fundamentally, the storage system. Like, we have to be able to map some of this

more complicated business metrics and goals that they're trying to achieve, which is, like, you know, protect the the users' data and so on, and map it onto the storage system themselves

that own the actual data, you know, and be able to expose it to developers in a way that is easy to consume. I think that there's a lot of tooling that is missing there.

Well, I appreciate you taking the time today to join me and discuss all the work that you've been doing with Red Panda and Vectorized

and exploring and expanding the streaming space. It's definitely a very interesting product and 1 that I plan to take a closer look at myself. So I appreciate all the time and energy you've put into that, and hope you enjoy the rest of your day. Thanks, Tobias. Talk to you soon.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links