Adding Support For Distributed Transactions To The Redpanda Streaming Engine

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3,000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Dennis Ristov about implementing transactions in the Red Panda streaming engine. So, Dennis, can you start by introducing yourself? Thank you for advising me, Tobias.

It's wonderful to be here. I listen a lot of episodes before and have never imagined someday I'm gonna

talk about my work there. So, yeah, I'm Dennis. Currently, I'm working on Vectorize,

but yeah, before I was in Amazon and Yandex back in Russia.

Yeah. But let's just dive into and discuss the topic. And do you remember how you first got involved in data management?

Oh, yeah. Sure. So I think it was in 2,000

10. I've just graduated from a university.

I got

a degree in applied math and computer science. And

for listeners who are new in the industry,

about this time, there was a boom of big data and NoSQL

solutions. And at my first job, it was a tiny startup

financed by, free triple

by triple f finding friends, family, and fools.

And

there, I was investigating

the differences between this NoSQL

solutions. I I think I was looking into Cassandra,

MongoDB,

and React.

Eventually, we just stayed with the Postgres

for this particular

project.

Yeah. So this is how it started. And shortly after that, I joined Yandex.

Again, for people who don't know, it's a search engine, which is more popular than the Google in Russia.

And it's fun is that,

more countries have space programs at a local search engine.

And just like within Google, there is a lot of data management, data transfers.

And I was leaning towards big data at that time and was working on infrastructure

So here, it was the first time when I used Apache Kafka. It was very early days of the was probably

07 or 8 version.

And, yeah, that's how I started. Then

with all of this, then I migrated to the United States. I was working on the Cosmos DB at

Microsoft

and

was mostly responsible for the consistency test testing. So it was something similar to what Kahl Kinstlinger does with Jepsen, but

it was internal

and featured for a particular

system. The the work itself was based on the tasting shared memories paper,

and it's interesting that large scale distributed

system and modern CPU, very similar

in its internals.

So with the surveillance systems, you have kind of local nodes with fast data access, and you have remote nodes with slower access. And exactly the

same is happening inside of, CPUs. You have multiple cores. Each core has its own cache, and it's fast to access the data from the cache, but it's slower to access data from a different cache or from a memory.

It's kind of all very similar and people in the CPU world, they were working,

they have this advantage of time. So they started working

all in these problems and

they figured out how to test consistency. Because when you work in in this environment, it's tempting just to have a local copy of your data that you're working

with and modifying this data instead of going kind of the slower path. But when you do this, your replicas of the data may become out of the sync. So it's important to make sure that the data you see, they are up to date. So

people will figure out, okay, how to write this and then say, well, kind of thinking of how to test this. And this

paper, it describes

say if you have some restrictions of the data operations. So basically, if you do all the

right operations with a compare and set,

then you can have this consistency testing in the linear time. And

it was kind of a revelation to me because I always

assumed that it's NP complete problem. So you can't test a long kind of histories.

And it's was very important in in a Cosmos DB because it's a large system. And some of the operations they which potentially may affect

consistency of the database. They take a lot of time and testing it just by a minute wasn't enough to make sure that everything is correct.

So when I look at this paper, that's what we need and just applied this research to another domain.

I think there is a legend,

about Steve Jobs and Bill Gates. So that as they first observed this research on the graphical user interface is in the Xerox

pack and since they kind of make a commercial software

based on this. And I think we should do the same in data management system because

there's a huge amount of researches

happening right now, and

it's just a lot of things that we can get expired on or just

implement

XAMM and base the systems to get an advantage.

So I I think there's a lot of opportunities,

Busia.

And for now, I'm

volunteering

to the

JCS academic paper. It's a journal of system research with Diamond Access

and helped organize

a Hyder conference on computers in parallel computing and work or, to vectorize on

the Red Panther project. Yeah. It's definitely interesting

to see

how research in 1 domain can be applied in other areas without the sort of original researchers having any conception really of

the potential applications or even really knowledge of those other domains where it ends up getting used.

It's definitely cool that you're able to take some of the research in, you know, CPU architectures and how to optimize that and apply it to these distributed systems problems.

And then so in terms of Red Panda and the work that you're doing at Vectorized,

I did have an interview a little while ago with Alexander, but I'm wondering if you can just quickly, for people who didn't listen to that, give a bit of an overview about what it is that you're building at Vectorized with the Red Panda project

and some of the goals and maybe some of the ways that the project has changed or evolved in the past year. Definitely.

So Red Panther is part of the kind of streaming

solutions. And when we talk about streaming in general,

it's

solves the problem of data transfer from 1 system to another and processing

along the way. So for example, when I said that we were using

Kafka at Yandex long time before it was used to ship logs from worker nodes to this internal

Mapages

system. And there is a lot of other kind of places where we need to connect different system

and kind of as a streaming solutions are really good on this. I saw multiple

advices to

start with a database as a foundation for the message bus and then kind of replaced with a specialized solution

like Kafka, Repan, Bashi Pulsar,

when you start going. Because using a database, it's kind of okay on small scale, but database

does way more than this streaming solutions. And when it kind of does more, you

eventually need to pay more. And I'm not talking about some money, but it's more about

performance,

latency,

and these things. So if you try to measure kind of a latency of the database and then compare it with any of the kind of streaming solutions,

You'll be surprised. At least I was surprised when I saw this data. With database,

there were more variants in the data and chart was looking like a jigsaw

and we have streaming solutions. It's just a straight line

and different

streaming

systems. They kind of little bit different in the performance.

But, yeah, if we compare with database, it's like comparing cars that you use to

drive in the city with,

this jet propulsion monsters that they test on the salt lakes. So, of course, it's it's not suitable for all of the use cases, but the things it's optimized, it does really well.

And, Redburn

is 1 of, such system.

So I describe it as a kind of alternative implementation of the Kafka protocol, but focus

on low latency. So this is now in the future, it's growing into we add more features.

So 1 of the known things that we're working on is

vast transformation,

but there's much more. I wish I could talk about this, but I'm not getting for the guys who will be mad at me if I if I share the road map.

So Red Panda, it's kind of as a reimplementation of the protocol and compare it to Apache Kafka, we use,

different language. So it's written in a c plus plus versus Java.

It uses a different architecture. So it's based on the sister framework, and it uses

a thread per core kind of approach. And it's in for rhymes with what we were talking

before about

the CPU and distributed,

port. So even kind of inside of this big distributed system,

such as RedPandas,

when we work on, kind of a lowest level, we try kind of to stick with 1 core and not to move data around to have this performance gains on all of all these layers.

And, also, we use

different protocols.

So we use

Rafta for

application.

We use various transactions

optimizations,

including parallel commits for transactions.

And

these

things that we're using, it just didn't exist at the time

Kafka started as a as a project. So it's all kind of a recent

developments in this area, and we kind of are looking around and thinking, okay, what we can do better to make our customers

happy and focusing on these same points.

1 of the things that we are working is a Wasm transformation. So I mentioned that

streaming solutions is about moving data and transforming along the way. So

Wasm is aimed for these transformations

and

transactions,

as I was working implementing the last year. It also

aim to help with this idea of transformation.

Yeah. So this is what the responder is right now. I think it's funny that the marketing people don't want you talking about the roadmap when it's usually the marketing people who try to promise new features that the engineers haven't built yet.

Oh, I can promise new features for the next 10 years.

My high rise is just too big.

And so now digging into the recent work that you've been doing with adding transactional support to the Red Panda engine, I'm wondering if you can just talk through some of the

primary use cases where that is beneficial in a streaming context for a PubSub messaging system to be able to have these atomic transactions.

Yeah. Assuming it's about moving and transforming data.

When we

look on this kind of little bit more,

Lower level is it's like having a

large set of

lists in your system where you can add messages to this lists

and to then where you can query this list by the index. You can iterate them

and with the transformation

wise, you want to take messages from 1 list,

do some modification

and then to write them to another list.

And before transactions, there were several problems with this. Basically, you're

okay with having, duplicates in the system,

but you want to process all of the data that you have and having a duplicates is okay to you. Or you're okay to lose some information in the system, but

you don't want to have the duplicates.

And all of this anomalies, like data loss or data duplication comes from the fact that

when we take a message from the queue,

the default may happen with the current system.

And then when you start again,

you're kind of on the crossroads. What do you want to do? You want kind of assumes that the last message was processed and inserted,

or you assume that it wasn't and you processing it again and then sourcing this message to the system. But if the first case, this message wasn't inserted

and you decided that it wasn't sorted, you have data loss. But in the second case, if this original message

was processed and then you insert it and then you're reprocessing

it again. Then you have a data duplication

and working with this anomalous is really hard for the programmers because we need to think not only

essential complexity of the problem we are working with

and code this transformation. But also we need to think about this accidental

complexity of having anomalous in the system. And

it's not what people are trained at a university and

boot camps. We kind of get used to that if we have a variable and we change it, it has a changed value. Nobody come and play with it. We are not watching. And transactions, it helps to solve this anomalous. And so they actually not solve it. They they prevent exams. So transactions helps to do this transformation step, like taking a message from the source topic, processing it, and putting in the target topic, atomic. So we don't need to think about this anomaly. Something goes with this

processor,

then we'll restart on the by some,

orchestration

program on top of it. And then is a guarantee to kind of

start where it has left behind and, kind of avoid all of San Luis.

So kind of a this is a kind of major use case for transactions.

Because of the fact that you're working with a distributed system, you have to deal with the transactions being distributed as well, which adds a lot of complexity to it. And I'm wondering if you can talk through some of the sort of main challenges

that exist in trying to apply

distributed transaction semantics, particularly in a streaming context.

The first problem is it's a distributed

system, so

we need to think about kind of a failure of different nodes.

But this problem, it was kind of solved before. We have a wonderful distributed databases like CockroachDB,

Itd,

and, and MongoDB. And so they all have this build transactions. But when we compare

transactions

of old speed databases,

every Kafka transactions,

and red PUN transactions.

So

OLTP

transactions tend to be kind of a small we have this website or systems that we're working on. They use interact, sends a request to the server. Server does its support, so execute it within the transaction, sends request back, and

this tend

to be small. And even some of the transactional system, they they require you to provide the

set of keys that they are going to modify

before you start the transaction. So so most kind of a major example here is

Calvin paper,

which is 1 fly area like this approach. It has amazing

throughput,

but it has this kind of a limitation of having some data,

that you're going to modify in, in plants. Another example is ocean vista.

It also can use a similar approach to the Kalin, and it's also you need to know these keys in advance.

And with streaming, it's not true with streaming transactions. So we're talking about moving the data into usually it's a lot of data and it's not on the we don't know what we're processing

in advance, but it's also

it's a scale.

It's very high velocity and

it's a challenge to pick out of all of the transactions

protocols that are out there as 1 of which feeds the streaming

use case. So I think scale was a major problem.

And then as far as the actual

impetus for starting the work of adding transactions to Red Panda, I'm wondering if there were any particular product requirements or customer needs that you were trying to fulfill or if it was just something that you were trying to gain parity with the Kafka APIs or just sort of what it is that set you down the path of actually adding transactional

semantics to the Red Panda engine? You know, it's a customer need first.

So even if it was a part of the Kafka protocol, but if nobody were using these features and

it wasn't really kind of important to support it in the first time. But we've had this customer pressure that they wanted to use different tools, which depending on the transactional processing.

So we kind of forced to work on this area. In terms of the actual

Kafka API, I know that

the upstream engine does have support for for transactions

built into it, at least in some of the recent releases. And I'm wondering

how that influenced your approach to how the actual API for the transactions was designed and then how if you were trying to map the upstream API, how that constraint influenced your approach to actually doing the work of creating the transaction semantics within the Red Panda engine? I think it was a blessing for me.

It's kind of really hard to work on something when you don't have restrictions

and you can just keep working on the architecture

without any restrictions. So you, you can always see, okay, so I can modify this API and then it will get even faster. And then you're working and then you might realize, oh, okay. I can change this.

And it's really hard to work in this thing, at least for me. When I have illustrations, it's kind of boost my

creativity and imagination, but also keep me attached to the ground. And

if it wasn't the case, then the kind of

domain of the works I could choose, it's it's not infinite, but it's quite a big 1. When we talk about transactions,

so

first, the change in is it, based on 2 phase locking? So basically pessimistic locking when we unlock the data sources or the records that we're operating in advance.

So when we start the transactions and then we do the works and then send we commit this,

And we may block the data,

or we may do as a optimistic approach. Both has a pro and cons. And

recently there was a very interesting paper called Polyjuice,

and it's about kind of combining this approach and building a system which

peaks its locking mechanism

based on the current

workflow, its experience.

But if you don't build such sophisticated

model, you you kind of need to make this choose either it's pessimistic

or optimistic. And if it's pessimistic,

you then need to work on the deadlock prevention mechanism. And there's, again,

several, like, kind of branches

which we can choose.

And with the timestamp

ordering, there is kind of choices, which road to go. It can be

multi version concurrency control or it's maybe

optimistic concurrency control.

Or we can scale all of these approaches kind of known

nineties. We want to try something new, and then we can go and try Calvin. We can go and try granola that be or we can go and say, okay, this installations level, we think for our customers, it's not important. Let's choose the weakest 1. Well, maybe not the weakest 1. Let's increase it just a little bit and then we can use ramp transactions. And the scope of this book, it's just so big to choose from. So without restrictions, it's impossible.

So I use this restrictions just to kind of guide my focus on what we can use for Red Panda.

Struggling with broken pipelines,

stale dashboards, missing data? If this resonates with you, you're not alone.

Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end, fully automated data observability platform.

In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem with broken data pipelines.

Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL and business intelligence reducing the time to detection and resolution from weeks or days to just minutes.

Start trusting your data with Monte Carlo today. Visit data engineering podcast dotcom/impact

today to save your spot at Impact, the Data Observability Summit, a half day virtual event featuring the 1st US chief data scientist, the founder of the data mesh,

the creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data.

The first 50 people who RSVP will be entered to win an Oculus Quest 2.

And so in terms of the isolation level and the capabilities and sort of the requirements there, I'm wondering if you can just talk through some of the actual technical implementation

and some of the design decisions that you did make as far as how those transactions are manifested in Red Panda?

So transactions, if we try to compare them with the database world to say, are it committed?

So

you get an acknowledgment that your data was polite to the system

before it's actually

was replicated to all of the nodes. So it doesn't mean that it's,

eventual consistent system,

which just where it had to work with. It's more like a strong conventional consistent. So you know that eventually your transactions will be applied.

But if you

go and try to read this data

immediately after you get the confirmation, there is a risk of getting little bit stale data.

And if we talk about this kind of implementation of all of this,

so as optimizations that we did in red panda, they based on 2 approach.

First 1,

it's a commit time. So it's not the first 1, it's the last time. And in the end of the transactions, we commit and

here we use file commit optimization.

It was pioneered at CockroachDB.

And so basic idea here is to reduce latency. So usually when people talk about transactional

commit on distributed

system, the first thing that pops up is 2 phase commit protocol

and the name kind of give

you a hint that there is 2 rounds of communication.

1st, you need to go to each of your note and say, oh, please

persist kind of a state that you have on this can

kind of validates that you can commit. And if you can commit,

kind of make this,

decision persistent on disk and then you got the confirmation from everybody

or not on disk. It can be replicated to using the Paxos and Draft protocols. So it gets a confirmation from everybody.

And then once everybody

said that is okay to kind of proceed with this transactions and you see

your mark this transaction

is committed on your kind of master record representing this transactional

object. And after this, you communicate with,

again, with everybody and do this kind of instruction. Please commit the data that you reserve.

And this last part with Adpanda and with Kafka, it's because of the isolation models that we choose. It happens after the fact

that after acknowledgment of the transaction. So we don't count this as a latency impact, but still there is 2 things which

happens before we acknowledge request.

1st, this repair communication.

2nd is

updating this master record

And what parallel commits does it okay. Let's do all of this in parallel.

So

it's an kind of a piece of information that

please confer, do your work. And by the way, this is all of the other

participants in this transaction

or there might be a preference back to this master transactional record.

And when this 1 phase is executed, we

sent as an acknowledgment to the client because we'd know that everybody kind of decided to participate on these transactions.

But if somebody else observed this incomplete

record, then it's gonna be the somebody else responsibility

to go and clean it up.

And usually, it's the same transactional occurrence in nature,

but it's kind of gives you

reduce the numbers of network communication and makes

it say things faster. And I don't think that some major benefit is reducing the network communication. Network communication is still pretty fast to these days, but what is really important for reduce the numbers of sequential f syncs to the disk, and this is kind of a major point with latency,

height and kind of data management systems.

So this is the last optimization we did in terms of processing. And the first optimizations

we did I mentioned that Kafka and RedPand is about the moving data. And

when we're processing a transactions, we need to process a lot of data and we can't

keep this old information in the memory.

So while we're processing this information, we store it to disk

and we can store a gigabyte. It doesn't matter how much data we want to process in 1 transaction.

So we need to save all of this to disk.

And because of the network application, we can't say it's just, so this is my huge chunk of data. Please take it from here, save it and give me a confirmation.

We, from the client side, we send this data in kind of a small TCP

packets

and process them

in sequence and we store each

chunk of this huge piece of data

sequentially

and we want to store them reliably. So we use f syncs to get the confirmations that everything is secured.

And once we

saved all of this

1 to be of data, we finally go and commit. So what we did in red bundle, we decide, okay, let's do this optimization. Let's

treat this transaction

as a unit of work and optimize

this as a whole instead of optimizing each individual

record.

So what we're doing is we

save data of of all this confirmation

from the disk kind of optimistically.

But then when we're about to commit, we'll make sure that it was kind of a fully persisted to disk and it was fully replicated

to all of the replicas.

And if it doesn't, which happens,

this is a rare situation.

Then we just aborted transaction

and as a user, it's kind of expects that transactions might be aborted because of some internal things,

happening in in the cluster. So it's kind of no surprise that they can have a ready to process this information

again.

But the gains that we have it, it just it's really big because

if we want to persist

n messages

in pandas, then alternatively,

we will need to async

and messages sequentially.

It's not end button. I'm simplifying.

And now we are not doing this. We just let the system to figure out on its own pace when it's safe, when it needs to save this data. And in the end, we check that it was safe, and it

gives us huge boost in performance.

So I think this is a 2 major

optimizations that we did in transactions.

Beyond just the constraints of needing to adhere to the Kafka API for transaction support, what were the other priorities

that you were

orienting around to determine

how exactly to implement these transactions, and how did that sort of

inform the work that you were doing? And wondering also sort of what types of roadblocks or challenges those constraints might have added to the overall effort? When I was kind of designing and working on this, I wanted

to have some result

fast. So I kind of came up with several optimizations

and I was

focusing on them. And also,

I wanted to have done this faster, but usually when you

do something, you think, oh, okay. This is kind of a suboptimal. I will came up to this part of a system later, but later usually never comes. So I was, trying to achieve the maximum performance I could and then kind of iterate on this rather than having kind of coping to copy what Kafka does and then optimize it

later. I understand that from when you first started on this till you have delivered it in its current form, it ended up taking on the order of about a year. And so I'm curious

sort of how you were able to

break up the steps so that you were able to continue shipping Red Panda, but you were also able to do your work without it either sort of having to deal with constantly rebasing the upstream changes against your branch and dealing with all those conflicts and just the overall sort of, like, software development methodology that you use to be able to deliver such a sort of substantial feature over such a long period of time without having it affect

other work that was ongoing?

There's multiple layers on this question. But first, I want to kind of comment on this 1 year. It was yeah. It was 1 year, but

at the same time, it was close to 10 years because, well, I mentioned that I was working at Yandex and I was touching Kafka at that time. And, of course, it didn't have transactions implemented to the system itself. And 1 of the tasks that I got at Yandex was to

build a transactional system on top of Kafka to

process

this logs automatically.

And, yeah, I was fresh out of the college and I successfully failed this,

this part. But maybe I decided that, okay, I will overcompensate

this and I will dive into research. And now, 10 years later,

I've finally,

succeeded this part of implementing

transactions on top of this, auto Kafka

thing. So I can finally

make my old managers happy.

But you were asking about

how to kind of work on this big

projects without affecting the other areas.

And

some of the things I just indicated some part of my time on working on the transactions and some other part of my time working on something else.

So it's kind of answer on how not affect kind of the other works that I was doing. But there is another layer in this question. So once you can for working on something,

because the surface area might be a better future and it's may touch

the different areas inside of red panda and may affect the

other operations.

But it wasn't this way in our case because all of this parts and

transactions and so, Kafka protocol, they require

the majority of this complexity

lives in the transactional caretinator.

And transactional cardinator, it's completely

separate topic, and it's a separate entity in the system.

Most of the work I've done it, it wasn't mixing with other

developing

process inside of Red Panda.

And this kind of a small part of which does, yeah, we confuse

tests to validate to say nothing isn't affected.

And then as far as the actual

transactional support that's in there now, I'm wondering what how you handled

validation

of the transactions

as you were iterating on the problem and how you decided when you were complete. Yeah. I can't say that it's complete now because right now, transactions in Red Panda

say,

I don't want to be wrong, but I think it's corresponds

to Kafka

2.4

level. So there is a still

some Kibbutz Kafka improvement

proposals that we need to support to achieve as a complete 100%

feature.

But, yeah, early on. So I was using a unit testing

test of what I was working on. And also we have chaos testing that we use.

So partially it's based on the work I previously done at Microsoft.

So it's all connected.

And as part of this chaos testing, we use IP tables to block the network communication to see how it affects the system.

Also,

restart some support service. We do this graceful shutdown that we use kill minus 9 to terminate the process.

Also, we have, fuse

file system

to

insert the errors in the

file system level, and we also introduced

delays there to see how it affects

the performance.

The behavior of AgatePanda.

What

I really want to continue doing with this user

project is

to

kind of cache all of the rights as it goes into the system. And when there is

a error during f sync just to forget all of this data to simulate complete power loss.

It's hard to achieve in a lightweight

to wait because if you have this application and it's not enough to terminate the application

operating system can finish all of this pending operations and

you don't broke things. And when you wanted to broke things, it kind of gets in the way of doing it.

Also,

what helped us is

this very basic isolation level of the transaction. It's read committed. It has,

some guarantees,

but you don't need to think about the interaction with

other transactions.

So,

if it was something

stronger that you will need to use tools like l,

the level of visibility is right and it's hard problem.

Map Glades as the guarantees are pretty low, and we

don't need to

over complicate

the testing approach.

In terms of the actual usage of the transaction capabilities, what are some of the limitations

as far as either the, you know, total number of records that you're able to process within transaction boundaries, the number of operations

before you sort of explode the level of complexity beyond what can actually be reasonably done within a transaction

or, you know, the sort of interfaces that are available such as being able to execute a transaction from within a WASM transform?

The problem with Wasm transactions

comes from the additional operations that we need to do. So if you

want to have a linearizable

system and you're okay with

having a duplicate in the system, it it faster to go without the transactions because

for each of transactions, you

write 2 additional

message batches in the system. There's a prepare and a commit. So if you process just a single record

in the system, when you use transactions, the single record becomes pre records.

So it's kind of became more expensive.

But yeah, this, this is a price which people need to pay for having this operation

automatically. And usual recommendation

is

to fetch more data from the source topic and process them in the larger batches

and transactional operations. And this overhead of transaction processing doesn't

introduce this. It still introduced the additional overhead, but it's small compared to the workload

people are processing.

Another limitation is the API itself. It's similar to the database work, but it's a little bit different. So people should understand what it is. 1 of the example is transactional IT. It's very important, but it's not the identifier of this transaction. It's more like identifier of application of the transactional processor who is executing this transactions. And even in the early proposals of

the transactions in the Kafka, they, they called it application IT instead of the transactional IT. IT. And so end of the times I decided to change the name.

I think this is kind of a major limitations of the transactions.

But besides the limitation, there was several surprises,

pleasant surprises.

So when we

implement the transactions and started testing and do the performance testing,

we find out that it's faster to do

batch, bulk import in red band with transactions

and without transactions.

And it was a complete surprise and it was some 100%

unexpected because I have this intuition that transactions are always slower

than long transactions. And it wasn't the case in this case. I even thought that there might be some error in the data or in the testing procedure, so I need to double check it. And

it was this way because of this async optimizations that I mentioned before.

So we were kind of

just processing this whole batch as a whole instead of processing the individual messages, and it gives us this performance boost.

And in terms of the ways that transactions are actually being used now that they're available in Red Panda, I'm wondering what are some of the most interesting or unexpected or innovative ways that you've seen them applied. I think I'll just talk about this. It's very unexpected

and

very, very comfortable to come to the customers to use it. And in the process of actually

going through the work of adding transactions to Red Panda, what are some of the most interesting or unexpected or challenging lessons that you learned in the process?

It was harder than I expected.

And

initially, I was planning maybe to finish it in within 3 months, but it took almost a year.

And

I had implemented,

application and consensus protocols

before, and I thought that transactions are gonna be easier, but it wasn't the case. It was easier

to understand

why it's working and what is happening inside, but it was a lot of work to kind of implement this. And with the replication

and the consensus parts, it's the other way around. It's less work on implementing,

and the major complexity comes from realizing why this thing even works in the first place.

And then in terms of

the sort of near to medium term future of

the work that you're doing at Red Panda and Vectorize, what are some of the things that you're excited to dig into? And if there are any

sort of discoveries or ideas that you came up with in the process of building these transactions

that are sort of the next logical step for you to explore?

I still have some work to do on the transaction side.

I really want to add more observability into the system and simplify

the life

of SREs

of managing red panda, a flock of red panda or a herd of red pandas. Yeah. I don't know. Cause they live together.

And yeah, this is related to transactions. And of course I want to achieve this, feature parity with Kafka. But another thing I'm really excited

is, is a replication

protocol.

So right now,

Panda

uses Raft as a replication protocol, but it's more than 7 years old. Can you recommend somebody to buy a 7 years old iPhone instead of what is the current version? 13?

If both models have a friends or budget, of course, I can't end. Same with application protocols. There is a very hot area of research and there was, several

advances made since this Raft paper appeared 7 years ago. And recently I, as part of this JCS work, I reviewed a matchmaker

boxes and it's looks really cool. It allows

faster configurations

of the system.

And the faster you can do this maintenance

kinds of the operations,

the less impact they have on the overall cost and say, the less impact they have on the customers.

So I'm very excited about this, but I'm not sure when I'll be able to do this because there is so much thing cause that it's possible to vectorize and

everything looks very excited. And

I wish I could talk about this more.

Are there any other aspects of the work that you're doing on Red Panda as far as the transaction support or just the overall vectorized

platform that we didn't discuss yet that you'd like to cover before we close out the show? No. Probably not really. Maybe you could ask me in a year.

Sounds good. And I will able to share something in you. Alright.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think in terms of the gap, is this gap between

the industry and academia? So

they develop new ideas way faster than we implement them. So

that's where we should look at.

Well, thank you very much for taking the time today to join me and share the work that you're doing on Red Panda and the recent work that you've done adding transaction support to it. It's It's definitely a very

interesting project and an interesting challenge that you were able to, you know, come out successful with. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day. Yeah. Thank you. It was a pleasure

to be a Sam. Bye.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.

If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links