Summary
Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Denis Rystsov about implementing transactions in the RedPanda streaming engine
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you quickly recap what RedPanda is and the goals of the project?
- What are the use cases for transactions in a pub/sub messaging system?
- What are the elements of streaming systems that make atomic transactions a complex problem?
- What was the motivation for starting down the path of adding transactions to the RedPanda engine?
- How did the constraint of supporting the Kafka API influence your implementation strategy for transaction semantics?
- Can you talk through the details of how you ended up implementing transactions in RedPanda?
- What are some of the roadblocks and complexities that you encountered while working through the implementation?
- How did you approach the validation and verification of the transactions?
- What other features or capabilities are you planning to work on next?
- What are the most interesting, innovative, or unexpected ways that you have seen transactions in RedPanda used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on transactions for RedPanda?
- When are transactions the wrong choice?
- What do you have planned for the future of transaction support in RedPanda?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Vectorized
- RedPanda
- RedPanda Transactions Post
- Yandex
- Cassandra
- MongoDB
- Riak
- Cosmos DB
- Jepsen
- Testing Shared Memories paper
- Journal of Systems Research
- Kafka
- Pulsar
- Seastar Framework
- CockroachDB
- TiDB
- Calvin Paper
- Polyjuice Paper
- Parallel Commit
- Chaos Testing
- Matchmaker Paxos Algorithm
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3,000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Dennis Ristov about implementing transactions in the Red Panda streaming engine. So, Dennis, can you start by introducing yourself? Thank you for advising me, Tobias.
[00:02:06] Unknown:
It's wonderful to be here. I listen a lot of episodes before and have never imagined someday I'm gonna talk about my work there. So, yeah, I'm Dennis. Currently, I'm working on Vectorize, but yeah, before I was in Amazon and Yandex back in Russia.
[00:02:23] Unknown:
Yeah. But let's just dive into and discuss the topic. And do you remember how you first got involved in data management?
[00:02:30] Unknown:
Oh, yeah. Sure. So I think it was in 2,000 10. I've just graduated from a university. I got a degree in applied math and computer science. And for listeners who are new in the industry, about this time, there was a boom of big data and NoSQL solutions. And at my first job, it was a tiny startup financed by, free triple by triple f finding friends, family, and fools. And there, I was investigating the differences between this NoSQL solutions. I I think I was looking into Cassandra, MongoDB, and React. Eventually, we just stayed with the Postgres for this particular project.
Yeah. So this is how it started. And shortly after that, I joined Yandex. Again, for people who don't know, it's a search engine, which is more popular than the Google in Russia. And it's fun is that, more countries have space programs at a local search engine. And just like within Google, there is a lot of data management, data transfers. And I was leaning towards big data at that time and was working on infrastructure related to systems similar to Hadoop. It wasn't Hadoop. It was an internal thing with focus on privacy, but, yeah, it it was basically my videos. So here, it was the first time when I used Apache Kafka. It was very early days of the was probably 07 or 8 version.
And, yeah, that's how I started. Then with all of this, then I migrated to the United States. I was working on the Cosmos DB at Microsoft and was mostly responsible for the consistency test testing. So it was something similar to what Kahl Kinstlinger does with Jepsen, but it was internal and featured for a particular system. The the work itself was based on the tasting shared memories paper, and it's interesting that large scale distributed system and modern CPU, very similar in its internals. So with the surveillance systems, you have kind of local nodes with fast data access, and you have remote nodes with slower access. And exactly the same is happening inside of, CPUs. You have multiple cores. Each core has its own cache, and it's fast to access the data from the cache, but it's slower to access data from a different cache or from a memory.
It's kind of all very similar and people in the CPU world, they were working, they have this advantage of time. So they started working all in these problems and they figured out how to test consistency. Because when you work in in this environment, it's tempting just to have a local copy of your data that you're working with and modifying this data instead of going kind of the slower path. But when you do this, your replicas of the data may become out of the sync. So it's important to make sure that the data you see, they are up to date. So people will figure out, okay, how to write this and then say, well, kind of thinking of how to test this. And this paper, it describes say if you have some restrictions of the data operations. So basically, if you do all the right operations with a compare and set, then you can have this consistency testing in the linear time. And it was kind of a revelation to me because I always assumed that it's NP complete problem. So you can't test a long kind of histories.
And it's was very important in in a Cosmos DB because it's a large system. And some of the operations they which potentially may affect consistency of the database. They take a lot of time and testing it just by a minute wasn't enough to make sure that everything is correct. So when I look at this paper, that's what we need and just applied this research to another domain. I think there is a legend, about Steve Jobs and Bill Gates. So that as they first observed this research on the graphical user interface is in the Xerox pack and since they kind of make a commercial software based on this. And I think we should do the same in data management system because there's a huge amount of researches happening right now, and it's just a lot of things that we can get expired on or just implement XAMM and base the systems to get an advantage.
So I I think there's a lot of opportunities, Busia. And for now, I'm volunteering to the JCS academic paper. It's a journal of system research with Diamond Access and helped organize a Hyder conference on computers in parallel computing and work or, to vectorize on
[00:07:48] Unknown:
the Red Panther project. Yeah. It's definitely interesting to see how research in 1 domain can be applied in other areas without the sort of original researchers having any conception really of the potential applications or even really knowledge of those other domains where it ends up getting used. It's definitely cool that you're able to take some of the research in, you know, CPU architectures and how to optimize that and apply it to these distributed systems problems. And then so in terms of Red Panda and the work that you're doing at Vectorized, I did have an interview a little while ago with Alexander, but I'm wondering if you can just quickly, for people who didn't listen to that, give a bit of an overview about what it is that you're building at Vectorized with the Red Panda project and some of the goals and maybe some of the ways that the project has changed or evolved in the past year. Definitely.
[00:08:36] Unknown:
So Red Panther is part of the kind of streaming solutions. And when we talk about streaming in general, it's solves the problem of data transfer from 1 system to another and processing along the way. So for example, when I said that we were using Kafka at Yandex long time before it was used to ship logs from worker nodes to this internal Mapages system. And there is a lot of other kind of places where we need to connect different system and kind of as a streaming solutions are really good on this. I saw multiple advices to start with a database as a foundation for the message bus and then kind of replaced with a specialized solution like Kafka, Repan, Bashi Pulsar, when you start going. Because using a database, it's kind of okay on small scale, but database does way more than this streaming solutions. And when it kind of does more, you eventually need to pay more. And I'm not talking about some money, but it's more about performance, latency, and these things. So if you try to measure kind of a latency of the database and then compare it with any of the kind of streaming solutions, You'll be surprised. At least I was surprised when I saw this data. With database, there were more variants in the data and chart was looking like a jigsaw and we have streaming solutions. It's just a straight line and different streaming systems. They kind of little bit different in the performance.
But, yeah, if we compare with database, it's like comparing cars that you use to drive in the city with, this jet propulsion monsters that they test on the salt lakes. So, of course, it's it's not suitable for all of the use cases, but the things it's optimized, it does really well. And, Redburn is 1 of, such system. So I describe it as a kind of alternative implementation of the Kafka protocol, but focus on low latency. So this is now in the future, it's growing into we add more features. So 1 of the known things that we're working on is vast transformation, but there's much more. I wish I could talk about this, but I'm not getting for the guys who will be mad at me if I if I share the road map.
So Red Panda, it's kind of as a reimplementation of the protocol and compare it to Apache Kafka, we use, different language. So it's written in a c plus plus versus Java. It uses a different architecture. So it's based on the sister framework, and it uses a thread per core kind of approach. And it's in for rhymes with what we were talking before about the CPU and distributed, port. So even kind of inside of this big distributed system, such as RedPandas, when we work on, kind of a lowest level, we try kind of to stick with 1 core and not to move data around to have this performance gains on all of all these layers. And, also, we use different protocols.
So we use Rafta for application. We use various transactions optimizations, including parallel commits for transactions. And these things that we're using, it just didn't exist at the time Kafka started as a as a project. So it's all kind of a recent developments in this area, and we kind of are looking around and thinking, okay, what we can do better to make our customers happy and focusing on these same points. 1 of the things that we are working is a Wasm transformation. So I mentioned that streaming solutions is about moving data and transforming along the way. So Wasm is aimed for these transformations and transactions, as I was working implementing the last year. It also aim to help with this idea of transformation.
[00:13:01] Unknown:
Yeah. So this is what the responder is right now. I think it's funny that the marketing people don't want you talking about the roadmap when it's usually the marketing people who try to promise new features that the engineers haven't built yet.
[00:13:14] Unknown:
Oh, I can promise new features for the next 10 years. My high rise is just too big.
[00:13:23] Unknown:
And so now digging into the recent work that you've been doing with adding transactional support to the Red Panda engine, I'm wondering if you can just talk through some of the primary use cases where that is beneficial in a streaming context for a PubSub messaging system to be able to have these atomic transactions.
[00:13:43] Unknown:
Yeah. Assuming it's about moving and transforming data. When we look on this kind of little bit more, Lower level is it's like having a large set of lists in your system where you can add messages to this lists and to then where you can query this list by the index. You can iterate them and with the transformation wise, you want to take messages from 1 list, do some modification and then to write them to another list. And before transactions, there were several problems with this. Basically, you're okay with having, duplicates in the system, but you want to process all of the data that you have and having a duplicates is okay to you. Or you're okay to lose some information in the system, but you don't want to have the duplicates.
And all of this anomalies, like data loss or data duplication comes from the fact that when we take a message from the queue, the default may happen with the current system. And then when you start again, you're kind of on the crossroads. What do you want to do? You want kind of assumes that the last message was processed and inserted, or you assume that it wasn't and you processing it again and then sourcing this message to the system. But if the first case, this message wasn't inserted and you decided that it wasn't sorted, you have data loss. But in the second case, if this original message was processed and then you insert it and then you're reprocessing it again. Then you have a data duplication and working with this anomalous is really hard for the programmers because we need to think not only essential complexity of the problem we are working with and code this transformation. But also we need to think about this accidental complexity of having anomalous in the system. And it's not what people are trained at a university and boot camps. We kind of get used to that if we have a variable and we change it, it has a changed value. Nobody come and play with it. We are not watching. And transactions, it helps to solve this anomalous. And so they actually not solve it. They they prevent exams. So transactions helps to do this transformation step, like taking a message from the source topic, processing it, and putting in the target topic, atomic. So we don't need to think about this anomaly. Something goes with this processor, then we'll restart on the by some, orchestration program on top of it. And then is a guarantee to kind of start where it has left behind and, kind of avoid all of San Luis.
So kind of a this is a kind of major use case for transactions.
[00:16:44] Unknown:
Because of the fact that you're working with a distributed system, you have to deal with the transactions being distributed as well, which adds a lot of complexity to it. And I'm wondering if you can talk through some of the sort of main challenges that exist in trying to apply distributed transaction semantics, particularly in a streaming context.
[00:17:04] Unknown:
The first problem is it's a distributed system, so we need to think about kind of a failure of different nodes. But this problem, it was kind of solved before. We have a wonderful distributed databases like CockroachDB, Itd, and, and MongoDB. And so they all have this build transactions. But when we compare transactions of old speed databases, every Kafka transactions, and red PUN transactions. So OLTP transactions tend to be kind of a small we have this website or systems that we're working on. They use interact, sends a request to the server. Server does its support, so execute it within the transaction, sends request back, and this tend to be small. And even some of the transactional system, they they require you to provide the set of keys that they are going to modify before you start the transaction. So so most kind of a major example here is Calvin paper, which is 1 fly area like this approach. It has amazing throughput, but it has this kind of a limitation of having some data, that you're going to modify in, in plants. Another example is ocean vista.
It also can use a similar approach to the Kalin, and it's also you need to know these keys in advance. And with streaming, it's not true with streaming transactions. So we're talking about moving the data into usually it's a lot of data and it's not on the we don't know what we're processing in advance, but it's also it's a scale. It's very high velocity and it's a challenge to pick out of all of the transactions protocols that are out there as 1 of which feeds the streaming use case. So I think scale was a major problem.
[00:18:59] Unknown:
And then as far as the actual impetus for starting the work of adding transactions to Red Panda, I'm wondering if there were any particular product requirements or customer needs that you were trying to fulfill or if it was just something that you were trying to gain parity with the Kafka APIs or just sort of what it is that set you down the path of actually adding transactional
[00:19:21] Unknown:
semantics to the Red Panda engine? You know, it's a customer need first. So even if it was a part of the Kafka protocol, but if nobody were using these features and it wasn't really kind of important to support it in the first time. But we've had this customer pressure that they wanted to use different tools, which depending on the transactional processing. So we kind of forced to work on this area. In terms of the actual
[00:19:51] Unknown:
Kafka API, I know that the upstream engine does have support for for transactions built into it, at least in some of the recent releases. And I'm wondering how that influenced your approach to how the actual API for the transactions was designed and then how if you were trying to map the upstream API, how that constraint influenced your approach to actually doing the work of creating the transaction semantics within the Red Panda engine? I think it was a blessing for me.
[00:20:23] Unknown:
It's kind of really hard to work on something when you don't have restrictions and you can just keep working on the architecture without any restrictions. So you, you can always see, okay, so I can modify this API and then it will get even faster. And then you're working and then you might realize, oh, okay. I can change this. And it's really hard to work in this thing, at least for me. When I have illustrations, it's kind of boost my creativity and imagination, but also keep me attached to the ground. And if it wasn't the case, then the kind of domain of the works I could choose, it's it's not infinite, but it's quite a big 1. When we talk about transactions, so first, the change in is it, based on 2 phase locking? So basically pessimistic locking when we unlock the data sources or the records that we're operating in advance.
So when we start the transactions and then we do the works and then send we commit this, And we may block the data, or we may do as a optimistic approach. Both has a pro and cons. And recently there was a very interesting paper called Polyjuice, and it's about kind of combining this approach and building a system which peaks its locking mechanism based on the current workflow, its experience. But if you don't build such sophisticated model, you you kind of need to make this choose either it's pessimistic or optimistic. And if it's pessimistic, you then need to work on the deadlock prevention mechanism. And there's, again, several, like, kind of branches which we can choose.
And with the timestamp ordering, there is kind of choices, which road to go. It can be multi version concurrency control or it's maybe optimistic concurrency control. Or we can scale all of these approaches kind of known nineties. We want to try something new, and then we can go and try Calvin. We can go and try granola that be or we can go and say, okay, this installations level, we think for our customers, it's not important. Let's choose the weakest 1. Well, maybe not the weakest 1. Let's increase it just a little bit and then we can use ramp transactions. And the scope of this book, it's just so big to choose from. So without restrictions, it's impossible. So I use this restrictions just to kind of guide my focus on what we can use for Red Panda.
[00:23:02] Unknown:
Struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world's first end to end, fully automated data observability platform. In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem with broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL and business intelligence reducing the time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today. Visit data engineering podcast dotcom/impact today to save your spot at Impact, the Data Observability Summit, a half day virtual event featuring the 1st US chief data scientist, the founder of the data mesh, the creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data.
The first 50 people who RSVP will be entered to win an Oculus Quest 2. And so in terms of the isolation level and the capabilities and sort of the requirements there, I'm wondering if you can just talk through some of the actual technical implementation and some of the design decisions that you did make as far as how those transactions are manifested in Red Panda?
[00:24:25] Unknown:
So transactions, if we try to compare them with the database world to say, are it committed? So you get an acknowledgment that your data was polite to the system before it's actually was replicated to all of the nodes. So it doesn't mean that it's, eventual consistent system, which just where it had to work with. It's more like a strong conventional consistent. So you know that eventually your transactions will be applied. But if you go and try to read this data immediately after you get the confirmation, there is a risk of getting little bit stale data. And if we talk about this kind of implementation of all of this, so as optimizations that we did in red panda, they based on 2 approach.
First 1, it's a commit time. So it's not the first 1, it's the last time. And in the end of the transactions, we commit and here we use file commit optimization. It was pioneered at CockroachDB. And so basic idea here is to reduce latency. So usually when people talk about transactional commit on distributed system, the first thing that pops up is 2 phase commit protocol and the name kind of give you a hint that there is 2 rounds of communication. 1st, you need to go to each of your note and say, oh, please persist kind of a state that you have on this can kind of validates that you can commit. And if you can commit, kind of make this, decision persistent on disk and then you got the confirmation from everybody or not on disk. It can be replicated to using the Paxos and Draft protocols. So it gets a confirmation from everybody.
And then once everybody said that is okay to kind of proceed with this transactions and you see your mark this transaction is committed on your kind of master record representing this transactional object. And after this, you communicate with, again, with everybody and do this kind of instruction. Please commit the data that you reserve. And this last part with Adpanda and with Kafka, it's because of the isolation models that we choose. It happens after the fact that after acknowledgment of the transaction. So we don't count this as a latency impact, but still there is 2 things which happens before we acknowledge request.
1st, this repair communication. 2nd is updating this master record And what parallel commits does it okay. Let's do all of this in parallel. So it's an kind of a piece of information that please confer, do your work. And by the way, this is all of the other participants in this transaction or there might be a preference back to this master transactional record. And when this 1 phase is executed, we sent as an acknowledgment to the client because we'd know that everybody kind of decided to participate on these transactions. But if somebody else observed this incomplete record, then it's gonna be the somebody else responsibility to go and clean it up.
And usually, it's the same transactional occurrence in nature, but it's kind of gives you reduce the numbers of network communication and makes it say things faster. And I don't think that some major benefit is reducing the network communication. Network communication is still pretty fast to these days, but what is really important for reduce the numbers of sequential f syncs to the disk, and this is kind of a major point with latency, height and kind of data management systems. So this is the last optimization we did in terms of processing. And the first optimizations we did I mentioned that Kafka and RedPand is about the moving data. And when we're processing a transactions, we need to process a lot of data and we can't keep this old information in the memory.
So while we're processing this information, we store it to disk and we can store a gigabyte. It doesn't matter how much data we want to process in 1 transaction. So we need to save all of this to disk. And because of the network application, we can't say it's just, so this is my huge chunk of data. Please take it from here, save it and give me a confirmation. We, from the client side, we send this data in kind of a small TCP packets and process them in sequence and we store each chunk of this huge piece of data sequentially and we want to store them reliably. So we use f syncs to get the confirmations that everything is secured.
And once we saved all of this 1 to be of data, we finally go and commit. So what we did in red bundle, we decide, okay, let's do this optimization. Let's treat this transaction as a unit of work and optimize this as a whole instead of optimizing each individual record. So what we're doing is we save data of of all this confirmation from the disk kind of optimistically. But then when we're about to commit, we'll make sure that it was kind of a fully persisted to disk and it was fully replicated to all of the replicas. And if it doesn't, which happens, this is a rare situation.
Then we just aborted transaction and as a user, it's kind of expects that transactions might be aborted because of some internal things, happening in in the cluster. So it's kind of no surprise that they can have a ready to process this information again. But the gains that we have it, it just it's really big because if we want to persist n messages in pandas, then alternatively, we will need to async and messages sequentially. It's not end button. I'm simplifying. And now we are not doing this. We just let the system to figure out on its own pace when it's safe, when it needs to save this data. And in the end, we check that it was safe, and it gives us huge boost in performance.
So I think this is a 2 major optimizations that we did in transactions.
[00:31:02] Unknown:
Beyond just the constraints of needing to adhere to the Kafka API for transaction support, what were the other priorities that you were orienting around to determine how exactly to implement these transactions, and how did that sort of inform the work that you were doing? And wondering also sort of what types of roadblocks or challenges those constraints might have added to the overall effort? When I was kind of designing and working on this, I wanted
[00:31:33] Unknown:
to have some result fast. So I kind of came up with several optimizations and I was focusing on them. And also, I wanted to have done this faster, but usually when you do something, you think, oh, okay. This is kind of a suboptimal. I will came up to this part of a system later, but later usually never comes. So I was, trying to achieve the maximum performance I could and then kind of iterate on this rather than having kind of coping to copy what Kafka does and then optimize it
[00:32:09] Unknown:
later. I understand that from when you first started on this till you have delivered it in its current form, it ended up taking on the order of about a year. And so I'm curious sort of how you were able to break up the steps so that you were able to continue shipping Red Panda, but you were also able to do your work without it either sort of having to deal with constantly rebasing the upstream changes against your branch and dealing with all those conflicts and just the overall sort of, like, software development methodology that you use to be able to deliver such a sort of substantial feature over such a long period of time without having it affect other work that was ongoing?
[00:32:45] Unknown:
There's multiple layers on this question. But first, I want to kind of comment on this 1 year. It was yeah. It was 1 year, but at the same time, it was close to 10 years because, well, I mentioned that I was working at Yandex and I was touching Kafka at that time. And, of course, it didn't have transactions implemented to the system itself. And 1 of the tasks that I got at Yandex was to build a transactional system on top of Kafka to process this logs automatically. And, yeah, I was fresh out of the college and I successfully failed this, this part. But maybe I decided that, okay, I will overcompensate this and I will dive into research. And now, 10 years later, I've finally, succeeded this part of implementing transactions on top of this, auto Kafka thing. So I can finally make my old managers happy.
But you were asking about how to kind of work on this big projects without affecting the other areas. And some of the things I just indicated some part of my time on working on the transactions and some other part of my time working on something else. So it's kind of answer on how not affect kind of the other works that I was doing. But there is another layer in this question. So once you can for working on something, because the surface area might be a better future and it's may touch the different areas inside of red panda and may affect the other operations.
But it wasn't this way in our case because all of this parts and transactions and so, Kafka protocol, they require the majority of this complexity lives in the transactional caretinator. And transactional cardinator, it's completely separate topic, and it's a separate entity in the system. Most of the work I've done it, it wasn't mixing with other developing process inside of Red Panda. And this kind of a small part of which does, yeah, we confuse tests to validate to say nothing isn't affected.
[00:34:57] Unknown:
And then as far as the actual transactional support that's in there now, I'm wondering what how you handled validation of the transactions
[00:35:05] Unknown:
as you were iterating on the problem and how you decided when you were complete. Yeah. I can't say that it's complete now because right now, transactions in Red Panda say, I don't want to be wrong, but I think it's corresponds to Kafka 2.4 level. So there is a still some Kibbutz Kafka improvement proposals that we need to support to achieve as a complete 100% feature. But, yeah, early on. So I was using a unit testing test of what I was working on. And also we have chaos testing that we use. So partially it's based on the work I previously done at Microsoft. So it's all connected.
And as part of this chaos testing, we use IP tables to block the network communication to see how it affects the system. Also, restart some support service. We do this graceful shutdown that we use kill minus 9 to terminate the process. Also, we have, fuse file system to insert the errors in the file system level, and we also introduced delays there to see how it affects the performance. The behavior of AgatePanda. What I really want to continue doing with this user project is to kind of cache all of the rights as it goes into the system. And when there is a error during f sync just to forget all of this data to simulate complete power loss.
It's hard to achieve in a lightweight to wait because if you have this application and it's not enough to terminate the application operating system can finish all of this pending operations and you don't broke things. And when you wanted to broke things, it kind of gets in the way of doing it. Also, what helped us is this very basic isolation level of the transaction. It's read committed. It has, some guarantees, but you don't need to think about the interaction with other transactions. So, if it was something stronger that you will need to use tools like l, the level of visibility is right and it's hard problem.
Map Glades as the guarantees are pretty low, and we don't need to over complicate the testing approach.
[00:37:37] Unknown:
In terms of the actual usage of the transaction capabilities, what are some of the limitations as far as either the, you know, total number of records that you're able to process within transaction boundaries, the number of operations before you sort of explode the level of complexity beyond what can actually be reasonably done within a transaction or, you know, the sort of interfaces that are available such as being able to execute a transaction from within a WASM transform?
[00:38:04] Unknown:
The problem with Wasm transactions comes from the additional operations that we need to do. So if you want to have a linearizable system and you're okay with having a duplicate in the system, it it faster to go without the transactions because for each of transactions, you write 2 additional message batches in the system. There's a prepare and a commit. So if you process just a single record in the system, when you use transactions, the single record becomes pre records. So it's kind of became more expensive. But yeah, this, this is a price which people need to pay for having this operation automatically. And usual recommendation is to fetch more data from the source topic and process them in the larger batches and transactional operations. And this overhead of transaction processing doesn't introduce this. It still introduced the additional overhead, but it's small compared to the workload people are processing.
Another limitation is the API itself. It's similar to the database work, but it's a little bit different. So people should understand what it is. 1 of the example is transactional IT. It's very important, but it's not the identifier of this transaction. It's more like identifier of application of the transactional processor who is executing this transactions. And even in the early proposals of the transactions in the Kafka, they, they called it application IT instead of the transactional IT. IT. And so end of the times I decided to change the name. I think this is kind of a major limitations of the transactions.
But besides the limitation, there was several surprises, pleasant surprises. So when we implement the transactions and started testing and do the performance testing, we find out that it's faster to do batch, bulk import in red band with transactions and without transactions. And it was a complete surprise and it was some 100% unexpected because I have this intuition that transactions are always slower than long transactions. And it wasn't the case in this case. I even thought that there might be some error in the data or in the testing procedure, so I need to double check it. And it was this way because of this async optimizations that I mentioned before.
So we were kind of just processing this whole batch as a whole instead of processing the individual messages, and it gives us this performance boost.
[00:40:43] Unknown:
And in terms of the ways that transactions are actually being used now that they're available in Red Panda, I'm wondering what are some of the most interesting or unexpected or innovative ways that you've seen them applied. I think I'll just talk about this. It's very unexpected
[00:40:57] Unknown:
and very, very comfortable to come to the customers to use it. And in the process of actually
[00:41:03] Unknown:
going through the work of adding transactions to Red Panda, what are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:41:11] Unknown:
It was harder than I expected. And initially, I was planning maybe to finish it in within 3 months, but it took almost a year. And I had implemented, application and consensus protocols before, and I thought that transactions are gonna be easier, but it wasn't the case. It was easier to understand why it's working and what is happening inside, but it was a lot of work to kind of implement this. And with the replication and the consensus parts, it's the other way around. It's less work on implementing, and the major complexity comes from realizing why this thing even works in the first place.
[00:41:53] Unknown:
And then in terms of the sort of near to medium term future of the work that you're doing at Red Panda and Vectorize, what are some of the things that you're excited to dig into? And if there are any sort of discoveries or ideas that you came up with in the process of building these transactions that are sort of the next logical step for you to explore?
[00:42:14] Unknown:
I still have some work to do on the transaction side. I really want to add more observability into the system and simplify the life of SREs of managing red panda, a flock of red panda or a herd of red pandas. Yeah. I don't know. Cause they live together. And yeah, this is related to transactions. And of course I want to achieve this, feature parity with Kafka. But another thing I'm really excited is, is a replication protocol. So right now, Panda uses Raft as a replication protocol, but it's more than 7 years old. Can you recommend somebody to buy a 7 years old iPhone instead of what is the current version? 13? If both models have a friends or budget, of course, I can't end. Same with application protocols. There is a very hot area of research and there was, several advances made since this Raft paper appeared 7 years ago. And recently I, as part of this JCS work, I reviewed a matchmaker boxes and it's looks really cool. It allows faster configurations of the system.
And the faster you can do this maintenance kinds of the operations, the less impact they have on the overall cost and say, the less impact they have on the customers. So I'm very excited about this, but I'm not sure when I'll be able to do this because there is so much thing cause that it's possible to vectorize and everything looks very excited. And I wish I could talk about this more.
[00:43:55] Unknown:
Are there any other aspects of the work that you're doing on Red Panda as far as the transaction support or just the overall vectorized platform that we didn't discuss yet that you'd like to cover before we close out the show? No. Probably not really. Maybe you could ask me in a year.
[00:44:11] Unknown:
Sounds good. And I will able to share something in you. Alright.
[00:44:15] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think in terms of the gap, is this gap between
[00:44:33] Unknown:
the industry and academia? So they develop new ideas way faster than we implement them. So that's where we should look at.
[00:44:43] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing on Red Panda and the recent work that you've done adding transaction support to it. It's It's definitely a very interesting project and an interesting challenge that you were able to, you know, come out successful with. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day. Yeah. Thank you. It was a pleasure
[00:45:05] Unknown:
to be a Sam. Bye.
[00:45:12] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Dennis Ristov Begins
Dennis Ristov's Background and Career Journey
Overview of Red Panda and Vectorized
Adding Transactional Support to Red Panda
Challenges and Optimizations in Implementing Transactions
Validation and Testing of Transactions
Limitations and Performance of Transactions
Future Work and Exciting Developments
Closing Remarks and Final Thoughts