Summary
Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Elad Eldor about operating Kafka in production and how to keep your clusters stable and performant
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe your experiences with Kafka?
- What are the operational challenges that you have had to overcome while working with Kafka?
- What motivated to write a book about how to manage Kafka in production?
- There are many options now for persistent data queues. What are the factors to consider when determining whether Kafka is the right choice?
- In the case where Kafka is the appropriate tool, there are many ways to run it now. What are the considerations that teams need to work through when determining whether/where/how to operate a cluster?
- When provisioning a Kafka cluster, what are the requirements that need to be considered when determining the sizing?
- What are the axes along which size/scale need to be determined?
- The core promise of Kafka is that it is a durable store for continuous data. What are the mechanisms that are available for preventing data loss?
- Under what circumstances can data be lost?
- What are the different failure conditions that cluster operators need to be aware of?
- What are the monitoring strategies that are most helpful for identifying (proactively or reactively) those errors?
- In the event of these different cluster errors, what are the strategies for mitigating and recovering from those failures?
- When a cluster's usage expands beyond the original designed capacity, what are the options/procedures for expanding that capacity?
- When a cluster is underutilized, how can it be scaled down to reduce cost?
- What are the most interesting, innovative, or unexpected ways that you have seen Kafka used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with Kafka?
- When is Kafka the wrong choice?
- What are the changes that you would like to see in Kafka to make it easier to operate?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Kafka: Troubleshooting in Production book (affiliate link)
- IronSource
- Druid
- Trino
- Kafka
- Spark
- SRE == Site Reliability Engineer
- Presto
- System Performance by Brendan Gregg (affiliate link)
- HortonWorks
- RAID == Redundant Array of Inexpensive Disks
- JBOD == Just a Bunch Of Disks
- AWS MSK
- Confluent
- Aiven
- JStat
- Kafka Tiered Storage
- Brendan Gregg iostat utilization explanation
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack. You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.
Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free. Your host is Tobias Macy, and today, I'm interviewing Elad El Dor about operating Kafka in production and how to keep your cluster stable and performant. So, Elad, can you start by introducing yourself?
[00:01:35] Unknown:
Yeah. Sure. So, nice to meet you, first of all, and and thank you for letting me be on the podcast. So my name is Elad, and I'm a DataOps group manager at, IronSource, which is an Israeli company that was merged with Unity. And on my job, I primarily focus on stability, deployment, and cost reduction of big data analytics clusters on AWS, such as Druid, Trino, Spark, etcetera. And I also, am involved with Kafka production issues and cost reduction issues, but mainly production and stability issues on, on Kafka. Prior to that, I was a SRE for a company called Cognite, where I where I was in charge of stability, of big data clusters on prem, for Linux on prem, mainly for Spark Streaming, Spark Batch, and Presto on HDFS at various, customer sites.
Prior to that, I was a Java back end engineer for for about 10 years. And, I recently published a book called the Kafka troubleshooting in production, which will be the I think, the main topics of what we will discuss today. And it talks about how to handle production issues in Kafka clusters, both on prem and on the cloud on AWS.
[00:03:28] Unknown:
And do you remember how you first got started working in data and what it is about that space that keeps you interested?
[00:03:35] Unknown:
So I I started working on big data, about 7 years ago when I developed the my first Spark streaming application, consuming from Kafka and persisting to HDFS on Linux on prem clusters. Before that, I just wrote, you know, back end applications that were reading from some database, writing from some into some databases, but it wasn't, it wasn't formally big data. And 5 years ago, I understood that, on big data clusters, there are production issues that I didn't see before. And, instead of so I reverted from being a developer, I reverted to being an SRE and solving stability issues in these clusters, which meant knowing better not my code, but, the the infrastructure, the the cluster that my code or some code, some big data code, runs on, whether these are Presto clusters, Spark clusters, Spark batches, Spark streaming, Kafka, HDFS, and this it was a whole a whole new world for me.
And that that's what brought me specifically, to handle Kafka issues, which I which I do for the last, like, in production. I do it since 2018, but I started doing it a year before. So maybe the question is more what brought me to be not to be a developer anymore, but to become an SRE slash data ops of, of big data clusters, and specifically focusing on cost reduction, whether on prem or on the cloud. And and and focusing on on Kafka. So I I I I will elaborate on on the Kafka issue. So I understood that Kafka is, like, stuck in the middle of everything, everything. Like, if there is a problem with Kafka, your whole data pipeline is just stuck.
Producers can't write, consumers can't read. And and that's what it was a combination of 2 things. 1, the understanding that Kafka, the best ROI for my time will be to focus on Kafka. And the second thing was that I stumbled into a book called the System Performance by Brenton Gregg, who was a lead performance engineer at Netflix, and now he works under the, Intel's CTO on distributed clusters. And this really opened me a world of that I wasn't aware of before, of monitoring and detecting bottlenecks in Linux, clusters. And not only understanding what what's your bottleneck what's the cluster bottleneck, but also it opened a way to reduce, costs.
And what I saw when I stumbled into production issues in Kafka, Spark, Presto, Druid, For this matter, any service, any applicative service on running on Linux cluster is that, understanding how to diagnose issues in a Linux cluster can really help you detect what what's the problem in in many cases, and also allows you to reduce costs on the on the cloud.
[00:07:29] Unknown:
In terms of the Kafka focus, you mentioned that you were working as a back end developer. You were interested in the use cases for Kafka, the production requirements around it, how to make sure that it was stable and performant. And I'm wondering if you can talk to, at least at a high level, some of the different environments in which you've had experience working with Kafka and some of the categories of operational challenge that you've had to deal with in that process?
[00:07:58] Unknown:
Sure. So I started with a on prem cluster, a Kafka on prem, and it was on a Kafka distribution of Hortonworks, HDP, before they were acquired by Cloudera. So it was it was a free distribution of Kafka, and and now I work on with the Kafka Open Source on AWS Cloud. So in either cases, it wasn't a managed service, so we had to to diagnose the issues and solve them ourselves. Now the because both are practically Kafka open source, although it was when I was on prem, it was part of the HDP, but I don't see a lot a lot of difference here. The main issue here was on prem is on prem versus cloud.
And this is why my book focuses mostly on cloud deployment, but also on on prem deployment. So I started from from on prem, and there's a big difference, in in the in some parts, there are big differences on on prem because, you are in charge of the hardware failures, for example. Now the the most common failure for hardware on prem, no matter which cluster it is, are disks. And in Kafka on prem, usually, you you you need a lot of storage. And so in order not to pay much, you use HDD disks and not SSD. So they fail for a lot of their ratio failure is pretty high, and you need to take care of it because it is not cloud.
Another issue is that on on prem environments, you usually the SRE is not on-site. So you have a tier 1, tier 2 support, which are usually unaware of of signals that can say that something is going to get wrong, something is wrong in the in the in the cluster. So as an SRE of on prem, you reach the problems once the cluster holds sometime. But there are also but on the cloud, the the cloud provider will tell you, okay, you don't, you don't need to to handle, disk failures. However, when even on the cloud, a broker can halt because of disk, deterioration, and you wouldn't know of it until you get into legs.
But you don't need to replace the disk. You just can replace the broker, which is much easier in the cloud versus on prem, where you need to, get into the drawer of disks and replace the disk. Another big difference is the scaling. If you have more traffic, you can scale out. On the cloud, you can just scale out or scale up, and by spinning in your or spin a new cluster pretty easily, while on on prem, it's very, very tough, because think of it like if you want to scale up, you need to have disks on-site, and sometimes you don't have these disks on-site. What happens if you ship the cluster to another when your customer is in another country? Other, another issue is, like, what happens when you need more RAM? On the cloud, you just, you just add more spin a cluster with instances with more RAM.
But on prem, you need to check that you have enough slots in order to insert DIMM, DIMM sticks, memory sticks, into the into the class into each machine. And you need to do it, like, manually. So I think the the scaling, the 2 main difference of the scaling option, which is much easier on the cloud, and handling failures, hardware failures, which is also when you are the when you are an SAW of on prem, it's it's on you or on on on the first years that manage the issue once, but but while in the cloud, you just detect the issue. You need to detect the issue and then replace the the machine.
The main so these were the 2 like, I came from on prem, not only from Kafka. Like, only for Spark application. It's it's pretty much the same issue, between on prem and on the cloud, these these 2 issues. However, the the benefit of, deploying on prem is the fact that you don't pay every for every hour. You just, pay once, there is I think today, there is a growing discussion, whether a cloud based companies, maybe they need to go on prem for some of their clusters. So it's the math the math behind this calculation, sometimes, I think, favors on prem even though you need to to handle this. The scaling is very tough, and it needs on you to to detect, hardware issues, but, maybe the cost sometimes justifies this.
[00:13:40] Unknown:
Yeah. The on premise versus in cloud debate is definitely ongoing and always very nuanced. And to your point, yeah, where on prem, the cost over time is much lower because you own the hardware, so you don't have to pay continuous upkeep for it. But there's the opportunity cost of having to move slower and more deliberately and, perhaps reducing the number of chances or experiments that you take because of the fact that it is so much lead time to bring in that hardware and scale up the cluster. And, also, on the point of Kafka, that's an interesting aspect as well where the my understanding of the way that Kafka itself is designed and some of the aspects of having to define up front the number of topics in order to accommodate a certain number of clients seems as though it lends itself more readily to that fixed installment on prem environment versus the cloud environment where you are incentivized to elastically scale up and down. And I'm wondering what you see as some of the challenges of bringing Kafka into the cloud because of that potential for elasticity in the clusters. I
[00:14:56] Unknown:
money like, use the cloud. Okay? The money is the the the money is the only reason why going on prem because it's indeed it's it's tough. But it's interesting that you mentioned, that that Kafka might sound a reasonable a good candidate to to be deployed on prem. And by the way, there are companies that most of their clusters, are that that host the open source analytics tool or messaging tools are deployed on the cloud, but Kafka is deployed on prem. I I know of 2 not small Israeli companies that have their Kafka deployed on prem.
And what like, from from my experience, because I was deploy I I was SRE for various customers when I was working for on prem. So I had I saw several examples. And the main thing that I saw in the main difficulty that I saw when provisioning a Kafka cluster by the way, most of the clusters were over provisioned. This is 1 problem in on prem, that you over provision because it takes time to purchase more disks or dings or brokers. So it's so it's hard to to say what what's going to be the the size of the cluster. But if I if I will take 1 1 factor that makes it hard to to provision upfront the Kafka cluster is the storage, is the retention.
Because on on prem, you need to remember that most of, like, people on the cloud that were born to the cloud, let's say, that started working only and they work only on cloud environments, they tend to forget, that there are many on prem clusters. And much of these on prem clusters are deployed by, enterprises, or by companies that are not really high-tech, and they have big on prem clusters. And these customers, the the man the the owners of the customer, let's say, they are also the owners of the data center, and you provide them, the data center.
And they have retention requirements that they serve not only how much time will it take you to recover from lag, but they have which is usually in in cloud and Internet companies. That's the retention, several hours, they want days or even a week. So this changes the whole picture because then when I started when I started as a necessary, I saw cluster that were just a a bunch of machines that cost sometimes tens of 1, 000 of dollars. And the CPU utilization is is, like, ridiculous. It's 10% user time. And the RAM usage, well, you never know what's your RAM usage, because we might later talk about the page cache, but it's very hard to understand what's the RAM usage in Kafka.
But the disk storage was the utilization usage was really high because of this retention. And then you come to to a point, and I dedicated a chapter in the book only for this, to how how you whether you use RAID, RAID 1 plus RAID 0 or use JBOD. Because I can give a real example of a customer that had 17, brokers. And just in order to satisfy the retention requirements of the customer, and then the customer decided to double the amount of retention because that's the order that he got. It was some guy government law enforcement.
So you don't mess with them. They tell you, okay, I need to double my retention, so you need to satisfy this. And the the the first reaction was, okay, let's double the amount of brokers, but then I I, I convinced that the my managers that, you you we just need to switch from RAID 10, to JBOD. Now RAID 10 gives you double the amount of replication. So if you have replication factor of 3, you will get 6 copies per segment if you use rate 10. But if you use JBOD, then you save then you save half of the disks, and then you don't need to add even 1 broker, just use the same number of disks. And if you want even more storage, you can just add more disks. But then you run into the issue of, okay, I don't have enough disks in my, in my drawer.
So you need to add another drawer to each broker. These are these are things that on the cloud, you don't even think about them. People on the cloud even don't know about them. But if I will take 1 thing that that is really, like, that makes provisioning Kafka on prem to be tough is is the retention requirements. And managers for on prem clusters tend to be very sensitive for retention. They they want they want 10 times the amount of retention or 20 times the amount of retention, that Internet companies have, for example.
[00:20:39] Unknown:
And bringing us around to the book that you wrote, which you mentioned, what was your motivation for bringing this all together in the written form, and what are the overall goals that you have for the book and the people who are reading it?
[00:20:55] Unknown:
So back in I remember myself as a as a back in engineering trying to deploy my application, my my streaming application on Kafka, and going to something didn't work in the Kafka on dev. So I remember going to the to the DevOps room, and, you know, everyone are afraid when you go to the DevOps room, because they're the most important the critical part of the of the organization. And I I asked them, okay. I don't know what's wrong with my Kafka. And I had no clue, and they didn't get an answer because they had no time. And then I went out to the room. I decided that I will know Kafka from all the angles. Like, that was my, like, really initial motivation. But if to be, like, more serious, when I started handling production issues in Kafka, and I went over to sites customer sites, I saw that people just didn't were clueless about what's wrong with Kafka. They blamed everyone blamed in the other part, the the devos blame consumers, consumer blame producer, producers blame Kafka.
And so I I wanted to to and it it is in the middle of the data pipeline. So that was my motivation. Like, the best ROI for my time was definitely Kafka and Linux operating system, like understanding the metrics and how to diagnose using Linux metrics, diagnose problems in Kafka. After now this is was why I wanted to learn Kafka from real production issues. My motivation for the book was that these, support engineers and DevOps and developers that encounter issues in Kafka, which is not a managed service, so they manage it themselves, will have a a cookbook for understanding, like, like recipes, understanding how to how to handle production issues in Kafka. And this is why the book is split it into 3, sections, 3 logical sections. It's the the data section, the Linux OS section, and the Kafka metrics section section.
And, also, for, most of it cloud based based, but but also 2 chapters dedicated specifically to to on prem. Because, the duplicated when I was, starting to work on the cloud. On the cloud, it's just easier because you have, you know, you have monitoring in front of your eyes. You don't need to get logs from from, support engineers. But, so I I saw that that that there there was no book. There was nothing that even resembled a book about real production issues in Kafka, and there is a real need, for for this because there are so many Kafka deployments out there, whether on prem or the cloud.
And and people are just you know, I I talked to 1 CTO of a start, Israeli startup who told me, I must use Kafka because that's the message of us today, but it's hell. Okay? Managing it is so tough. And I heard it after he said it from different people, different manager ops managers, or tech tech tech leaders at some small companies, and understood that, it's just a common common problem. And then then the idea for the book, I I mean, it didn't come to me. I got an offer to to write the book, and then, it took me some months because it's a lot of work. Then I then I decided to compile to to gather all the stuff that I compiled for during the 6 years of me working in Kafka.
And that was my my motivation. Like, I'm not saying that people shouldn't use Amazon MSK, yeah, or, yes, or Confluent or Ivan, but I think that for for for those who already manage Kafka, they should have a better guide than than no guide, first of all. And, also, for those who consider they they manage their own Kafka and consent, say, okay. It's tough. It's too it's too much. It's too tough. We need to pay license and move to some managed service. I think that if they have the core the right guide, then it can save them money. And, I work in the open source with open source for 20 years, and, like, I see this as my small contribution to the open source community.
[00:26:01] Unknown:
To the point of saving money and whether to run your own Kafka cluster or use a managed service, there are a lot of considerations that go into all of that as well as the use cases of what you're going to be building on top of Kafka and what are the sizing and scaling requirements. I'm wondering for people who are in that position of deciding, do I want to use Kafka? How do I want to use Kafka as far as self managed or managed service? And what are the parameters along which I need to project what my cost is going to be? What are the different elements that they need to be considering and planning for as they start to evaluate and do those initial deployments?
[00:26:48] Unknown:
So so first of all, you need to understand your traffic. Okay? So, let's assume you are not a small company and that you you decide you you you are the decision point whether you need to use Kafka, first of all. And then if you use Kafka, whether you're going to use cloud manage or you manage it yourself. So, regard regarding alternatives to Kafka, I I'm not aware of any alternatives. However, I never search for alternative because Kafka is just everywhere. So let's say you chose Kafka and you have enough traffic so that you can build even a minimal cluster of, 33 brokers. Yeah?
But let's assume you have more, and let's assume you have multiple Kafka clusters per each team or service or a group or whatever. So in that case, first of all, you need to know your traffic, the number of topics, the per to per cluster, the number of partitions, and who who are your consumers, per topic, who are your producers. So understanding the producing rate, the consuming rate, and then you need to provision a cluster. Let's say, even before you know whether it's cloud based or not, you need to understand how much CPU you're going to require, how much RAM in order to support not the Kafka process itself, but the page cache because you want data to be read from the page cache and not from the disk, and allocate enough disks, cheap disks in order to support as cheaper as possible, to handle the the retention requirement.
And now after you have this estimation and you know which how many brokers you have and what's the size of the broker, now it depends the decision whether it's cloud based or not. If you are cloud based, you you will usually pick cloud based. But again, some companies decide, okay, I will put my Kafka on prem because I know it's it cannot be spot instances. But most of the companies which are cloud based will will deploy it on the cloud, and they will deploy it probably on on demand instances if they're on AWS. Some if if they have a strong DevOps team, they will, deploy it on a on a Kubernetes, maybe. But, but I I don't have an experience with Kafka and Kubernetes, so I can't say anything about it. I also didn't write about it, of course.
But, so so and if you're on prem, you will deploy it on prem, of course. Now the question may of whether to deploy it on like, purchase a a license for, for Confluent or for Amazon MSK, or for Ivan. These are some of the many services. Or use it or or manage it yourself. It really depends on on the on how your ops team, whether it's DevOps or DataOps, know how to handle production issues. And let's say, you know, ops teams are are rare. Good ops teams, I mean, are are rare. And and I think that if you if, like, you need good mon an excellent monitoring of Kafka. And when I say an excellent time, I don't mean hundreds of metrics.
Okay? I mean, I mean, a minimal subs a minimal set of metrics that will show you where production issues can occur. And in my book, I have 2 chapters, 1 on producer metrics, 1 on consumer metrics, and is 3 chapters on CPU RAM and Disks for the broker themselves. So I think, like, half of the book talks about how to diagnose issues, and and from this, the the reader can understand what to monitor. So you need an excellent monitoring, but not too large. You need, like, specific monitoring, in order to deploy it yourself. Because if you are blind in Kafka, you will pay a big price. You will have downtime, and you will just lose data.
So you must have an excellent, excellent monitoring and a team who knows how to how to manage the Kafka cluster. And, otherwise, to choose, choose cloud based or there are again, because it's the knowledge of handling Kafka clusters is is pretty rare. What happens is that some places, some customers, and they saw it in on prem, They just sometimes Kafka holds, and they lose data because there's not enough knowledge. That's not, like, that's not, of course, the situation at my current company. We have an excellent, DevOps team, and its team leader was also the technical editor for my book. And he affected a lot on the content and also on how to, like, what some parts of what to focus. His name is Oronon, and I I am very grateful for him to to to that he invested the time to to read and to edit the book.
But, again, it's at some companies, you have an excellent DevOps team, and some customers, you don't have. And if you're willing to lose data at some time, then use the open source. And if not, then go to then pay the license for Managed Kafka.
[00:33:02] Unknown:
Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake, and Hoody, so you always maintain ownership of your data.
Want to see Starburst in action? Go to data engineering podcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. And on that note of data loss and the cluster uptime and stability, what are some of the failure conditions that cluster operators need to be thinking about that might lead to data loss and some of the ways to mitigate that and plan for it in order to be able to reduce the time to resolution?
[00:34:20] Unknown:
In terms of preventing, data loss, let's start with prevention and then go to to to what can cause, data loss. First of all, you have a replication. Now replication goes side by side with the size of the cluster because the higher the replication factor, then each segment will have more copies on different brokers. However, you will need more storage, and more storage means, like, more storage means sometimes, more brokers. And even if not more brokers, it means more more, more disks. So it will cost you more, higher replication factor. Also, you can add another level of, assurance and deploy RAID. Let's say RAID 10, RAID 1 plus RAID 0, and this will double the amount of, of storage that you that you need.
So, replication is 1 thing. Making sure that your your producer, they have a they receive a axe acknowledgments from the broker is another thing, although it may affect latency. Retention policy. Retention policy, it's pretty simple policy, and you wouldn't believe the number of times that that consumers lose data because of it. Because you can define it by time or by size or by both, and then, the threshold is the minimal between both. And and and, if you define it by size and suddenly the topic size the the traffic increases even by a small amount of percentage and you your consumer lags, then you will lose data, or some consumers will lose data.
So if, like, if the audience will take 1 thing from regarding retention is that highly consider configuring retention by, by time and and not by by size, because you you really don't know the traffic at any point of time. And and some producer can increase the traffic by by multiply of 10, because, there was some filter and the developer just removed that filter. And and suddenly you get 10 times the traffic. However, if again, nothing is simple in Kafka. If you configure retention, by time, then you might get into storage, a 100% storage. So, again, this equilibrium between retention size retention time and and storage. Again, we reach storage.
How do failures? Monitor hardware failures. Now on on prem, you can just check the status of the disks. I repeat again. Disk, disk, disk. This is the the most critical the cheapest part of Kafka, but the most critical part of Kafka, and and the cause for many, many problems. So like, on on prem, you can you there are tools that you can monitor the state of the of the disk, and just a very, like, recommendation that can save several clusterizers, I assume. If a disk becomes read only, so you can just read and not write, usually, it means you need to replace the disk, because it, even running FSEK on on on the disk will help maybe for several minutes, hours, or days. But after that, it will become again read only, and producers cannot write to it, and it will create a whole mess. So monitor the disks.
On prem, you have the the tool to to check its, itself. And on the cloud, you can check, you can check the the disk utilization. So the IOSTAT tool, the Linux tool called IOSTAT, IOSTAT minus x, print every 1 second can show you that if if the disk utilization is 100% all the time, then something is wrong with your with your disk. Another metric is, what there there are a lot of metrics, but the the another 1 that happens to be around every production issue is the system time, the CPU system time. There are 4 main CPU metrics. You have the user time, system time, I wait time, which is either disk time, wait time for disk or wait time for network and context switches. If the system time goes, let's say, about 10%, then you should suspect something this is not good in your cluster. If your contact switches time reaches, let's say, more than 3, 4%, then check check what happens there.
A common cause, by the way, to a high context switch time percentage could be, that you just have too many disk kind of threads compared to the number of disks that you have. And this is if you want like, if to if you want to be, like, like, very cautious, you can you can regular backup regularly your data, either by using the feature of a tiered storage of Kafka, which I admit I I didn't use until now, or you can just read add another consumer and just take the data from the topic and persist it to some cheaper storage, like HDFS, or s 3, where the storage is separate for the from the compute.
So this is regarding some ways to prevent data loss. But in order to but these these are the these are the, like, what happens after after you already have a problem. But the question is how you prevent the problems. And from my experience with Kafka in production in the last, 6 years is that Kafka cluster talk before they hint you before they they creep you down. They tell you stuff, and they tell you this through the disk utilization, the IOSTA tool, which you can see using the IOSTA tool. They tell you this using top command. Check the system time, your CPU system time. Check the CPU context switch, CPU wait time.
If if you have GC issues in your Kafka process, then it will tell you this through spikes in the user time during a full GC. You can use the JSTAD tool, which is part of the JDK tools in order to check for full GC the frequency of full GC. Again, it's it's like, this is what half the book talks about half the chapters almost talk about how to, like, various cases that they stumbled upon regarding what can lead you to to to data loss. I had, I had a colleague asking me, I I gave him a copy of the book and he has he asked me, okay, but you didn't talk about the the issue of un replicated partitions.
And I replied to him, well, every problem in Kafka can result in under replicated partitions or in in partitions that are not. They don't appear in the list of in sync replicas in the ISR. So and there are tens of problems, which might not even be on the Kafka itself. Like, it could be in the operating system because something is, like sometimes it's even not in the Kafka. Again, like, if you deploy another service, if you deploy a antivirus antivirus, okay, or firewall, and they scan the the segments all the time, and you will have high disk utilization, this is not the fault of Kafka, but it will cripple down your Kafka.
So and there are some I saw on prem clusters where you shut down the firewall, and then it comes up again because of some policy. So so so you need to remember that not always you run alone on the on the broker, but asking, like, how like, what can cause data loss is, like, is like asking what should I monitor in Kafka and and and, how to deal with it. And that's that's a topic of the book in general.
[00:43:09] Unknown:
And given your experience both working as a back end engineer and operating Kafka clusters at scale. I'm wondering if you were to be in the room today redesigning Kafka from the ground up, what are some of the aspects of the system design that you might choose to revisit or revitalize?
[00:43:34] Unknown:
I think the design is is, makes a lot of sense, the log based approach. I I don't have many changes. I don't think of any change other than 1 change which is, for me, it's a bug. I don't know if they declare it as a feat as a feature that, they spread. If you have more than 1 disk, then the partitions are being spread, among the disks, given the number of partition per disk and not given the numb the the amount of storage per disk, which I I don't think that the Kafka community understands how it affects, some Kafka cluster owners to decide whether they go raid or, or j bot.
It's it's like something that if it will be fixed, the the amount of cost that will be saved for disk and maybe for cluster will be pretty big, because there are there are ops teams that say, okay. I don't want to go JBOD because because of this. I don't want to handle the spread of the data among the disk. But other than this feature, feature slash bug, of spreading the data evenly based according to storage and not not number of partitions. I do I I must I I must admit that I'm not in the I'm not in the applicative side anymore for a long time because to to be honest, it it's it interests me much less than the upside.
So my focus is on the upside, and and then it comes down to, like, this this question comes down to, like it could be asked regarding every big data cluster, but in Kafka, it's really hard, to detect to understand the root cause of production issues. To be honest, I don't know why, fix a problem in Druid or Spark or Spark Streaming or Treyno takes much less time than understanding why why, why data was, it was lost on Kafka or why the disk utilization is so high. I don't fully understand it, but, like, if I take the amount of time that I invested in every Kafka production issue compared to other class, it will be I'm not getting, like, 10 times between 5 times and 10 times more. So I think that and this brings me back to the motivation of writing the book that Kafka uses the operating system in a way that no other open source that I know of uses it, especially not the CPU, by the way. It's really like, on the CPU, it's, like, very simple. User time, and that's it if if you do it correct.
But the usage of the page cache, the trashing of the page cache in in cases that you have legs, and the the the amount of stress that comes down on the disks is something that I I don't think the the the the those who created Kafka or those who who develop it or contribute to it, I think there's some split between those who develop and those who who maintain, and and there's not enough connection between the 2 2 sides. And maybe this this is another when I think of it, I'm thinking out loud now. This is another reason why the book is important both for developers and for ops team in order to not not only understand Kafka, but also understand for developers to understand the ops team, because there are so much problems that are being caused by 1 disk that goes rogue.
1 disk. You can you can have a cluster with tens of disk and 1 disk go bad, and the this can cause the whole cluster to halt. And a healthy cluster should not suffer from such an issue. Remember, getting a call about 5 years ago from a customer side that had 3 brokers, 6 disks, HDD disks, configured in RAID 10, and 1 disk went bad. So you see but but in RAID 10 in RAID, you see only 1 logical disk. So and the utilization in Iosat shows the utilization of the, the highest utilization among all the disks in the RAID, and then you see 100 percent. So I had to I guess that that, it was 1 machine the 1 disk that was, that got screwed.
And I told the support engineer, go to the room and check the if the light, waves on 1 of the disc, and he told me, yes. There's the light was on 1 of them. But imagine that you need to do I needed to get to get this because because of the lack of monitoring. So when I think of it, then the work on something needs to change on like, not on the work on the disk because this is what Kafka does. It writes massive amounts of data to the disk, and it reads it tries not to read from the disk, but at many customer sites or cloud or cluster, it reads from disk.
And first of all, ops team dot node, it reads from disk. Secondly, like, many cluster have deployment of RAID. So how can you assist them in understanding the that 1 disk got screwed? They don't have a and in the cloud also, by the way. Why when 1 disk gets bad, there is the cloud provider won't tell you that because they they can't really tell you, okay. I'm on the disk is the disk is on 80% utilization for 30 minutes. We don't know if it's good or not, so we will not tell you ever anything. And then you get into high, high weight, and you replace the broker. So the mitigation for these issues in Kafka is, is is is brute force.
And and figuring out that you have a disk issue, you need to be a magician in order to to to know this. And and developers of Kafka just assume that, okay. Someone will handle it. It's not us. We just develop. And I think for me coming, I did a shift, a career shift, which is not common, like going from developer to ops. So I I understand the frustration from both sides, and I think that there should be more people that know, both paradigms. And because a lot of production issues originate from the lack of understanding or the lack of cooperation or knowledge sharing between the between the 2. If if the Kafka community had better communication between the developers and the ops team, then I think it would be much easier to detect, disk issues, which cause, I bet it causes, like, at least third of the problems in Kafka.
[00:51:26] Unknown:
Yeah. It's, definitely always a challenge balancing the developer of, I just wanna get something shipped and do something cool with some fancy new feature in the operations team of, I just want you to stop crushing my machines so that I can sleep at night.
[00:51:40] Unknown:
Yeah. Yes. But but I I it's like in the Kafka community, I think there is lack of ops team that will tell that will check these these developments. And but it's it's not even new development. Just for I am telling I'm asking the Kafka community, like, go to deployments and of Kafka and check ask customers, what's the percentage of production issues which is caused by disks. And then try to I don't know. Maybe maybe every Kafka tool needs a better monitoring of of of disks. Like and, also, if there is monitoring, how do you read like, Brandon Gregg has an excellent explanation of how to read a IOSTAT, the output of IOSTAT. So you have the utilization.
Like, the service time is obsolete. No 1 needs to look at it. So every utilization, you have throughput, which is the read megabyte per sec and write megabyte per sec, And you have IOPS, which is the read per second, write per sec. And and this is a fact that not many people know. The saturation of disks is 60%, Only 60%, which means that every increase in the disc utilization going is bringing the situation, the the level of IO 8 will become worse and worse and worse. And it's not like in CPU. In CPU, it could be, like, the recommendation on cloud is, like, 75% CPU CPU saturation.
Above that, your load average will increases in a nonlinear way. So how do you read the how Kafka, use ops team should read the output of iostat? Well, it's simple. Look at the disk utilization. The disk utilization of 100 I saw developer ops team telling me, okay. We reached 100%. This is not good. No. This is okay because Kafka works in bursts. So it writes a lot of data for a small amount of time, so we'll have high disk utilization caused by read meg write megabyte per second. Write write write per second. This this is good. And then you should see 0. And then again, you have a burst of writes.
And but if you have burst of, if you have 100% utilization because of reads, then it means that you read from the disk, which means that you have a problem somehow. Maybe you have a a consumer lag. Maybe you replicate some of a a broker that replicates the data is is is, lagging behind, which means that this partition is not in the it's this broker is not in your ISR list of this partition. So just look at the output of the iostat, minus 6. It prints every 1 second. And if you have several seconds of 100% utilization from writes, then this is okay. But if you have the same 100% utilization from grids, then this is not okay.
Now if the community will take this and make it an alert or some monitoring tool, then it will reduce, and ops team will understand know how to read this, then it will reduce the it will reduce the frequency of production issues. It it is more than 10%, for sure. Just this feature, detect leg detect legs, by looking at the diastat. And
[00:55:27] Unknown:
in your experience of working with Kafka clusters and helping customers and end users manage and ensure their uptime? What are some of the most interesting or unexpected or challenging production problems that you've had the, opportunity to diagnose?
[00:55:46] Unknown:
Okay. I will tell the most this is the okay. The most bizarre. I will tell the most bizarre production issues, the most the most interesting ones. So the most bizarre, I I I already mentioned it before, was a cluster of, 3 brokers, sick 18 disks, 6 per broker configured in RAID 10, that, 1 1 disk, 1 40 disks 1 40 disk crippled down the cluster because, because, it was on ray 10. So, in fact, only 2 disks were up, and the disk utilization were already was already pretty high, so it just added to the party this problem. And then I had to guess that 1 disk was faulty, and this was the support engineer really saw that 1 disk was faulty after he went to the server room. And, of course, the the other it's it's, the it's twin disk because of straight m also, didn't function.
And so on that and the disc utilization was already very high. So it brought the disk utilization to 100% on the brokers. So 3rd of the of the leaders, partition leaders were on that broker, and then producers stopped, couldn't cry to them. Consumers didn't read from them. And once consumers don't read even from 1 partition, then the day then they just cannot function. It depends, of course, on the nature of your consumers, but this was a streaming application that had to read from all consumers. So it got stuck. Even if it wasn't the application that needs to read all consumers, then you have partial data. So that was but that was a very bizarre problem.
That combine on prem and high disk utilization and guessing that 1 disk is 40. The most interesting problem that I ran into was, it took, it took several weeks, I think. And, it involved brokers that, from a certain point in time, every broker that was added to the cluster due to some failure of other broker, every broker was, got its, at some point, to 100% disk utilization, and and nobody managed to write and read for me. Of course, no manage nobody managed to write, so there was nothing to consume. And every time we replaced this broker with another broker, and again, the same phenomena happened.
And, like, and then again looking from the IOSTAT after a long, long, long time of trying to understand what is going on here, it was like voodoo, we noticed that that at some brokers, the disks reach 100% utilization, with because for for all the time, they were in 100% utilization because they but but when we looked at the throughput, we saw that they are half the throughput of the throughput that causes other discs on other brokers to be an 1 or 2 percent utilization. So not only did the healthy brokers, reach 100% utilization for small amounts or few seconds and then the utilization went down, These 40 brokers these brokers with 40 disks reached 100% disk utilization, and it kept 100% utilization and while the throughput was half.
So just dividing just correlating the disk utilization with the throughput and the amount of time the disk utilization was another was 100% led us to the understanding that these are just faulty disks. But this took a lot of excels and the and just checking trying to correlate every OS mon OS metric until we we found it out, and that was by far the most interesting production issue that I stumbled into in Kafka.
[01:00:20] Unknown:
And in your work of writing the book and consolidating all of the information and experience that you've had working with Kafka. I'm wondering if there are any insights that that helped you gain or any new knowledge that you were able to obtain in the process.
[01:00:39] Unknown:
Of course. Yeah. Because, mainly, like, from there there were 2 issues, like, like, it helped me formulate the the the 3 legs that Kafka stands on. Kafka needs to understand, let's say, which is the data part, the OS part, and the Kafka part. I was surprised to see that the Kafka part is is only third of the book, which shows how much the data part is is important, how how the the how the way that the data is spread among the partitions is so important for the, for the health of the cluster. And, also, I was surprised of how many issues production issues originate from a problem with with storage.
And, and, also, I found out several producer and consumer metrics that that were new to me, because I thought that I that many issues can be fixed with, with tuning the linger and the batch size on the producer and that found out several very important metrics in the consumer and producer. So this was also new to me. I ran them I ran across them during the time, like, from production issues that I dealt with during the time that I I wrote the book. And I must say that I must say that a lot of things written in the book were not things that only I discovered.
So I worked with several with several people that, that that I work we worked together and found the issues together. So in part, it was just documenting what team of ops people, found out, including me, but also other people. I I I I have, like, in order to to make this more more specific, I have an for example, the the issue of the storage. Just to emphasize, when I wrote the the part of the storage usage, turns out that it's much, that it's more vast than what I thought of. So if just we if we just take, this issue, the storage usage issue. So, for example, running out running out of disk space due to retention configuration.
What happens when you configure both time based and size based? The way the the the option there is an option to lose data, and I was surprised to to see how 2 simple configurations like this can cause data loss. And so retention policy and its effect on the data loss, the explaining, like, how to add storage to a cluster, how it differs between on prem and on the cloud. The the fact that that when you're on prem, this is not only technical decision. It's also managerial or financial decision because you can't tell the owner of the data center, okay. I I I I had a mistake. I don't need the 2 terabyte disk. I need 4 terabyte disks. So I need to throw away all the 2 terabyte and buy 4 terabyte disks so it will not go smoothly.
So understanding these aspects, like, this chapter became partially technical and partially how is a is a provider for an on prem customer, how you manage this this issue? Of again, in DIMMs, it's the same issue. Okay. I I I it was a mistake. I bought 16 megabyte this 16 gigabyte and it 32 or 64. How do you pass this decision? How do you mix DIMMs in on prem? And, also, the effect of of the retention on data replay. Something you need to replay the data because you did some wrong transformations. So what's the effect of, like, customer ops team need to understand that they need storage also for a for a for replay.
The data skew, how data skew can affect data loss even you if you have a lot of storage. If you don't partition the data correctly, then you will get data loss at some point, even 1 partition. And it's for certain consumers, this is like data loss in all partition. You need to replay the data again. So the data aspects were also something that I learned along the way. So this is only an example of 1 chapter. The issue is discussed in the chapter of storage usage. But I not only that I learned during the writing of the book, I think that if I wouldn't write the book, I would forget almost everything.
So for me, the for for the my the personal benefit for me is that I know I remember stuff, really, like, Kafka related stuff that I knew, and and I didn't forget them, but also that I learned along the way. So I think the investing 10 months, during the weekends, every weekend for the for this amount of period to write the book was beneficial for my technical knowledge. Absolutely.
[01:06:29] Unknown:
Are there any other aspects of the work that you've done with Kafka, your work on the book, the overall Kafka operations ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?
[01:06:43] Unknown:
I think we covered a lot of technical technical stuff. We we can discuss the the cost reduction, but but in very short, like, I would like to mention that, there is a chapter on cost reduction in in Kafka, but it relates it relates to like, I I brought 6 examples, 6 real world examples of cluster that I stumbled upon, which per each example, I I I write how much, I specify how much CPU, RAM, and disk each cluster has, and also the usage of each of these resources. And then I ask whether the cluster can be scaled down or scaled in.
And then I discuss other metrics, monitoring metrics. And by correlating this monitoring, Kafka monitoring metrics, and the operating system metrics usage, I I give recommendation regarding whether you can scale in or scale down the cluster. And I think, like, this is, like, the the cost of the Kafka cluster, it's not big, I think, compared to other clusters in an organization. But, because for for cloud based, I I assume that most of the deployments are on demand, So even if you have reservation, again, it's on demand, it's not spot, so it's important, especially in today's market, to squeeze in every penny that, that you can save.
So the the cost reduction part is something that that can help to reduce cost, on Kafka. But there is a part that I didn't talk about, which is might be a bigger part even than the machines themselves, which is the data transfer between consumers and brokers. Because if you don't configure if there is no rack awareness in the cluster, then consumers will read data, only from leaders, and these leaders can be can be will will most of the leaders statistically won't be in the same AZ. And for some, for some companies, this can save 100 of 1, 000 of dollar per year configuring reco reco awareness.
But since I don't have an experience with reco awareness, I I didn't, discuss it thoroughly. But for for those who listen, checking for echo checking if you can configure echo awareness, in your consumers and producers, and consumers and brokers, sorry, this can be beneficial. So you need to check your data transfer cost, and maybe it's worthwhile for you to invest in a in in deploy in test in deploying and testing, and validating the track awareness, works. And then you will read not from the leaders, but from the closest replica and save, data transfer costs.
[01:10:08] Unknown:
Yeah. That can definitely be a substantial cost when running in the cloud, and that's always 1 of those, surprise gotchas when you're first getting up and running in a cloud environment.
[01:10:20] Unknown:
Not not only when you're getting started, but also when you're, after years, then you you see the the data transfer. Yeah. This brings us back to what we started with regarding cloud versus on prem. Thus, it is the the amount of reasons why it becomes, like, I don't know, like, for some clusters, it might be a good idea to to check the, like, to I'm saying this because I came from on prem, so it's not Chinese for me. And, okay, so you don't have managed services, but it might be from from the financial perspective, the valid choice for some clusters to to be deployed on prem.
[01:11:14] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[01:11:32] Unknown:
Oh, also, I I usually work with analytics clusters. And I think that if there was a tool that wouldn't would would show at any given point of time a correlation between the traffic, whether it's the incoming traffic or a query or query load. So correlating between the the load on the cluster and the and the usage of the cluster in terms of, like, CPU, RAM, disk, or even internal usage. For example, let's say, a droid cluster that uses sometimes the bottleneck is the number of workers, or for Trello clusters, the number of split queries. So if there was something that some tool that would show correlation between the load and the cluster and the real, usage and cost, it would allow a ops team to teams to to better understand whether they can save cost on the cluster, whether they can scale it down, or, maybe replace on demands with spots, or maybe replace, on demand reservation with on demands without reservation, and then auto scaling them. So something that we show some tool that will show correlation between usage applicative usage and resource usage, and will enable to to save costs because especially today in today's economic economics, it's it's becomes pretty important to save costs.
[01:13:18] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your experiences of running and operating Kafka clusters and the work that you've done on the book to make that easier for everybody else to do as well. It's definitely a very challenging and necessary task. And as you said, Kafka is very widely deployed, so I appreciate the time and energy that you put into sharing your hard 1 knowledge with everyone else, and I hope you enjoy the rest of your day.
[01:13:46] Unknown:
Cool. Thank you very much. Again, thank you for hosting me. And I hope that the audience will, will gain something from, from this podcast.
[01:14:03] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts atdataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Elad El Dor and His Work
Journey into Data Engineering and Kafka
Challenges of Operating Kafka On-Prem vs. Cloud
Kafka in Production: Real-World Issues and Solutions
Preventing Data Loss in Kafka Clusters
Design Considerations for Kafka
Interesting and Challenging Kafka Production Problems
Insights Gained from Writing the Book
Cost Reduction Strategies for Kafka Clusters
Future of Data Management Tooling