Summary
Kafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread popularity, there are numerous accounts of the difficulty that operators face in keeping it reliable and performant, or trying to scale an installation. To make the benefits of the Kafka ecosystem more accessible and reduce the operational burden, Alexander Gallego and his team at Vectorized created the Red Panda engine. In this episode he explains how they engineered a drop-in replacement for Kafka, replicating the numerous APIs, that can scale more easily and deliver consistently low latencies with a much lower hardware footprint. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces. This was a fascinating conversation with an energetic and enthusiastic engineer and founder about the challenges and opportunities in the realm of streaming data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta.
- If you’re looking for a way to optimize your data engineering pipeline – with instant query performance – look no further than Qubz. Qubz is next-generation OLAP technology built for the scale of Big Data from UST Global, a renowned digital services provider. Qubz lets users and enterprises analyze data on the cloud and on-premise, with blazing speed, while eliminating the complex engineering required to operationalize analytics at scale. With an emphasis on visual data engineering, connectors for all major BI tools and data sources, Qubz allow users to query OLAP cubes with sub-second response times on hundreds of billions of rows. To learn more, and sign up for a free demo, visit dataengineeringpodcast.com/qubz.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host is Tobias Macey and today I’m interviewing Alexander Gallego about his work at Vectorized building Red Panda as a performance optimized, drop-in replacement for Kafka
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Red Panda is and what motivated you to create it?
- What are the limitations of Kafka that make something like Red Panda necessary?
- What are the current strengths of the Kafka ecosystem that make it a reasonable implementation target for Red Panda?
- How is Red Panda architected?
- How has the design or direction changed or evolved since you first began working on it?
- What are the challenges that you face in automatically optimizing the runtime to take advantage of the hardware that it is deployed on?
- How do cloud environments contribute to that complexity?
- How are you handling the compatibility layer for the Kafka API?
- What is your approach for managing versioning and ensuring that you maintain bug compatibility?
- Beyond performance, what other areas of innovation or improvement in the capabilities and experience do you see while adhering to the Kafka protocol?
- What are the opportunities for innovation in the streaming space that aren’t being explored yet?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Redpanda being used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building Red Panda and Vectorized?
- When is Red Panda the wrong choice?
- What do you have planned for the future of the product and business?
- What is your Hack The Planet diversity scholarship?
Contact Info
- @emaxerrno on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
-
-
@vectorizedio Company Twitter Accn’t
-
Concord alternative to Flink
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Their comprehensive data level security, auditing, and de identification features eliminate the need for time manual processes, and their focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms.
Learn how they streamline and accelerate manual processes to help you derive real results from your data at data engineering pod cast.com/immuta, that's immu t a, and get a 14 day free trial. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow, so try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. Don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Alexander Gallego about his work at vectorized building Red Panda as a performance optimized drop in replacement for Kafka. So, Alexander, can you start by introducing yourself? Hi, Tobias. Thanks for having me here.
[00:01:57] Unknown:
I am the founder and CEO of Vectorize, working on Red Panda, which is a drop in replacement for Kafka for mission critical systems. Been working on streaming for the last 10 years. Kind of started actually on the computational side of things. I was the CTO and founder of a company called Concorde, which is similar to Apache Flink, but we wrote it in c plus plus And we sold that to Akamai in 2016. During my tenure at Concore, we basically couldn't find a storage system that could keep up with with that system, and I've now moved to a storage side of things with Red Panda.
[00:02:32] Unknown:
And do you remember how you first got involved in the area of data management?
[00:02:36] Unknown:
After college, I I actually went to school for crypto, and I decided to drop out of grad school. Anyways, after college, I went to work for this ad tech company in New York City. And from a technical perspective, it was super cool. It was the idea of, like, let's compete with Google on AdTech for the premium publishers and advertisers. I was the 1st engineer there, so it was a really exciting opportunity for me as a as a recent college grad. And what we did there was we built this real time machine learning for ad bidding pipeline, which it doesn't sound super cool today. But honestly, 10 years ago, it was really innovative, and we were basically printing money 10 years ago by doing that. You know, that's how I got started.
[00:03:18] Unknown:
You mentioned that Red Panda is this drop in replacement for Kafka with a focus on mission critical applications. So I'm wondering if you can give a bit more context as to what type of applications those might be and some of the ways that Red Panda facilitates that use case.
[00:03:34] Unknown:
Effectively, what we care about through a lot of the market research is we focus on 3 tenets for for the product. And then I'm gonna tell you a little bit more about, like, the particular use case that I think we're a really good fit for. So we basically went, and before I started out this company as a as a full company, I interviewed something like 40 companies, and they spent everything from IoT to the big tech, even some things. And what we found is that the majority of the companies were actually unhappy running Kafka. And what was really good about Kafka was actually the entire ecosystem. Like, the Kafka API sort of won, but the guarantees of Kafka, whether it was data safety or running Kafka at a scale, a lot of those companies were having difficulties with that. I think the complaint from the majority of this survey, so 50% of these companies were really having difficulties running Kafka. And for those that did manage to operate Kafka at a scale, so the usual, like, you know, Netflix, Dropbox, etcetera, those people that have deep distributed system expertise, they weren't getting enough performance. And so what you end up with is, like, massively overprovisioning clusters.
And for some small percentage, they actually cared a lot about data loss. And what's interesting is actually through the last probably a 100 enterprises that we've talked to, the majority of enterprises actually run Kafka in a in an unsafe mode. And so when we say mission critical systems are really people that want to manage 1 system. They wanna have 1 fault domain. They don't wanna manage zookeeper. They don't wanna understand how to manage zookeeper snapshot or run TLS on the leader election port of zookeeper or restate the snapshot or understand how to tune the JVM. They really want sort of this streamlined system that kind of gets out of the way. That's really the first 1. Right? So operational simplicity was really a big 1. The second 1 was really safety. And so if you look at kind of what happened in hardware over the last 10 years, a decade ago, there's a company called Blackblaze, and they published this statistics about hardware. And so 10 years ago, to run a terabyte of SSD was like $2, 500.
But now the cost of running a terabyte SSD is like $200. Right? So it's sort of the first bottleneck. And because hardware got better on many levels, both on on networking, on disk, and on CPU, the rise of many core systems, we are now able to give stronger guarantees to our users about data safety. And so the main kind of critical piece about our architecture that has had a profound impact for a lot of our customers, whether they're in oil or ad tech or whatever, is really that we're safe. Right? We give them the linearizable Raft implementation to save their data with a sound fault domain and and a sound failure model. Like, as a developer, you actually understand exactly what it means to have 2 out of 3 replicas up and running. You understand that there'll be no gaps in the logs. So you have this really strong long completeness guarantee.
You know, if if you contrast that with the default settings of Kafka, you can run Kafka by the way safely, but no 1 that we've talked to in the last a 100 enterprises does it because the performance impact of flushing every message to disk is too slow. So if you contrast kind of this approach versus Kafka's, we give basically safety by default for users. And we allow them to both run with low latency, high throughput, and safety. And so that's really what we mean when we say mission critical systems.
[00:06:59] Unknown:
And in terms of the safety aspect, you mentioned that with Kafka, you can flush everything to disk on every right. But because of the overhead of the JVM, it ends up being slow. I'm curious if in your implementation, you're using something akin to write a head log such as what Postgres does for being able to handle that safety guarantee while also being able to handle some of the overall performance throughput on the underlying hardware?
[00:07:25] Unknown:
This is like a deep technical topic, so let me give you the big highlights, here. So it is actually not just the JVM. So what the JVM adds to a system is really latency. And so there's 2 aspects to this question here. It's it's understanding the difference between latency and throughput here. Adding the garbage collection, what it does and we've seen this in production. We send system, you know, regularly with 2 second JVM latency pauses and in some systems with 6 seconds latency pauses kind of on the outliers. So that's with the JVM.
But what I'm talking about is really throughput. And the reason most people don't run Kafka with safety settings is because it affects not only their latency, but particularly their throughput. And so what happens when you issue an f sync on a particular file handle is that that injects a global barrier on the kernel. And so that file, right, so if if you write into a file, it actually injects a barrier on the file system itself because you can't do anything else. You have to wait for the media to be flushed to to disk. We have to flush the file too to guarantee data safety, but but we do kind of a lot of tricks around batch coalescing and deep bouncing the rights on our raft implementation.
The second thing that we did there is from a technical level, we don't use the page cache. Right? And so we eliminate a a very large set of locks that have been in the kernel. On the first step. We'll talk actually about the networking in just a second. But when a message comes in, we're not consuming any resources from the page cache, which means the global logs and resources that happen on a particular file handle basis, they're just not taken. They're completely eliminated. The second thing is that we have this adaptive file allocation technique. And so what it means, a lot of the barriers or a lot of the performance limitations come around with synchronizing the metadata on the Linux kernel. And so by preallocating the file system space and coalescing the flushes and doing kind of all of these low level tricks, we're really able to drive probably 4 x the throughput of of a cluster while being safe and deliver about 10 x lower latency.
So hopefully, that disambiguates some of the performance things that we've seen with our Raft implementation.
[00:09:36] Unknown:
In addition to the challenges of running Kafka in a safe mode and the performance impacts that that brings, what are some of the other areas where Kafka has limitations with Red Panda in terms of simplifying the operability of the system beyond just the failure domains?
[00:09:53] Unknown:
1 of the things that we're really proud of is our auto tuner. Our team has a lot of deep technical expertise from big companies like Microsoft, Yandex, Red Hat, Akamai. And 1 of the things that you have to do at a scale is, like, you kinda have to basically automate yourself out of a job almost. So 1 of the most challenging things actually in running any storage system is for the person that is running the system, not the expert that wrote the system, is for this operator to understand, honestly, what are the best settings for my computer? You know, like, how much memory do I allocate in particular for the JVM? Like, what's the what's the garbage collection? Like, how do I tune these these things? So, of course, eliminating the JVM kind of eliminates an entire category of problems. Right? So that kinda makes it simple. The second thing that we do is we try to really deeply understand what are the kernel parameters that affect the runtime performance both for throughput and latency of RedPand.
And instead of exposing this as a list of checklists that say, you know, run this file system, disable no merges on the Linux IO scheduler, enable your multi queue network, and kind of all of these very deep technical things, we ship an auto tuner. And so even before Red Panda starts, we guarantee that by the time Red Panda starts, the binary starts, which by the is is just a single file, the kernel has the optimal settings. I'll tell you 3 or 4 things that we do to optimize sort of the experience for the person that's managing the storage system. So the first thing that we do is we probe the hardware for whether the mult whether the NIC can support multi queue networking. And if it does, we can enable multi queue networking. The second thing that we do is we detect if you're running an SSD.
And if you are, we probe the hardware and then we measure with some target latency of 500 microseconds. How much data can we push to this NVMe device given this target latency? The third thing we do is we tune all other couple of Linux kernel settings. 1 big 1 that has an impact is because we use DMA, because we don't use the page cache, because we talk effectively to the device controller, we tell the Linux controller not to merge our IO blocks. And in fact, I was like, don't do anything because we, the application, understand much better how to align our memory buffers so that, you know, they're basically block aligned like the file system expects them. And so all of this is automated.
This kind of a step forward in managing the system is really a system that kind of gets out of the way. You run 1 binary. You do a systemctl star red panda. If you run-in system d, it auto tunes the machine and then it runs the thing. Of course, this happens super fast, which is very anticlimactic when I'm trying to show this to my wife. I was like, look at all the things that we do, and it happens in, like, a 100 milliseconds. But from a management perspective, that's sort of that operational simplicity is really what the CEOs are buying. The technical person, the engineer in the room, they wanna hear all the details about how we do, you know, debounce the rights to raft. How do we batch thing? How do we improve latency?
How do we do kind of all these low level things, whether it's from the data management the data structure perspective at the core. But what the CIO cares about is, like, okay. How hard is it to run this system? That's the first thing that they care about. And secondly, what are the things that you're improving on on the status quo? Given this 1 binary is really, I think, the only way the the reason why we're getting so much traction with some of our customers, Because not only do we improve the safety guarantees of Kafka, the throughput, we give lower latency, we give higher throughput, we give them safety, But we also, from a management perspective, actually reduce the operational overhead of what it means to run a Kafka cluster on a scale.
[00:13:37] Unknown:
You mentioned that there are some areas for improvement on the specifics of how Kafka is implemented. But at the same time, the overall ecosystem around Kafka is obviously large and vibrant enough for you to target it as being a drop in replacement. So I'm wondering if you can talk through the overall strengths of the surrounding ecosystem and why it is that you are choosing to be an easy replacement for Kafka rather than building your own protocols and trying to grow your own ecosystem in place of it?
[00:14:10] Unknown:
Yeah. We have to really give credit to Kafka here. What Kafka is what they figure is Kafka has won as the de facto API. Right? We we talk to actually, literally, hun by now, 100 of enterprises, probably around a 100 enterprises, high level executives about running Kafka, and fundamentally, the Kafka API won. And if you think about the Kafka API has actually been the de facto standard, they have a huge ecosystem. There are millions of lines of code. You can take TensorFlow and push it into a topic and then have that consumed by a Spark ML and push it back onto a topic. And then eventually that makes it to Elasticsearch. And all of this sudden, with 3 or 4 little processes, you have a very sophisticated machine learning pipeline. Right? And so the API is really, in my opinion, the thing that won. Now if you compare the Kafka API to what SQL was for all of these databases, is kinda really how we're treating the Kafka API. We said, okay, this protocol has 1. Developers love it, and fundamentally, there's millions of lines of code that are written into this, and we wanna plug into that ecosystem. We wanna give them a better system for it. If you compare Red Panda and the Kafka API with SQL and kind of the whole evolution on, you know, you went from relational databases to NoSQL databases to, like, Cassandra. Right? So it was Postgres, and then everyone moved to Cassandra type systems.
And then the last 1 is the NewSQL or the CockroachDB system, which gives them, you know, distribution, geo distribution, and so on. We really think that the Kafka API really needs a new system that gives people the guarantees that they're used. Right? So there's this new stack of people that don't wanna run JVM system, that want a system that integrates well with Kubernetes and all of these modern systems, but it's easy to run. It gives them better safety, or it gives them kind of the things that they expect the queue to give them. But in terms of the Kafka API, it was hugely important.
And, honestly, from a business perspective, it was important that we leverage their system. There was no way actually, the first couple of months of the company, I tried to boil the ocean. And and clearly, it didn't work. I was like, look. I'm gonna invent this new API. We're gonna solve a bunch of things, like head of line blocking and all of this, you know, very low level of details. And then I didn't take into account this this super rich ecosystem. But after talking with a bunch of enterprises, it kind of became clear that what people wanted was not to touch their production application. They just wanted to point their application to a better system, and off you go at 10 x the speed. That's really what they wanted. And they want a reduction in hardware.
They wanted to get back, for example, engineering capacity, because now you can run the same system with a part time person rather than 2 full time people. So we really thought that the Kafka API was, like, this super rich ecosystem that where we can give better guarantees without breaking any existing application.
[00:17:03] Unknown:
And so can you dig a bit more into how Red Panda itself is actually implemented and some of the ways that you are able to be a drop in replacement for Kafka?
[00:17:11] Unknown:
Kafka has actually 50 current different APIs. And each API, by the way, has 12 different versions. So let's look at fetch, for example. Just basically the consume API. Let's see. There are 12 versions, and then you have an encoder and a decoder per version, per request and response. Right? So I send the message, and that has an encoder and decoder, so 2 things. And then I received a message that has an encoder and a decoder, 2 more things. And for those 4 things, you have 12 particular versions. So there's no way we were going to manually generate this. And so we spent the bulk of late last year honestly writing a parser combinator library where we could just cogenerate the entire Kafka API ecosystem. And so Kafka has these JSON files, and and it describes it's like a variant of JSON, but it describes the type kind of what the JVM type system expects.
So we can consume those files directly, and then we added an enhancement of that with some of the c plus plus 20 features to actually give stronger guarantees and prevent some abuse from the native API. That's sort of the first level. And so if you look at Feds, for example, it would be 48 different scenarios that you have to take into account. So there's no way we would be able to keep up with the Kafka. Like, we wouldn't have a product if we had to manually generate this a 160 APIs. A couple more deep level kind of technical things, you know, how we guarantee compatibility. The next 1 is because the ecosystem is so rich, right, and we really are just targeting at the API level. We say, talk to this TCP endpoint, and then we look like Kafka, we talk like Kafka, there's exactly no difference, and all of your applications work, which means we can download all of these open source projects that are already interacting with Kafka, that rely on bug compatibility in, you know, in the case of like the Saramaj driver, they have some particular bugs that we had to like imitate, which by the way upstream does the same. And so we imported a large set of of unit tests from the Liberty Kafka test, and then we just pointed at at our IP. And because we speak Kafka, it sort of simplifies this giant set of testing matrix and that all we have to do is run these open source systems and put them on Amazon or Google. It it really doesn't matter. And they say, you know, test this and let us know where we failed. Those are the 2 kind of guarantees.
And then on the safety level, you know, we hired Dennis out of the Cosmos DB team to kinda come in and implement Jepsen. In fact, we actually extended Jepsen to be able to analyze longer histories and without getting too much of the rabbit safety, which we can next. So we give them safety and API compatibility, and all of the product features actually have an integration test. 1 of the most fun things that that we spent probably last year, a couple of months writing, this really interesting fault injection test. And we've been running this 247 since last year. And every a 150 seconds, we inject a fatal failure, whether we delete like, a rough group or we crash the file system or we inject network partitions, we delete topics, we create topics, and we have to give the system, you know, time to recover because at the same time, by the way, we're pushing a gigabyte of data per second. And so we can simulate a much larger fault scenario than most of our customers can. And we know because we are the expert, kinda what are the difficult things. And in addition to that, we run them to all of these, you know, fault exploration frameworks, these these safety frameworks, these compatibility frameworks.
And that's how we are trying to guarantee safety and and compatibility.
[00:20:48] Unknown:
Another interesting element of this is that while Kafka has become the de facto API for streaming storage systems, there are some other contenders, most notably being Pulsar, which has also admittedly added a protocol compatibility layer for Kafka. But then there's also the work being done for the open streaming standard to try and coalesce around a particular interface for streaming applications. And I'm wondering what your overall thoughts are on the viability of some of these other protocols and the potential areas for improvement in the Kafka APIs and ways that the overall ecosystem might be able to progress to a shared understanding of how best to interact with these types of systems?
[00:21:35] Unknown:
Let's tackle each of those in particular. And I think there's also a set of things we haven't yet talked about, which is what we see as the future of streaming. In a way, as a company, we're trying to position ourselves as the future of streaming for the new stack. So talk about Pulsar. From a market distribution perspective, I think something like 90% of all the companies that we talk to use Kafka. There's some companies that that use Pulsar. Maybe there's bias in the people that want to talk to us. But that's just, you know, the stats that we're seeing. There's very few people that are running NATS and all the other streamings, you know, RabbitMQ, there's very, very few. And just really in terms of the market. I was actually speaking with an executive from IBM the other day, and literally Kafka crashed multibillion dollar products for IBM. Like, that's how popular it is. And I think the reason is not necessarily the protocol, but the amount of connectors. Right? The fact that you can take TensorFlow and plug it straight into Elasticsearch or plug it into a Cassandra compatible database or, you know, push it to DynamoDB.
It's kind of this Lego component for a lot of the architectures, whether you're trying to do it, to use it to detect real time fraud, or you're trying to use Kafka to deal with back pressure from databases that have a hard time with the scaling, like Elasticsearch, is the thing that is 1. So to answer your question about streaming and the protocol, I think it's great. I think the great thing about open source and the market is that people will build the applications that they need, and they don't care whether it is a protocol or it is a standard or something. They're just trying to get work done. And the majority of people are getting work done with Kafka. In a way, it's really easy for us to provide another parser mechanism or, like, you know, change the format from the Kafka format to JSON. It doesn't matter. But I think what matters is really this plug and play architecture where you get to leverage all of the Kafka connectors. And in fact, all of the new database, whether it's like OmniSight or CillaDB or CockroachDB, they even all actually, main SQL too has a transactional Kafka connector.
So the Kafka API is so popular that all of these modern database systems actually have built in customized drivers to consume data from Kafka in a way that gives them higher and better guarantees or different guarantees than the standard connectors. So I'm not sure whether the open messaging standard is ultimately going to win. We're just trying to give users a better system for their existing code currently. It's really hard to tell. I think it's great that there are, you know, innovations in this space, and it would make, honestly, a lot of our jobs simpler if it wasn't the Kafka API, if it was, like, a standard that had, like, you know, well understood semantics. Some of the things that we have to guarantee are really hard because it's not like the bugs are not documented, of course, And you really kinda have to run an application, download a bunch of open source project to really detect these edge conditions, which is why we leverage the entire open source ecosystem to test our code. But ultimately, I think the Kafka APIs, I think that's 1 today. It's very cool that the other messaging frameworks also take off.
[00:24:42] Unknown:
Going a little bit further down this path, the other aspect to the streaming ecosystem and 1 of the things that Pulsar touts as an advantage is that it also supports the more traditional messaging paradigm similar to what RabbitMQ does, where it doesn't rely on this persisted append only log stream for being able to send messages between systems. And I'm wondering what your thoughts are on the utility of that being collocated within a system such as Kafka or Red Panda or Pulsar versus running it as a dedicated system where you're relying on some of the time tested applications such as RabbitMQ?
[00:25:21] Unknown:
The request response for for messaging, I think is it's super useful. I think the the answer to that in the Kafka ecosystem is a thing called Kafka REST or like a proxy. We actually have our own. It's written in c plus plus API compatible, where you kinda send and and receive messages. Underneath the scenes, it kind of proxies the protocol from REST for this request response to the Kafka protocol. For the majority of use cases that we talk to, which are either large web 2.0 companies, like the largest database in the world, analytical database in the world, the largest CDN, people doing fraud detection for the largest banks, or on the hedge fund side where they're trying to do settlements. They really care quite a bit about data loss as you would imagine. Those are probably our focus. So I haven't really seen that much demand from a market perspective to run a lot of Kafka proxy stuff. In fact, I the real driver for the Kafka proxy is the Kafka driver when I mean the driver, I mean the client code has a lot of knowledge. It is a very sophisticated piece of technology.
And the reason to use the proxy is really to bridge gaps. So when you connect to Kafka via whether it's Go or or c plus plus or Java, you know, really any driver, but those are the most popular ones. The driver has to understand the internal IPs because it does literal election on the consumer groups. It, you know, has, like, a lot of protocols embedded into the client, and so it needs to understand the internal cluster state. It actually fetches all the metadata from the controller and so on. And so the HTTP proxies, I think, Les used a way to replace, RabbitMQ in the request response type workload and more actually to bridge external systems. So if you're trying to proxy your streaming rights to your infrastructure via an HTTP proxy, then you would put our, you know, c plus plus proxy in front of Kafka. And then anyone that talks HTTP can talk straight to this proxy without having to understand the mechanics or the internal client protocols.
And for the majority of whether it's your machine learning applications or your fraud detection or your real time ad bidding, most people just really use that. And I think that Kafka ultimately replaced RabbitMQ, at least in the market and what we've seen.
[00:27:41] Unknown:
And going back to the specifics of your implementation, what are some of the ways that the overall design or the product direction has changed or evolved since you first began working on it, and any initial assumptions that you had going into the project that have been updated or invalidated in the process?
[00:28:01] Unknown:
I would say the main thing that changed for us was our ability to add coprocessors. So let me give you kind of a little bit of background here. For most APIs currently, right, let's use gRPC to be executive with our time. You define inside a protocol file. You define the data exchange format, so, like, the data structure you're gonna exchange, and then you run it through an ideal generator, like protoc or g r p gRPC compiler, and it takes the schema and it generates HTTP stubs for a multitude of languages, Python, Java, you know, it doesn't matter. And that's roughly how, like, a lot of serialization format, a known h point, and a known way to interact with these endpoints.
What doesn't exist to this day in a way that is scalable, and I'll mention the the specific details in a second, is really a way to provide users with a data API in a way that goes beyond serialization, in a way that gives deep data guarantees. And so something that has evolved from the beginning of the product was our idea of coprocessors. So what's a coprocessor is Red Panda allows for inline WASM transforms. So as you're sending the data, we can run it through, yeah, through, like, a WASM transform so you can upload either a JavaScript or WASM. We run it inside the V 8 engine, and then we can materialize that transformation to a different topic. This may sound trivial, but I'm gonna walk you through example. 1 of our customers where we took 5 clusters of 20 nodes each, and we reduced it to 1 20 node cluster.
And this goes to the idea of the data API. So what they were doing is that they were sending information. And for the sake of this discussion, let's say that it had 3 fields, first name, last name, and some PII information. Right? And so the other cluster were literally materialized views of this original cluster. And the way it's done today is you either use, Kafka streams or Flink or Spark or some other job to consume data from 1 topic and push it to another topic. We think that that is not scalable. The reason it's not scalable is because you're consuming network resources from the clusters that you're consuming to and producing to. And so our answer to that and we're still in early stages of dates, and we think this is the right approach. Ultimately, the market will decide. But given people the ability to upload a Wasm transform scripts, basically, decouples the production of data from the consumption of data with a data API.
So 1 example here is you can push some person object with some PII information. And so, of course, the simple thing to do is you say, okay. I'm gonna remove all of the PII fields and push them to a separate topic. So let's take change data capture. You have Debezium pushing your transactions to a database, and now you can run an inline transform, and you can guarantee that the consumers of of a particular topic will not see PII information. But then you can even go beyond that. Right? What you can say is, like, I am going to transform the same 3 fields, first name, last name, and some PI information. Let's take Social Security number, and it's gonna have the same 3 fields. So it's not just about removing or adding fields or data enrichment, which by the way is useful, but it's about changing the meaning of the data.
And so, you know, being able to upload inline transformations is I think what is fundamentally missing from being able to give people API like guarantees like you do for your microservices, but now this is for your data. You can upload and you can say, oh, I am going to divorce the producers of this data and the version in this game of my sources of data and guaranteed whether it's for your machine learning, workloads, whether it's a uniform contract, a uniform serialization format with a uniform set of fields, or I'm gonna enrich some data with a little bit of things. This 1 shot transformation, this is, I think, what is missing from being able to truly expose the idea of the data APIs, which is, again beyond serialization. It's about deep introspective guarantees of the data.
And what's useful about our particular implementation is that you get parent stream correctness guarantees, which means because they execute locally, you're not consuming network transfer. You're not consuming CPU. You're simply pushing it through a materialization engine, like a V 8 engine, and say, hey. Run this function. And whatever I get back, I'm gonna write it to 2 topics. Right? So it's very cheap for us to do it in a very scalable way, because the only thing that is increasing is disk. But from a business perspective, it now gives you the ability to guarantee GDPR compliance. You can say all of the tuples that come through this stream have guaranteed removed the Social Security number for everyone in here, or have removed their driver's license information, or have masked their address, and so on.
[00:33:08] Unknown:
Yeah. It's definitely very interesting and useful approach to that problem. And that brings me to my next question too, which is trying to explore some of the ways that you are able to innovate on the overall space beyond just performance when you are tying yourself to the specifics of the API and some of the room for growth or other areas of the streaming ecosystem that are not being thoroughly explored at the current point in time?
[00:33:35] Unknown:
That's always so fun. I think that the most fun as a technology is building. So every time someone asked me this question, it's it's a lot of fun for me to talk about. We're solving real problems right now, which is to some extent, we had to start with something. You know? You know, I had Emacs and GCC on my computer, and I was like, okay. I'm gonna build the system. 18 months later, we have the system that is compatible with Kafka. And that's where we had to start. Right? We had to start with better guarantees, better performance, kind of all of these things that we've been talking about through the podcast. I think the future though is much more interesting, in particular, on the disaster recovery and the compliance space and the idea that we call the shadow indexing.
And so shadow indexing is actually you push your data to the Red Panda cluster. Right? And instead of deleting files locally, before we delete the file locally from this, we upload them to s 3 or Google Cloud storage or Azure Blob. It doesn't matter to some durable, very cheap storage. Right? So that's the first 1. So that that ability is tier to storage in you know, Kafka's working on a KIP and Pulsar has something like that, but that's not enough. What we also upload is the shadow index, and it's basically a self discovery metadata about the state. What that allows us to do is add the ability to have a true global Kafka cluster. How does that work? So let's say that you're running to US east and you wanna read from US west. So US east post their data and, you know, it's working as a regular Red Panda cluster. It uploads their data to s 3. When US west wants to read data that was written to US East, instead of using MirrorMaker, which is how most of this is done. Right? You you have this thing. It kinda like mirrors the topic, and it consumes resources both on the raft level, on the disk level, on the compression, on the network, on both sides of the cluster.
We tell s 3 the API, and I was like, hey, s 3. Copy object, which is is an API that effectively allows you to move the object or basically point the source to the other data center, let's say in US east. Okay. But fundamentally, what is shadow index? Shadow index then transfer a little bit of metadata. We're talking about kilobytes for, like, a petabyte range of of actually true data. So what it gives is we actually don't consume any resources, and we have this shadow index, this metadata that tells us, oh, if you wanna read the entire historical data of this topic, you can.
It is on US east. Let me copy the data to US west s 3, and it's very cheap. It's, like, 2¢ per gigabyte, and only materialize what the client gives me. Why is this a game changer, I think, for the future of streaming? Fundamentally, what you're giving people is the unification of the historical file system access and the real time access. You're giving them the exact same Kafka API. And so, of course, on US West, if you request historical data that was written in US east, you know, it'll has to fetch it from s 3 and it'll materialize it on some scratch space of the file system, but it'll give you the data. Let's say as a machine learning person in an organization, which by the way, it keeps the access control and stuff. You don't have to understand where the data comes from. Right? Like Red Panda will manage the hierarchy of data. It's not just uploading. It's I think it's the ability to query historical and real time with the exact same Kafka API.
So that's just sort of all offloaded onto here. And I think really that idea of, like, leveraging the cloud providers native, you know, truly becoming cloud native. I think cloud native is really thrown around. My view of it is you leverage the cost saving infrastructure of Amazon with s 3 or or Azure, all of these dynamic compute resources to build a better system. That is what, to me, what is cloud native. And so s 3 allows us to give them a global view of Kafka with unlimited replicas in any cluster that is super cheap. Right? I mean, it's it's literally 2¢ per gigabyte transfer, and that's what you're paying on your on your s 3 bill without the need to materialize it locally and for that cluster to have overload. So from a system performance perspective, you could literally proxy the reads of petabytes and petabytes of data with actually a very small memory footprint on the box. So that's a big 1. That coupled with our this inline WASM transform, right, this 1 shot transformations, which, you know, it doesn't solve all the use cases if you're merging a a ton of streams and, you know, there are other systems like Flink is is is excellent, in my opinion.
They they do a better job of, like, merging a stream and doing more computation. But this 1 shot transformation, given data API guarantees combined with our idea of shadow indexing, is I really think is sort of the the future of streaming. It's like, you know, read it anywhere. It's scalable. It's cheap. You can run it. It gives you safety guarantees. It give you correctness guarantees. We give you a linearizable raft implementation. We give you, you know, some low latency, some high throughput for every cluster.
[00:38:38] Unknown:
If you're looking for a way to optimize your data engineering pipeline with instant query performance, look no further than cubes. Cubes is next generation OLAP technology built for the scale of big data from UST Global, a renowned digital services provider. Cubes lets users and enterprises analyze data on the cloud and on premise with blazing speed while eliminating the complex engineering required to operationalize analytics at scale. With an emphasis on visual data engineering, connectors for all major business intelligence tools and data sources, cubes allows users to query OLAP cubes with sub second response times on 100 of billions of rows.
To learn more and sign up for a free demo, visit data engineering pod cast.com/cubes today. That's data engineering podcast.com/qubz. For people who are interested in exploring Red Panda and deploying it into production or a test environment, what is actually involved in getting a cluster set up and joining the nodes together and managing the auto tuning that you mentioned? And I'm also curious to understand some of the challenges or complexities that you face in being able to automatically discover those optimal settings and how cloud environments complicate that problem.
[00:39:58] Unknown:
By the time this podcast is out on September 22nd, we're gonna have a free download. People can just go to our website, but put it on the link of the podcast. They could try it themselves. They get their own kind of private repo. The auto tuning and discovery, it happens as long as you're running system d. This auto tuning, because it's hardware, it literally runs at hardware speed, which is, like I mentioned, is really anticlimactic when you're trying to show someone new camera that doesn't understand the system. You're like, oh, look. I did all these cool things, and it's just like, you know, a bunch of terminal lines. But anyway, so that happens automatically, and then Red Panda starts. Now the exciting part about this question is configuration. The reason we chose Raft you know, actually, the original protocol I wrote was chain replication. But when you add chain like, node removal and node addition and a bunch of stuff, you end up with a thing that looks exactly like Raft, except that Raft gives you a proof on configuration changes. There are 2 types of changes in Raft. I'm only gonna talk about the thing that is gonna be available on September 22nd, which is called joint consensus.
And so the reason for us choosing Raft, in addition to safety and to having that this log completeness guarantee, which says there's no gaps in the log or you can't write data. It was configuration changes, sound configuration changes. And so when you add a node through the cluster, you have to give it basically 3 or 4 IPs just to for failure recovery, but it'll ask the current nodes. It's like, hey. Who is the leader in the cluster for the controller? It's a very low low throughput cluster. It really it's only used when you're creating a topic or you add a node. It's basically otherwise idle. So So there's 1 node in the cluster that does this, and it uses, like, the RAF leader election to move around in case there is failure. Anyways, you ask the controller, and then the controller initiates a joint consensus, protocol, which is exactly how you add or remove nodes. Right? So you can just expand or remove notes like you would. You know, you start up a thing, and then you'll go through this sound configuration changes, which means you add them, you know, you make sure that they have the controller metadata as updated and so on, and then they can join the cluster and start the voting. So that's how you add a node. It's really, ultimately, some of the systems have some of the open source systems, we're probably going to open source a large part of this later this year, have offloaded a lot of this complexity onto the users. And so instead, we unloaded the complexity of, like, node discovery, configuration changes onto the system itself, and then we added the complexity of having to implement the joint consensus algorithm to add and remove nodes. So it's really, I think, you know, quite trivial. It's like you would have on on any node. You say, I've got install Red Panda.
It gives you the thing. You give it the seed IPs, and then then it's up and running from from that point.
[00:42:45] Unknown:
Digging into the business side of things, what is your overall approach to being able to grow this platform and expand its capabilities and ensure that it's sustainable from a technological and business perspective?
[00:43:01] Unknown:
The first thing that we had to do was really be API compatible. Basically, that opened up every enterprise in the world. It's amazing how many Kafka installations are in the world. I think I expect something like a 120, 000 total Kafka installations to date. So the market for us is really huge. We have seen demands in the high volume space. So like pushing a couple of petabytes per week, you know, let's say 4, 5, 6 gigabytes per second, like huge, huge volumes per cluster, and in the finance sector. Right? And the reason for that is that they care about safety, and the only 1 cares about throughput and scalability. What we haven't really addressed is the middle of the market, the majority. And so, you know, what's good about where we started is an oil company is always gonna have money to process, like, IoT devices, same thing with the largest analytical space, same thing with the largest CDN and telco providers. That's like the big volume.
On the finance side, you know, hedge funds really have no price sensitivity. So that's also a good market for us. But ultimately, this won't be like a massively successful product, I think, unless we give this product to the masses. And so on September 22nd, we're gonna have a free download trial where anyone can just try it. It's the the entire product. They can test it. They can see if what we're saying is true. Hopefully, they find bugs. It's great when your customers find faults and you fix them and you make the system better. And then ultimately, towards the later part of the year, we want to make this available for a lot of people that don't have the money to pay. And so what that means is probably a community license of sorts similar.
My current thinking on this is similar to the CockroachDB license, where if you're a cloud provider, you have to pay. If you are vendor in data services where Kafka is kinda like, you know, they think that you really are selling plus or minus some proxies, you have to pay. But if you wanna use it, basically, all of the other users are free for people to use. And I think that's how we move forward. So that's 1. And then the second 1 is I think there's this largely unaddressed huge developer ecosystem, which is like, the JavaScript, the Python developers, and so on. Because our system is so easy to run, like, no 1 complains honestly about running Nginx or running Node. Js. It's like it's a single binary that you run. You know, I think it's to some extent, the reason why Cockroach has seen a lot of growth is the system is really easy to run. It was 1 binary, so we put a lot of effort into making it super, super easy to run, no tunables, autodiscoverable, where people could just get to focus on their applications.
And so I think that's a large market that, you know, we're seeing some pickup traction where a lot of node developers, they wanna do real time fraud detection. They have the same problems that some of the Java enterprise shops have, but maybe they don't have the internal sophistication to manage JVM system, to understand ZooKeeper, to understand, like, Snapshot, TLS, literal election, kind of all of these really complicated things. And they just want a system that gets out of the way. And then 2, that they can interact with the streams and simply upload another, you know, JavaScript and Wazhm scripts to give you data API concept I've been talking about. And so I think those are how we address maybe the 3 parts of the market. So we already target the fortune, basically, 2, 000 with what we have today. They care about reducing the number of nodes from 300 to 3. This is actually a real use case, and I can talk more later. They care about hardware reduction, cost, performance optimization, getting back engineering capacity on the finance side. So, like, very high end. They have no price sensitivity. They care about safety, data correctness, guarantees.
The majority of the market kind of like, you know, most businesses running 3, 5, 7 nodes will benefit from our open source. You know? So that's the strategy. And then hopefully, towards the end of the year, we're gonna announce our cloud. And our cloud is, you know, effectively what every data vendor gives you. It's like you offload everything to the vendor. You know, they offer you a a mutual TLS authenticated port, and he goes pushed to this port and consumed from this port, and the management, they they kinda take advantage of that. That's what we're thinking in terms of business.
[00:47:04] Unknown:
In terms of the ways that you have seen some of your initial customers using Red Panda, what are some of the most most interesting or innovative or unexpected use cases that you've seen?
[00:47:14] Unknown:
We have 1 incredibly intriguing use case. It's We've only heard of this 1 customer of ours, and I've never even thought about it. And I've been doing a streaming for 10 years. So they are a database company fundamentally. Right? They provide some services, but at its core, they're a database company. They're like a cloud native database company. And so they're actually thinking because of our tiered storage, the idea that we talk about shadow indexing, Kafka, in terms of offset managing, gives you a totally addressable log. Right? Like, from offset 0 all the way up to, like, you in 64 max, you can address every single record in a Kafka log. If you couple that with our shadow indexing idea, which means any cluster can read for any topic, can go and read any batch in the history of this, and we manage the data hierarchy and what we materialize on this versus what's uploaded onto durable storage like s 3, then what they're using us for is that they actually upload a megabyte of data for every record batch. And so every record is a megabyte of blob is, like, their own format. We don't really know what format it is. They just send a a megabyte. When they receive queries from their users, like, let's say somebody connects like a Tableau or something like that, and they're trying to run their data, their query processes, they actually fetch arbitrary offsets in the lineage of this totally ordered log to fetch BLOBs in a way that gives them hierarchical storage, that gives them low latency and high throughput for your streaming rights. So data that's hot is gonna come back hot. And data that's cold is still gonna come back relatively fast because we can fetch it from s 3, decode it really fast, and kinda like return the data to the user. So they get this totally addressable log with automatic data hierarchy management. That's probably the most technically interesting use case.
And from a business use case is we are talking to a cancer lab, and they basically wanna do DNA sequencing because we can manage such large volumes of data and such a small amount of hardware. They want to retest cancer patients before they leave the wet lab. And so, you know, you spit on it or you're like, you know, you urinate in a cup and then they analyze the thing and they do some sequencing on it. Eventually, if they push that data to Red Panda and they run like SparkML on it, they can actually crunch through gigabytes of data per second. And you can just tell your patient, Hey, can you wait in the waiting room for the next 2 hours while I'm going through, you know, let's say like a terabyte of data, and then you have the results. And so then being able to test a patient before they leave is maybe the most interesting, maybe business use case. And so I think those are the maybe the 2 most most wild use case that we've seen with Red Panda.
[00:50:06] Unknown:
From talking to you and learning more about the system, Red Panda is definitely a very interesting and impressive piece of technology and something that I'm personally looking forward to experimenting with. But what are the cases where it's the wrong choice?
[00:50:19] Unknown:
Yeah. We don't support transactions yet. From customer's experience, like, kind of the cool thing about being in the data space is that it really doesn't matter what our claims are or what our competitors' claims are. Like, ultimately, the clients will tell you, you know, this is what worked, and and this is what didn't work. And to them, a lot of them have scalability limitations running the current implementation of the Kafka transactions. And so I would say it's probably less than 10% of the people that are running Kafka transactions. So we're not a good fit if you require Kafka transactions. Currently, we only solve the base API, which is deceptively simple to say. Remember that Kafka has over a 160 APIs with, like, versioning and so on.
So we covered the majority of them, but we don't cover transactions. We are basically thinking through how do we improve the throughput of transactions. Right? So transactions are really not for low latency. They're for data safety. And what people care about is, like, they wanna push a bunch of transactions per second, but they're not trying to optimize the last 500 microseconds or the last 200 microseconds. That's not what transaction users are are using transactions for upstream. And so we did drive the draft for transactions. We didn't like the performance of it or, like, the design trade offs. So that's still very much in in active research, and it's probably going to be on, I would say, q 1 next year. You know, if you are using the Kafka transactions, then that's fundamental to how you're building your use case. Like, you know, we're not a good fit there. I think for the other type of systems, we're really I think the future of streaming with our co processors, with the Wazim engine, and because we also sort of give them the entire ecosystem. Right? We have in the schema registry. We have an HTTP proxy.
We give you API compatibility. We give you these 2 things, which are the future of streaming, leveraging cloud native technologies. But if you're definitely in the transaction space, I think currently Kafka is probably your only option.
[00:52:22] Unknown:
Are there any other aspects of the overall streaming ecosystem or the Kafka space or the work that you're doing at Vectorized and Red Panda that we didn't discuss that you'd like to cover before we close out the show or any particular ways that the audience can engage with you to help you drive this product forward?
[00:52:39] Unknown:
Yeah. For sure. I really encourage people to try it. It's gonna be free to download, you know, free to run. There's gonna be no limits, you know, right now. So download, try it, you know, let us know what you think. We just love to to see how we can improve. That's really sort of the first thing. Ultimately, all of our features have ring. When you start a company, you kinda have this grand idea of how things are gonna work out. You know, for me, it was like, I wanna build the super fastest storage engine. I'm gonna solve all the problems. And people just wanted, you know, a better Kafka.
Now we're building the future with co processors and stuff. I really think if you are, especially in the JavaScript community, in the Python community, and you have always been wanting to try and test out and and build real time streaming pipelines in a way that's easy for you to manage, which is, you know, it's a single binary. I think we really are calling for for this community to help us create a system that they wish they had for them. We think that the ability to upload these co processors, given these data API guarantees is gonna be super useful for effectively tensor flow shops, where you push data to a topic. They really are not either latency sensitive or throughput sensitive. They just want convenience. And so this 1 binary is, I think, addressing the need of running a simple system that gives them data API guarantees.
So I think that's really sort of the future. And and, you know, later this year, we're gonna have also part of the product, a big part of the product, probably the majority, is going to be open source and available for people to hack. It is written in c plus plus 20, with, like, you know, coroutines and future. So probably not a lot of contributors. But if you are looking to hack around this, stay tuned. The best way to reach out to me, Twitter at vectorize io or my personal Twitter, which is called emaxrdna, which is the largest number in the Linux kernel. And, yeah, and I think, you know, feel free to reach out, and and we're happy to to start the conversation. 1 of the things that we're most excited about is I'm a Latino founder, and there aren't that many black, Latinos, and females working on hard distributed systems.
We have created a scholarship called Hack the Planet, which we're super psyched. And I think if if there are any listeners that are underrepresented in tech, definitely reach out. The application process is open. And this came about, honestly, recently through all of the recent events, which are very sad. We wanted to do something. And I think that the best thing in addition to donating to the organizations that we do is really to donate our time. We have experts that have run some of the largest systems in the world at Microsoft, at Red Hat, at Akamai, and we can have a significant impact in your life if you're listening to this podcast.
You get to keep the entire IP. You get to work on the project that you want. We don't want anything of the IP. We wanna give you money to hack on something that is hard, that is in distributed systems as long as you're part of this, and you get the mentoring. You get to talk to CEO of me once a month on anything, whether it's starting a business or running distributed systems. And you also get mentorship from all of our senior architects, Dennis, Noah, Meha, Ben. I mean, mostly everyone that I work with has been in this business for, like, you know, 10, 15, 20 years. We really hope that you take advantage of it. Definitely hit up the vectorize.iosolarship.
We want to help you. Yeah. And we'd love to be the change that we want to see today.
[00:56:09] Unknown:
For anybody who does want to get in touch or reach out about any of those things, I'll have your contact information in the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:56:23] Unknown:
It's a really fun question to always look at the future. You know, the biggest piece missing to me is tooling that gives deep guarantees about the data. We have an idea and we have an opinion, which is our co processors, but, ultimately, people will decide if that's useful. You know? I think that the way we advance is we ask some of our users, we propose a system, we build a system that we think is the future. There are other alternatives that I can think of, but I think tooling that gives deep introspective guarantees about the data. So 1 example is whether you're a database or streaming system, it ultimately doesn't matter. That's you what we have to move away is from kind of this low bids like vectorize. You know, that's for us, for very few people to really spend all of their time and expertise focusing on moving data around on the bids. But from a business perspective, what I think is missing is allowing you know, building system that give users higher guarantees. 1 example is if you put data on any system, that that system guarantees data provenance, which is the right people through the entire transformation of the data only have access to the data. Whether that system goes through Red Panda and moves through HBase and then back to and it doesn't matter what the pipeline is. But, like, that kind of data prominence, you know, just keeping all the permission in around all of the systems, something that manages that, I think, is missing.
The second 1, I would say, is the ability to give users higher level guarantees. Like, as long as you put JSON in here, we'll make sure that no private information is leaked. But those things are guaranteed by this storage system. In a way, I really think that GDPR compliance is a contract that databases have to adhere to, or fundamentally, the storage system. Like, we have to be able to map some of this more complicated business metrics and goals that they're trying to achieve, which is, like, you know, protect the the users' data and so on, and map it onto the storage system themselves that own the actual data, you know, and be able to expose it to developers in a way that is easy to consume. I think that there's a lot of tooling that is missing there.
[00:58:30] Unknown:
Well, I appreciate you taking the time today to join me and discuss all the work that you've been doing with Red Panda and Vectorized and exploring and expanding the streaming space. It's definitely a very interesting product and 1 that I plan to take a closer look at myself. So I appreciate all the time and energy you've put into that, and hope you enjoy the rest of your day. Thanks, Tobias. Talk to you soon. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Alexander Gallego
Alexander's Background and Journey
Red Panda: A Kafka Replacement
Operational Simplicity and Auto Tuning
Red Panda's Implementation and Compatibility
Future of Streaming and Kafka Ecosystem
Coprocessors and Data APIs
Innovations and Shadow Indexing
Setting Up Red Panda
Business Strategy and Market Approach
Interesting Use Cases
Limitations and Future Plans
Engaging with the Community
Biggest Gaps in Data Management Tooling
Closing Remarks