Summary
The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Kudu is and the motivation for building it?
- How does it fit into the Hadoop ecosystem?
- How does it compare to the work being done on the Iceberg table format?
- What are some of the common application and system design patterns that Kudu supports?
- How is Kudu architected and how has it evolved over the life of the project?
- There are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu?
- How does the storage layer in Kudu differ from what would be found in systems like Hive or HBase?
- What are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance?
- A number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure?
- What was the motivation for using C++ as the language target for Kudu?
- If you were to start the project over today what would you do differently?
- What are some situations where you would advise against using Kudu?
- What have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu?
- What are you most excited about for the future of Kudu?
Contact Info
- Brock
- @brocknoland on Twitter
- Jordan
- @jordanbirdsell
- jbirdsell on GitHub
- PhData
- Website
- phdata on GitHub
- @phdatainc on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Kudu
- PhData
- Getting Started with Apache Kudu
- Thomson Reuters
- Hadoop
- Oracle Exadata
- Slowly Changing Dimensions
- HDFS
- S3
- Azure Blob Storage
- State Farm
- Stanly Black & Decker
- ETL (Extract, Transform, Load)
- Parquet
- ORC
- HBase
- Spark
- Impala
- Netflix Iceberg
- Hive ACID
- IOT (Internet Of Things)
- Streamsets
- NiFi
- Kafka Connect
- Moore’s Law
- 3D XPoint
- Raft Consensus Algorithm
- STONITH (Shoot The Other Node In The Head)
- Yarn
- Cython
- Pandas
- Cloudera Manager
- Apache Sentry
- Collibra
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello. Welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. Go to data engineering podcast dotcom/linode today to get a $20 credit and launch a new server in under a minute. And go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat to join the community and keep the conversation going. Your host is Tobias Macy. And today, I'm interviewing Brock Noland and Jordan Birdsall about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem.
[00:01:09] Unknown:
So, Brock, could you start by introducing yourself? Yep. I'm Brock Noland, and I'm, cofounder of PHData. And, I've been contributing to, Apache Kudu, I don't know, for a very long time since it was before it was even created, publicly. And, also coauthor of the getting started with Apache book Apache Kudu book by O'Reilly. And, been involved in the Houdic Houdoupe ecosystem for, for quite some time. And Jordan, how about yourself?
[00:01:36] Unknown:
Jordan Birdsall, and, I've been involved in kinda Hadoop space for probably 7 or 8 years now, starting at, State Farm, moving on through a couple other companies before coming into PH Data. In that time, I became a committer and PMC member for Apache Kudu, by creating the Python client for Kudu.
[00:02:00] Unknown:
Going back to you, Brock, do you remember how you first got involved in the area of data management?
[00:02:04] Unknown:
Yeah. I was, was working at Thomson Reuters, and, ultimately, what we're trying to do at that point was scale up a a a website that they have. So so Thomson Reuters, 1 of their core businesses is legal research, and they had a very old antiquated system and they really wanted to revamp it. So it felt a lot more like Google when you do legal research. And so, you know, instantaneous results, like, you search all the data with with just natural language. And so we were basically taking this back end search system, which is built very, very similar to how Google is built internally, and trying to scale it up like a 100 x.
And, so in in order to do that, we had to take the performance metrics that the system on how the system was performing and also, you know, honestly, just the log data and have it so that we could run queries over that data very, very quickly. And, the data was at such a large volume that we ran into Hadoop, and started using Hadoop to store and analyze the data versus we had been using Oracle Exadata prior to that point. And, Jordan, do you remember how you first got involved in the area of data management?
[00:03:18] Unknown:
Yeah. So I I started my career at, GEICO, about 11 years ago, working in their, pricing systems. So that was kind of a mixture of both analytics and operational, data management strategies. Then when I went to State Farm, I started in kind of a pure research capacity with advanced analytics and, and kinda what became the data science group. And then, you know, continued on that path. And then even at State Farm moved into, sort of operationalizing, data science applications and and things to that effect, and then moved on to do effectively the same thing at Stanley Black and Decker kind of, with more of a cloud native, approach.
[00:04:03] Unknown:
So can you start by explaining a bit about what the Kudu project is and the original motivation for building it?
[00:04:11] Unknown:
Yeah. I'll I'll go ahead and get started on that. So, you know, ultimately, what we found is there's a lot of use cases on top of the Hadoop platform. It's a very general platform used to solve a lot of different problems. Right? And, there ends up being just a lot of use cases that require, you know, fairly, rapid modifications. The kind of, like, first use case attached to that is, slowly changing dimensions. So those are kind of like the least quick, updates, I would say. And so they come in, you know, daily or even monthly. And, you know, ultimately, what that requires is that we update some data in HDFS. And there's a bunch of tricks on how or s 3, HDFS, s 3, Azure, Bob, storage, whatever it is. There's a bunch of tricks to handle those, but those tricks lead to complexity. And that complexity inhibits the number of users that can actually use the platform. And so that was really kind of, you know, 1 of the motivating factors for Kudu. And then, another 1 is that you have all these IoT use cases or IoT like use cases where, you know, data is starting to be starting to come in as a a slow trickle or a stream of data as opposed to, you know, in microbatches or or even large batches. And dealing with that kind of data, ultimately, if you stick it in HDFS or s 3 or Azure blob storage, you have to compact that kind of, data if you trickle it in. And those compaction jobs are again, like, just really boring to write, first of all, and result in complexity that limits the use case. Limits the availability of the platform.
And so, like, those 2 use cases, I think, are the really motivating factors, for the platform.
[00:06:09] Unknown:
Yeah. And I'll I'll jump in there to just kind of, reinforce that, you know, kind of the original reason why I got involved, the Kudu project and and why my company at the time, State Farm, was was very interested in Kudu was was for that reason alone is the the level of complexity that our, ingest and ETL applications had grown to was just unmanageable. You know, we were having a hard time, for 1, getting, you know, our existing data and ETL, resources to to convert over to the skill set. For 1, because the applications were complex, but it was just a huge shift. But then just the amount of, development hours we were spending trying to to make all these complex architectures work when when something like Hadoop completely eliminates that problem. And you mentioned the Hadoop ecosystem a few times and the fact that Kudu was built to
[00:06:58] Unknown:
fit a niche that the other components weren't quite able to achieve, at least in a single tool. So I'm wondering if you can talk a bit more about how it fits into the Hadoop ecosystem overall and some of the other technologies that it's more commonly used with. Yeah. Absolutely.
[00:07:14] Unknown:
So ultimately, Kudu is a storage system within the within the general Hadoop, ecosystem. And so, I mean that pretty broadly. You know, there's a whole bunch of different components, that are, you know, in the, quote to do ecosystem, end quote. And and this is a storage system. So it's a columnar storage system that complements, HDFS. So like on top of HDFS or s 3 or Azure Blob storage, you'd put, you know, like parquet or ORC on top of those to get, like, fast column or scans. And, you know, and, basically, a competitor to that would be to use Kudu. And in that competitive landscape, it would replace both HDFS and also the the columnar storage on top. So it's both like a format and also a storage system from a very high level perspective.
So it takes care of the replication, internally
[00:08:16] Unknown:
as well. Yeah. And I would say too, you know, there's also a little bit of overlap into HBase with, random reads as well. The kudu's main focus, I I think from a a scan perspective is on on, analytics scans. So,
[00:08:31] Unknown:
that's where the HDFS and and parquet piece comes in. So Yeah. So but kinda like the overarching goal of Kudu is to be, like, almost as fast as, like, parquet on HDFS for analytical scans, and then almost as fast as fast as HBase, for, like, single row gets and stuff. And, ultimately, that kind of meet in the center of those 2 use cases or those 2 examples. There's a lot of use cases that fit there. And then, you know, from a a scan perspective, it's typically Spark or, Apache and Palo that are used to scan,
[00:09:11] Unknown:
the database. And in terms of other projects that sort of jumped out at me is in terms of being in a similar space as I was doing some research to prepare for the show, it put me in mind of the iceberg table format that's being built for hooking into the, Hive meta store and the Hive, SQL interface. So I'm wondering if you can do some compare and contrast between the use cases that Kudu enables and what Ice is enabling, or are they just completely orthogonal to each other? Yeah. Good question. I I can go first on that 1. So, yes,
[00:09:47] Unknown:
project iceberg or iceberg and HiveAcid are kind of what I would call slow motion Kudu. So they're built for handling updates, really on that, like, slowly changing dimension side. So if you had just have a trickle of updates and you don't care how quickly they're applied, you know, that's really where they fit. So they're they're kind of similar ideas, but Kudu, because it also extends down into the storage layer, could handle a much, much higher volume of updates. And so, you know, what we see is people you often using either, you know, a partial SSD system or a full, SSD host for for storing Kudu data. And, you know, you're looking at tens of thousands of updates per second on that kind of host, in in terms of, you know, possibilities. And so it's it's a much you know, it's built to handle much more rapid inserts and updates, than those systems. Whereas those systems are very similar idea, but really built for a really slow moving data. And I think the tagline for, for Iceberg is something like it's a table format for large slow moving tabular data. And I think Hi VAST is is very, very similar in that regard. And in terms of the types of use cases and application and system design patterns that Kudu lends itself to, I'm wondering what are some of the ones that you found to be the most commonly applied and some of the most useful.
Yeah. So, I'll go ahead and give that a shot again. Jordan, please step in when when you want. And so, like, Kudu is really, really good at at IoT data. So, like, what we see often is and this doesn't have anything to do with updates, but what we see often is IoT devices in the wild, data streaming in, and and very often, there's someone that wants to perform, like, per user, queries on that data. So, like, for example, customer service that wants to be able to to look at a particular user and and query those the, you know, that data so they can help that user on the phone. And then also, of course, like, you wanna be able to do analytics on that data over time, you know, across users. Right? And so both of those use cases, Kudu is really, really good at. And so that's a a really common, just like home run use case, that that we we see, for Kudu, where you can store the data once and fulfill both those use cases, And data can be queried, you know, seconds after it's generated. And then, another 1 is is, you know, you have some, relational system and you want to basically and this is, you know, pretty simple, but you wanna basically mirror that relational system in your analytical system, but you don't want you don't want this lag. So oftentimes, there's this lag where, you know, at night, we we do a batch copy of all the updates and stuff. Since Kudu can handle those updates in real time, we can mirror that relational system into the Hadoop platform, and it's up to the second, if you will. I I was gonna jump in on that second 1. So I, you know, I've spent a lot of time in in data science use cases and specifically with with data science, you know, oftentimes, Hadoop or the Hadoop ecosystem
[00:13:01] Unknown:
becomes kind of your general purpose, application development environment or model training environment. And so these, r d b m s offloads, if you will, or replicas kind of become quite critical to those projects, you know. Yeah. A lot of data science use cases use images and things like that, but, you know, they typically have to augment these use cases with relational data that's coming from, ERP system. So the ability to manage that from an IT perspective in a clean way is something that Koo really does well. And in terms of that replication, would you you just be using a change data capture feed to be able to replay those events, or are there specific support for things like right ahead right ahead log replication
[00:13:43] Unknown:
or, the MySQL, op log replication for being able to populate those updates?
[00:13:50] Unknown:
Yeah. So from in terms of getting those updates in, you know, that's we really see that falling, you know, kind of outside of QDU in the in the in the the tool space. And so, you know, things like StreamSets or NiFi or Kafka Connect, all of those things are very common, tools to to ingest those those CRUD operations. And then KUDO is just a way that we can really easily, provide up to the second reflection of those CRUD operations.
[00:14:21] Unknown:
And as far as the overall project design, I found it interesting that you decided to leave the interface as a set of language bindings and support the Impala SQL engine rather than implementing your own SQL interface directly in the project. So I'm curious what the motivation was for that design choice. Like, that's the great thing about, like, the broader Hadoop ecosystem
[00:14:47] Unknown:
is that there's you know, it it's basically kind of like a deconstructed database. Right? We've taken a database and pulled it apart, so each individual component is separate. And, so we can have something like Kudu come in and just implement, you know, pretty small, external user, you know, interface, and then, we can integrate all the the technologies on top of that that 1 interface. So we don't have to do SQL. Right? Because there's already, you know, a whole host of SQL engines in the in the Hadoop platform. And so, you know, I think that makes implementing Kudu a lot easier, and it makes it more it makes it smaller and more focused, so it can really focus on what it does well. And so I think I I I guess I see that as a function of the Hadoop ecosystem, which is a a dissected or or a dismembered database, if you will, versus, you know, the old world where we had, you know, fully vertically integrated databases.
[00:15:47] Unknown:
And does that design choice make it then easier to join across the different data sources? So the rapidly updating information that's being stored in Kudu and some of the more slowly changing dimensions that would be accessible via Hive or HDFS or HBase?
[00:16:04] Unknown:
Yeah. I would say absolutely. So, you know, by by using something like a query engine like Impala, that can access all of those those, different storage layers and Kudu included, you know, or HDFS. You know, it enables, that query engine, you know, and that project team to to access and join those,
[00:16:26] Unknown:
disparate data sources together. Whereas had Kudu developed their own query engine, you know, that that may not have been as likely. Yeah. And I think that, like, you know, you can very easily, especially with customers that are in the cloud, like, you can see this, like, you know, pretty clear storage hierarchy where it's like, you know, data that that's changing, or that's being ingested rapidly, like, makes a ton of sense to to be in Kudu. And then data that you're gonna is gonna be scanned, you know, but isn't isn't updating, it isn't, you know, being ingested concurrently, you know, that that's really like an HDFS and parquet slash ORC use case. Right? And then from an archive perspective, like, data that's gonna be queried periodically, you've got s 3, which is, you know, quite a bit more latent, from a query perspective.
And, and then even with s 3, you know, you can basically get cheaper storage by, you know, degrading your access to that data. And then all of that that whole spectrum, can be accessed with Impala and Spark.
[00:17:28] Unknown:
And in terms of the overall architecture of the Kudu project, can you dig through that a more and explain how, for instance, the storage layer is architected and implemented and some of the design choices that have proven to be most beneficial and then some of the overall evolution of the system from when it first began to where it is now? Yeah. So, you know, like, 1 thing that's changing quite a bit within storage architecture
[00:17:56] Unknown:
is that, like, disks are becoming, much much faster from a seek perspective. Right? So we've got SSD, where seeks are, you know, very, very low cost. And then you've even got, you know, newer memory slash disk type technologies like 3 d x point, by Intel. And so, ultimately, you know, the the big reason why you, in the past, wouldn't necessarily have, like, a column or database that you could update, you know, at a a very high rate is because of disk seeks. And so, ultimately, that that's not a bottleneck anymore. And so Kudu is is a full column their database under the covers.
And, you know, it was really built with the best of breed columnar encodings and compressions, so forth and so on internally. And you can have a true column or database that you can go and update a single row or get a single row, very, very efficiently because of that change in underlying storage technology. And, and so, you know, that's, like, 1 of the key kind of, like, underpinnings from an architectural perspective that that Kudu was based on is based on. It's kind of like this underlying change in in, storage. And then another piece is that if you look at, like, CPU performance, like, single core CPU performance, it's it's ultimately kind of flat lining.
Like, we're not, you know, Moore's law is is dead or or or dying. Right? And so, like, what very likely comes out of that is that, like, single core CPU performance for a project is gonna you know, for for software is gonna become increasingly more important, you know, as time proceeds. And so, Kudu was was implemented in c plus plus and has the, you know, has the the ability, from a road map perspective, to take down projections and, and filters from a query engine. So it can actually apply them within the storage, layer as, as native code. And so, you know, those 2 things, I think, are 2 kind of, like, underlying, primitives that, you know, require building different database. You couldn't really do those things
[00:20:19] Unknown:
in an existing database. And then, to kinda compare it to some of the other, things that we've been talking about. So, like, it's on disk format. You know, Kudu kinda resembles parquet, I guess, you could say, or is very similar. There's some distinct differences though, particularly when it comes to kind of enabling the random reads like we talked about so that it can can kinda perform similar to HBase,
[00:20:43] Unknown:
for the random reads. But yeah. Yeah. And the other thing, you know, it it's often in the same conversation as as HBase. And, a lot of people feel or think of HBase as like a column in their database, but ultimately, it's not. So you you basically end up creating in HBase a column family, and that column family stores a whole bunch of columns. And then if you read the h based best practices, they say, don't create too many column families. So you can't, like, create a column family per column. And, ultimately, what that means is you're storing a whole bunch of columns together that aren't the same type and, you can't get these high or, you know, you can't get these extremely efficient, you can't apply these extremely efficient columnar encodings on top of on top of HBase for for a whole host of reasons. 1 of them being that it's actually not a columnar database. It's a column family oriented database, which means it's it's more similar to a row oriented database than a column oriented database in practice. That same thing applies to Cassandra as well.
[00:21:51] Unknown:
And you mentioned that the implementation for Kudu is done in c plus plus whereas a lot of the other ecosystem components for Hadoop in particular, but also the broader big data ecosystem has generally been Java or at least JVM, runtime languages. So I'm curious what led to the decision of using c plus plus for the implementation target and what you found to be the impact in terms of system efficiency, but also development velocity?
[00:22:21] Unknown:
Yeah. So, so Kudu, the server component of Kudu is implemented in c plus plus. There's also a c plus plus client. So Impala, for example, is is probably the most prominent user of the c plus plus client. There's also a Java client and a Python client as well. And, from a client perspective, you can kinda choose any language. Obviously, the JVM language is included in that. In terms of, you know, choosing c plus plus on the back end, you know, really the, you know, the driving force behind that was CPU efficiency. If you look at, you know, the the trends in CPUs, it's ultimately the the the performance is leveling off. And so single core CPU performance is increasingly coming becoming important.
And if you look at, if you spend, you know, there's a lot of kind of, you know, junk benchmarks, I guess, on the Internet. But if you spend time really doing a strong benchmark of Impala versus which is implemented in c plus plus versus the JVM based languages. You can see or JVM based SQL engines. You can you you can find places where, pretty easily, where you you see that impact from a a CPU perspective. And so I think I think that that was really important. Another piece is with HBase, for example, you just spend a lot of time doing and and all the the SQL engines too. You spend a lot of time doing j you end up spending a lot of time doing JVM tuning. And, you know, it's a problem. It's always kinda getting better, but it's never really gone away. And so for us as as operators of the Qudu platform, you know, we just spend a lot less time thinking about the JVM. And I I think that's been a huge operational win as well. And then, another piece, I guess, is that, like, if you if you run a a Kudu or a Cassandra system, you you you think about compactions more than you would like. And so Kudu has a pretty novel philosophy of compact of from a compaction perspective.
And, you know, ultimately, it's kind of like always being compact always be compacting. They there's an internal, algorithm that's used to decide if if the benefit of the compaction is is greater than the cost. And, and so from our perspective, we've seen that, you know, we don't have to think about compactions very much with Kudu either. We've we've seen that be a big 1 as well. And
[00:24:53] Unknown:
going back to the storage layer, I know that Kudu supports clustering across multiple instances. So I'm wondering how that manifests as far as replicating data across nodes and some of the complexities or challenges that you faced in the process of building out those capabilities.
[00:25:12] Unknown:
Yeah. Absolutely. So, internally, Kudu uses Raft. So it uses Raft for replication, across nodes. So the typical pattern is that you, you know, you have a table in KUDO, that table is partitioned by some, key, either our range or hash or both. And then those tablets are replicated 3 times, using the Raft protocol. And, Jordan and I definitely wanna don't wanna take credit for that or a lot of, early work. You know? We're both, contributors, and he's a committer to the project. But, other folks were were definitely, doing that work. But, like, you know, what we see there is that, you know, ultimately, when when you do an update or you do an operation on top of Kudu. If you're if you haven't conf configured your client, otherwise, you're going to wait until the data's committed to 2 of the 3 nodes, before it's, you know, successfully considered committed. And, so the level of safety is a lot higher than or is higher than with HDFS. So HDFS ultimately does not use a quorum, to replicate data internally, between the 3 nodes. And s 3 is a little bit different, but ultimately, your safety of your updates is, is quite strong. And we found that people are less surprised by this, the way Kudu behaves than than with s 3 or or even HDFS.
There's lost times when when people are like, oh, I lost the last 10 seconds of data. That surprises me. Because the guarantees, I think, are a little more, aligned with human expectations.
[00:27:01] Unknown:
And in terms of the raft protocol for the managing the overall cluster state and things like leader election. So I'd be interested to hear about some of the discussion that went into deciding on Raft as the consensus algorithm and some of the overall benefits that it has provided
[00:27:29] Unknown:
compared to some of the other solutions that were put forth? Yeah. So with, you know, like with ZooKeeper, you know, you can use ZooKeeper to elect a leader. Right? But when you do that, like, you also need to make sure that the storage, wherever state is stored, because people typically don't store, you know, large volumes of state in zookeeper, you need to make sure that state is protected as well. So you really need to, like, elect a leader and then also deny all possible, you know, previous leaders from updating the state.
And so, like, you know, sometimes people will use zookeeper with, like a stoneth, which is, you know, shoot the other in the head when you take over as leader type technology. And, I I think, you know, Todd Lipcon, the the creator of Kudu, right before he implemented Kudu, he implemented, name node h a for the HDFS, name node. And, spent a lot of time looking at Stoneth and other approaches for fencing, that storage from, you know, previous leaders who may not have realized they're not leaders anymore. And, decided that using RAFT for both, leader election and then also, replication of data was a a stronger approach, to, providing availability and consistency. So, you know, zookeeper solve long story short is that zookeeper solves 1 problem, the leader election, but then you also need to do storage fencing. And, those 2 things are are very, very related, and, you know, have a lot of interdependencies.
So implementing Raft themselves was, since they were gonna have to do the right the data replication anyway,
[00:29:07] Unknown:
made a lot of sense to the CUDA team. And it also has the benefit of being 1 less component that you need to build out and maintain and scale independently from the Kudu system that you are actually primarily caring about. Unless you're already running ZooKeeper for other systems. It's just more headache and operational complexity that's not necessary. Yep. Yep. I think that makes a lot of sense. And, you know, if you're,
[00:29:32] Unknown:
you know, there's certainly a lot of users that are gonna be running the full stack of, like, ZooKeeper, HDFS, Yarn, so forth and so on, along with Kudu and Impala. But, there's also users out there that use Kimpala, which is, you know, really just KUDO and Impala together. And, having those users have yet another service dependency as well wouldn't be great.
[00:29:57] Unknown:
And in terms of ingestion of data into the Kudu system,
[00:30:01] Unknown:
do do people primarily just use the language APIs that are exposed in Kudu, or would they generally use Impala as the mechanism for executing insert statements, or is it just very sort of diverse range of approaches to building on top of Kudu? So I'll I'll speak from or the the data engineering side, and then Jordan, you can speak from from more of the data science side. And so, Impala can do inserts in the Kudu, but, like, Impala is a big MPP query engine. And so it's not you know, if you're gonna do, like, single row inserts, it's not the most efficient, approach. In if you're gonna do single row inserts, typically, you're gonna use either the client API, Java, c plus plus, Python, or a tool that's built on top of that. And so, like, most of the KUDO data that we're aware of is actually, ingested with stream sets. So with stream sets, you know, you have a visual editor, it's an ETL tool, right, where you can ingest data from a ton of different sources and without writing any sequel, you can dump it into Kudu. And so that's a really, really common approach for users. The other probably second, most common approach that that we see is, for IoT type use cases. You would, you know, your API, your your micro service, or whatever it may be that's, serving those, IoT type clients, it would it would use that API directly.
Typically, the Java API, I think I think across the APIs, Java the Java API is by far the most used, either through Scala or through Java directly, and it's just interacting with Kudu directly. So I'd say that's the most common approach. Impala will also be used maybe if you're, populating a table from some from, you know, sometimes in a workflow, you'll be populating a kudu table from parquet, and that'd be a great use case for Impala to do that work. And so on the data science side,
[00:31:58] Unknown:
largely depends on on who's doing the ingestion. But, you know, oftentimes, data scientists are gonna do some of their own, own ingestion. So Impala becomes kinda more prevalent there just because data scientists typically are a little more comfortable with with SQL than, say, going in and write using, you know, Java or c plus plus directly. Of course, we have the Python API as well. However, this is probably 1 of my regrets in in writing that is that I kinda stuck with the same semantics that the c plus plus client uses. So as a side note, the Python client using Python just wraps the the c plus plus API. But, the semantics of doing rights, through the Python clients are are admittedly a little engineering like feel. I did did some effort to to wrap those and give it kind of a Python feel, but, ultimately, it's it's not as comfortable as, say, something something like, you know, giving them a pandas like feel to, to a client. But, Yeah. Another thing I'd I'd say, I would be remiss if I didn't,
[00:32:59] Unknown:
mention Spark. Spark is kind of, you know, what I consider the default ETL, tool for big data. And, and so lots of Spark, CUDAO,
[00:33:10] Unknown:
ingestion work, going on as well. And your comment about the Python client brings up the question of some of the challenges or overall incentives for keeping the different language interfaces feature complete in relation to each other and the overall design approach to make sure that it's easy to use any of the different language run times, but also make them fairly idiomatic for the language that you're targeting.
[00:33:40] Unknown:
Yeah. And and, honestly, that's, you know, being so Java and and c plus plus, well, Java is actually entirely separate client, whereas c plus plus and Python are are kind of bound together pretty closely. But feature parity has been, keeping feature parity has been pretty high priority, and largely mostly what I what I kinda deal with at this point given that the Python client's created, but, just keeping that up to speed. And then and then the other piece, like you mentioned, is, you know, keeping it aligned with kind of the the natural feel for those languages, and and that's been a challenge too. I can't speak as much for, say, Spark, and some of the others, although I've done a patch or or 2. But, you know, in Python, particularly, trying to to keep keep it, in line with kind of the Python way, has been a challenge. But, you know, and again, like I said, that 1 of the regrets is not changing that direction early on. But, you know, ultimately, a lot of that can be addressed through, you know, just a kind of functional layer on top. So And in terms of deployment and maintenance
[00:34:46] Unknown:
of Kudu, I know that there's been a fair amount of focus put on enabling operational simplicity in terms of actually getting it deployed and maintained, but also in terms of observability for the system. So can you talk through some of the capabilities of Kudu that simplify that overall process and what's involved in actually getting it deployed and also some of the upgrade life cycles that you might have to undergo?
[00:35:14] Unknown:
Yeah. So, in terms of, like, you know, installation upgrades and stuff, for the mo most people just use Cloudera manager to do that, and it's really, really slick, very easy to do. In terms of things that I think to do has done really, really well, like, in this general area, is on the observability front. You know, there's, with c plus plus, you know, things like stack traces typically aren't implemented. There's typically no real good way to get, you know, stack traces from from a c plus plus process. And, and so they've they've instrumented, Kudu and and have the ability to do that. I think, like, honestly, like, there's lots of fancy observability stuff that's super valuable in Kudu and also coming out generally. But when you have a process and you can just say, you know what, give me 10 stack traces in a row, and once a second. That right there is just so valuable that you can understand what's going on, what's what's expensive.
And, and so so that's a feature that is absolutely invaluable. You know, on the metric side, they've done a a a ton of work to make sure that Kudu is providing, the metrics that are, you know, important around, managing that platform. It also instrumenting the memory allocator. You can very, very easily figure out, like, where is memory being allocated? What who is using memory within a process? So a lot of the like, I've always said, like, the thing about the JBM that's so awesome, is that it's so transparent. You know? It's easy to figure out what's going on inside of it. And they've spent, you know because of their decision to use c plus plus, I've spent a lot of time replicating a lot of that transparency inside of Kudu.
And, I'm personally big on that, And, you know, I mean, for for not being a JVM, project, I think that Kudu is really the most transparent process that I've ever operated before. So props to, Todd for setting that direction, early on. And, it's, for me, it's given what it's doing and given the scale that we have, like, we have customers trying to that, it's it's just not that big of a headache. In terms of, you know, you kinda mentioned upgrades, I guess. The early versions of Kudu, you know, like, when it was, you know, 1.0 through 1.3, which, you know, I don't know, is a year or 18 months ago now. There were some definite, like, challenges with upgrading, and and using those systems. But, today, we push out Kudu updates to customers without a lot of worry, and it's a it's a very, very easy process to do that. And in terms of the overall system design,
[00:37:54] Unknown:
are there any particular edge cases that you have found to be problematic or that would be useful to have a heads up about for any new users? And if you were to start the entire project over today, what are some of the aspects of the system that you would design differently?
[00:38:10] Unknown:
So I think, the the number 1 is that, like, let's say you have a table and you wanna reload that table. What you don't wanna do is go and do a delete on all the rows in the table and then, like, re ingest. You'd prefer actually to drop the table, recreate, and then re ingest. And, ultimately, it's because Kudu's gotta sort through all those delete operations in order to scan, to to scan the table. And so, like, there's work to, like, improve how how that behaves. And and I think even some has made the latest release, but, like, today, like, you just definitely don't wanna do that. Just drop the table and recreate it. And that that's hit a number of people. That's probably been the the the biggest gotcha, I think, that the people have found. And then, I guess, the go ahead, Jordan. Oh, yeah. I was just gonna say on on the analytics side as an example,
[00:39:01] Unknown:
a lot of times, the the datasets that we use are really, really wide. And, you know, Kudu doesn't handle these incredibly wide tables very well. So a lot of times, we have to chunk those up or or kinda get them back into, denormalized state. But, you know, like, it really once you go beyond, like, 300 columns or fields, it becomes kinda complicated. But, you know, I've seen use cases, and this is gonna sound ridiculous, where we have up to, like, 15, 000 columns, you know, to try and, build a predictive model from. So those kinds of use cases are challenging
[00:39:34] Unknown:
in Kudu. Yeah. And then the only other thing I'd say is that I wouldn't say there there's I can think of somebody that, like, definitely do differently. Well, I mean, kind of, I guess. So, ultimately, like, until, you know, even today, the security around Kudu is very, very coarse grained. Like, you can write to Kudu, or you can't write to Kudu. And, and so it's really, really coarse grained. And so, there's work going to, integrate Kudu with Apache Sentry. So you can do role based access control on a on a per table basis. And if I could have changed anything, like, I would have prioritized that work earlier on. And so there's been a number of use cases where, you know, we haven't been able to use Kudu, and it would have simplified things, because of that, really, really coarse grained security. So having that prioritized earlier, I think would've would've been would've been nice. But, you know, that's that's really kind of the only piece that I have or complaint I have, I guess.
[00:40:32] Unknown:
I agree. And so in with that in mind, what are some of the situations where you would advise against using Kudu, whether because of those security restrictions or just in terms of the overall system attributes?
[00:40:43] Unknown:
So to be clear on the security end, a lot of, we'll call it, the access patterns for utilizing data within Kudu. So ignoring the ingestion for a moment, go through Impala. So meaning, data scientists or business analysts come in and use Kudu data through Impala. And so Impala is integrated with Sentry, and then the data can be secured at a role based level that way. The situations where it's not is is when you're going through 1 of the client APIs, you know, Java or, Spark or Python or what have what have you. So these are the situations, where that becomes an issue. Generally, what we've seen people do is just, force everyone through Impala, where you can, and then, you know, white list use cases essentially, to the I API,
[00:41:28] Unknown:
where possible. Brock, did you wanna add to that? Like, most of the time, we can work around it, you know, using that approach that Jordan has Jordan mentioned. But, you know, I I think it'd be more ideal. Like, security is coming, and it would have been better to have it, you know, earlier It's kinda what I'd say there. In terms of use cases where KUDO doesn't really make sense, if we have, like, for example, KUDO has a has a string and has a binary type. But right now, the max the default max size is 64 k. And so if you're bringing in big blobs, big images, things that nature, it it doesn't really make sense to put that can't put it in Kudu. You probably don't need that in a column or in the database. So we have a number of use cases where we'll bring that blob in as a side channel, put it somewhere, and then put the rest of the table, in Kudu. And then, you know, another use case is if we talked about changes, you dealing with CDC type data. Right? Continually changing data. Well, just like with the deletes, you know, as I mentioned, like, if the whole dataset changes every day, that's a problem as well because of the way CUDA is internally architected. And some of those things are being improved, but, you know, that would be a use case where we'd wanna spend a lot of time thinking about, how to architect that. It's by default, it wouldn't work very well out of the box.
[00:42:42] Unknown:
And in terms of your own experiences working with Kudu and building on top of it, what have you found to be some of the most interesting or unexpected or challenging lessons learned?
[00:42:52] Unknown:
Yeah. So just this is kinda more of a a personal note, but, this was, for me, my first, open source project really being, like, heavily engaged in. And, all the while, I was doing that at it while working at a company who was, financial based. So they had a lot of strict intellectual property, rules and stuff. And and none of this really has to do with the project, but, it was an interesting experience for me, learning to get through all of that, and, kind of my first journey into into open source. Yeah. And I'd say,
[00:43:24] Unknown:
2 things. So first, on the, you know, lessons learned or whatever. I'd say, initially, Kudu was really target architected at these really, really long tables, you know, really, really large tables that are used for IoT type use cases, and and it excels at that. But, you know, because of the, you know, there's so many enterprise data sets, in in all these companies that are constantly changing and stuff. It's really, really useful to use it to to to be the storage layer for just constantly changing enterprise data sets that maybe aren't a 1000000000 rows or, you know, a 1000000000 rows, and then constant improvements in Kudu for that use case. But it was kind of something that the Kudu team didn't really see out of the gate. And, and so, you know, it always helps to get in front of users, and certainly, the Kudu team did this. But, get in front of users and and try and understand how they're gonna use the technology.
And, you know, I think that was just an insight that was a little missed out of the gate, I guess. And there was some work to kinda catch up with that use case. Yeah. So I guess lesson learned is, you know, always get out in front of talk about use cases as a community, get out in front of users and understand how they're, you know, thinking about using the platform, a a broad spectrum of users. And then, you know, I guess, in terms of things that I'm really surprised by or or proud of or, you know, kind of all combination of those, you know, all our peer SSD boxes, are becoming really common because of the cloud. And, it's amazing the throughput that Kudu as a pure column or database can do on top of, a pure SSD box, in, in AWS, for example. It it's it's really amazing to have a truly pure column in their database and yet have, you know, just amazing throughput in terms of, you know, CRUD operations per second. So that's really cool to see. I I think that's kind of seeing the future, a little bit, when, when working on those use cases.
[00:45:28] Unknown:
And what are you most excited about for the future of Kudu in terms of new features or improvements
[00:45:36] Unknown:
or just overall community adoption or anything along those lines? The first thing I'd say is kinda 1 we already talked about, which is, kind of that finer grained access controls, which is coming through, Century integration. That's being worked on as we speak. And I would hope, you know, this isn't like an official statement, but I would hope that that would be completed, you know, end of q 1 or or maybe even early q 2. And I think once that's that's released, that the barrier to entry for most large enterprises is gonna be eliminated. So, you know, the ones that haven't joined in, you know, that's been a a big problem. So I think once that's gone, that we'll see a a lot more adoption of Kudu, and and, I look forward to that. Yeah. And I'd I'd say from my perspective,
[00:46:18] Unknown:
you know, we've been really involved in the community since the the very beginning. And so minus the or module the security, piece that that Jordan just mentioned, I really feel that it's in a it's in a place where, you know, adoption is is is starting to drive upwards, and I think that it's gonna continue. And so I'm really excited to see, more and more use cases be implemented on top of Kudu and and and in the growing community,
[00:46:48] Unknown:
of Kudu. And are there any other aspects of Kudu or fast analytics
[00:46:54] Unknown:
or the Hadoop ecosystem or anything else that we didn't cover yet which you think we should discuss before we close out the show? No. I I I think we, we did a good job. And so, yeah, I I would just, encourage folks to, you know, to to try Kudu and, you know, bring the feedback to the community. It's got a very active, Slack channel where, where you'll find, the original authors and and contributors and committers like, Jordan and myself,
[00:47:20] Unknown:
along with a with a large number of users as well. So give it a try and reach out with questions. Alright. Well, for anybody who does want to get in touch with either of you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. We'll go with you first, Brock. Yeah. You know,
[00:47:48] Unknown:
from I think Jordan's gonna speak the same. So I'll I'll just say, give it a short sentence. But what we see, right, in the future is that, you know, there's a bunch of, machine learning models being used to help to to really make decisions for businesses. Right? And, like if you look inside Google, or you look inside Facebook, ultimately, there's a lot of production machine learning model. And we really believe that's going, you know, the the enterprise is gonna have a lot of production machine learning models, in the future.
And yet, if you talk with, you know, a lot of our customers, you know, or prospects, like, there just isn't any machine learning models in production. In fact, a lot of them don't know they have to do that.
[00:48:36] Unknown:
And so, to me, that's kind of the the glaring gap in the ecosystem, and I'll let Jordan finish it off. It's really a space. Yeah. Yeah. So that's that's largely what my team's focused on, here at Peach Data. But I'd say from a tooling perspective, there's there's a a few gaps. 1, you know, there's really no 1 platform for this. Right? There's a lot of people trying to address it and and, you know, minute areas with kind of push button solutions to go to production. And and inevitably, those types of solutions just don't work for large enterprises, fortune 500. And, and so and this so this is a challenge. Right? So that puts you back to, like, basically these really complex ecosystems where you kinda have to string together a bunch of tools, kinda like how Hadoop would have been pre Cloudera. The challenges right now is that we don't have anyone really stringing those together. So, you know, that's something that that we focus on as well as kind of the engineering gap. I also feel there's a kind of a gap in skills between, say, a data scientist and a data engineer, kind of falling into the machine learning engineer space, and that's that prevalent of a role yet in the, in the market. So, but it's on the the upswing, so I I think things are looking good there. Then the other thing I would say, these are these are kind of, like, old ideas and and things that there have been tools out there for a while, but I feel like they're not evolving as quickly quickly as, say, some of the other components, and that would be kind of in the space of data quality and data cataloging.
And, again, there there are people like Calibra and and others that are coming out with stuff, but I've I still, to this day, am seeing a lot of clients, large clients develop their own solutions for these, and I just feel like there could be better frameworks in place. But yeah. Well, thank you both for taking the time today to join me and discuss your experiences
[00:50:22] Unknown:
working with the Kudu platform. It's definitely a very interesting project and 1 that seems to have a very well defined place in the Hadoop ecosystem. So thank you both for the time and effort you put into that and for your time today, and I hope you enjoy the rest of your day. Thank you. Yep.
[00:50:42] Unknown:
Thanks.
Introduction and Guest Introduction
Brock Noland and Jordan Birdsall's Background
Overview of Apache Kudu
Kudu's Role in the Hadoop Ecosystem
Comparing Kudu with Iceberg and HiveAcid
Common Use Cases for Kudu
Kudu's Architecture and Design Choices
Data Replication and Raft Protocol
Data Ingestion Methods
Deployment, Maintenance, and Observability
Challenges and Lessons Learned
Future of Kudu
Biggest Gaps in Data Management Tooling
Closing Remarks